Social Media Mining and Bioinformatics (with R)

arthur charpentier

8 years ago

[This article was first published on Freakonometrics » R-english, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In June and July, I receive copies of two books,

For the first one, two recent interesting books deal with the same topic. Reza Zafarani, Mohammad Ali Abbasi and Huan Liu published last year Social Media Mining: An Introduction. Actually, the book can be downloaded from dmml.asu.edu. And – of course – there is Matthew A. Russell‘s Mining the Social Web. The main interest of this new book seems to be that it should be just perfect for R users !

Now, to be honest, among Twitter, Facebook, LinkedIn, Google+, etc, my main interest – so far – is Twitter (that I start to understand well, I believe). I would love to start working on LinkedIn’s connections, but so far, did not try. I was a bit disappointed by Matthew’s book, which does not simply presents all APIs, but on Twitter, I did not find much (interesting) applications. But I believe that I have a biased and partial view of the book, which goes way beyond my usual interests. Ian Hopkinson published an interesting review on Matthew’s book, and I believe that you can really to a lot of things on social medias from his book.

Now, to get back on the book I received, David Springate wrote a review on his blog that is extremely interesting (and I share most of his concerns). I did learn a lot of thing about sentiment analysis and the tm package (for text mining), but I am clearly not an expert. On the other hand, I did not learn much about R (such as subtle points to manipulate strings and words). But I believe that it was not the goal of that book. By the way, the codes can be found on https://github.com/SocialMediaMininginR, so anyone can play with them.

I should also probably mention that I would have expected less on (basic) R language (such as plotting an histogram, or Anscombe’s regression) and more on the roots of sentiment analysis for instance, on the algorithm, or on pitfalls (with some examples, such as irony or sarcasm, which is rater common on Twitter). There is a (short) chapter 5 about the theory (or sort of), very brief, but we have hints about what’s going on, and then there are applications in chapter 6. I would have preferred to have (in the same chapter) the theory, and then the code, with some comments. And maybe 120 pages, instead of 40. I have the feeling that several opportunities have been missed. It is clearly not a starting point to start mining social media, but combined with another book, it might probably be interesting (if you are already a R user).

About the second one, I have to admit (one more time) that my expertise in bioinformatics is rather limited. There is a really nice ebook on a similar topic, by Avril Coghlan, entitled A Little Book of R For Bioinformatics, also available online, http://a-little-book-of-r-for-bioinformatics.readthedocs.org/. Nevertheless, the models mentioned here are the same as the one I use in my research, or teach in my courses.

For instance, there is a chapter on Machine Learning (in Bioinformatics). Now, let’s be honest one more time : as claimed in the title, it is a cookbook. But it is a fair one. In the Machine Learning chapter, there is a section on cross validation. Let us look at it to see how it is structured (the structure is the same all along the book)

We start with a brief introduction and description of the problem. Then, a short paragraph about the dataset used

Now, the core of the section is the following part,

(etc)

Here, we have the R code (with an introduction, to make sure we understand what we’re doing here).

Then, we have a wrap-up summary, where all the points are connected. But of course, alternative functions and packages can be used, and it is mentioned in the next paragraph,

And to conclude, there is a (really) brief list of references, to go further on theoretical aspects,

You need to find quickly a function to get a ROC curve or to visualize clusters? I think that you will find an illustration in this book to do it on your own. So I believe that it does the job. Now, just to be clear, 90% of the book is clearly outside my scope : I know nothing about “Protein Structure Analysis”, and even if someday I might be interested to learn more on that topic, so far, I do not really care. Nevertheless, I am facing a problem to read (in R) a .sql file, so I went through the book, to see if I can find a technique to read such a file, but I could not find anything helpful.

To leave a comment for the author, please follow the link and comment on their blog: Freakonometrics » R-english.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.