Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I was prompted by a thread on the apparent decline of FriendFeed to look for evidence of declining participation in my networks.
First, a quick and dirty Ruby script, tls.rb to grab the Life Scientists feed and count the likes and comments:
#!/usr/bin/ruby require 'rubygems' require 'json/pure' require 'net/http' require 'open-uri' def format_date(d) if d =~ /(d{4}-d{2}-d{2})T(d{2}:d{2}:d{2})Z/ return "#{$1},#{$2}" else return d end end def count_items(i) if i.nil? return 0 else return i.count end end n = ARGV[0] u = "http://friendfeed-api.com/v2/feed/the-life-scientists?start=#{n}" f = open(u).read j = JSON.parse(f) j.each_pair do |k,v| if k == "entries" v.each do |entry| date = format_date(entry['date']) likes = count_items(entry['likes']) comments = count_items(entry['comments']) puts "#{entry['id']},#{date},#{likes},#{comments}" end end end
By default, the API call returns the last 30 items, starting at zero. You can move back in time by running this script with, for example, “tls.rb 30″. Really, there should be a check to see if ARGV[0] is an integer but in fact the argument can be absent (or nothing at all) and it will be ignored. I did say quick and dirty.
The script returns CSV with entry ID, date, time, likes count and comments count, looking like this:
e/701b62de37c751ccea1b746b53d00352,2009-12-11,02:50:49,2,0 e/7b7db83f08debbf92670c74700574b8c,2009-12-11,01:10:58,0,0 e/59efb928ec73ea9849beca02f0f86b48,2009-12-10,23:55:48,0,0 e/53f8222704468289608378ed17489156,2009-12-10,23:54:17,0,0 ....
One big drawback of the FriendFeed API is that you cannot retrieve entries by date, or a range of dates. By experimenting with values of “?start=N” in the URL, it seemed that N=3600 retrieved entries from late 2008 onwards. And so:
for i in `seq 0 30 3600`; do ./tls.rb $i >> ffdata-raw.csv; done
Be aware that this will not retrieve all posts for 2009 and there will also be duplicate entries – which we can filter out by entry ID. To remove duplicates and 2008 entries:
sort -u ffdata-raw.csv | grep ",2009-" > ffdata-filtered.csv
We’re not quite there yet. We have unique records but they can have the same date. We need to sum the counts and likes for each date. Should have done that in the Ruby script really…but we can use awk, to sum the likes, as follows:
awk -F"," '{OFS=",";cnt1[$2]+=$4}END{for (x in cnt1){print x,cnt1[x]}}' ffdata-filtered.csv > ffdata-likes.csv
Just substitute $5 to sum the comments.
Last step: read the file into R, download Paul Bleicher’s calendarHeat.R code and generate plots:
> source("calendarHeat.R") > fflikes <- read.csv("ffdata-likes.csv", check.names=F,header=F) > png(filename="tls-likes.png", type="cairo", width=640) > calendarHeat(fflikes$V1, fflikes$V2, varname="Likes",color="r2b") > dev.off()
That was quick, relatively easy and most of all, fun.
In contrast, I’ve been trying to mine microarray data from the NCBI GEO database for the best part of 8 months now.
There’s an API of sorts but getting the results that I want is not quick, easy and most certainly not fun.
Is it any wonder that all the cool kids want to be web developers, not data scientists?
Posted in computing, R, statistics, web resources Tagged: api, friendfeed, geo, ncbi
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.