Site icon R-bloggers

Craig Venter’s first chromosome

[This article was first published on isomorphismes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

is, I think, the one you can find at sacred-texts.org.

curl -O 'http://www.sacred-texts.com/dna/hgp011k.htm'  #get it

#BORING DATA JANITORSHIP
tail -n +15 hgp011k.htm > hgp011k   #remove the HTML head stuff .. up to <pre>
head -n -3 hgp011k | sponge hgp011k #remove the HTML tail

#the `sponge` nonsense is because `command < file > file` will just blank your file
#`sponge` holds the output in a temp/swap for a sec, then writes > to file
#you can also slow your shell down by wrapping `command` in this bit of nonsense:
echo "`head -n -3 hgp011k`" > hgp011k

#now it's almost clean … just tattied with needless line endings
tr -d 'r' < hgp011k | sponge hgp011k
tr -d 'n' < hgp011k | sponge hgp011k

#ALL CLEAN!
less hgp011k

So that’s a bit of unix 101 / datacleaning 101. Now open up an R terminal for the fun part:

craig.v <- scan(file='hgp011k',what='character')
table( strsplit( craig.v, '') )

    A     C     G     T 
14941 15080 15210 14769

A good time was had by all.

Why don’t we do the same thing with π? Unlike Dr V’s DNA, I don’t have to get all wet and bloody acquiring as much of this data as I want. I do have to set some limits on how long to run the Berkeley Calculator though.

echo "scale=22222; a(1)*4" | bc -l  > pi.22222 #a(1) = arctan(1) = a quarter-circle
less pi.22222   #needs cleanup
echo "scale=22222; a(1)*4" | bc -l | tr -d 'n' | tr -d ''  > pi.22222
#one-liner! and it feels so good…

That was comparatively easier than scrolling through the HTML file to find the beginning of what we really wanted. R me the rock:

pi.2 <- scan(file=pi.22222, what='character')
pi.2 < strsplit(pi.2, '')    #R has no problem with the update-my-thing syntax! rainbow bash could learn a thing or two
table(pi.2) #could have also done table( strsplit( ... ))
   .    0    1    2    3    4    5    6    7    8    9 
   1 2186 2205 2179 2202 2259 2315 2254 2201 2194 2228 
   
#those are a bit hard to read so …

table(pi.2) / median(table(pi.2))

           .            0            1            2            3            4 
0.0004541326 0.9927338783 1.0013623978 0.9895549500 1.0000000000 1.0258855586 
           5            6            7            8            9 
1.0513169846 1.0236148955 0.9995458674 0.9963669391 1.0118074478


#still a bit inscrutable

round( table(pi.2)) / median(table(pi.2)) ,3)

    .     0     1     2     3     4     5     6     7     8     9 
0.000 0.993 1.001 0.990 1.000 1.026 1.051 1.024 1.000 0.996 1.012 


#there we go. pretty even distribution of digits, and let's leave the analysis of the dispersion for another day!

There we go. Pretty even distribution of digits of pi, and let’s leave the analysis of the dispersion for another day!

Obviously this was just an excuse for me to show off some unix tools like tr, curl, bc, tail -n +num, head -n -num, and some R functions like table, scan, and strsplit. But it works much better with a story, doesn’t it?!

Anyway, . Dr Venter, your epidermis is showing!

To leave a comment for the author, please follow the link and comment on their blog: isomorphismes.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.