[This article was first published on "R-bloggers" via Tal Galili in Google Reader, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Every time that I read a paper that discusses the frequencies of single letters in English, I feel like I should sit down and calculate them for myself from a sample of English text. Today, I finally did. Here are the probabilities and negative log probabilities of the characters in English over the corpus of Shakespeare’s plays:
And, for those who care, here’s the code to generate the data from the plays, which I downloaded from Project Gutenberg:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | def initialize_letter_counts(letter_counts) ('a'..'z').each do |chr| letter_counts[chr] = 0 end end def parse_file(filename, letter_counts) f = File.new(filename) begin while 1 char = f.readchar().chr.downcase if char.match(/[a-z]/) letter_counts[char] = letter_counts[char] + 1 end end rescue EOFError return nil end end directory = '/Users/johnmyleswhite/Princeton/Research/Letter Frequency' Dir.chdir(directory) letter_counts = {} initialize_letter_counts(letter_counts) Dir.new('Data').entries.each do |entry| if entry.match(/.txt$/) entry = File.expand_path(entry, directory + '/Data') parse_file(entry, letter_counts) end end letter_counts.keys.sort.each do |key| puts "'#{key}',#{letter_counts[key]}" end |
To leave a comment for the author, please follow the link and comment on their blog: "R-bloggers" via Tal Galili in Google Reader.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.