Site icon R-bloggers

Lists of English Words

[This article was first published on Byte Mining, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

When I was a kid, I went through an 80s music phase…well, some things never change. “People just love to play with words…” Know that song? Anyway…

One of the biggest pains of text mining and NLP is colloquialism — language that is only appropriate in casual language and not in formal speech or writing. Words such as informal contractions (“gonna”, “wanna”, “whatcha”, “ain’t”, “y’all”) are colloquialisms and are everywhere on the Web. There is also a great deal of slang common on the Web including acronyms/emoticons (“LOL”, “WTF”) and smilies that add sentiment to text. There is also a less used slang called leetspeak that replaces letters with numbers (“n00b” rather than “noob”, “pwned” instead of “owned” and “pr0n” instead of “porn”).

There are also regionalisms which are a pain for semantic analysis but not so much for probabilistic analysis. Some examples are pancakes (“flapjacks”, “griddlecakes”) or carbonated beverages (“soda”, “pop”, “Coke”). Or, little did I know, “maple bars” vs. “Long Johns”. Now I am hungry. There are also words that have a formal and informal meeting such as “kid” (a young goat, or a child…same thing).


Source: http://popvssoda.com/

Linguists consider colloquialisms different than slang. Slang is informal language used by a specific group of people: Internet users, gamers, teenagers, college students, men/women, surfers, skaters, boarders, etc. These words can be used to put users into social groups, but beyond the point of this post.

Text mining becomes a lot less overwhelming if we can filter out known English words and focus on mapping colloquialisms, slang and Internet jargon to known English words. By using a list of known English words, we can do just that. I got some great recommendations for lists of English words that go beyond the typical list of words which is about length 58,000. This list may evolve over time, but it is what I have for now, and it was very useful to me.

What about you? What are your favorite word lists?

To leave a comment for the author, please follow the link and comment on their blog: Byte Mining.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.