The Imperfection of Language

Mark Niemann-Ross

3 months ago

[This article was first published on R Programming Archives - Mark Niemann-Ross, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Human languages are notoriously ambiguous. Computer languages are notoriously un-ambiguous. Humans (mostly) are comfortable with uncertainty. Computers don’t even believe uncertainty is possible. It’s what led us to create un-ambiguous languages specifically for computers.

One morning, I shot an elephant in my pajamas. How he got in my pajamas I don’t know.”
Groucho Marx

Natural Language Processing is asking a computer to analyze human language; which really shouldn’t be possible. To do this, we have to provide the computer with a set of rules, as well as a set of exceptions to each rule. It also assumes inaccuracy.

stemming is a tool used in Natural Language Processing. It is a method of gathering words with similar meanings, so the analysis of a corpus (body of text) is more accurate. Like the rest of NLP, it makes assumptions and is often inaccurate.

For example, consider walking, walked, walk, and walker. In a story about hiking, these four words should be counted as four times the same word, rather than one time each of four words. We could conclude the article is about a slow hike, versus running or biking down the trail.

Here’s the R code…

> stemDocument(c("walking", "walked", "walk", "walker"))
[1] "walk"   "walk"   "walk"   "walker"
>

Now consider the word “cheaply.” I ran a poll on LinkedIn and Mastodon about a natural language processing function called stemDocument(). It’s part of the R tm package. Here’s the poll…

Take Our Poll

Go ahead. Answer the question…

…I’ll wait.

…Still waiting

Time’s up

On LinkedIn and Mastodon, 100% of the answers were “cheap.”

But that’s the wrong answer! It’s “cheapli”

Don’t believe me? Here’s some R code…

>library(tm)
Loading required package: NLP
> stemDocument("cheaply")
[1] "cheapli"

In what world would converting cheaply to cheapli be the correct answer? Let me answer this question by showing the algorithm for this change, and then a reason the algorithm exists.

cheaply = cheapli

The tm package in R relies on the Porter stemming algorithm. The decision tree is described in The English (Porter2) stemming algorithm.

Step 1c of the algorithm is responsible for our confusion: “replace suffix y or Y by i if preceded by a non-vowel which is not the first letter of the word (so cry->cri, by->by, say->say)”

Great – so “possibly” should become “possibli,” right? Nope. Here’s the code…

> stemDocument(c("possibly","possible"))
[1] "possibl" "possibl"

It turns out that “…bly” is also a special case.

But why?

There is no winning, but there is compromise. Consider this code example:

> stemDocument(c("many","man", "manly"))
[1] "mani" "man"  "man"

Although “many,” “man,” and “manly” are almost the same word, they have different meanings and shouldn’t be lumped together when analyzing a corpus. So rather than just cutting off the “…y” suffix, it is replaced with “…i.” Which brings us around to “cheaply” again. The internals of the algorithm are confusing and frustrating – but then so is human language.

Learn More

I’ve produced several courses on stemming in particular and natural language processing in general. Here are links to two of them.

Stemming From the course Introduction to NLP Using R by Mark Niemann-Ross.

Performing Natural Language Processing with R” on Educative.

The post The Imperfection of Language appeared first on Mark Niemann-Ross.

To leave a comment for the author, please follow the link and comment on their blog: R Programming Archives - Mark Niemann-Ross.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.