Document Classification using R

suresh kumar Gorakala

9 years ago

[This article was first published on Data Perspective, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

September 23, 2013

Recently I have developed interest in analyzing data to find trends, to predict the future events etc. & started working on few POCS on Data Analytics such as Predictive analysis, text mining. I’m putting my next blog on Data Mining- more specifically document classification using R Programming language, one of the powerful languages used for Statistical Analysis.

What is Document classification?

Document classification or Document categorization is to classify documents into one or more classes/categories manually or algorithmically. Today we try to classify algorithmically. Document classification falls into Supervised Machine learning Technique.

Technically speaking, we create a machine learning model using a number of text documents (called Corpus) as Input & its corresponding class/category (called Labels) as Output. The model thus generated will be able to classify into classes when a new text is supplied.< o:p>

Inside the Black Box:

Let’s have a look of what happens inside the black box in the above figure. We can divide the steps into:

Creation of Corpus
Preprocessing of Corpus
Creation of Term Document Matrix
Preparing Features & Labels for Model
Creating Train & test data
Running the model
Testing the model

To understand the above steps in detail, Let us consider a small used case:

We have speeches of US presidential contestants of Mr. Obama & Mr. Romney. We need to create a classifier which should be able to classify whether a particular new speech belongs to Mr. Obama or Mr. Romney.

Implementation

We implement the document classification using tm/plyr packages, as preliminary steps, we need to load the required libraries into R environment:< o:p>

Step I: Corpus creation:

Corpus is a large and structured set of texts used for analysis.

In our case, we create two corpuses- one each for contestant.

Step II: Preprocessing of Corpus

Now the created corpus needs to clean before we use the data for our analysis.

Preprocessing involves removal of punctuations, white spaces, Stop words such as is,

the, for, etc.< o:p>

Step III: Term Document Matrix This step involves creation of Term Document Matrix, i.e. matrix which has the frequency of terms that occur in a collection of documents. for example: D1 = “I love Data analysis” D2 = “I love to create data models” TDM:

< o:p>

Step IV: Feature Extraction & Labels for the model:

In this step, we extract input feature words which are useful in distinguishing the

documents and attaching the corresponding classes as Labels.< o:p>

Step V: Train & test data preparation

In this step, we first randomize the data & then, divide the Data containing Features &

Labels into Training (70%) & Test data (30%) before we feed into our Model.< o:p>

Step VI: Running the model: For creating our model using the training data we have separated in the earlier step. We use KNN-model, whose description can be found from here.

Step VII: Test Model Now that the model is created, we have to test the accuracy of the model using the test data created in the Step V.

Find the complete code here.

https://feeds.feedburner.com/DataPerspective

To leave a comment for the author, please follow the link and comment on their blog: Data Perspective.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.