Preparing the data for modelling with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Why data preparation
One of the first things which I came across while studying about data science was that three important steps in a data science project is data preparation, creating & testing the model and reporting. It is a widely accepted fact that data preparation takes up most of the time followed by creating the model and then reporting. There were opinions which says we should try to reduce the time taken for data preparation which we can use for creating and testing the model. But a model is only as good as the data on which it is created. A simpler model based on clean data will most likely outperform a complicated model based on dirty or ambiguous data.
Big Mart Sales dataset
With an example from a regression problem to predict the sales, we can go through some of the common situations we might face while creating a good data set. The dataset which I am using is taken from http://www.analyticsvidhya.com/ Big Mart Sales prediction problem which I have modified a bit to include some outliers in the response variable Item_Outlet_Sales
Data description
Item_Identifier : Unique product ID
Item_Weight : Weight of product
Item_Fat_Content : Whether the product is low fat or not
Item_Visibility : The % of total display area of all products in a store allocated to the particular product
Item_Type : The category to which the product belongs
Item_MRP : Maximum Retail Price (list price) of the product
Outlet_Identifier : Unique store ID
Outlet_Establishment_Year : The year in which store was established
Outlet_Size : The size of the store in terms of ground area covered
Outlet_Location_Type : The type of city in which the store is located
Outlet_Type : Whether the outlet is just a grocery store or some sort of supermarket
Item_Outlet_Sales : Sales of the product in the particulat store. This is the outcome variable to be predicted
Dealing with missing values
First let’s take a look at the missing values. If number of observations with missing values are much lower than the total number of observations, then there’s not much loss of information by dropping them. I am using the function complete.cases()
to check for rows without missing values. The function returns a logical vector indicating which cases are complete, i.e., have no missing values. Please note that this function looks for NULL/NA value and there might be missing values in other forms like blanks in character factor columns.
nrows <- nrow(Data) ncomplete <- sum(complete.cases(Data)) ncomplete
## [1] 7060
ncomplete/nrows
## [1] 0.8283468
Here we can see that by dropping all the rows with missing values, we are losing about 18% of data. So we cannot drop them.
Data Exploration
Now let’s have a proper look into the data set. We can begin with the response variable. From the information we have, it is a continuos variable. I am using the ggplot2
package for data visualization which I believe most of you would be familiar with. I will be showing the distribution of the dependent variable Item_Outlet_Sales
library(ggplot2) #Plotting the dependent variable distribution pl1 <- ggplot(Data, aes(Item_Outlet_Sales)) pl1 + geom_density(fill = "red", alpha = "0.7")
Here we can see that the distribution looks similar to a half normal distribution. If we take a closer look, we can see that there is a sudden spike towards the right end of the graph. This might possibly be a sentinel value. A sentinel value is a special kind of bad numerical value: a value that used to represent “unknown” or “not applicable”" or other special cases in numeric data. One way to detect sentinel values is to look for sudden jumps in an otherwise smooth distribution of values. We can now take a look into the summary of the Item_Outlet_Sales
variable to confirm this
summary(Data$Item_Outlet_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 33.29 834.90 1798.00 2218.00 3104.00 33330.00
Here we can see that the maximum value is 33333 which is nowhere closer to the other values.
We can now examine these values to check whether they follow a pattern. If not, we can drop them.
I am now using the dplyr
package. If anyone is not familiar with it, please go through the package help. The documentation is comprehensive
The filter()
function in dplyr helps us to subset the data based column values
library(dplyr) #Creating a data frame with only the outliers outlier <- Data %>% filter(Item_Outlet_Sales == 33333) outlier
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility ## 1 DRI11 NA Low Fat 0.03423768 ## 2 FDP25 15.200 Low Fat 0.02132748 ## 3 FDU55 16.200 Low Fat 0.03598410 ## 4 FDF44 7.170 Regular 0.05997133 ## 5 FDE41 NA reg 0.00000000 ## 6 FDK04 7.360 Low Fat 0.05260793 ## 7 FDJ53 10.500 Low Fat 0.07125791 ## 8 FDX08 12.850 Low Fat 0.02264989 ## 9 FDE53 10.895 Low Fat 0.02703220 ## 10 FDA31 7.100 Low Fat 0.11023479 ## Item_Type Item_MRP Outlet_Identifier ## 1 Hard Drinks 113.2834 OUT027 ## 2 Canned 216.8824 OUT017 ## 3 Fruits and Vegetables 260.6278 OUT045 ## 4 Fruits and Vegetables 132.1968 OUT018 ## 5 Frozen Foods 83.7566 OUT019 ## 6 Frozen Foods 56.3588 OUT017 ## 7 Frozen Foods 121.3098 OUT046 ## 8 Fruits and Vegetables 179.3318 OUT045 ## 9 Frozen Foods 106.3280 OUT017 ## 10 Fruits and Vegetables 171.7080 OUT045 ## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type ## 1 1985 Medium Tier 3 ## 2 2007 Tier 2 ## 3 2002 Tier 2 ## 4 2009 Medium Tier 3 ## 5 1985 Small Tier 1 ## 6 2007 Tier 2 ## 7 1997 Small Tier 1 ## 8 2002 Tier 2 ## 9 2007 Tier 2 ## 10 2002 Tier 2 ## Outlet_Type Item_Outlet_Sales ## 1 Supermarket Type3 33333 ## 2 Supermarket Type1 33333 ## 3 Supermarket Type1 33333 ## 4 Supermarket Type2 33333 ## 5 Grocery Store 33333 ## 6 Supermarket Type1 33333 ## 7 Supermarket Type1 33333 ## 8 Supermarket Type1 33333 ## 9 Supermarket Type1 33333 ## 10 Supermarket Type1 33333
There are only 10 observations which has these sentinel values. We can see that Item_Type
, Item_MRP
, Outlet_Location_Type
, Item_Weight
, Outlet_Type
are all different among these outliers. So, it does not look like wrong data from a particular store or location. Let`s drop them as them. Dealing with these type of values normally require domain knowledge.
#Removing the observations with sentinel value Data <- Data %>% filter(Item_Outlet_Sales != 33333) pl2 <- ggplot(Data, aes(Item_Outlet_Sales)) pl2 + geom_density(fill = "blue", alpha = "0.5")
summary(Data$Item_Outlet_Sales)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 33.29 833.60 1794.00 2182.00 3101.00 13090.00
Now we can explore the remaining variables.
summary(Data)
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility ## FDG33 : 10 Min. : 4.555 LF : 316 Min. :0.00000 ## FDW13 : 10 1st Qu.: 8.775 low fat: 112 1st Qu.:0.02699 ## DRE49 : 9 Median :12.600 Low Fat:5081 Median :0.05395 ## DRN47 : 9 Mean :12.860 reg : 116 Mean :0.06616 ## FDD38 : 9 3rd Qu.:16.850 Regular:2888 3rd Qu.:0.09466 ## FDF52 : 9 Max. :21.350 Max. :0.32839 ## (Other):8457 NA's :1461 ## Item_Type Item_MRP Outlet_Identifier ## Fruits and Vegetables:1228 Min. : 31.29 OUT027 : 934 ## Snack Foods :1200 1st Qu.: 93.81 OUT013 : 932 ## Household : 910 Median :143.02 OUT035 : 930 ## Frozen Foods : 852 Mean :140.99 OUT049 : 930 ## Dairy : 682 3rd Qu.:185.66 OUT046 : 929 ## Baking Goods : 648 Max. :266.89 OUT018 : 927 ## (Other) :2993 (Other):2931 ## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type ## Min. :1985 :2404 Tier 1:2386 ## 1st Qu.:1987 High : 932 Tier 2:2779 ## Median :1999 Medium:2791 Tier 3:3348 ## Mean :1998 Small :2386 ## 3rd Qu.:2004 ## Max. :2009 ## ## Outlet_Type Item_Outlet_Sales ## Grocery Store :1082 Min. : 33.29 ## Supermarket Type1:5570 1st Qu.: 833.58 ## Supermarket Type2: 927 Median : 1794.33 ## Supermarket Type3: 934 Mean : 2181.56 ## 3rd Qu.: 3101.30 ## Max. :13086.97 ##
Looking at the Item_Weight
variable, we can see that there are 1461 missing values. We can also see that the Item_Fat_Content
variable is coded incorrectly. The same factor levels are coded in different ways.
There are 2404 missing values in the Outlet_Size
variable. Another interesting thing is about the Item_Visibility
variable. In my opinion, there can’t be any item with 0 visibility as no item in a supermarket or grocery store is supposed to be completely invisible to customers.
Let’s treat them one by one:
First let’s recode the Item_Fat_Content
variable. There are 2 levels- Regular
and low fat
which are coded into 5 different levels named as LF
, low fat
, Low Fat
, reg
and Regular
. We can recode them into lowfat
#with gsub replacing the levels with Regular or lowfat as required Data$Item_Fat_Content <- gsub("LF", "lowfat",Data$Item_Fat_Content) Data$Item_Fat_Content <- gsub("low fat", "lowfat",Data$Item_Fat_Content) Data$Item_Fat_Content <- gsub("Low Fat", "lowfat",Data$Item_Fat_Content) Data$Item_Fat_Content <- gsub("reg", "Regular",Data$Item_Fat_Content) Data$Item_Fat_Content <- as.factor(Data$Item_Fat_Content) summary(Data$Item_Fat_Content)
## lowfat Regular ## 5509 3004
Now let us replace the missing values in the Item_Weight
variable. There are many ways to deal with missing values in a continuous variable which includes mean replacement, median replacement, replacing with an arbitrary constant, regression methods etc. I will be using mean replacement and regression in this example. I am using mean replacement for Item_Weight
and regression for Item_Visibility.
In real projects, these methods are chosen based on requirements. Normally we use mean replacement for variable which have lower predictive power for the final response variable.
#Using mean to replace the missing values in Item_Weight variable MeanItem_Weight <- mean(Data$Item_Weight[!is.na(Data$Item_Weight)]) Data$Item_Weight[is.na(Data$Item_Weight)] <- MeanItem_Weight #Using regression to replace the zeros in Item_visibility variable Data_1 <- Data %>% filter(Item_Visibility != 0) visibility_model <- lm(Item_Visibility ~ Item_Weight + Item_Fat_Content + Item_Type + Item_MRP + Outlet_Establishment_Year + Outlet_Size + Outlet_Location_Type + Item_Outlet_Sales, data = Data_1) Data$Item_Visibility[Data$Item_Visibility == 0] <- predict(visibility_model,newdata = Data[Data$Item_Visibility == 0,])
Finally we have to classify the missing values in the Outlet_Size variable
.
I am using the random forest algorithm for classification. In my experience, the random forest algorithm has worked well for classification models as it has the advantage of being an ensemble model. I am using the randomForest
package as it has a very good implementation of the random forest algorithm. The dataset is split to train and test set using the package caTools
. The caTools package is a very good tool for splitting our dataset for machine learning algorithms.
The function sample.split()
is used for splitting. Two subsets are made which are classified as TRUE
and FALSE
. Normally we use the TRUE
subset for training and FALSE
subset for testing
library(caTools) set.seed(100) Data$Outlet_Size <- as.character(Data$Outlet_Size) Storetypes <- subset(Data, Outlet_Size != "") spl <- sample.split(Storetypes$Outlet_Size, SplitRatio = 0.8) Train <- subset(Storetypes, spl == TRUE) Test <- subset(Storetypes, spl == FALSE) ###Using Random Forest for classification library(randomForest) Train$Outlet_Size <- as.factor(Train$Outlet_Size) Test$Outlet_Size <- as.factor(Test$Outlet_Size) ###Creating the model SizeForest <- randomForest(Outlet_Size ~.-Item_Outlet_Sales -Item_Identifier, data = Train,nodesize = 25, ntree = 100) ###Predicting on the test set PredictForest <- predict(SizeForest, newdata = Test) #Confusion matrix for accuracy table(Test$Outlet_Size, PredictForest)
## PredictForest ## High Medium Small ## High 186 0 0 ## Medium 0 558 0 ## Small 0 0 477
###Classifying the missing values in the dataset Data$Outlet_Size <- predict(SizeForest, newdata =Data) ######
Now we can check the complete dataset once more. We can see that the problem of missing values are resolved and the factors are well coded.
summary(Data)
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility ## FDG33 : 10 Min. : 4.555 lowfat :5509 Min. :0.003575 ## FDW13 : 10 1st Qu.: 9.310 Regular:3004 1st Qu.:0.033088 ## DRE49 : 9 Median :12.860 Median :0.060615 ## DRN47 : 9 Mean :12.860 Mean :0.070478 ## FDD38 : 9 3rd Qu.:16.000 3rd Qu.:0.096000 ## FDF52 : 9 Max. :21.350 Max. :0.328391 ## (Other):8457 ## Item_Type Item_MRP Outlet_Identifier ## Fruits and Vegetables:1228 Min. : 31.29 OUT027 : 934 ## Snack Foods :1200 1st Qu.: 93.81 OUT013 : 932 ## Household : 910 Median :143.02 OUT035 : 930 ## Frozen Foods : 852 Mean :140.99 OUT049 : 930 ## Dairy : 682 3rd Qu.:185.66 OUT046 : 929 ## Baking Goods : 648 Max. :266.89 OUT018 : 927 ## (Other) :2993 (Other):2931 ## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type ## Min. :1985 High : 932 Tier 1:2386 ## 1st Qu.:1987 Medium:5195 Tier 2:2779 ## Median :1999 Small :2386 Tier 3:3348 ## Mean :1998 ## 3rd Qu.:2004 ## Max. :2009 ## ## Outlet_Type Item_Outlet_Sales ## Grocery Store :1082 Min. : 33.29 ## Supermarket Type1:5570 1st Qu.: 833.58 ## Supermarket Type2: 927 Median : 1794.33 ## Supermarket Type3: 934 Mean : 2181.56 ## 3rd Qu.: 3101.30 ## Max. :13086.97 ##
The dataset is now ready for modelling….
The post Preparing the data for modelling with R appeared first on MilanoR.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.