Methods for dealing with imbalanced data

Modeling with R

3 years ago

[This article was first published on Modeling with R, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Introduction

The imbalanced data is the common feature of some type of data such as fraudulent credit card where the number of fraudulent cards is usually very small compared to the number of non fraudulent cards. The problem with imbalanced data is that the model being trained would be dominated by the majority class such as knn and svm models, and hence they would predict the majority class more effectively than the minority class which in turn would result in high value for sensitivity rate and low value for specificity rate (in binary classification).

The simple technique to reduce the negative impact of this problem is by subsampling the data. the common subsampling methods used in practice are the following.

Upsampling: this method increases the size of the minority class by sampling with replacement so that the classes will have the same size.
Downsampling: in contrast to the above method, this one decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.
Hybrid methods : The well known hybrid methods are ROSE (Random oversampling examples), and SMOTE (Synthetic minority oversampling technique), they downsample the majority class, and creat new artificial points in the minority class. For more detail about SMOTE method click here, and for ROSE click here.

Note: all the above methods should be applied only on the training set , the testing set must be never touched until the final model evaluation step.

Some type of models can handle imbalanced data such as deep learning model with the argument class_weight wich adds more weights to the minority class cases. Other models, however, such as svm or knn we have to make use of one of the above methods before training these type of models.

In this article we will make use of the creditcard data from kaggle website -click here to upload this data, which is highly imbalanced- and we will train a logistic regression model on the raw data and on the transformed data after applying the above methods and comparing the results. Also, we will use a simple deep learning model with and without taking into account the imbalanced problem.

First we call the data.

spsm(library(tidyverse))
data<-read.csv("../sparklyr/creditcard.csv",header=TRUE)

For privacy purposes the original features are replaced by the PCA variables from v1 to v28 and only Time and Amount features that are left from the original features.

Let’s first check Class variable levels frequency (after having been converted to a factor type).

data$Class<-as.factor(data$Class)
prop.table(table(data$Class))
## 
##           0           1 
## 0.998272514 0.001727486

As we see the minority class number “1” is only about 0.17% of the total cases. We also need to show the summary of the data to take an overall look at all the features to be aware of missing values or unusual outliers.

summary(data)
##       Time              V1                  V2                  V3          
##  Min.   :     0   Min.   :-56.40751   Min.   :-72.71573   Min.   :-48.3256  
##  1st Qu.: 54202   1st Qu.: -0.92037   1st Qu.: -0.59855   1st Qu.: -0.8904  
##  Median : 84692   Median :  0.01811   Median :  0.06549   Median :  0.1799  
##  Mean   : 94814   Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.0000  
##  3rd Qu.:139321   3rd Qu.:  1.31564   3rd Qu.:  0.80372   3rd Qu.:  1.0272  
##  Max.   :172792   Max.   :  2.45493   Max.   : 22.05773   Max.   :  9.3826  
##        V4                 V5                   V6                 V7          
##  Min.   :-5.68317   Min.   :-113.74331   Min.   :-26.1605   Min.   :-43.5572  
##  1st Qu.:-0.84864   1st Qu.:  -0.69160   1st Qu.: -0.7683   1st Qu.: -0.5541  
##  Median :-0.01985   Median :  -0.05434   Median : -0.2742   Median :  0.0401  
##  Mean   : 0.00000   Mean   :   0.00000   Mean   :  0.0000   Mean   :  0.0000  
##  3rd Qu.: 0.74334   3rd Qu.:   0.61193   3rd Qu.:  0.3986   3rd Qu.:  0.5704  
##  Max.   :16.87534   Max.   :  34.80167   Max.   : 73.3016   Max.   :120.5895  
##        V8                  V9                 V10                 V11          
##  Min.   :-73.21672   Min.   :-13.43407   Min.   :-24.58826   Min.   :-4.79747  
##  1st Qu.: -0.20863   1st Qu.: -0.64310   1st Qu.: -0.53543   1st Qu.:-0.76249  
##  Median :  0.02236   Median : -0.05143   Median : -0.09292   Median :-0.03276  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.32735   3rd Qu.:  0.59714   3rd Qu.:  0.45392   3rd Qu.: 0.73959  
##  Max.   : 20.00721   Max.   : 15.59500   Max.   : 23.74514   Max.   :12.01891  
##       V12                V13                V14                V15          
##  Min.   :-18.6837   Min.   :-5.79188   Min.   :-19.2143   Min.   :-4.49894  
##  1st Qu.: -0.4056   1st Qu.:-0.64854   1st Qu.: -0.4256   1st Qu.:-0.58288  
##  Median :  0.1400   Median :-0.01357   Median :  0.0506   Median : 0.04807  
##  Mean   :  0.0000   Mean   : 0.00000   Mean   :  0.0000   Mean   : 0.00000  
##  3rd Qu.:  0.6182   3rd Qu.: 0.66251   3rd Qu.:  0.4931   3rd Qu.: 0.64882  
##  Max.   :  7.8484   Max.   : 7.12688   Max.   : 10.5268   Max.   : 8.87774  
##       V16                 V17                 V18           
##  Min.   :-14.12985   Min.   :-25.16280   Min.   :-9.498746  
##  1st Qu.: -0.46804   1st Qu.: -0.48375   1st Qu.:-0.498850  
##  Median :  0.06641   Median : -0.06568   Median :-0.003636  
##  Mean   :  0.00000   Mean   :  0.00000   Mean   : 0.000000  
##  3rd Qu.:  0.52330   3rd Qu.:  0.39968   3rd Qu.: 0.500807  
##  Max.   : 17.31511   Max.   :  9.25353   Max.   : 5.041069  
##       V19                 V20                 V21           
##  Min.   :-7.213527   Min.   :-54.49772   Min.   :-34.83038  
##  1st Qu.:-0.456299   1st Qu.: -0.21172   1st Qu.: -0.22839  
##  Median : 0.003735   Median : -0.06248   Median : -0.02945  
##  Mean   : 0.000000   Mean   :  0.00000   Mean   :  0.00000  
##  3rd Qu.: 0.458949   3rd Qu.:  0.13304   3rd Qu.:  0.18638  
##  Max.   : 5.591971   Max.   : 39.42090   Max.   : 27.20284  
##       V22                  V23                 V24          
##  Min.   :-10.933144   Min.   :-44.80774   Min.   :-2.83663  
##  1st Qu.: -0.542350   1st Qu.: -0.16185   1st Qu.:-0.35459  
##  Median :  0.006782   Median : -0.01119   Median : 0.04098  
##  Mean   :  0.000000   Mean   :  0.00000   Mean   : 0.00000  
##  3rd Qu.:  0.528554   3rd Qu.:  0.14764   3rd Qu.: 0.43953  
##  Max.   : 10.503090   Max.   : 22.52841   Max.   : 4.58455  
##       V25                 V26                V27            
##  Min.   :-10.29540   Min.   :-2.60455   Min.   :-22.565679  
##  1st Qu.: -0.31715   1st Qu.:-0.32698   1st Qu.: -0.070840  
##  Median :  0.01659   Median :-0.05214   Median :  0.001342  
##  Mean   :  0.00000   Mean   : 0.00000   Mean   :  0.000000  
##  3rd Qu.:  0.35072   3rd Qu.: 0.24095   3rd Qu.:  0.091045  
##  Max.   :  7.51959   Max.   : 3.51735   Max.   : 31.612198  
##       V28                Amount         Class     
##  Min.   :-15.43008   Min.   :    0.00   0:284315  
##  1st Qu.: -0.05296   1st Qu.:    5.60   1:   492  
##  Median :  0.01124   Median :   22.00             
##  Mean   :  0.00000   Mean   :   88.35             
##  3rd Qu.:  0.07828   3rd Qu.:   77.17             
##  Max.   : 33.84781   Max.   :25691.16

looking at this summary, we do not have any critical issues like missing values for instance.

Data partition

Before applying any subsampling method we split the data first between the training set and the testing set and we use only the former to be subsampled.

spsm(library(caret))
set.seed(1234)
index<-createDataPartition(data$Class,p=0.8,list=FALSE)
train<-data[index,]
test<-data[-index,]

Subsampling the training data

Upsampling :

The caret package provides a function called upSample to perform upsampling technique.

set.seed(111)

trainup<-upSample(x=train[,-ncol(train)],
                  y=train$Class)

table(trainup$Class)
## 
##      0      1 
## 227452 227452

As we see the two classes now have the same size 227452

downsampling:

By the some way we make use of the caret function downSample

set.seed(111)
traindown<-downSample(x=train[,-ncol(train)],
                  y=train$Class)

table(traindown$Class)
## 
##   0   1 
## 394 394

now the size of each class is 394

ROSE:

To use this technique we have to call the ROSE package

spsm(library(ROSE))
set.seed(111)

trainrose<-ROSE(Class~.,data=train)$data

table(trainrose$Class)
## 
##      0      1 
## 113827 114019

since this technique add new synthetic data points to the minority class and daownsamples the majority class the size now is about 114019 for minority class and 113827 for the majority class.

SMOTE:

this technique requires the DMwR package.

spsm(library(DMwR))
set.seed(111)

trainsmote <- SMOTE(Class~.,data = train)

table(trainsmote$Class)
## 
##    0    1 
## 1576 1182

The size of the majority class is 113827 and for the minority class is 114019 .

training logistic regression model.

we are now ready to fit logit model to the original training set without subsampling, and to each of the above subsampled training sets.

without subsampling

set.seed(123)
model <- glm(Class~., data=train, family = "binomial")
                
summary(model)
## 
## Call:
## glm(formula = Class ~ ., family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.9290  -0.0291  -0.0190  -0.0124   4.6028  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.486e+00  2.852e-01 -29.753  < 2e-16 ***
## Time        -2.673e-06  2.528e-06  -1.057  0.29037    
## V1           9.397e-02  4.794e-02   1.960  0.04996 *  
## V2           1.097e-02  6.706e-02   0.164  0.87006    
## V3           1.290e-03  5.949e-02   0.022  0.98270    
## V4           6.851e-01  8.408e-02   8.148 3.69e-16 ***
## V5           1.472e-01  7.301e-02   2.017  0.04372 *  
## V6          -8.450e-02  7.902e-02  -1.069  0.28491    
## V7          -1.098e-01  7.591e-02  -1.446  0.14816    
## V8          -1.718e-01  3.402e-02  -5.050 4.41e-07 ***
## V9          -1.926e-01  1.258e-01  -1.531  0.12579    
## V10         -8.073e-01  1.118e-01  -7.224 5.07e-13 ***
## V11         -3.920e-03  9.131e-02  -0.043  0.96575    
## V12          2.855e-02  9.432e-02   0.303  0.76210    
## V13         -3.064e-01  9.007e-02  -3.401  0.00067 ***
## V14         -5.308e-01  6.816e-02  -7.787 6.86e-15 ***
## V15         -1.285e-01  9.559e-02  -1.344  0.17903    
## V16         -2.164e-01  1.423e-01  -1.520  0.12840    
## V17          2.913e-02  7.729e-02   0.377  0.70624    
## V18         -3.642e-02  1.445e-01  -0.252  0.80095    
## V19          6.064e-02  1.094e-01   0.554  0.57938    
## V20         -4.449e-01  9.737e-02  -4.570 4.89e-06 ***
## V21          3.661e-01  6.709e-02   5.456 4.87e-08 ***
## V22          5.965e-01  1.519e-01   3.927 8.59e-05 ***
## V23         -1.157e-01  6.545e-02  -1.768  0.07706 .  
## V24          8.146e-02  1.625e-01   0.501  0.61622    
## V25          4.325e-02  1.482e-01   0.292  0.77043    
## V26         -2.679e-01  2.226e-01  -1.203  0.22893    
## V27         -7.280e-01  1.542e-01  -4.720 2.36e-06 ***
## V28         -2.817e-01  9.864e-02  -2.856  0.00429 ** 
## Amount       9.154e-04  4.379e-04   2.091  0.03656 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5799.1  on 227845  degrees of freedom
## Residual deviance: 1768.0  on 227815  degrees of freedom
## AIC: 1830
## 
## Number of Fisher Scoring iterations: 12

At this step and to make things more simpler, we remove the insignificant variables (without asterix) and we keep the remaining ones to use in all the following models.

set.seed(123)
model1 <- glm(Class~.-Time-V2-V3-V6-V7-V9-V11-V12-V15-V16-V17-V18-V19-V24-V25-V26, data=train, family = "binomial")
                
summary(model1)
## 
## Call:
## glm(formula = Class ~ . - Time - V2 - V3 - V6 - V7 - V9 - V11 - 
##     V12 - V15 - V16 - V17 - V18 - V19 - V24 - V25 - V26, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6514  -0.0290  -0.0186  -0.0117   4.6192  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.763e+00  1.510e-01 -58.023  < 2e-16 ***
## V1           2.108e-02  2.918e-02   0.722 0.470129    
## V4           7.241e-01  6.306e-02  11.483  < 2e-16 ***
## V5           9.934e-02  3.566e-02   2.785 0.005346 ** 
## V8          -1.549e-01  2.178e-02  -7.115 1.12e-12 ***
## V10         -9.290e-01  9.305e-02  -9.985  < 2e-16 ***
## V13         -3.307e-01  8.577e-02  -3.855 0.000116 ***
## V14         -5.229e-01  5.566e-02  -9.396  < 2e-16 ***
## V20         -2.388e-01  6.005e-02  -3.976 7.01e-05 ***
## V21          4.811e-01  5.259e-02   9.148  < 2e-16 ***
## V22          7.675e-01  1.277e-01   6.011 1.84e-09 ***
## V23         -1.522e-01  5.925e-02  -2.569 0.010212 *  
## V27         -6.381e-01  1.295e-01  -4.927 8.34e-07 ***
## V28         -2.485e-01  9.881e-02  -2.515 0.011900 *  
## Amount       2.713e-07  1.290e-04   0.002 0.998323    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5799.1  on 227845  degrees of freedom
## Residual deviance: 1798.7  on 227831  degrees of freedom
## AIC: 1828.7
## 
## Number of Fisher Scoring iterations: 11

We have now two predictors that are non significant V1 and Amount, they should be also removed.

set.seed(123)
finalmodel <- glm(Class~.-Time-V1-V2-V3-V6-V7-V9-V11-V12-V15-V16-V17-V18-V19-V24-V25-V26-Amount, data=train, family = "binomial")
                
summary(finalmodel)
## 
## Call:
## glm(formula = Class ~ . - Time - V1 - V2 - V3 - V6 - V7 - V9 - 
##     V11 - V12 - V15 - V16 - V17 - V18 - V19 - V24 - V25 - V26 - 
##     Amount, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.6285  -0.0289  -0.0186  -0.0117   4.5835  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.75058    0.14706 -59.505  < 2e-16 ***
## V4           0.69955    0.05265  13.288  < 2e-16 ***
## V5           0.10650    0.02586   4.119 3.81e-05 ***
## V8          -0.15525    0.01982  -7.833 4.76e-15 ***
## V10         -0.89573    0.07630 -11.740  < 2e-16 ***
## V13         -0.33583    0.08448  -3.975 7.02e-05 ***
## V14         -0.54238    0.04862 -11.155  < 2e-16 ***
## V20         -0.22318    0.04781  -4.668 3.04e-06 ***
## V21          0.47912    0.05205   9.204  < 2e-16 ***
## V22          0.78631    0.12439   6.321 2.60e-10 ***
## V23         -0.15046    0.05498  -2.736  0.00621 ** 
## V27         -0.58832    0.10411  -5.651 1.60e-08 ***
## V28         -0.23592    0.08901  -2.651  0.00804 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5799.1  on 227845  degrees of freedom
## Residual deviance: 1799.2  on 227833  degrees of freedom
## AIC: 1825.2
## 
## Number of Fisher Scoring iterations: 11

For the other training sets we will use only these significant predictors from the above model.

Now let’s get the final results from the confusion matrix.

pred <- predict(finalmodel,test, type="response")
pred <- as.integer(pred>0.5)
confusionMatrix(as.factor(pred),test$Class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 56856    41
##          1     7    57
##                                           
##                Accuracy : 0.9992          
##                  95% CI : (0.9989, 0.9994)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : 1.581e-08       
##                                           
##                   Kappa : 0.7033          
##                                           
##  Mcnemar's Test P-Value : 1.906e-06       
##                                           
##             Sensitivity : 0.9999          
##             Specificity : 0.5816          
##          Pos Pred Value : 0.9993          
##          Neg Pred Value : 0.8906          
##              Prevalence : 0.9983          
##          Detection Rate : 0.9982          
##    Detection Prevalence : 0.9989          
##       Balanced Accuracy : 0.7908          
##                                           
##        'Positive' Class : 0               
##

As we see we have a large accuracy rate about 99.92%. However, this rate is almost the same as the no information rate 99.83% (if we predict all the cases as class label 0). In other words this high rate is not due to the quality of the model but rather due to the imbalanced classes. if we look at the specificity rate. it is about 58.16% indicating that the model poorly predict the fraudulent cards which is the most important class label that we want to predict correctly. Among the available metrics, the best one for imbalanced data is cohen’s kappa statistic. and according to the scale of kappa value interpretation suggested by Landis & Koch (1977), the kappa value obtained here 0.7033 is a good score.

But here we stick with accuracy rate for pedagogic purposes to show the effectiveness of the above discussed methods.

Upsampling the train set

Now let’s use the training data resulted from the upsmpling method.

set.seed(123)
modelup <- glm(Class~V4+V5+V8+V10+V13+V14+V20+V21+V22+V23+V27+V28, data=trainup, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(modelup)
## 
## Call:
## glm(formula = Class ~ V4 + V5 + V8 + V10 + V13 + V14 + V20 + 
##     V21 + V22 + V23 + V27 + V28, family = "binomial", data = trainup)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.2906  -0.2785  -0.0001   0.0159   2.8055  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -3.271053   0.011741 -278.610  < 2e-16 ***
## V4           0.952941   0.005478  173.966  < 2e-16 ***
## V5           0.126627   0.003976   31.846  < 2e-16 ***
## V8          -0.289448   0.004368  -66.261  < 2e-16 ***
## V10         -0.710629   0.009150  -77.665  < 2e-16 ***
## V13         -0.479344   0.007352  -65.200  < 2e-16 ***
## V14         -0.802941   0.006825 -117.638  < 2e-16 ***
## V20         -0.090453   0.007955  -11.371  < 2e-16 ***
## V21          0.233604   0.007702   30.332  < 2e-16 ***
## V22          0.209203   0.010125   20.662  < 2e-16 ***
## V23         -0.320073   0.005299  -60.399  < 2e-16 ***
## V27         -0.238132   0.017019  -13.992  < 2e-16 ***
## V28         -0.152294   0.019922   -7.644  2.1e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 630631  on 454903  degrees of freedom
## Residual deviance: 136321  on 454891  degrees of freedom
## AIC: 136347
## 
## Number of Fisher Scoring iterations: 9
pred <- predict(modelup,test, type="response")
pred <- as.integer(pred>0.5)
confusionMatrix(as.factor(pred),test$Class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 55334    12
##          1  1529    86
##                                           
##                Accuracy : 0.9729          
##                  95% CI : (0.9716, 0.9743)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0975          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.97311         
##             Specificity : 0.87755         
##          Pos Pred Value : 0.99978         
##          Neg Pred Value : 0.05325         
##              Prevalence : 0.99828         
##          Detection Rate : 0.97144         
##    Detection Prevalence : 0.97165         
##       Balanced Accuracy : 0.92533         
##                                           
##        'Positive' Class : 0               
##

Now we have a smaller accuracy rate 97.29%, but we have a larger specificity rate 87.75% which increases the power of the model to predict the fraudulent cards.

Down sampling the training set.

set.seed(123)
modeldown <- glm(Class~V4+V5+V8+V10+V13+V14+V20+V21+V22+V23+V27+V28, data=traindown, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pred <- predict(modeldown,test, type="response")
pred <- as.integer(pred>0.5)
confusionMatrix(as.factor(pred),test$Class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 54837    12
##          1  2026    86
##                                           
##                Accuracy : 0.9642          
##                  95% CI : (0.9627, 0.9657)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0748          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.96437         
##             Specificity : 0.87755         
##          Pos Pred Value : 0.99978         
##          Neg Pred Value : 0.04072         
##              Prevalence : 0.99828         
##          Detection Rate : 0.96271         
##    Detection Prevalence : 0.96292         
##       Balanced Accuracy : 0.92096         
##                                           
##        'Positive' Class : 0               
##

With downsampling method, we get approximately the same specificity rate 87.75% with a slight decrease of the over all accuracy rate 96.42%, and the sensitivity rate has decreased to 96.43% since we have decreased the majority class size by downsampling.

subsampline the train set by ROSE technique

set.seed(123)
modelrose <- glm(Class~V4+V5+V8+V10+V13+V14+V20+V21+V22+V23+V27+V28, data=trainrose, family = "binomial")
pred <- predict(modelrose,test, type="response")
pred <- as.integer(pred>0.5)
confusionMatrix(as.factor(pred),test$Class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 56080    14
##          1   783    84
##                                         
##                Accuracy : 0.986         
##                  95% CI : (0.985, 0.987)
##     No Information Rate : 0.9983        
##     P-Value [Acc > NIR] : 1             
##                                         
##                   Kappa : 0.1715        
##                                         
##  Mcnemar's Test P-Value : <2e-16        
##                                         
##             Sensitivity : 0.98623       
##             Specificity : 0.85714       
##          Pos Pred Value : 0.99975       
##          Neg Pred Value : 0.09689       
##              Prevalence : 0.99828       
##          Detection Rate : 0.98453       
##    Detection Prevalence : 0.98478       
##       Balanced Accuracy : 0.92169       
##                                         
##        'Positive' Class : 0             
##

Using this method the sensitivity rate is slightly smaller than the previous ones 85.71% but still a large improvement in predicting fraudulent cards compared to the model trained with the original imbalanced data.

Subsampling the train set by SMOTE technique

set.seed(123)
modelsmote <- glm(Class~V4+V5+V8+V10+V13+V14+V20+V21+V22+V23+V27+V28, data=trainsmote, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
pred <- predict(modelsmote,test, type="response")
pred <- as.integer(pred>0.5)
confusionMatrix(as.factor(pred),test$Class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 55457    14
##          1  1406    84
##                                           
##                Accuracy : 0.9751          
##                  95% CI : (0.9738, 0.9763)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1029          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.97527         
##             Specificity : 0.85714         
##          Pos Pred Value : 0.99975         
##          Neg Pred Value : 0.05638         
##              Prevalence : 0.99828         
##          Detection Rate : 0.97360         
##    Detection Prevalence : 0.97384         
##       Balanced Accuracy : 0.91621         
##                                           
##        'Positive' Class : 0               
##

With this method we get the same specificity rate 85.71% such as ROSE method.

deep learning model (without class weight).

When we use deep learning models via some software we can assign a weight to the labels of the target variables. For us we will make use of keras package. We will first train the model without weighting the data , Then we retrain the same model after assigning weight to the minority class.
To train this model we should first convert the data (train and test sets) into numeric matrix and remove the column names (we convert also the Class to numeric type). However, in order to be inline with the above models we keep only their features, but this time it would be better to be normalized since this helps the gradient running more faster.

spsm(library(keras))
train1 <-train[,c('V4','V5','V8','V10','V13','V14','V20','V21','V22','V23','V27','V28','Class')]
test1 <-test[,c('V4','V5','V8','V10','V13','V14','V20','V21','V22','V23','V27','V28','Class')]
train1$Class<-as.numeric(train1$Class)
test1$Class<-as.numeric(test1$Class)
train1[,'Class']<-train1[,'Class']-1
test1[,'Class']<-test1[,'Class']-1
trainx <- train1[,-ncol(train1)]
testx <- test1[,-ncol(test1)]
trained<-as.matrix(trainx)
tested <- as.matrix(testx)
trainy <- train1$Class
testy <- test1$Class
dimnames(trained)<-NULL
dimnames(tested)<-NULL

then we apply one hot encoding on the target variable.

trainlabel<-to_categorical(trainy)
testlabel<-to_categorical(testy)

The final step now is normalizing the matrices (trained and tested)

trained1<-normalize(trained)
tested1<-normalize(tested)

Now we are ready to create the model with two hidden layers followed by dropout layers.

modeldeep <- keras_model_sequential()

modeldeep %>%
    layer_dense(units=32, activation = "relu",
              kernel_initializer = "he_normal",input_shape =c(12))%>%
    layer_dropout(rate=0.2)%>%
    layer_dense(units=64, activation = "relu",
              kernel_initializer = "he_normal")%>%
    layer_dropout(rate=0.4)%>%
        layer_dense(units=2, activation = "sigmoid")

summary(modeldeep)
## Model: "sequential"
## ________________________________________________________________________________
## Layer (type)                        Output Shape                    Param #     
## ================================================================================
## dense (Dense)                       (None, 32)                      416         
## ________________________________________________________________________________
## dropout (Dropout)                   (None, 32)                      0           
## ________________________________________________________________________________
## dense_1 (Dense)                     (None, 64)                      2112        
## ________________________________________________________________________________
## dropout_1 (Dropout)                 (None, 64)                      0           
## ________________________________________________________________________________
## dense_2 (Dense)                     (None, 2)                       130         
## ================================================================================
## Total params: 2,658
## Trainable params: 2,658
## Non-trainable params: 0
## ________________________________________________________________________________

we will use the accuracy rate as the metric. The loss function will be binary crossentropy since we deal with binary classification problem. and for the optimizer we will use adam optimizer.

modeldeep %>%
  compile(loss="binary_crossentropy",
          optimizer="adam",
          metric="accuracy")

During training, the model will use 10 epochs (the default), 5 sample as batch size to update the weights, and keep 20% of the inputs (training samples) out to assess the model

#history<- modeldeep %>%
  #fit(trained1,trainlabel,batch_size=5, validation_split=0.2)

You can run this model many times untill you get satisfied with the results, then it will be better to save it and load it again each time you need it as follows.

#save_model_hdf5(modeldeep,"modeldeep.h5")
modeldeep<-load_model_hdf5("modeldeep.h5")

All the above metric values are used in the training process, so they are not much reliable. The more reliable ones are those computed from unseen data.

pred<-  modeldeep %>%
  predict_classes(tested1)
confusionMatrix(as.factor(pred),as.factor(testy))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 56858    64
##          1     5    34
##                                           
##                Accuracy : 0.9988          
##                  95% CI : (0.9985, 0.9991)
##     No Information Rate : 0.9983          
##     P-Value [Acc > NIR] : 0.00125         
##                                           
##                   Kappa : 0.4959          
##                                           
##  Mcnemar's Test P-Value : 2.902e-12       
##                                           
##             Sensitivity : 0.9999          
##             Specificity : 0.3469          
##          Pos Pred Value : 0.9989          
##          Neg Pred Value : 0.8718          
##              Prevalence : 0.9983          
##          Detection Rate : 0.9982          
##    Detection Prevalence : 0.9993          
##       Balanced Accuracy : 0.6734          
##                                           
##        'Positive' Class : 0               
##

The same as the above models, the specificity rate is even worst than the other models 0.3469 which is also caused by the imbalanced data.

deep learning model with class weights

Now let’s try the previous model by taking into account the class imbalance

modeldeep1 <- keras_model_sequential()

modeldeep1 %>%
    layer_dense(units=32, activation = "relu",
              kernel_initializer = "he_normal",input_shape =c(12))%>%
    layer_dropout(rate=0.2)%>%
    layer_dense(units=64, activation = "relu",
              kernel_initializer = "he_normal")%>%
    layer_dropout(rate=0.4)%>%
        layer_dense(units=2, activation = "sigmoid")

modeldeep1 %>%
  compile(loss="binary_crossentropy",
          optimizer="adam",
          metric="accuracy")

To define the appropriate weight, we divide the fraction of the majority class by the fraction of the minority class to get how many times the former is larger than the latter.

prop.table(table(data$Class))[1]/prop.table(table(data$Class))[2]
##       0 
## 577.876

Now we include this value as weight in the class_weight argument.

#history1<- modeldeep1 %>%
  #fit(trained1,trainlabel,batch_size=5,  #validation_split=0.2,class_weight=list("0"=1,"1"=577))

Again i should save this model before kniting the document. For you if you want to run the above code just uncomment it.

#save_model_hdf5(modeldeep1,"modeldeep1.h5")
modeldeep1<-load_model_hdf5("modeldeep1.h5")

Now let’s get the confusion matrix.

pred<-  modeldeep1 %>%
  predict_classes(tested1)
confusionMatrix(as.factor(pred),as.factor(testy))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 55303    14
##          1  1560    84
##                                          
##                Accuracy : 0.9724         
##                  95% CI : (0.971, 0.9737)
##     No Information Rate : 0.9983         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.0935         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.97257        
##             Specificity : 0.85714        
##          Pos Pred Value : 0.99975        
##          Neg Pred Value : 0.05109        
##              Prevalence : 0.99828        
##          Detection Rate : 0.97089        
##    Detection Prevalence : 0.97114        
##       Balanced Accuracy : 0.91485        
##                                          
##        'Positive' Class : 0              
##

Using this model we get less accuracy rate 0.9724, but the specificity rate is higher compared to the previous model so that this model can well predict the negative class label as well as the postive class label.

Conclusion

With the imbalanced data most machine learning model tend to more efficiently predict the majority class than the minority class. To correct thus this behavior we can use one of the above discussed methods to get more closer accuracy rates between classes. However, deep learning model can easily handle this problem by specifying the class weights.

To leave a comment for the author, please follow the link and comment on their blog: Modeling with R.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.