A comment on preparing data for classifiers
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have been working through (with some honest appreciation) a recent article comparing many classifiers on many data sets: “Do we Need Hundreds of Classifiers to Solve Real World Classification Problems?” Manuel Fernández-Delgado, Eva Cernadas, Senén Barro, Dinani Amorim; 15(Oct):3133−3181, 2014 (which we will call “the DWN paper” in this note). This paper applies 179 popular classifiers to around 120 data sets (mostly from the UCI Machine Learning Repository). The work looks good and interesting, but we do have one quibble with the data-prep on 8 of the 123 shared data sets. Given the paper is already out (not just in pre-print) I think it is appropriate to comment publicly.
The DWN paper is an interesting empirical study that measures the performance of a good number of popular classifiers (179 but their own account) on about 120 data sets (mostly from UCI).
This actually represents a bit of work as the UCI data sets are not all in exactly the same format. The data sets have varying file names, varying separators, varying missing value symbols, varying quoting/escaping conventions, non-machine readable headers, some data sets have row-ids, column to be predicted in varying positions, some data in zip files, and many other painful variations. I have always described UCI as “not quite machine readable.” Working with any one data set is easy, but the prospect of building an adapter for each of a large number of such data sets is unappealing. Combined with the fact that the data sets are often of small size, and often artificial/synthetic (designed to show off one particular inference method) few people work with more than a few of these data sets. The authors of DMW worked with well over 100 and shared their fully machine readable results ( .arff
and apparently standardized *_R.dat
files) in a convenient single downloadable tar-file (see their paper for the URL).
The stated conclusion of the paper is comforting, and not entirely unexpected: random forest methods are usually in the top 3 classifiers in terms of accuracy.
The problem is: we are always more accepting of an expected outcome. To confirm such a conclusion we will, of course, need more studies (on larger and more industry-typical data sets), better measures than accuracy (see here for some details), and a lot of digging in to methodology (including data preparation).
To be clear: I like the paper. The authors (as good scientists) publicly shared their data and a bit of their preparation code. This is something most authors do not do, and should in fact be our standard for accepting work for evaluation.
But, let us get down to quibbles. Let’s unpack the data and look at an example. Suppose we start with “car” a synthetic data set we have often used as an example. The UCI repository supplies 3 files: car.c45-names, car.data, and car.names
car.names
Free-form description of the data-set and format.car.data
Comma separated data (without header).car.c45-names
Presumably machine readable header forC4.5
packages
The standard way to deal with this data is to (by hand) inspect car.names
or car.c45-names
and hand-build a custom command to load the data. Example R code to do this is given below:
library(RCurl)
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"
tab <- read.table(text=getURL(url,write=basicTextGatherer()),
header=F,sep=',')
colnames(tab) <- c('buying', 'maint', 'doors',
'persons', 'lug_boot', 'safety', 'class')
options(width=50)
print(summary(tab))
Which (assuming RCurl
is properly installed) yields:
buying maint doors persons
high :432 high :432 2 :432 2 :576
low :432 low :432 3 :432 4 :576
med :432 med :432 4 :432 more:576
vhigh:432 vhigh:432 5more:432
lug_boot safety class
big :576 high:576 acc : 384
med :576 low :576 good : 69
small:576 med :576 unacc:1210
vgood: 65
For any one data set having to read the documentation and adapt that into custom loading code is not a big deal. However, having to do this for over 100 data sets is an effort. Let’s look into how the DWN paper did this.
The DWN paper car
directory has 9 items:
car.data
original file from UCI.car.names
original file from UCI.le_datos.m
Matlab custom data loading code.car.txt
Facts about the data set.car.arff
Derived.arff
format version of the data set.car.cost
Pricing of classification errors.car_R.dat
Derived standard tab separated values file with header.conxuntos.dat
Likely a result file.conxuntos_kfold.dat
Likely a result file.
The files I am interested in are car_R.dat
and le_datos.m
. car_R.dat
looks to be a TSV (tab separated values) file with header, likely intended to be read into R. It looks like the file is in a very regular format with row numbers, feature columns first (and named f*
) and category to be predicted last (named clase
and re-encoded as an integer). Notice that all features (which in this case were originally strings or factors) have been re-encoded as floating point numbers. That is potentially a problem. Let’s try and dig in how this conversion may have been done. We look into le_datos.m
and see the following code fragment:
for i_fich=1:n_fich
f=fopen(fich{i_fich}, 'r');
if -1==f
error('erro en fopen abrindo %sn', fich{i_fich});
end
for i=1:n_patrons(i_fich)
fprintf(2,'%5.1f%%r', 100*n_iter++/n_patrons_total);
for j = 1:n_entradas
t= fscanf(f,'%s',1);
if j==1 || j==2
val={'vhigh', 'high', 'med', 'low'};
elseif j==3
val={'2', '3', '4', '5-more'};
elseif j==4
val={'2', '4', 'more'};
elseif j==5
val={'small', 'med', 'big'};
elseif j==6
val={'low', 'med', 'high'};
end
n=length(val); a=2/(n-1); b=(1+n)/(1-n);
for k=1:n
if strcmp(t,val{k})
x(i_fich,i,j)=a*k+b; break
end
end
end
t = fscanf(f,'%s',1); % lectura da clase
for j=1:n_clases
if strcmp(t,clase{j})
cl(i_fich,i)=j; break
end
end
end
fclose(f);
end
It looks like for each categorical variable the researchers have hand-coded an ordered choice of levels. Then each level is replaced by equally spaced code-number from -1
through 1
(using the linear rule x(i_fich,i,j)=a*k+b
). Then (in code not shown) possibly more transformations are applied to numeric variables (such as centering and scaling to unit variance). This changes the original data which looks like this:
buying maint doors persons lug_boot safety class
1 vhigh vhigh 2 2 small low unacc
2 vhigh vhigh 2 2 small med unacc
3 vhigh vhigh 2 2 small high unacc
4 vhigh vhigh 2 2 med low unacc
5 vhigh vhigh 2 2 med med unacc
6 vhigh vhigh 2 2 med high unacc
To this
f1 f2 f3 f4 f5 f6 clase
1 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 -1.22439
1
2 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 0 1
3 -1.34125 -1.34125 -1.52084 -1.22439 -1.22439 1.22439 1
4 -1.34125 -1.34125 -1.52084 -1.22439 0 -1.22439 1
It appears as if one of the machine learning libraries the authors are using only accepts numeric features (I think some of the Python scikit-learn
methods have this limitation) or the authors believe they are using such a package. Whomever prepared this data seemed to be unaware that the standard way to convert categorical variables to numeric is the introduction of multiple indicator variables (see page 33 of chapter 2 of Practical Data Science with R for more details).
Indicator variables encoding US Census reported levels of education.
The point is: encoding multiple levels of a categorical variable into a single number may seem reversible to a person (as it is a 1-1 map), but some machine learning methods can not undo the geometric detail lost in such an encoding. For example: with a linear method (be it regression, logistic regression, a linear SVM, or so on) we lose explanatory power unless the encoding has properly guessed both the correct order of the attributes and the relative magnitudes. Even tree-based methods (like decision trees, or even random forest) waste part of their explanatory power (roughly degrees of freedom) trying to invert the encoding (leaving less power remaining to explain the original relation in the data). This sort of ad-hoc encoding may not cause much harm in this one example, but it is exactly what you don’t want to do if there are a great number of levels, cases where the order isn’t obvious, or when you are comparing different methods (as different methods are damaged to different degrees by this encoding).
This sort of “convert categorical features” through an arbitrary function is something we have seen a few times. It is one of the reasons we explicitly discuss indicator variables in “Practical Data Science with R” despite the common wisdom that “everybody already knows about them.” When you are trying to get best possible results for a client, you don’t want to inflict avoidable errors in your data transforms.
If you absolutely don’t want to use indicator variables consider impact coding or a safe automated transform such as vtreat. In both cases the actual training data is used to try and estimate the order and relative magnitudes of an encoding that would be useful for downstream modeling.
Is there any actual damage in this encoding? Let’s load the processed data set and see.
url2 <- 'http://winvector.github.io/uciCar/car_R.dat'
dTreated <- read.table(url2,
sep='t',header=TRUE)
The original data set supports a pretty good logistic regression model for unaccaptable cars:
set.seed(32353)
train <- rbinom(dim(tab)[[1]],1,0.5)==1
m1 <- glm(class=='unacc'~buying+maint+doors+persons+lug_boot+safety,
family=binomial(link='logit'),
data=tab[train,])
tab$pred <- predict(m1,newdata=tab,type="response")
print(table(class=tab[!train,'class'],
unnacPred=tab[!train,'pred']>0.5))
## unnacPred
## class FALSE TRUE
## acc 181 18
## good 30 0
## unacc 22 577
## vgood 35 0
The transformed data set does not support as good a logistic regression mode.
m2 <- glm(clase==1~f1+f2+f3+f4+f5,
family=binomial(link='logit'),
data=dTreated[train,])
dTreated$pred <- predict(m2,newdata=dTreated,type="response")
print(table(class=dTreated[!train,'clase'],
unnacPred=dTreated[!train,'pred']>0.5))
## unnacPred
## class FALSE TRUE
## 0 28 7
## 1 69 530
## 2 64 135
## 3 23 7
Now obviously some modeling methods are more sensitive to this miss-coding than others. In fact for a moderate number of levels you would expect random forest methods to actually invert the coding. But the fact that some methods are more affected than others is one reason why you don’t want to perform this encoding before making comparisons. As to the question whey to ever use logistic regression? Because when you have a proper encoding of of the data and the model structure is in fact somewhat linear logistic regression can in fact be a very good method.
In the DWN paper 8 data sets (out of 123) have the a*k+b
fragment in their le_datos.m
file. So likely the study was largely driven by data sets that natively have only numeric features. Also, we emphasize the DWN paper shared its data and a bit of its methods, which puts it light-years ahead of most published empirical studies. The only reason we can’t so citique other authors is many other authors don’t share their work.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.