Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The primary user-facing data types in the R statistical computing environment behave as vectors. That is: one dimensional arrays of scalar values that have a nice operational algebra. There are additional types (lists, data frames, matrices, environments, and so-on) but the most common data types are vectors. In fact vectors are so common in R that scalar values such as the number 5
are actually represented as length-1 vectors. We commonly think about working over vectors of “logical”, “integer”, “numeric”, “complex”, “character”, and “factor” types. However, a “factor” is not a R vector. In fact “factor” is not a first-class citizen in R, which can lead to some ugly bugs.
For example, consider the following R code.
levels <- c('a','b','c')
f <- factor(c('c','a','a',NA,'b','a'),levels=levels)
print(f)
## [1] c a a <NA> b a
## Levels: a b c
print(class(f))
## [1] "factor"
This example encoding a series of 6 observations into a known set of factor-levels ('a'
, 'b'
, and 'c'
). As is the case with real data some of the positions might be missing/invalid values such as NA
. One of the strengths of R is we have a uniform explicit representation of bad values, so with appropriate domain knowledge we can find and fix such problems. Suppose we knew (by policy or domain experience) that the level 'a'
was a suitable default value to use when the actual data is missing/invalid. You would think the following code would be the reasonable way to build a new revised data column.
fRevised <- ifelse(is.na(f),'a',f)
print(fRevised)
## [1] "3" "1" "1" "a" "2" "1"
print(class(fRevised))
## [1] "character"
Notice the new column fRevised
is an absolute mess (and not even of class/type factor). This sort of fix would have worked if f
had been a vector of characters or even a vector of integers, but for factors we get gibberish.
We are going to work through some more examples of this problem.
R is designed to support statistical computation. In R analyses and calculations are often centered on a type called a data-frame. A data frame is very much like a SQL table in that it is a sequence of rows (each row representing an instance of data) organized against a column schema. This is also very much like a spreadsheet where we have good column names and column types. (One caveat: in R vectors that are all NA
typically lose their type information and become type "logical"
.) An example of an R data frame is given below.
d <- data.frame(x=c(1,-0.4),y=c('a','b'))
print(d)
## x y
## 1 1.0 a
## 2 -0.4 b
A R data frame is actually implemented as a list of columns, each column being treated as a vector. This encourages a very powerful programming style where we specify transformations as operations over columns. An example of working over column vectors is given below:
d <- data.frame(x=c(1,-0.4),y=c('a','b'))
d$xSquared <- d$x^2
print(d)
## x y xSquared
## 1 1.0 a 1.00
## 2 -0.4 b 0.16
Notice that we did not need to specify any for-loop, iteration, or range over the rows. We work over column vectors to great advantage in clarity and speed. This is fairly clever as traditional databases tend to be row-oriented (define operations as traversing rows) and spreadsheets tend to be cell-oriented (define operations over ranges of cells). We can confirm R’s implementation of data frames is in fact a list of column vectors (not merely some other structure behaving as such) through the unclass-trick:
print(class(unclass(d)))
## [1] "list"
print(unclass(d))
## $x
## [1] 1.0 -0.4
##
## $y
## [1] a b
## Levels: a b
##
## $xSquared
## [1] 1.00 0.16
##
## attr(,"row.names")
## [1] 1 2
The data frame d
is implemented as a class/type annotation over a list of columns (x
, y
, and xSquared
). Let’s take a closer look at the class or type of the column y
.
print(class(d$y))
## [1] "factor"
The class of y
is "factor"
. We gave R a sequence of strings and it promoted or coerced them into a sequence of factor levels. For statistical work this makes a lot of sense; we are more likely to want to work over factors (which we will define soon) than over strings. And at first glance R seems to like factors more than strings. For example summary()
works better with factors than with strings:
print(summary(d))
## x y xSquared
## Min. :-0.40 a:1 Min. :0.16
## 1st Qu.:-0.05 b:1 1st Qu.:0.37
## Median : 0.30 Median :0.58
## Mean : 0.30 Mean :0.58
## 3rd Qu.: 0.65 3rd Qu.:0.79
## Max. : 1.00 Max. :1.00
print(summary(data.frame(x=c(1,-0.4),y=c('a','b'),
stringsAsFactors=FALSE)))
## x y
## Min. :-0.40 Length:2
## 1st Qu.:-0.05 Class :character
## Median : 0.30 Mode :character
## Mean : 0.30
## 3rd Qu.: 0.65
## Max. : 1.00
Notice how if y
is a factor column we get nice counts of how often each factor-level occurred, but if y
is a character type (forced by setting stringsAsFactors=FALSE
to turn off conversion) we don’t get a usable summary. So as a a default behavior R promotes strings/characters to factors and has better summaries for strings/characters than for factors. This would make you think that factors might be a preferred/safe data type in R. This turns out to not completely be the case. A careful R programmer must really decide when and where they want to allow factors in their code.
What is a factor? In principle a factor is a value where the value is known to be taken from a known finite set of possible values called levels. This is similar to an enumerated type. Typically we think of factor levels or categories taking values from a fixed set of strings. Factors are very useful in encoding categorical responses or data. For example we can represent which continent a country is in with the factor levels "Asia"
, "Africa"
, "North America"
, "South America"
, "Antarctica"
, "Europe"
, and "Australia"
. When the data has been encoded as a factor (perhaps during ETL) you not only have the continents indicated, you also know the complete set of continents and have a guarantee of no ad-hoc alternate responses (such as “SA” for South America). Additional machine-readable knowledge and constraints make downstream code much more compact, powerful, and safe.
You can think of a factor vector as a sequence of strings with an additional annotation as to what universe of strings the strings are taken from. The R implementation of factor actually implements factor as a sequence of integers where each integer represents the index (starting from 1) of the string in the sequence of possible levels.
print(class(unclass(d$y)))
## [1] "integer"
print(unclass(d$y))
## [1] 1 2
## attr(,"levels")
## [1] "a" "b"
This implementation difference should not matter, except R exposes implementation details (more on this later). Exposing implementation details is generally considered to be a bad thing as we don’t know if code that uses factors is using the declared properties and interfaces or is directly manipulating the implementation.
Down-stream users or programmers are supposed to mostly work over the supplied abstraction not over the implementation. Users should not routinely have direct access to the implementation details and certainly not be able to directly manipulate the underlying implementation. In many cases the user must be aware of some of the limitations of the implementation, but this is considered a necessary undesirable consequence of a leaky abstraction. An example of a necessarily leaky abstraction: abstracting base-2 floating point arithmetic as if it were arithmetic over the real numbers. For decent speed you need your numerics to be based on machine floating point (or some multi-precision extension of machine floating point), but you want to think of numerics abstractly as real numbers. With this leaky compromise the user doesn’t have to have the entire IEEE Standard for Floating-Point Arithmetic (IEEE 754) open on their desk at all times. But the user should known the exceptions like: (3-2.9)<=0.1
tends to evaluate to FALSE
(due to the implementation, and in violation of the claimed abstraction) and know the necessary defensive coding practices (such as being familiar with What Every Computer Scientist Should Know About Floating-Point Arithmetic).
Now: factors can be efficiently implemented perfectly, so they should be implemented perfectly. At first glance it appears that they have been implemented correctly in R and the user is protected from the irrelevant implementation details. For example if we try and manipulate the underlying integer array representing the factor levels we get caught.
d$y[1] <- 2
## Warning message:
## In `[<-.factor`(`*tmp*`, 1, value = c(NA, 1L)) :
## invalid factor level, NA generated
This is good, when we tried to monkey with the implementation we got caught. This is how the R implementors try to ensure there is not a lot of user code directly monkeying with the current representation of factors (leaving open the possibility of future bug-fixes and implementation improvements). Likely this safety was gotten by overloading/patching the [<-
operator. However, as with most fix-to-finish designs, a few code paths are missed and there are places the user is exposed to the implementation of factors when they expected to be working over the abstraction. Here are a few examples:
f <- factor(c('a','b','a')) # make a factor example
print(class(f))
## [1] "factor"
print(f)
## [1] a b a
## Levels: a b
# c() operator collapses to implementation
print(class(c(f,f)))
## [1] "integer"
## [1] 1 2 1 1 2 1
print(c(f,f))
## [1] 1 2 1 1 2 1
# ifelse(,,) operator collapses to implementation
print(ifelse(rep(TRUE,length(f)),f,f))
# [1] 1 2 1
# factors are not actually vectors
# this IS as claimed in help(vector)
print(is.vector(f))
## [1] FALSE
# factor implementations are not vectors either
# despite being "integer"
print(class(unclass(f)))
## [1] "integer"
print(is.vector(unclass(f)))
## [1] FALSE
# unlist of a factor is not a vector
# despite help(unlist):
# "Given a list structure x, unlist simplifies it to produce a vector:
print(is.vector(unlist(f)))
## [1] FALSE
print(unlist(f))
## [1] a b a
## Levels: a b
print(as.vector(f))
## [1] "a" "b" "a"
What we have done is found instances where a factor
column does not behave as we would expect a character vector to behave. These defects in behavior are why I claim factor are not first class in R. They don’t get the full-service expected behavior from a number of basic R operations (such is passing through c()
or ifelse(,,)
without losing their class label). It is hard to say a factor is treated as a first-class citizens that correctly “supports all the operations generally available to other entities” (quote taken from Wikipedia: First-class_citizen). R doesn’t seem to trust leaving factor data types in factor data types (which should give one pause about doing the same).
The reason these differences are not a mere curiosities is: in any code where we are expecting one behavior and we experience another, we have a bug. So these conversions or abstraction leaks cause system brittleness which can lead to verbose hard to test overly defensive code (see Postel’s law: not sure who to be angry with for some of the downsides of being required to code defensively).
September 9, 1947 Grace Murray Hopper “First actual case of bug being found.”
(image: Computer History Museum)
Why should we expect a factor to behave like a character vector? Why not expect it to behave like an integer vector? The reason is: we supplied a character vector and R’s default behavior in data.frame()
was to convert it to a factor. R’s behavior only makes sense under the assumption there is some commonality of behavior between factors and character vectors. Otherwise R has made a surprising substitution and violated the principle of least astonishment. To press the point further: from an object oriented view (which is a common way to talk about the separation of concerns of interface and implementation) a valid substitution should at the very least follow some form of the Liskov substitution principle of a factor being a valid sub-type of character vector. But this is not possible between mutable versions of factor and character vector, so the substitution should not have been offered.
What we are trying to point out is: design is not always just a matter of taste. With enough design principles in mind (such as least astonishment, Liskov substitution, and a few others) you can actually say some design decisions are wrong (and maybe even some day some other design decisions are right). There are very few general principals of software system design, so you really don’t want to ignore the few you have.
One possible criticism of my examples is: “You have done everything wrong, everybody knows to set stringsAsFactors=FALSE
.” I call this the “Alice’s Adventures in Wonderland” defense. In my opinion the user is a guest and it is fair for the guest to initially assume default settings are generally the correct or desirable settings. The relevant “Alice’s Adventures in Wonderland” quote being:
At this moment the King, who had been for some time busily writing in his note-book, cackled out ‘Silence!’ and read out from his book, ‘Rule Forty-two. All persons more than a mile high to leave the court.’
Everybody looked at Alice.
‘I’m not a mile high,’ said Alice.
‘You are,’ said the King.
‘Nearly two miles high,’ added the Queen.
‘Well, I shan’t go, at any rate,’ said Alice: ‘besides, that’s not a regular rule: you invented it just now.’
‘It’s the oldest rule in the book,’ said the King.
‘Then it ought to be Number One,’ said Alice.
(text: Project Gutenberg)
(image from Wikipedia)
Another obvious criticism is: “You have worked hard to write bugs.” That is not the case, I have worked hard to make consequences direct and obvious. Where I first noticed my bug was code deep in an actual project which is similar to the following example. First let’s build a synthetic data set where y~f(x)
where x
is a factor or categorical variable.
# build a synthetic data set
set.seed(36236)
n <- 50
d <- data.frame(x=sample(c('a','b','c','d','e'),n,replace=TRUE))
d$train <- FALSE
d$train[sample(1:n,n/2)] <- TRUE
print(summary(d$x))
## a b c d e
## 4 7 12 14 13
# build noisy = f(x), with f('a')==f('b')
vals <- rnorm(length(levels(d$x)))
vals[2] <- vals[1]
names(vals) <- levels(d$x)
d$y <- rnorm(n) + vals[d$x]
print(vals)
## a b c d e
## 1.3394631 1.3394631 0.3536642 1.6990172 -0.5423986
# build a model
model1 <- lm(y~0+x,data=subset(d,train))
d$pred1 <- predict(model1,newdata=d)
print(summary(model1))
##
## Call:
## lm(formula = y ~ 0 + x, data = subset(d, train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53459 -0.43303 -0.07942 0.49278 2.20614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xa 2.9830 0.7470 3.993 0.000715 ***
## xb 2.0506 0.5282 3.882 0.000926 ***
## xc 1.2824 0.3993 3.212 0.004378 **
## xd 2.3644 0.3993 5.922 8.6e-06 ***
## xe -1.1541 0.4724 -2.443 0.023974 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.056 on 20 degrees of freedom
## Multiple R-squared: 0.8046, Adjusted R-squared: 0.7558
## F-statistic: 16.47 on 5 and 20 DF, p-value: 1.714e-06
Our first model is good. But during the analysis phase we might come across some domain knowledge, such as 'a'
and 'b'
are actually equivalent codes. We could reduce fitting variance by incorporating this knowledge in our feature engineering. In this example it won’t be much of an improvement, we are not merging much and not eliminating many degrees of freedom. In a real production example this can be a very important step where you may have a domain supplied roll-up dictionary that merges a large number of levels. However, what happens is our new merged column gets quietly converted to a column of integers which is then treated as a numeric column in the following modeling step. So the merge is in fact disastrous, we lose the categorical structure of the variable. We can, of course, re-institute the structure by calling as.factor()
if we know about the problem (which we might not), but even then we have lost the string labels for new integer level labels (making debugging even harder). Let’s see the failure we are anticipating, notice how the training adjusted R-squared disastrously drops from 0.7558 to 0.1417 after we attempt our “improvement.”
# try (and fail) to build an improved model
# using domain knowledge f('a')==f('b')
d$xMerged <- ifelse(d$x=='b',factor('a',levels=levels(d$x)),d$x)
print(summary(as.factor(d$xMerged)))
## 1 3 4 5
## 11 12 14 13
# disaster! xMerged is now class integer
# which is treated as numeric in lm, losing a lot of information
model2 <- lm(y~0+xMerged,data=subset(d,train))
print(summary(model2))
##
## Call:
## lm(formula = y ~ 0 + xMerged, data = subset(d, train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3193 -0.5818 0.8281 1.6237 3.5451
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xMerged 0.2564 0.1132 2.264 0.0329 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.98 on 24 degrees of freedom
## Multiple R-squared: 0.176, Adjusted R-squared: 0.1417
## F-statistic: 5.128 on 1 and 24 DF, p-value: 0.03286
There is an obvious method to merge the levels correctly: convert back to character (which we show below). The issue is: if you don’t know about the conversion to integer happening, you may not know to look for it and correct it.
# correct f('a')==f('b') merge
d$xMerged <- ifelse(d$x=='b','a',as.character(d$x))
model3 <- lm(y~0+xMerged,data=subset(d,train))
d$pred3 <- predict(model3,newdata=d)
print(summary(model3))
##
## Call:
## lm(formula = y ~ 0 + xMerged, data = subset(d, train))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53459 -0.51084 -0.05408 0.71385 2.20614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## xMergeda 2.3614 0.4317 5.470 1.99e-05 ***
## xMergedc 1.2824 0.3996 3.209 0.00422 **
## xMergedd 2.3644 0.3996 5.916 7.15e-06 ***
## xMergede -1.1541 0.4729 -2.441 0.02361 *
## ---
## Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
##
## Residual standard error: 1.057 on 21 degrees of freedom
## Multiple R-squared: 0.7945, Adjusted R-squared: 0.7553
## F-statistic: 20.3 on 4 and 21 DF, p-value: 5.693e-07
dTest <- subset(d,!train)
nTest <- dim(dTest)[[1]]
# Root Mean Square Error of original model on test data
print(sqrt(sum((dTest$y-dTest$pred1)^2)/nTest))
## [1] 1.330894
# Root Mean Square Error of f('a')==f('b') model on test data
print(sqrt(sum((dTest$y-dTest$pred3)^2)/nTest))
## [1] 1.297682
Factors are definitely useful, and I am glad R has them. I just wish they had fewer odd behaviors. My rule of thumb is just to use them as late as possible, set stringsAsFactors=FALSE
and if you need factors in some place convert from character near that place.
Please see the following articles for more ideas on working with categorical variables and preparing data for analysis.
Related posts:
- R has some sharp corners
- R minitip: don’t use data.matrix when you mean model.matrix
- R style tip: prefer functions that return data frames
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.