Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
For better or worse I spend some time each day at Stack Overflow [r], reading and answering questions. If you do the same, you probably notice certain features in questions that recur frequently. It’s as though everyone is copying from one source – perhaps the one at the top of the search results. And it seems highest-ranked is not always best.
Nowhere is this more apparent to me than in the way many users create data frames. So here is my introductory guide “how not to create data frames”, aimed at beginners writing their first questions.
1. No need for vectors
There is no need to create vectors first and then add them as columns:
x <- 1:2 y <- 3:4 df <- data.frame(x, y) # just do this! df <- data.frame(x = 1:2, y = 3:4)
If you really need the columns as vectors, they can always be obtained using df$x
or df$y
.
While we’re here, that df thing…
2. …df is not a great variable name
Sure, you can call a variable df
and R will know when you mean that variable and when you mean the function, df()
. But why risk the confusion, when you could just call it something else? Like df1
. Or mydata
. Or example
.
3. No need to convert from a matrix
Here’s another rather bizarre way to make a data frame that I often see:
df1 <- matrix(1:4, ncol = 2, nrow = 2) df1 <- as.data.frame(df1) # or perhaps to name columns df1 <- matrix(1:4, 2, 2, dimnames = list(c(1, 2), c("x", "y"))) df1 <- as.data.frame(df1)
Which would again be better achieved simply using data.frame()
:
df1 <- data.frame(x = 1:2, y = 3:4)
Using a matrix is especially problematic when you want to mix variable types, which is possible in data frames but not in matrices. Here, our numbers become characters in the matrix and hence factors in the data frame:
df1 <- matrix(c(1:2, letters[1:2]), 2, 2, dimnames = list(c(1, 2), c("x", "y"))) df1 <- as.data.frame(df1) # oh look, your numbers are now factors, that's not what you want str(df1) 'data.frame': 2 obs. of 2 variables: $ x: Factor w/ 2 levels "1","2": 1 2 ..- attr(*, "names")= chr "1" "2" $ y: Factor w/ 2 levels "a","b": 1 2 ..- attr(*, "names")= chr "1" "2"
Which brings us to…
4. …No strings as factors
Ever.
df1 <- data.frame(x = 1:2, y = letters[1:2], stringsAsFactors = FALSE)
5. Consider the alternatives and use the inbuilt help
You might consider the newer tibble
in which strings are never factors, amongst other advantages such as pretty printing with information about variables. The syntax is just the same:
library(tibble) df1 <- tibble(x = 1:2, y = 3:4)
And when you know the command name – data.frame
for example, help is only “?” + command_name away. It isn’t always the best documentation, but it does generally tell you all you need to know.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.