Site icon R-bloggers

An absolute beginner’s guide to creating data frames for a Stack Overflow [r] question

[This article was first published on R – What You're Doing Is Rather Desperate, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

For better or worse I spend some time each day at Stack Overflow [r], reading and answering questions. If you do the same, you probably notice certain features in questions that recur frequently. It’s as though everyone is copying from one source – perhaps the one at the top of the search results. And it seems highest-ranked is not always best.

Nowhere is this more apparent to me than in the way many users create data frames. So here is my introductory guide “how not to create data frames”, aimed at beginners writing their first questions.

1. No need for vectors

There is no need to create vectors first and then add them as columns:

x <- 1:2
y <- 3:4

df <- data.frame(x, y)

# just do this!
df <- data.frame(x = 1:2, y = 3:4)

If you really need the columns as vectors, they can always be obtained using df$x or df$y.

While we’re here, that df thing…

2. …df is not a great variable name

Sure, you can call a variable df and R will know when you mean that variable and when you mean the function, df(). But why risk the confusion, when you could just call it something else? Like df1. Or mydata. Or example.

3. No need to convert from a matrix

Here’s another rather bizarre way to make a data frame that I often see:

df1 <- matrix(1:4, ncol = 2, nrow = 2)
df1 <- as.data.frame(df1)

# or perhaps to name columns
df1 <- matrix(1:4, 2, 2, dimnames = list(c(1, 2), c("x", "y")))
df1 <- as.data.frame(df1)

Which would again be better achieved simply using data.frame():

df1 <- data.frame(x = 1:2, y = 3:4)

Using a matrix is especially problematic when you want to mix variable types, which is possible in data frames but not in matrices. Here, our numbers become characters in the matrix and hence factors in the data frame:

df1 <- matrix(c(1:2, letters[1:2]), 2, 2, dimnames = list(c(1, 2), c("x", "y")))
df1 <- as.data.frame(df1)

# oh look, your numbers are now factors, that's not what you want
str(df1)

'data.frame':	2 obs. of  2 variables:
 $ x: Factor w/ 2 levels "1","2": 1 2
  ..- attr(*, "names")= chr  "1" "2"
 $ y: Factor w/ 2 levels "a","b": 1 2
  ..- attr(*, "names")= chr  "1" "2"

Which brings us to…

4. …No strings as factors

Ever.

df1 <- data.frame(x = 1:2, y = letters[1:2], stringsAsFactors = FALSE)

5. Consider the alternatives and use the inbuilt help

You might consider the newer tibble in which strings are never factors, amongst other advantages such as pretty printing with information about variables. The syntax is just the same:

library(tibble)
df1 <- tibble(x = 1:2, y = 3:4)

And when you know the command name – data.frame for example, help is only “?” + command_name away. It isn’t always the best documentation, but it does generally tell you all you need to know.

To leave a comment for the author, please follow the link and comment on their blog: R – What You're Doing Is Rather Desperate.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.