Site icon R-bloggers

How to Create a Dataframe in R with 30 Code Examples (2022)

[This article was first published on R tutorial Archives – Dataquest, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

DataFrames are essential data structures in the R programming language. In this tutorial, we’ll discuss how to create a dataframe in R.

A DataFrame in R is a tabular (i.e., 2-dimensional, rectangular) data structure used to store values of any data type. It’s a data structure of the base R, meaning that we don’t have to install any specific package to create DataFrames and work with it.

As with any other table, every DataFrame consists of columns (representing variables, or attributes) and rows (representing data entries). Speaking in terms of R, a dataframe is a list of vectors of equal length, and, being at the same time a 2-dimensional data structure, it resembles an R matrix differing from it in the following way: a matrix has to contain only one data type while a DataFrame is more versatile since it can have multiple data types. However, while different columns of a DataFrame can have different data types, each column should be of the same data type.

Creating a Dataframe in R from Vectors

To create a DataFrame in R from one or more vectors of the same length, we use the data.frame() function. Its most basic syntax is as follows:

df <- data.frame(vector_1, vector_2)

We can pass as many vectors as we want to this function. Each vector will represent a DataFrame column, and the length of any vector will correspond to the number of rows in the new DataFrame. It’s also possible to pass only one vector to the data.frame() function, in which case a DataFrame with a single column will be created.

One way to create an R DataFrame from vectors is to create each vector first and then pass them all in the necessary order to the data.frame() function:

rating <- 1:4
animal <- c('koala', 'hedgehog', 'sloth', 'panda') 
country <- c('Australia', 'Italy', 'Peru', 'China')
avg_sleep_hours <- c(21, 18, 17, 10)
super_sleepers <- data.frame(rating, animal, country, avg_sleep_hours)
print(super_sleepers)
  rating   animal   country avg_sleep_hours
1      1    koala Australia              21
2      2 hedgehog     Italy              18
3      3    sloth      Peru              17
4      4    panda     China              10

(Side note: to be more precise, the geography of hedgehogs and sloths is a little wider, but let’s not be that picky!)

Alternatively, we can provide all the vectors directly in the following function:

super_sleepers <- data.frame(rating=1:4, 
                             animal=c('koala', 'hedgehog', 'sloth', 'panda'), 
                             country=c('Australia', 'Italy', 'Peru', 'China'),
                             avg_sleep_hours=c(21, 18, 17, 10))
print(super_sleepers)
  rating   animal   country avg_sleep_hours
1      1    koala Australia              21
2      2 hedgehog     Italy              18
3      3    sloth      Peru              17
4      4    panda     China              10

We got the same DataFrame as in the previous code. Note the following features:

Let’s confirm that the data structure we got is, indeed, a DataFrame:

print(class(super_sleepers))
[1] "data.frame"

Now, let’s explore its structure:

print(str(super_sleepers))
'data.frame':   4 obs. of  4 variables:
 $ rating         : int  1 2 3 4
 $ animal         : Factor w/ 4 levels "hedgehog","koala",..: 2 1 4 3
 $ country        : Factor w/ 4 levels "Australia","China",..: 1 3 4 2
 $ avg_sleep_hours: num  21 18 17 10
NULL

We see that despite the animal and country vectors being originally character vectors, the corersponding columns have a factor data type. This conversion is the default behavior of the data.frame() function. To suppress it, we need to add an optional parameter stringsAsFactors and set it to FALSE:

super_sleepers <- data.frame(rating=1:4, 
                             animal=c('koala', 'hedgehog', 'sloth', 'panda'), 
                             country=c('Australia', 'Italy', 'Peru', 'China'),
                             avg_sleep_hours=c(21, 18, 17, 10),
                             stringsAsFactors=FALSE)
print(str(super_sleepers))
'data.frame':   4 obs. of  4 variables:
 $ rating         : int  1 2 3 4
 $ animal         : chr  "koala" "hedgehog" "sloth" "panda"
 $ country        : chr  "Australia" "Italy" "Peru" "China"
 $ avg_sleep_hours: num  21 18 17 10
NULL

Now we see that the columns created from the character vectors are also of the character data type.

It’s possible to add also the names of rows of a DataFrame (by default, the rows are just indexed as consecutive integer numbers starting from 1). For this purpose, we use an optional parameter row.names, as follows:

super_sleepers <- data.frame(rating=1:4, 
                             animal=c('koala', 'hedgehog', 'sloth', 'panda'), 
                             country=c('Australia', 'Italy', 'Peru', 'China'),
                             avg_sleep_hours=c(21, 18, 17, 10),
                             row.names=c('row_1', 'row_2', 'row_3', 'row_4'))
print(super_sleepers)
      rating   animal   country avg_sleep_hours
row_1      1    koala Australia              21
row_2      2 hedgehog     Italy              18
row_3      3    sloth      Peru              17
row_4      4    panda     China              10

Note than in an R DataFrame, both the column names and row names (if they exist) have to be unique. If, by mistake, we provide the same name for two columns, R will automatically add a suffix to the second of them:

# Adding by mistake 2 columns called 'animal'
super_sleepers <- data.frame(animal=1:4, 
                             animal=c('koala', 'hedgehog', 'sloth', 'panda'), 
                             country=c('Australia', 'Italy', 'Peru', 'China'),
                             avg_sleep_hours=c(21, 18, 17, 10))
print(super_sleepers)
  animal animal.1   country avg_sleep_hours
1      1    koala Australia              21
2      2 hedgehog     Italy              18
3      3    sloth      Peru              17
4      4    panda     China              10

If, instead, we make a similar mistake with the row names, the program will throw an error:

# Naming by mistake 2 rows 'row_1'
super_sleepers <- data.frame(rating=1:4, 
                             animal=c('koala', 'hedgehog', 'sloth', 'panda'), 
                             country=c('Australia', 'Italy', 'Peru', 'China'),
                             avg_sleep_hours=c(21, 18, 17, 10),
                             row.names=c('row_1', 'row_1', 'row_3', 'row_4'))
print(super_sleepers)
Error in data.frame(rating = 1:4, animal = c("koala", "hedgehog", "sloth", : duplicate row.names: row_1
Traceback:

1. data.frame(rating = 1:4, animal = c("koala", "hedgehog", "sloth", 
 .     "panda"), country = c("Australia", "Italy", "Peru", "China"), 
 .     avg_sleep_hours = c(21, 18, 17, 10), row.names = c("row_1", 
 .         "row_1", "row_3", "row_4"))

2. stop(gettextf("duplicate row.names: %s", paste(unique(row.names[duplicated(row.names)]), 
 .     collapse = ", ")), domain = NA)

If necessary, we can rename the columns of a DataFrame after its creation using the names() function:

names(super_sleepers) <- c('col_1', 'col_2', 'col_3', 'col_4')
print(super_sleepers)
  col_1    col_2     col_3 col_4
1     1    koala Australia    21
2     2 hedgehog     Italy    18
3     3    sloth      Peru    17
4     4    panda     China    10

Creating a Dataframe in R from a Matrix

It’s possible to create an R DataFrame from a matrix. However, in this case, all the values of a new DataFrame will be of the same data type.

Let’s say we have the following matrix:

my_matrix <- matrix(c(1, 2, 3, 4, 5, 6), nrow=2)
print(my_matrix)
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

We can create a DataFrame from it using the same data.frame() function:

df_from_matrix <- data.frame(my_matrix)
print(df_from_matrix)
  X1 X2 X3
1  1  3  5
2  2  4  6

Unfortunately, in this case, there is no way to change the column names directly inside the function using some optional parameters (while, ironically, we still can rename the rows by using the row.names parameter). Since the default column names aren’t descriptive (or at least meaningful), we have to fix it after creating the DataFrame by applying the names() function:

names(df_from_matrix) <- c('col_1', 'col_2', 'col_3')
print(df_from_matrix)
  col_1 col_2 col_3
1     1     3     5
2     2     4     6

Creating a Dataframe in R from a List of Vectors

Another way to create a DataFrame in R is to provide a list of vectors to the data.frame() function. Indeed, we can think of an R DataFrame as a particular case of a list of vectors where all the vectors are of the same length.

Let’s say we have the following list of vectors:

my_list <- list(rating=1:4,
                animal=c('koala', 'hedgehog', 'sloth', 'panda'), 
                country=c('Australia', 'Italy', 'Peru', 'China'),
                avg_sleep_hours=c(21, 18, 17, 10))
print(my_list)
$rating
[1] 1 2 3 4

$animal
[1] "koala"    "hedgehog" "sloth"    "panda"   

$country
[1] "Australia" "Italy"     "Peru"      "China"    

$avg_sleep_hours
[1] 21 18 17 10

Now, we want to create a DataFrame from it:

super_sleepers <- data.frame(my_list)
print(super_sleepers)
  rating   animal   country avg_sleep_hours
1      1    koala Australia              21
2      2 hedgehog     Italy              18
3      3    sloth      Peru              17
4      4    panda     China              10

It’s important to emphasize once again that to be able to create a DataFrame from a list of vectors, each vector of the provided list has to have the same number of items; otherwise, the program throws an error. For example, if we try to remove ‘koala’ when creating my_list, we’ll still manage to create the list of vectors. However, when we try to use that list to create a DataFrame, we’ll get an error.

Another thing to note here is that the items of the list should be named (directly when creating the list of vectors, just as we did, or later, applying the names() function on the list). If we don’t do so and pass a list containing "nameless" vectors to the data.frame() function, we’ll get a DataFrame with quite meaningless column names:

my_list <- list(1:4,
                c('koala', 'hedgehog', 'sloth', 'panda'), 
                c('Australia', 'Italy', 'Peru', 'China'),
                c(21, 18, 17, 10))

super_sleepers <- data.frame(my_list)
print(super_sleepers)
  X1.4 c..koala....hedgehog....sloth....panda..
1    1                                    koala
2    2                                 hedgehog
3    3                                    sloth
4    4                                    panda
  c..Australia....Italy....Peru....China.. c.21..18..17..10.
1                                Australia                21
2                                    Italy                18
3                                     Peru                17
4                                    China                10

Creating a Dataframe in R from Other Dataframes

We can create a DataFrame in R by combining two or more other DataFrames. We can do this horizontally or vertically.

To combine DataFrames horizontally (i.e., adding the columns of one dataframe to the columns of the other), we use the cbind() function, where we pass the necessary DataFrames.

Let’s say we have a DataFrame containing only the first two columns of our super_sleepers table:

super_sleepers_1 <- data.frame(rating=1:4, 
                               animal=c('koala', 'hedgehog', 'sloth', 'panda'))
print(super_sleepers_1)
  rating   animal
1      1    koala
2      2 hedgehog
3      3    sloth
4      4    panda

The second 2 columns of super_sleepers are saved in another dataframe:

super_sleepers_2 <- data.frame(country=c('Australia', 'Italy', 'Peru', 'China'),
                               avg_sleep_hours=c(21, 18, 17, 10))
print(super_sleepers_2)
    country avg_sleep_hours
1 Australia              21
2     Italy              18
3      Peru              17
4     China              10

Now, we’ll apply the cbind() function to concatenate both DataFrames and get the initial super_sleepers DataFrame:

super_sleepers <- cbind(super_sleepers_1, super_sleepers_2)
print(super_sleepers)
  rating   animal   country avg_sleep_hours
1      1    koala Australia              21
2      2 hedgehog     Italy              18
3      3    sloth      Peru              17
4      4    panda     China              10

Note that in order to perform the above operation successfully, the DataFrames must have the same number of rows; otherwise, we’ll get an error.

Similarly, to combine DataFrames vertically (i.e., adding the rows of one DataFrame to the rows of the other), we use the rbind() function, where we pass the necessary dataFrames.

Let’s say we have a DataFrame containing only the first two rows of super_sleepers:

super_sleepers_1 <- data.frame(rating=1:2, 
                               animal=c('koala', 'hedgehog'), 
                               country=c('Australia', 'Italy'),
                               avg_sleep_hours=c(21, 18))
print(super_sleepers_1)
  rating   animal   country avg_sleep_hours
1      1    koala Australia              21
2      2 hedgehog     Italy              18

Another DataFrame contains the last two rows of super_sleepers:

super_sleepers_2 <- data.frame(rating=3:4, 
                               animal=c('sloth', 'panda'), 
                               country=c('Peru', 'China'),
                               avg_sleep_hours=c(17, 10))
print(super_sleepers_2)
  rating animal country avg_sleep_hours
1      3  sloth    Peru              17
2      4  panda   China              10

Let’s combine them vertically using the rbind() function to get our initial DataFrame:

super_sleepers <- rbind(super_sleepers_1, super_sleepers_2)
print(super_sleepers)
  rating   animal   country avg_sleep_hours
1      1    koala Australia              21
2      2 hedgehog     Italy              18
3      3    sloth      Peru              17
4      4    panda     China              10

Note that in order to perform this operation successfully, the DataFrames must have the same number of columns and the same column names in the same order. Otherwise, we’ll get an error.

Creating an Empty Dataframe in R

In some cases, we may need to create an empty R DataFrame with only column names and column data types and no rows — to then later fill using a for-loop. For this purpose, we apply again the data.frame() function, as follows:

super_sleepers_empty <- data.frame(rating=numeric(),
                                   animal=character(),
                                   country=character(),
                                   avg_sleep_hours=numeric())
print(super_sleepers_empty)
[1] rating          animal          country         avg_sleep_hours
 (or 0-length row.names)

Let’s check the data types of the columns of our new empty DataFrame:

print(str(super_sleepers_empty))
'data.frame':   0 obs. of  4 variables:
 $ rating         : num 
 $ animal         : Factor w/ 0 levels: 
 $ country        : Factor w/ 0 levels: 
 $ avg_sleep_hours: num 
NULL

As we saw earlier, the columns that we want to be of character data type are actually of factor data type due to the default conversion conducted by the data.frame() function. As earlier, we can fix it by introducing an optional parameter stringsAsFactors=FALSE:

super_sleepers_empty <- data.frame(rating=numeric(),
                                   animal=character(),
                                   country=character(),
                                   avg_sleep_hours=numeric(),
                                   stringsAsFactors=FALSE)
print(str(super_sleepers_empty))
'data.frame':   0 obs. of  4 variables:
 $ rating         : num 
 $ animal         : chr 
 $ country        : chr 
 $ avg_sleep_hours: num 
NULL

Note: adding the stringsAsFactors=FALSE parameter is always a good practice when applying the data.frame() function. We haven’t used it much in this tutorial, but only to avoid overloading the code and to focus on the main details. However, for real-world tasks, you should always consider adding this parameter to prevent the undesirable behavior of DataFrames containing character data type.

Another way to create an empty DataFrame in R is to create an empty "copy" of another DataFrame (practically meaning that we copy only the column names and their data types).

Let’s re-create our original super_sleepers (this time, using the stringsAsFactors=FALSE parameter):

super_sleepers <- data.frame(rating=1:4, 
                             animal=c('koala', 'hedgehog', 'sloth', 'panda'), 
                             country=c('Australia', 'Italy', 'Peru', 'China'),
                             avg_sleep_hours=c(21, 18, 17, 10),
                             stringsAsFactors=FALSE)
print(super_sleepers)
  rating   animal   country avg_sleep_hours
1      1    koala Australia              21
2      2 hedgehog     Italy              18
3      3    sloth      Peru              17
4      4    panda     China              10

Now, create an empty template of it as a new DataFrame using the following syntax:

super_sleepers_empty <- super_sleepers[FALSE, ]
print(super_sleepers_empty)
[1] rating          animal          country         avg_sleep_hours
 (or 0-length row.names)

Let’s double-check if the data types of the original DataFrame’s columns are preserved in the new empty DataFrame:

print(str(super_sleepers_empty))
'data.frame':   0 obs. of  4 variables:
 $ rating         : int 
 $ animal         : chr 
 $ country        : chr 
 $ avg_sleep_hours: num 
NULL

Finally, we can create an empty DataFrame from a matrix with no rows and the necessary number of columns and then assign the corresponding column names to it:

columns= c('rating', 'animal', 'country', 'avg_sleep_hours') 
super_sleepers_empty = data.frame(matrix(nrow=0, ncol=length(columns))) 
names(super_sleepers_empty) = columns
print(super_sleepers_empty)
[1] rating          animal          country         avg_sleep_hours
 (or 0-length row.names)

One potential disadvantage of the last approach is that the data types of the columns aren’t set from the beginning:

print(str(super_sleepers_empty))
'data.frame':   0 obs. of  4 variables:
 $ rating         : logi 
 $ animal         : logi 
 $ country        : logi 
 $ avg_sleep_hours: logi 
NULL

Reading a DataFrame in R from a File

Apart from creating a DataFrame in R from scratch, we can import an already existing dataset in a tabular form and save it as a DataFrame. Indeed, this is the most common way to create an R DataFrame for real-world tasks.

To see how it works, let’s download a Kaggle dataset Oranges vs. Grapefruit on our local machine, save it in the same folder as this notebook, read it as a new DataFrame citrus, and visualize the first six rows of the DataFrame. Since the original dataset exists as a csv file, we’ll use the read.csv() function to read it:

citrus <- read.csv('citrus.csv')
print(head(citrus))
    name diameter weight red green blue
1 orange     2.96  86.76 172    85    2
2 orange     3.91  88.05 166    78    3
3 orange     4.42  95.17 156    81    2
4 orange     4.47  95.60 163    81    4
5 orange     4.48  95.76 161    72    9
6 orange     4.59  95.86 142   100    2

Note: it’s not mandatory to save the downloaded file in the same folder as the working notebook. If the file is saved in another place, we simply need to provide the entire path to it instead of just the file’s name (e.g., 'C:/Users/User/Downloads/citrus.csv'). However, saving the dataset file in the same file as the working R notebook is a good practice.

The above code reads the dataset 'citrus.csv' into the DataFrame called citrus.

It’s also possible to read other types of files rather than csv. In other cases, we can find useful the functions read.table() (for reading any kind of tabular data), read.delim() (for tab-delimited text files), and read.fwf() (for fixed width formatted files).

Conclusion

In this tutorial, we’ve explored different ways of creating a DataFrame in R: from one or more vectors, from a matrix, from a list of vectors, combining other DataFrames horizontally or vertically, reading an available tabular dataset and assigning it to a new DataFrame. In addition, we considered the three different ways of creating an empty dataframe in R and when these approaches are applicable. We paid special attention to the syntax of code and its variations, technical nuances, good and bad practices, possible pitfalls, and workarounds to fix or avoid them.

To leave a comment for the author, please follow the link and comment on their blog: R tutorial Archives – Dataquest.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.