Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
DataFrames are essential data structures in the R programming language. In this tutorial, we’ll discuss how to create a dataframe in R.
A DataFrame in R is a tabular (i.e., 2-dimensional, rectangular) data structure used to store values of any data type. It’s a data structure of the base R, meaning that we don’t have to install any specific package to create DataFrames and work with it.
As with any other table, every DataFrame consists of columns (representing variables, or attributes) and rows (representing data entries). Speaking in terms of R, a dataframe is a list of vectors of equal length, and, being at the same time a 2-dimensional data structure, it resembles an R matrix differing from it in the following way: a matrix has to contain only one data type while a DataFrame is more versatile since it can have multiple data types. However, while different columns of a DataFrame can have different data types, each column should be of the same data type.
Creating a Dataframe in R from Vectors
To create a DataFrame in R from one or more vectors of the same length, we use the data.frame()
function. Its most basic syntax is as follows:
df <- data.frame(vector_1, vector_2)
We can pass as many vectors as we want to this function. Each vector will represent a DataFrame column, and the length of any vector will correspond to the number of rows in the new DataFrame. It’s also possible to pass only one vector to the data.frame()
function, in which case a DataFrame with a single column will be created.
One way to create an R DataFrame from vectors is to create each vector first and then pass them all in the necessary order to the data.frame()
function:
rating <- 1:4 animal <- c('koala', 'hedgehog', 'sloth', 'panda') country <- c('Australia', 'Italy', 'Peru', 'China') avg_sleep_hours <- c(21, 18, 17, 10) super_sleepers <- data.frame(rating, animal, country, avg_sleep_hours) print(super_sleepers) rating animal country avg_sleep_hours 1 1 koala Australia 21 2 2 hedgehog Italy 18 3 3 sloth Peru 17 4 4 panda China 10
(Side note: to be more precise, the geography of hedgehogs and sloths is a little wider, but let’s not be that picky!)
Alternatively, we can provide all the vectors directly in the following function:
super_sleepers <- data.frame(rating=1:4, animal=c('koala', 'hedgehog', 'sloth', 'panda'), country=c('Australia', 'Italy', 'Peru', 'China'), avg_sleep_hours=c(21, 18, 17, 10)) print(super_sleepers) rating animal country avg_sleep_hours 1 1 koala Australia 21 2 2 hedgehog Italy 18 3 3 sloth Peru 17 4 4 panda China 10
We got the same DataFrame as in the previous code. Note the following features:
- The vectors for a DataFrame can be created using either the
c()
function (e.g.,c('koala', 'hedgehog', 'sloth', 'panda')
) or a range (e.g.,1:4
). - In the first case, we used the assignment operator
<-
to create the vectors. In the second case, we used the argument assignment operator=
. - In both cases, the names of the vectors became the column names of the resulting DataFrame.
- In the second example, we can include each vector name in quotation marks (e.g.,
'rating'=1:4
) but it’s not really necessary — the result will be the same.
Let’s confirm that the data structure we got is, indeed, a DataFrame:
print(class(super_sleepers)) [1] "data.frame"
Now, let’s explore its structure:
print(str(super_sleepers)) 'data.frame': 4 obs. of 4 variables: $ rating : int 1 2 3 4 $ animal : Factor w/ 4 levels "hedgehog","koala",..: 2 1 4 3 $ country : Factor w/ 4 levels "Australia","China",..: 1 3 4 2 $ avg_sleep_hours: num 21 18 17 10 NULL
We see that despite the animal
and country
vectors being originally character vectors, the corersponding columns have a factor data type. This conversion is the default behavior of the data.frame()
function. To suppress it, we need to add an optional parameter stringsAsFactors
and set it to FALSE
:
super_sleepers <- data.frame(rating=1:4, animal=c('koala', 'hedgehog', 'sloth', 'panda'), country=c('Australia', 'Italy', 'Peru', 'China'), avg_sleep_hours=c(21, 18, 17, 10), stringsAsFactors=FALSE) print(str(super_sleepers)) 'data.frame': 4 obs. of 4 variables: $ rating : int 1 2 3 4 $ animal : chr "koala" "hedgehog" "sloth" "panda" $ country : chr "Australia" "Italy" "Peru" "China" $ avg_sleep_hours: num 21 18 17 10 NULL
Now we see that the columns created from the character vectors are also of the character data type.
It’s possible to add also the names of rows of a DataFrame (by default, the rows are just indexed as consecutive integer numbers starting from 1). For this purpose, we use an optional parameter row.names
, as follows:
super_sleepers <- data.frame(rating=1:4, animal=c('koala', 'hedgehog', 'sloth', 'panda'), country=c('Australia', 'Italy', 'Peru', 'China'), avg_sleep_hours=c(21, 18, 17, 10), row.names=c('row_1', 'row_2', 'row_3', 'row_4')) print(super_sleepers) rating animal country avg_sleep_hours row_1 1 koala Australia 21 row_2 2 hedgehog Italy 18 row_3 3 sloth Peru 17 row_4 4 panda China 10
Note than in an R DataFrame, both the column names and row names (if they exist) have to be unique. If, by mistake, we provide the same name for two columns, R will automatically add a suffix to the second of them:
# Adding by mistake 2 columns called 'animal' super_sleepers <- data.frame(animal=1:4, animal=c('koala', 'hedgehog', 'sloth', 'panda'), country=c('Australia', 'Italy', 'Peru', 'China'), avg_sleep_hours=c(21, 18, 17, 10)) print(super_sleepers) animal animal.1 country avg_sleep_hours 1 1 koala Australia 21 2 2 hedgehog Italy 18 3 3 sloth Peru 17 4 4 panda China 10
If, instead, we make a similar mistake with the row names, the program will throw an error:
# Naming by mistake 2 rows 'row_1' super_sleepers <- data.frame(rating=1:4, animal=c('koala', 'hedgehog', 'sloth', 'panda'), country=c('Australia', 'Italy', 'Peru', 'China'), avg_sleep_hours=c(21, 18, 17, 10), row.names=c('row_1', 'row_1', 'row_3', 'row_4')) print(super_sleepers) Error in data.frame(rating = 1:4, animal = c("koala", "hedgehog", "sloth", : duplicate row.names: row_1 Traceback: 1. data.frame(rating = 1:4, animal = c("koala", "hedgehog", "sloth", . "panda"), country = c("Australia", "Italy", "Peru", "China"), . avg_sleep_hours = c(21, 18, 17, 10), row.names = c("row_1", . "row_1", "row_3", "row_4")) 2. stop(gettextf("duplicate row.names: %s", paste(unique(row.names[duplicated(row.names)]), . collapse = ", ")), domain = NA)
If necessary, we can rename the columns of a DataFrame after its creation using the names()
function:
names(super_sleepers) <- c('col_1', 'col_2', 'col_3', 'col_4') print(super_sleepers) col_1 col_2 col_3 col_4 1 1 koala Australia 21 2 2 hedgehog Italy 18 3 3 sloth Peru 17 4 4 panda China 10
Creating a Dataframe in R from a Matrix
It’s possible to create an R DataFrame from a matrix. However, in this case, all the values of a new DataFrame will be of the same data type.
Let’s say we have the following matrix:
my_matrix <- matrix(c(1, 2, 3, 4, 5, 6), nrow=2) print(my_matrix) [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6
We can create a DataFrame from it using the same data.frame()
function:
df_from_matrix <- data.frame(my_matrix) print(df_from_matrix) X1 X2 X3 1 1 3 5 2 2 4 6
Unfortunately, in this case, there is no way to change the column names directly inside the function using some optional parameters (while, ironically, we still can rename the rows by using the row.names
parameter). Since the default column names aren’t descriptive (or at least meaningful), we have to fix it after creating the DataFrame by applying the names()
function:
names(df_from_matrix) <- c('col_1', 'col_2', 'col_3') print(df_from_matrix) col_1 col_2 col_3 1 1 3 5 2 2 4 6
Creating a Dataframe in R from a List of Vectors
Another way to create a DataFrame in R is to provide a list of vectors to the data.frame()
function. Indeed, we can think of an R DataFrame as a particular case of a list of vectors where all the vectors are of the same length.
Let’s say we have the following list of vectors:
my_list <- list(rating=1:4, animal=c('koala', 'hedgehog', 'sloth', 'panda'), country=c('Australia', 'Italy', 'Peru', 'China'), avg_sleep_hours=c(21, 18, 17, 10)) print(my_list) $rating [1] 1 2 3 4 $animal [1] "koala" "hedgehog" "sloth" "panda" $country [1] "Australia" "Italy" "Peru" "China" $avg_sleep_hours [1] 21 18 17 10
Now, we want to create a DataFrame from it:
super_sleepers <- data.frame(my_list) print(super_sleepers) rating animal country avg_sleep_hours 1 1 koala Australia 21 2 2 hedgehog Italy 18 3 3 sloth Peru 17 4 4 panda China 10
It’s important to emphasize once again that to be able to create a DataFrame from a list of vectors, each vector of the provided list has to have the same number of items; otherwise, the program throws an error. For example, if we try to remove ‘koala’ when creating my_list
, we’ll still manage to create the list of vectors. However, when we try to use that list to create a DataFrame, we’ll get an error.
Another thing to note here is that the items of the list should be named (directly when creating the list of vectors, just as we did, or later, applying the names()
function on the list). If we don’t do so and pass a list containing "nameless" vectors to the data.frame()
function, we’ll get a DataFrame with quite meaningless column names:
my_list <- list(1:4, c('koala', 'hedgehog', 'sloth', 'panda'), c('Australia', 'Italy', 'Peru', 'China'), c(21, 18, 17, 10)) super_sleepers <- data.frame(my_list) print(super_sleepers) X1.4 c..koala....hedgehog....sloth....panda.. 1 1 koala 2 2 hedgehog 3 3 sloth 4 4 panda c..Australia....Italy....Peru....China.. c.21..18..17..10. 1 Australia 21 2 Italy 18 3 Peru 17 4 China 10
Creating a Dataframe in R from Other Dataframes
We can create a DataFrame in R by combining two or more other DataFrames. We can do this horizontally or vertically.
To combine DataFrames horizontally (i.e., adding the columns of one dataframe to the columns of the other), we use the cbind()
function, where we pass the necessary DataFrames.
Let’s say we have a DataFrame containing only the first two columns of our super_sleepers
table:
super_sleepers_1 <- data.frame(rating=1:4, animal=c('koala', 'hedgehog', 'sloth', 'panda')) print(super_sleepers_1) rating animal 1 1 koala 2 2 hedgehog 3 3 sloth 4 4 panda
The second 2 columns of super_sleepers
are saved in another dataframe:
super_sleepers_2 <- data.frame(country=c('Australia', 'Italy', 'Peru', 'China'), avg_sleep_hours=c(21, 18, 17, 10)) print(super_sleepers_2) country avg_sleep_hours 1 Australia 21 2 Italy 18 3 Peru 17 4 China 10
Now, we’ll apply the cbind()
function to concatenate both DataFrames and get the initial super_sleepers
DataFrame:
super_sleepers <- cbind(super_sleepers_1, super_sleepers_2) print(super_sleepers) rating animal country avg_sleep_hours 1 1 koala Australia 21 2 2 hedgehog Italy 18 3 3 sloth Peru 17 4 4 panda China 10
Note that in order to perform the above operation successfully, the DataFrames must have the same number of rows; otherwise, we’ll get an error.
Similarly, to combine DataFrames vertically (i.e., adding the rows of one DataFrame to the rows of the other), we use the rbind()
function, where we pass the necessary dataFrames.
Let’s say we have a DataFrame containing only the first two rows of super_sleepers
:
super_sleepers_1 <- data.frame(rating=1:2, animal=c('koala', 'hedgehog'), country=c('Australia', 'Italy'), avg_sleep_hours=c(21, 18)) print(super_sleepers_1) rating animal country avg_sleep_hours 1 1 koala Australia 21 2 2 hedgehog Italy 18
Another DataFrame contains the last two rows of super_sleepers
:
super_sleepers_2 <- data.frame(rating=3:4, animal=c('sloth', 'panda'), country=c('Peru', 'China'), avg_sleep_hours=c(17, 10)) print(super_sleepers_2) rating animal country avg_sleep_hours 1 3 sloth Peru 17 2 4 panda China 10
Let’s combine them vertically using the rbind()
function to get our initial DataFrame:
super_sleepers <- rbind(super_sleepers_1, super_sleepers_2) print(super_sleepers) rating animal country avg_sleep_hours 1 1 koala Australia 21 2 2 hedgehog Italy 18 3 3 sloth Peru 17 4 4 panda China 10
Note that in order to perform this operation successfully, the DataFrames must have the same number of columns and the same column names in the same order. Otherwise, we’ll get an error.
Creating an Empty Dataframe in R
In some cases, we may need to create an empty R DataFrame with only column names and column data types and no rows — to then later fill using a for-loop. For this purpose, we apply again the data.frame()
function, as follows:
super_sleepers_empty <- data.frame(rating=numeric(), animal=character(), country=character(), avg_sleep_hours=numeric()) print(super_sleepers_empty) [1] rating animal country avg_sleep_hours (or 0-length row.names)
Let’s check the data types of the columns of our new empty DataFrame:
print(str(super_sleepers_empty)) 'data.frame': 0 obs. of 4 variables: $ rating : num $ animal : Factor w/ 0 levels: $ country : Factor w/ 0 levels: $ avg_sleep_hours: num NULL
As we saw earlier, the columns that we want to be of character data type are actually of factor data type due to the default conversion conducted by the data.frame()
function. As earlier, we can fix it by introducing an optional parameter stringsAsFactors=FALSE
:
super_sleepers_empty <- data.frame(rating=numeric(), animal=character(), country=character(), avg_sleep_hours=numeric(), stringsAsFactors=FALSE) print(str(super_sleepers_empty)) 'data.frame': 0 obs. of 4 variables: $ rating : num $ animal : chr $ country : chr $ avg_sleep_hours: num NULL
Note: adding the
stringsAsFactors=FALSE
parameter is always a good practice when applying thedata.frame()
function. We haven’t used it much in this tutorial, but only to avoid overloading the code and to focus on the main details. However, for real-world tasks, you should always consider adding this parameter to prevent the undesirable behavior of DataFrames containing character data type.
Another way to create an empty DataFrame in R is to create an empty "copy" of another DataFrame (practically meaning that we copy only the column names and their data types).
Let’s re-create our original super_sleepers
(this time, using the stringsAsFactors=FALSE
parameter):
super_sleepers <- data.frame(rating=1:4, animal=c('koala', 'hedgehog', 'sloth', 'panda'), country=c('Australia', 'Italy', 'Peru', 'China'), avg_sleep_hours=c(21, 18, 17, 10), stringsAsFactors=FALSE) print(super_sleepers) rating animal country avg_sleep_hours 1 1 koala Australia 21 2 2 hedgehog Italy 18 3 3 sloth Peru 17 4 4 panda China 10
Now, create an empty template of it as a new DataFrame using the following syntax:
super_sleepers_empty <- super_sleepers[FALSE, ] print(super_sleepers_empty) [1] rating animal country avg_sleep_hours (or 0-length row.names)
Let’s double-check if the data types of the original DataFrame’s columns are preserved in the new empty DataFrame:
print(str(super_sleepers_empty)) 'data.frame': 0 obs. of 4 variables: $ rating : int $ animal : chr $ country : chr $ avg_sleep_hours: num NULL
Finally, we can create an empty DataFrame from a matrix with no rows and the necessary number of columns and then assign the corresponding column names to it:
columns= c('rating', 'animal', 'country', 'avg_sleep_hours') super_sleepers_empty = data.frame(matrix(nrow=0, ncol=length(columns))) names(super_sleepers_empty) = columns print(super_sleepers_empty) [1] rating animal country avg_sleep_hours (or 0-length row.names)
One potential disadvantage of the last approach is that the data types of the columns aren’t set from the beginning:
print(str(super_sleepers_empty)) 'data.frame': 0 obs. of 4 variables: $ rating : logi $ animal : logi $ country : logi $ avg_sleep_hours: logi NULL
Reading a DataFrame in R from a File
Apart from creating a DataFrame in R from scratch, we can import an already existing dataset in a tabular form and save it as a DataFrame. Indeed, this is the most common way to create an R DataFrame for real-world tasks.
To see how it works, let’s download a Kaggle dataset Oranges vs. Grapefruit on our local machine, save it in the same folder as this notebook, read it as a new DataFrame citrus
, and visualize the first six rows of the DataFrame. Since the original dataset exists as a csv file, we’ll use the read.csv()
function to read it:
citrus <- read.csv('citrus.csv') print(head(citrus)) name diameter weight red green blue 1 orange 2.96 86.76 172 85 2 2 orange 3.91 88.05 166 78 3 3 orange 4.42 95.17 156 81 2 4 orange 4.47 95.60 163 81 4 5 orange 4.48 95.76 161 72 9 6 orange 4.59 95.86 142 100 2
Note: it’s not mandatory to save the downloaded file in the same folder as the working notebook. If the file is saved in another place, we simply need to provide the entire path to it instead of just the file’s name (e.g.,
'C:/Users/User/Downloads/citrus.csv'
). However, saving the dataset file in the same file as the working R notebook is a good practice.
The above code reads the dataset 'citrus.csv'
into the DataFrame called citrus
.
It’s also possible to read other types of files rather than csv. In other cases, we can find useful the functions read.table()
(for reading any kind of tabular data), read.delim()
(for tab-delimited text files), and read.fwf()
(for fixed width formatted files).
Conclusion
In this tutorial, we’ve explored different ways of creating a DataFrame in R: from one or more vectors, from a matrix, from a list of vectors, combining other DataFrames horizontally or vertically, reading an available tabular dataset and assigning it to a new DataFrame. In addition, we considered the three different ways of creating an empty dataframe in R and when these approaches are applicable. We paid special attention to the syntax of code and its variations, technical nuances, good and bad practices, possible pitfalls, and workarounds to fix or avoid them.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.