Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Not all datasets are as clean and tidy as you would expect. Therefore, after importing your dataset into RStudio, most of the time you will need to prepare it before performing any statistical analyses. Data manipulation can even sometimes take longer than the actual analyses when the quality of the data is poor. Data manipulation include a broad range of tools and techniques. We present here in details the manipulations that you will most likely need for your projects. Do not hesitate to let me know (as a comment at the end of this article for example) if you find other data manipulations essential so that I can add them.
Dataset
In this article, we use the dataset cars
to illustrate the different data manipulation techniques. Note that the dataset is installed by default in RStudio (so you do not need to import it) and I use the generic name dat
as the name of the dataset throughout the article (see here why I always use a generic name instead of more specific names). Here is a table of the whole dataset:
dat <- cars # rename the cars dataset with a generic name dat # display the entire dataset ## speed dist ## 1 4 2 ## 2 4 10 ## 3 7 4 ## 4 7 22 ## 5 8 16 ## 6 9 10 ## 7 10 18 ## 8 10 26 ## 9 10 34 ## 10 11 17 ## 11 11 28 ## 12 12 14 ## 13 12 20 ## 14 12 24 ## 15 12 28 ## 16 13 26 ## 17 13 34 ## 18 13 34 ## 19 13 46 ## 20 14 26 ## 21 14 36 ## 22 14 60 ## 23 14 80 ## 24 15 20 ## 25 15 26 ## 26 15 54 ## 27 16 32 ## 28 16 40 ## 29 17 32 ## 30 17 40 ## 31 17 50 ## 32 18 42 ## 33 18 56 ## 34 18 76 ## 35 18 84 ## 36 19 36 ## 37 19 46 ## 38 19 68 ## 39 20 32 ## 40 20 48 ## 41 20 52 ## 42 20 56 ## 43 20 64 ## 44 22 66 ## 45 23 54 ## 46 24 70 ## 47 24 92 ## 48 24 93 ## 49 24 120 ## 50 25 85
This dataset has 50 observations with 2 variables (speed and distance). You can check the number of observations and variables with nrow(dat)
and ncol(dat)
, or dim(dat)
:
nrow(dat) # number of rows/observations ## [1] 50 ncol(dat) # number of columns/variables ## [1] 2 dim(dat) # dimension: number of rows/observations and columns/variables ## [1] 50 2< !-- library(pander) --> < !-- pander(dat) --> < !-- Note that the `pander()` command is only used for aesthetic reasons. You do not necessarily have to use them when you work in R Studio. -->
Subset a dataset
First or last observations
- To keep only the first 10 observations:
head(dat, n = 10) ## speed dist ## 1 4 2 ## 2 4 10 ## 3 7 4 ## 4 7 22 ## 5 8 16 ## 6 9 10 ## 7 10 18 ## 8 10 26 ## 9 10 34 ## 10 11 17
- To keep only the last 5 observations:
tail(dat, n = 5) ## speed dist ## 46 24 70 ## 47 24 92 ## 48 24 93 ## 49 24 120 ## 50 25 85
Random sample of observations
- To draw a sample of 4 observations without replacement:
library(dplyr) sample_n(dat, 4, replace = FALSE) ## speed dist ## 1 13 34 ## 2 15 20 ## 3 18 56 ## 4 20 32
Based on row or column numbers
If you know what observation(s) or column(s) you want to keep, you can use the row or column number(s) to subset your dataset. We illustrate this with several examples:
- keep all the variables for the \(3^{rd}\) observation:
dat[3, ]
- keep the \(2^{nd}\) variable for all observations:
dat[, 2]
- You can mix the two above methods to keep only the \(2^{nd}\) variable of the \(3^{rd}\) observation:
dat[3, 2] ## [1] 4
- keep several observations; for example observations \(1\) to \(5\), the \(10^{th}\) and the \(15^{th}\) observation for all variables:
dat[c(1:5, 10, 15), ] # do not forget c() ## speed dist ## 1 4 2 ## 2 4 10 ## 3 7 4 ## 4 7 22 ## 5 8 16 ## 10 11 17 ## 15 12 28
- remove observations 5 to 45:
dat[-c(5:45), ] ## speed dist ## 1 4 2 ## 2 4 10 ## 3 7 4 ## 4 7 22 ## 46 24 70 ## 47 24 92 ## 48 24 93 ## 49 24 120 ## 50 25 85
- tip: to keep only the last observation, use
nrow()
instead of the row number:
dat[nrow(dat), ] # nrow() gives the number of rows ## speed dist ## 50 25 85
This way, no matter the number of observations, you will always select the last one. This technique of using a piece of code instead of a specific value is to avoid “hard coding”. Hard coding is generally not recommended (unless you want to specify a parameter that you are sure will never change) because if your dataset changes, you will need to manually edit your code.
As you probably figured out by now, you can select observations and/or variables of a dataset by running dataset_name[row_number, column_number]
. When the row or column number is left empty, the entire row/column is selected. Note that all examples presented above also works for matrices:
mat <- matrix(c(-1, 2, 0, 3), ncol = 2, nrow = 2) mat ## [,1] [,2] ## [1,] -1 0 ## [2,] 2 3 mat[1, 2] ## [1] 0
Based on variable names
To select one variable of the dataset based on its name rather than on its column number, use dataset_name$variable_name
:
dat$speed ## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 ## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
Accessing variables inside a dataset with this second method is strongly recommended compared to the first if you intend to modify the structure of your database. Indeed, if a column is added or removed in the dataset, the numbering will change. Therefore, variables are generally referred to by its name rather than by its position (column number). In addition, it is easier to understand and interpret code with the name of the variable written (another reason to call variables with a concise but clear name). There is only one reason why I would still use the column number; if the variables names are expected to change while the structure of the dataset do not change.
To select variables, it is also possible to use the select()
command from the powerful dplyr
package (for compactness only the first 6 observations are displayed thanks to the head()
command):
head(select(dat, speed)) ## speed ## 1 4 ## 2 4 ## 3 7 ## 4 7 ## 5 8 ## 6 9
This is equivalent than removing the distance variable:
head(select(dat, -dist)) ## speed ## 1 4 ## 2 4 ## 3 7 ## 4 7 ## 5 8 ## 6 9
Based on one or multiple criterion
Instead of subsetting a dataset based on row/column numbers or variable names, you can also subset it based on one or multiple criterion:
- keep only observations with speed larger than 20. The first argument refers to the name of the dataset, while the second argument refers to the subset criteria:
subset(dat, dat$speed > 20) ## speed dist ## 44 22 66 ## 45 23 54 ## 46 24 70 ## 47 24 92 ## 48 24 93 ## 49 24 120 ## 50 25 85
- keep only observations with distance smaller than or equal to 50 and speed equal to 10. Note the
==
(and not=
) for the equal criteria:
subset(dat, dat$dist <= 50 & dat$speed == 10) ## speed dist ## 7 10 18 ## 8 10 26 ## 9 10 34
- use
|
to keep only observations with distance smaller than 20 or speed equal to 10:
subset(dat, dat$dist < 20 | dat$speed == 10) ## speed dist ## 1 4 2 ## 2 4 10 ## 3 7 4 ## 5 8 16 ## 6 9 10 ## 7 10 18 ## 8 10 26 ## 9 10 34 ## 10 11 17 ## 12 12 14
- to filter out some observations, use
!=
. For instance, to keep observations with speed not equal to 24 and distance not equal to 120 (for compactness only the last 6 observations are displayed thanks to thetail()
command):
tail(subset(dat, dat$speed != 24 & dat$dist != 120)) ## speed dist ## 41 20 52 ## 42 20 56 ## 43 20 64 ## 44 22 66 ## 45 23 54 ## 50 25 85
Create a new variable
Often a dataset can be enhanced by creating new variables based on other variables from the initial dataset. In this example, we create two new variables; one being the speed times the distance (which we call speed_dist
) and the other being a categorization of the speed (which we call speed_cat
). We then display the first 6 observations of this new dataset with the 4 variables:
# create new variable speed_dist dat$speed_dist <- dat$speed * dat$dist # create new variable speed_cat # the function ifelse() below means that if dat$speed > 7, then speed_cat is "high speed", otherwise it is "low_speed" dat$speed_cat <- ifelse(dat$speed > 7, "high speed", "low speed" ) # display first 6 observations head(dat) # 6 is the default in head() ## speed dist speed_dist speed_cat ## 1 4 2 8 low speed ## 2 4 10 40 low speed ## 3 7 4 28 low speed ## 4 7 22 154 low speed ## 5 8 16 128 high speed ## 6 9 10 90 high speed
Note than in programming, a character string is generally surrounded by quotes ("character string"
).
Discretize a continuous variable
To transform a continuous variable into a discrete variable:
dat$speed_discr <- cut(dat$speed, breaks = c(0, 12, 15, 19, 26), # cut points right = FALSE # closed on the left, open on the right ) dat[c(1:2, 23:24, 49:50), ] # display some observations ## speed dist speed_dist speed_cat speed_discr ## 1 4 2 8 low speed [0,12) ## 2 4 10 40 low speed [0,12) ## 23 14 80 1120 high speed [12,15) ## 24 15 20 300 high speed [15,19) ## 49 24 120 2880 high speed [19,26) ## 50 25 85 2125 high speed [19,26)
Categorical variables and labels management
For categorical variables, it is a good practice to use the factor format and to name the different levels of the variables.
- for this example, let’s create another new variable called
dist_cat
based on the distance and then change its format from numeric to factor (while also specifying the labels of the levels):
# create new variable dist_cat dat$dist_cat <- ifelse(dat$dist < 15, 1, 2 ) # change from numeric to factor and specify the labels dat$dist_cat <- factor(dat$dist_cat, levels = c(1, 2), labels = c("small distance", "big distance") # follow the order of the levels ) head(dat) ## speed dist speed_dist speed_cat speed_discr dist_cat ## 1 4 2 8 low speed [0,12) small distance ## 2 4 10 40 low speed [0,12) small distance ## 3 7 4 28 low speed [0,12) small distance ## 4 7 22 154 low speed [0,12) big distance ## 5 8 16 128 high speed [0,12) big distance ## 6 9 10 90 high speed [0,12) small distance
- to check the format of a variable:
class(dat$dist_cat) ## [1] "factor" # or str(dat$dist_cat) ## Factor w/ 2 levels "small distance",..: 1 1 1 2 2 1 2 2 2 2 ...
This will be sufficient if you need to format only a limited number of variables. However, if you need to do it for a large amount of categorical variables, it quickly becomes time consuming to write the same code many times. As you can imagine, it possible to format many variables without having to write the entire code for each variable one by one by using the within()
command:
dat <- within(dat, { speed_cat <- factor(speed_cat, labels = c( "high speed", "low speed" )) dist_cat <- factor(dist_cat, labels = c( "small distance", "big distance" )) }) head(dat) ## speed dist speed_dist speed_cat speed_discr dist_cat ## 1 4 2 8 low speed [0,12) small distance ## 2 4 10 40 low speed [0,12) small distance ## 3 7 4 28 low speed [0,12) small distance ## 4 7 22 154 low speed [0,12) big distance ## 5 8 16 128 high speed [0,12) big distance ## 6 9 10 90 high speed [0,12) small distance str(dat) ## 'data.frame': 50 obs. of 6 variables: ## $ speed : num 4 4 7 7 8 9 10 10 10 11 ... ## $ dist : num 2 10 4 22 16 10 18 26 34 17 ... ## $ speed_dist : num 8 40 28 154 128 90 180 260 340 187 ... ## $ speed_cat : Factor w/ 2 levels "high speed","low speed": 2 2 2 2 1 1 1 1 1 1 ... ## $ speed_discr: Factor w/ 4 levels "[0,12)","[12,15)",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ dist_cat : Factor w/ 2 levels "small distance",..: 1 1 1 2 2 1 2 2 2 2 ...
Recode categorical variables
It is possible to recode labels of a categorical variable if you are not satisfied with the current labels. In this example, we change the labels as follows:
- “small distance” \(\rightarrow\) “short distance”
- “big distance” \(\rightarrow\) “large distance”
dat$dist_cat <- recode(dat$dist_cat, "small distance" = "short distance", "big distance" = "large distance" ) head(dat) ## speed dist speed_dist speed_cat speed_discr dist_cat ## 1 4 2 8 low speed [0,12) short distance ## 2 4 10 40 low speed [0,12) short distance ## 3 7 4 28 low speed [0,12) short distance ## 4 7 22 154 low speed [0,12) large distance ## 5 8 16 128 high speed [0,12) large distance ## 6 9 10 90 high speed [0,12) short distance< !-- This will be sufficient if you need to recode only one variable. However, if you need to recode more than one variables without having to recode them one by one, another piece of code is necessary: --> < !-- ```{r} --> < !-- dat2 < - select(dat, c(speed_cat, dist_cat)) --> < !-- dat2$speed_cat < - as.factor(dat$speed_cat) --> < !-- dat2 < - apply(dat2, 2, function(x) { --> < !-- x < - recode( --> < !-- x, --> < !-- "'small distance' = 'short distance'; --> < !-- 'big distance' = 'large distance'; --> < !-- 'low speed' = 'slow'; --> < !-- 'high speed' = 'fast'" --> < !-- ) --> < !-- x --> < !-- }) --> < !-- ``` -->
Change reference level
For some analyses, you might want to change the order of the levels. For example, if you analyze data about a control group and a treatment group, you may want to set the control group as the reference group. By default, levels are ordered by alphabetical order or by its numeric value if it was change from numeric to factor.
- to check the current order of the levels (the first level being the reference):
levels(dat$dist_cat) ## [1] "short distance" "large distance"
In this case, “short distance” being the first level it is the reference level. It is the first level because it was initially set with a value equal to 1 when creating the variable.
- to change the reference level:
dat$dist_cat <- relevel(dat$dist_cat, ref = "large distance") levels(dat$dist_cat) ## [1] "large distance" "short distance"
Large distance is now the first and thus the reference level.
Rename variable names
To rename variable names, use the rename()
command from the dplyr
package as follows:
- dist \(\rightarrow\) distance
- speed_dist \(\rightarrow\) speed_distance
- dist_cat \(\rightarrow\) distance_cat
dat <- rename(dat, distance = dist, speed_distance = speed_dist, distance_cat = dist_cat ) names(dat) # display variable names ## [1] "speed" "distance" "speed_distance" "speed_cat" ## [5] "speed_discr" "distance_cat"
Create a dataframe manually
Although most analyses are performed on an imported dataset, it is also possible to create a dataframe directly in R:
# Create the data frame named dat dat <- data.frame( "variable1" = c(6, 12, NA, 3), # presence of 1 missing value "variable2" = c(3, 7, 9, 1), stringsAsFactors = FALSE ) # Print the data frame dat ## variable1 variable2 ## 1 6 3 ## 2 12 7 ## 3 NA 9 ## 4 3 1
Missing values
Missing values (represented by NA in RStudio, for “Not Applicable”) are often problematic for many analyses. For instance, the mean of a series or variable with at least one NA will give a NA (the dataframe created in the previous section is used for this example):
mean(dat$variable1) ## [1] NA
It is however possible to compute most measures for variables including at least one NA thanks to the argument na.rm = TRUE
:
mean(dat$variable1, na.rm = TRUE) ## [1] 7
Nonetheless, datasets with NAs are still problematic for some types of analysis. Several alternatives exist to remove or impute missing values.
Remove NAs
A simple solution is to remove all observations (i.e., rows) containing at least one missing value. This is done by keeping observations with complete cases:
dat_complete <- dat[complete.cases(dat), ] dat_complete ## variable1 variable2 ## 1 6 3 ## 2 12 7 ## 4 3 1
Be careful before removing observations with missing values, especially if missing values are not “missing at random”. This is, however, beyond the scope of the present article.
Impute NAs
Instead of removing observations with at least one NA, it is possible to impute them, that is, replace them by some values such as the median or the mode of the variable. This can be done easily with the command impute()
from the package imputeMissings
:
library(imputeMissings) dat_imputed <- impute(dat) # default method is median/mode dat_imputed ## variable1 variable2 ## 1 6 3 ## 2 12 7 ## 3 6 9 ## 4 3 1
When the median/mode method is used (the default), character vectors and factors are imputed with the mode. Numeric and integer vectors are imputed with the median. Again, use imputations carefully. Other packages offer more advanced imputation techniques. However, we keep it simple and straightforward for this article as advanced imputations is beyond the scope of introductory data manipulations in R.
Thanks for reading. I hope this article helped you to manipulate your data in RStudio. See other articles related to R here. As always, if you find a mistake/bug or if you have any questions do not hesitate to let me know in the comment section below, raise an issue on GitHub or contact me. Get updates every time a new article is published by subscribing to this blog.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.