Site icon R-bloggers

R Training – Data Manipulation

[This article was first published on R – SLOW DATA, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the third module of a five-module training on R I conceived and taught to my ex-colleagues back in December 2016. RStudio is the suggested IDE to go through this module. For the raw code, example data and resources visit this repo on GitHub.

Setup

Load the packages in your workspace

library(data.table)
library(dplyr)

Let’s work with iris data

data("iris")

Convert to data.table

data.table is an efficient package to perform data manipulation. It is often convenient to convert a data.frame into a data.table to leverage the advanced data manipulation features of this package.

dt <- data.table(iris)

the good thing is that data.frame class is not lost in this way.

class(dt)

Be careful when calling functions expecting one class or the other (e.g. ggplot2)

 

Removing NAs

Depending on the study we may want to simply exclude missing values.

iris <- iris[complete.cases(iris),]

 

Coalescing NAs to 0s

Another option is to replace NAs with 0s. This function will make the trick using data.table syntax.

replaceNAs = function(DT) {
  for (j in seq_len(ncol(DT)))
    set(DT,which(is.na(DT[[j]])),j,0)
}
replaceNAs(iris)

 

data.table presentation

data.table is a package to handle data frames. It is particularly appreciated for its speed in performing operations on large data sets. It also makes it very easy to manipulate data for those knowing the basics of SQL since you can think of a data.table object as being structured this way:

 

dplyr presentation

Quoting from CRAN dplyr package is a fast, consistent tool for working with data-frame-like objects. The main methods of this package to manipulate data are the following:

but there are also lesser known tricks you can do with this package. Here we’ll cover just a few examples to give you an idea of its functioning.

Another awesome thing about dplyr is that it works seamlessly with the pipeline operator.

The pipeline operator from magrittr package (installed along with dplyr) allows you to create pipes. Pipes are chains which take the output from a function and feed it to the first argument of the next function. In this way it becomes very elegant and fast to query data frames.

 

Filtering rows

data table way

iris_setosa <- dt[Species=="setosa"] # only 'setosa' rows
iris_big <- dt[Petal.Width>2] # only rows with petal width greater than 2
iris_big <- dt[Petal.Width>2 | Petal.Length>3] # only rows with petal width greater than 2 or petal length greater than 3
setosaBig <- dt[Species=="setosa" & Petal.Width>0.5] # only rows 'setosa' and with petal width greater than 0.5
iris_sub <- dt[Species %in% c("setosa", "virginica")] # only rows setosa or virginica

dplyr way

setosa <- dt %>% 
  filter(Species == "setosa") # only rows setosa

Subsetting columns

Data.table way

keep <- c("Sepal.Length","Petal.Length","Species")
iris_keep <- dt[,keep,with=F]

dplyr way by name of variables

iris_keep_dplyr <- select(dt,Sepal.Length,Petal.Length,Species)

or using the pipeline operator

iris_keep_dplyr <- dt %>% select(Sepal.Length,Petal.Length,Species)

or by position of variables

iris_keep_dplyr <- dt %>% select(1,3,5)

Group by

data frame way

tapply(dt$Petal.Length, INDEX = dt$Species,  FUN = sum)

Data.table way

dt[,.(sum_pet_length=sum(Petal.Length)),by=Species]

dplyr way

dt %>% 
  group_by(Species) %>% 
  summarise(sum_pet_length = sum(Petal.Length))

Group by several variables, data.table way

head(dt[, list(avg_pet_length = mean(Petal.Width)),
            by=.(Species, Sepal.Length)])

Group by several variables, dplyr way

dt %>% 
  group_by(Species, Sepal.Length) %>% 
  summarise(avg_pet_length = mean(Petal.Width)) %>%
  head()

Merging tables

Typically a complex analysis will involve more than one table and often we will need to merge them to perform our analyses. When the tables have a similar structure by row or by column we may need to bind them.

Binding

Same columns: row binding

# Same columns, row binding
dt_1 <- dt
dt_total <- rbind(dt, dt_1 )

Same rows: column binding

vector <- 1:nrow(dt)
dt_total <- cbind(dt, vector)

Joining

More often what we would like to do is to merge different tables based on the values of some common variable.

  • to explore left_outer_join we make up a dataset with some info (totally fake!) on iris species

infoIris <- data.frame(s = levels(iris$Species), discovery = c(1950, 1960, 1961), color = c("red", "yellow", "brown"))

  • to explore inner_join we make up another dataset

flowers <- data.frame(Species = c("setosa", "virginica", "tulipan", "margherita"), n = 1:4)

Joining with base functions

  • with base merge functions the key variable has to have the same name on both tables

names(infoIris)[1] <- "Species"
left_join_base <- merge(dt, infoIris, by = "Species", all.x = T)

  • Inner join

inner_join <- merge(dt, flowers, by = "Species")

  • Right outer join

right_join_base <- merge(dt, flowers, by = "Species", all.y = T)

Joining, data table way

  • Left outer join

setkey(dt, "Species") # We need to assign a key to join as sql way before properly doing it
infoIris_dt <- data.table(infoIris)
setkey(infoIris_dt, "Species") # event though in this case they already coincide with data.table there's no need for that because we are explicitly setting a common key
left_outer_join <- dt[infoIris_dt]

  • Inner join

flowers_dt <- data.table(flowers)
setkey(flowers_dt, "Species") # no need to make the names coincide here because we are setting a key
inner_join <- dt[flowers_dt, nomatch = 0]

  • Right outer join

right_outer_join <- flowers_dt[dt]

Joining, dplyr way

  • left join

left_join_dyplr <- left_join(dt, infoIris, by = "Species")

  • inner join

inner_join_dplyr <- inner_join(dt, flowers, by = "Species")

  • right join

right_join_dyplr <- right_join(dt, flowers, by = "Species")

dplyr anti join

Suppose there is one full dataset (‘A’) and another one subset of the first (‘B’). A nice way to get only the data of A that it is not on B is the following:

t <- anti_join(dt, flowers, by = "Species")

 

That’s it for this module! If you have gone through all this code you should have learnt the basics of two fundamentals package in R for data manipulation, data.table and dplyr.

When you’re ready, go ahead with the fourth module: R training – data visualization.

 

The post R Training – Data Manipulation appeared first on SLOW DATA.

To leave a comment for the author, please follow the link and comment on their blog: R – SLOW DATA.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.