Data wrangling : Reshaping
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Data wrangling is a task of great importance in data analysis. Data wrangling, is the process of importing, cleaning and transforming raw data into actionable information for analysis. It is a time-consuming process which is estimated to take about 60-80% of analyst’s time. In this series we will go through this process. It will be a brief series with goal to craft the reader’s skills on the data wrangling task. This is the second part of this series and it aims to cover the reshaping of data used to turn them into a tidy form. By tidy form, we mean that each feature forms a column and each observation forms a row.
Before proceeding, it might be helpful to look over the help pages for the spread
, gather
, unite
, separate
, replace_na
, fill
, extract_numeric
.
Moreover please load the following libraries.
install.packages("magrittr")
library(magrittr)
install.packages("tidyr")
library(tidyr)
Please run the code below in order to load the data set:
data <- airquality[4:6]
Answers to the exercises are available here.
If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.
Exercise 1
Print out the structure of the data frame.
Exercise 2
Let’s turn the data frame in a wider form, from above and and turn the Month variable into column headings and spread the Temp values across the months they are related to.
Exercise 3
Turn the wide (exercise 2) data frame into its initial format using the gather
function, specify the columns you would like to gather by index number.
Exercise 4
Turn the wide (exercise 2) data frame into its initial format using the gather
function, specify the columns you would like to gather by column name.
Exercise 5
Turn the wide (exercise 2) data frame into its initial format using the gather
function, specify the columns by using remaining column names(the ones you don’t use for gathering).
Exercise 6
Unite the variables Day
and Month
to a new feature named Date
with the format %d-%m
.
Exercise 7
Create the data frame at its previous format (exercise 6). Separate the variable you have created before (Date
) to Day
, Month
.
Exercise 8
Replace the missing values (NA) with 'Unknown'
.
Exercise 9
Run the script below, so that you make a new feature year
.
back2long_na$year <- rep(NA, nrow(back2long_na))
back2long_na$year[1] <- '2015'
back2long_na$year[as.integer(nrow(back2long_na)/3)] <- '2016'
back2long_na$year[as.integer(2*nrow(back2long_na)/3)] <- '2017'
You have noticed, that the new column has many values. Fill the NAs with the non-missing value write above it. (eg.the NA’s that are below the ‘2016’ and ‘2017’ value assign it to ‘2016’.
Hint: use the fill
function.
Exercise 10
Extract the numeric values from the Temp
feature.
Hint: extract_numeric
, this is a very important function when the variable we apply the function on is a character with ‘noise’, for example ‘$40’ and you want to transform it to 40.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.