Site icon R-bloggers

Descriptive Analytics-Part 4 : Data Manipulation

[This article was first published on R-exercises, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Descriptive Analytics is the examination of data or content, usually manually performed, to answer the question “What happened?”.

In order to be able to solve this set of exercises you should have solved the part 0, part 1, part 2 ,and part 3 of this series but also you should run this script which contain some more data cleaning. In case you haven’t, run this script in your machine which contains the lines of code we used to modify our data set. This is the fifth set of exercise of a series of exercises that aims to provide a descriptive analytics solution to the ‘2008’ data set from here. This data set which contains the arrival and departure information for all domestic flights in the US from 2008 has become the “iris” data set for Big Data. Descriptive analytics is all about answering questions, the goal of this set of exercises is to ‘answer’ questions with very few lines of code using the dplyr package. The dplyr is a great package for data manipulation ( if you are familiar with sql , it will be a piece of cake for you). Before proceeding, it might be helpful to look over the help pages for the select, contains, filter,summarise, mutate, group_by, arrange.

For this set of exercises you will need to install and load the package rapportools, outliers.

install.packages('dplyr')
library(dplyr)
install.packages('chron')
library(chron)

Since we use the dplyr package, we will also make the our data frame a local data frame.
flights <- tbl_df(flights)
The reason we do that is because it has some cool properties that can be useful. First of all, if we type ( accidentally) ‘flights’ as a local data frame it will print only the first 10 rows , while as a data frame it will print as many as your screen can fit, which can be both disturbing or have RAM issues may occur down the road. Another reason is that when we type the name of the data frame , it provides us with some information regarding the number of rows and columns and the type of variable that each column is.

Answers to the exercises are available here.

If you obtained a different (correct) answer than those listed on the solutions page, please feel free to post your answer as a comment on that page.

Exercise 1
Print the destination, the delay of arrivals and the air time of each flight.
Hint: use select function

Exercise 2
Print the columns that their name contains the word ‘Delay’.

Exercise 3
Print the names of carrier, the month and the day of the week that the delay of carrier is higher than 180.

Exercise 4
Print out all the flights grouped by carrier names.

Exercise 5
Print out the mean of the arrival delay using the summarise function.

Exercise 6
Print out the minimum,mean,median,variance,standard deviation,maximum,and counts of AirTime.

Exercise 7
Print out the mean delay and the number of flights of each carrier.

Exercise 8
Print out the mean delay and the number of flights of each carrier in descending order.

Exercise 9
This exercise is a bit out of context, but it will demonstrate a great way of manipulating data and it is a prerequisite for the next exercise.

Create a new column code>Full_Date which will contain the date of each flight and then print it out.
Hint: Use the mutate function.

Exercise 10
Print out the dates that had the most flights and then print out the dates that had the highest ratio of cancelled flights.

To leave a comment for the author, please follow the link and comment on their blog: R-exercises.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.