Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This post follows a previous tutorial on pivoting on text in Excel. In this post I will reproduce the exercise in R. This way you begin to see the similarities and differences of the program and begin to diversify your data skill base.
Related: 5 Things Excel Users Should Know About R — Free Mini-Course
Similar to the previous lesson, we will be using the 2017 .csv version of the Lahman baseball dataset. We will again be using the People.csv and AwardsPlayers.csv files.
1. Set up our environment
In addition to reading in the two .csv files, we need to call two libraries for data manipulation: plyr and dplyr.
library(plyr) library(dplyr) players <- read.csv("C:/RFiles/people.csv") awards <- read.csv("C:/RFiles/awardsplayers.csv")
2. Filter and merge tables
Using the filter and select functions from dplyr, we create a table containing all awards information since 2010 by player ID, along with that player’s first and last name.
I include the nrow function to verify that indeed the resulting table in R has the same number of rows as the Excel table: 502.
awards <- filter(awards, yearID >= 2010) players <- select(players, playerID, nameFirst, nameLast) tbl <- merge(awards, players, all.x = TRUE) nrow(tbl)
3. Derive “First Name Last Name” column
Using the paste function with a space delimiter, we create a new column containing each player’s first and last name.
tbl$fullname <- paste(tbl$nameFirst, tbl$nameLast, sep = " ")
4. Pivot and view results
Now we use the ddply function from plyr to pivot on text. Similar to creating a measure in the Excel data model (Step 5 of the Excel post), we create a table summarized by awardID and yearID with the player name’s concatenated with a space delimiter (the paste function in R).
The View function then launches a spreadsheet-like viewer of the data frame.
x <- ddply(tbl, .(awardID, yearID), summarize, names = paste(fullname, collapse = ", ")) View(x)
Below I show the results in R and Excel — they are the same. Excel, however, does include a “subtotal” measure where R does not, but the base information is the same.
Complete code below:
#read in files and call libraries library(plyr) library(dplyr) players <- read.csv("C:/RFiles/people.csv") awards <- read.csv("C:/RFiles/awardsplayers.csv") #filter and merge data frames awards <- filter(awards, yearID >= 2010) players <- select(players, playerID, nameFirst, nameLast) tbl <- merge(awards, players, all.x = TRUE) nrow(tbl) #create full-name column tbl$fullname <- paste(tbl$nameFirst, tbl$nameLast, sep = " ") #pivot on full-name text x <- ddply(tbl, .(awardID, yearID), summarize, names = paste(fullname, collapse = ", ")) #launch spreadsheet-like viewer of data frame View(x)
Related: 5 Things Excel Users Should Know About R — Free Mini-Course
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.