[This article was first published on business-science.io, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
R Tutorials Update
Interested in more R tutorials? Learn more R tips:
There is no doubt that the tidyverse opinionated collection of R packages offers attractive, intuitive ways of wrangling data for data science. In earlier versions of tidyverse some elements of user control were sacrificed in favor of simplifying functions that could be picked up and easily used by rookies. In the 2020 updates to dplyr and tidyr there has been progress to restoring some finer control.
This means that there are new methods available in the tidyverse that some may not be aware of. The methods allow you to better transform your data directly to the way you want and to perform operations more flexibly. They also provide new ways to perform common tasks like nesting, modeling and graphing in ways where the code is more readable. Often users are only just scratching the surface of what can be done with the latest updates to this important set of packages.
It’s incumbent on any analyst to stay up to date with new methods. This post covers ten examples of approaches to common data tasks that are better served by the latest tidyverse updates. We will use the new Palmer Penguins dataset, a great all round dataset for illustrating data wrangling.
First let’s load our tidyverse packages and the Palmer Penguins dataset and take a quick look at it. Please be sure to install the latest versions of these packages before trying to replicate the work here.
The dataset presents several observations of anatomical parts of penguins of different species, sexes and locations, and the year that the measurements were taken.
1. Selecting columns
tidyselect helper functions are now built in to allow you to save time by selecting columns using dplyr::select() based on common conditions. In this case, if we want to reduce the dataset to just bill measurements we can use this, noting that all measurement columns contain an underscore:
A full set of tidyselect helper functions can be found in the documentation here.
2. Reordering columns
dplyr::relocate() allows a new way to reorder specific columns or sets of columns. For example, if we want to make sure that all of the measurement columns are at the end of the dataset, we can use this, noting that my last column is year:
Similar to .after you can also use .before as an argument here.
3. Controlling mutated column locations
Note in the penguins dataset that there are no unique identifiers for each study group. This can be problematic when we have multiple penguins of the same species, island, sex and year in the dataset. To address this and prepare for later examples, let’s add a unique identifier using dplyr::mutate(), and here we can illustrate how mutate() now allows us to position our new column in a similar way to relocate():
4. Transforming from wide to long
The penguins dataset is clearly in a wide form, as it gives multiple observations across the columns. For many reasons we may want to transform data from wide to long. In long data, each observation has its own row. The older function gather() in tidyr was popular for this sort of task but its new version pivot_longer() is even more powerful. In this case we have different body parts, measures and units inside these column names, but we can break them out very simply like this:
5. Transforming from long to wide
It’s just as easy to move back from long to wide. pivot_wider() gives much more flexibility compared to the older spread():
6. Running group statistics across multiple columns
dplyr can how apply multiple summary functions to grouped data using the across adverb, helping you be more efficient. If we wanted to summarize all bill and flipper measurements in our penguins we would do this:
7. Control output columns names when summarising columns
The columns in penguin_stats have been given default names which are not that intuitive. If we name our summary functions, we can then use the .names argument to control precisely how we want these columns named. This uses glue notation. For example, here we want to construct the new column names by taking the existing column names, removing any underscores or ‘mm’ metrics, and pasting to the summary function name using an underscore:
8. Running models across subsets
The output of summarize() can now be literally anything, because dplyr now allows different column types. We can generate summary vectors, dataframes or other objects like models or graphs.
If we wanted to run a model for each species you could do it like this:
It’s not usually that useful to keep model objects in a dataframe, but we could use other tidy-oriented packages to summarize the statistics of the models and return them all as nicely integrated dataframes:
9. Nesting data
Often we have to work with subsets, and it can be useful to apply a common function across all subsets of the data. For example, maybe we want to take a look at our different species of penguins and make some different graphs of them. Grouping based on subsets would previously be achieved by the following somewhat awkward combination of tidyverse functions.
The new function nest_by() provides a more intuitive way to do the same thing:
The nested data will be stored in a column called data unless we specify otherwise using a .key argument.
10. Graphing across subsets
Armed with nest_by() and the fact that we can summarize or mutate virtually any type of object now, this allows us to generate graphs across subsets and store them in a dataframe for later use. Let’s scatter plot bill length and depth for our three penguin species:
Now we can easily display the different scatter plots to show, for example, that our penguins exemplify Simpson’s Paradox:
Author: Jim Gruman, Data Analytics Leader
Serving enterprise needs with innovators in mobile power, decision intelligence, and product management, Jim can be found at https://jimgruman.netlify.app.
To leave a comment for the author, please follow the link and comment on their blog: business-science.io.