Site icon R-bloggers

Simple practice: data wrangling the iris dataset

[This article was first published on r-bloggers – SHARP SIGHT LABS, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

In last weeks post, I emphasized the importance of practicing R and the Tidyverse with small, simple problems, drilling them until you are competent.

In that post, I gave you a few very small scripts to practice (which I suggest that you memorize).

This week, I want to give you another small example. We’re going to clean up the < inline_code>iris dataset.

More specifically, we’re going to:

  1. Coerce the < inline_code>iris dataset from an old-school data frame into a tibble.
  2. Rename the variables, such that the characters are lower case, and such that “snake case” is applied in place of periods.

Like last week, this is a very simple example. However, (like I mentioned in the past) this is the sort of small task that you’ll need to be able to execute fluidly if you want to work on larger projects.

If you want to do large, complex analyses, it really pays to first master techniques on a small scale using much simpler datasets.

Ok, let’s dive in.

First, let’s take a look at the complete block of code.

library(tidyverse)
library(stringr)


#------------------
# CONVERT TO TIBBLE
#------------------
# – the iris dataframe is an old-school dataframe
#   ... this means that by default, it prints out
#   large numbers of records.
# - By converting to a tibble, functions like head()
#   will print out a small number of records by default

df.iris <- as_tibble(iris)


#-----------------
# RENAME VARIABLES
#-----------------
# - here, we're just renaming these variables to be more
#   consistent with common R coding "style"
# - We're changing all characters to lower case
#   and changing variable names to "snake case"

colnames(df.iris) <- df.iris %>%
  colnames() %>%
  str_to_lower() %>%
  str_replace_all("\\.","_")

# INSPECT

df.iris %>% head()

What have we done here? We’ve combined several discrete functions of the Tidyverse together in order to perform a small amount of data wrangling.

Specifically, we’ve turned the < inline_code>iris dataset into a tibble, and we’ve renamed the variables to be more consistent with modern R code standards and naming conventions.

This example is quite simple, but useful. This is the sort of small task that you’ll need to be able to do in the context of a large analysis.

Breaking down the script

To make this a little clearer, let’s break this down into its component parts.

In the section where we renamed the variables, we only used three core functions:

Each of these individual pieces are pretty straight forward.

We are using < inline_code>colnames() to retrieve the column names.

Then, we pipe the output into the < inline_code>stringr function < inline_code>str_to_lower_() to convert all the characters to lower case.

Next, we use < inline_code>str_replace_all() to replace the periods (“.”) with underscores (“_”). This effectively transforms the variable names to “snake case.” (Keep in mind that < inline_code>str_replace_all() uses regular expressions. You have learned regular expressions, right?)

Finally, using the assignment operator (at the upper, left hand side of the code), we assign the resulting transformed column names to the tibble by using < inline_code>colnames(df.iris).

I will point out that we have used these functions in a “waterfall” pattern; we have combined them by using the the pipe operator, < inline_code>%>%, such that the output of one step becomes the immediate input for the next step. This is a key feature of the Tidyverse. We can combine very simple functions together in new ways to accomplish tasks. This might not seem like a big deal, but it is extremely powerful. The modular nature of the Tidyverse functions, when used with the pipe operator, make the Tidyverse flexible and syntactically powerful, while allowing the code to remain clear and easy to read.

A test of skill: can you write this fluently?

The functions that we just used are all critical for doing data science in R. With that in mind, this script is a good test of your skill: can you write code like this fluently, from memory?

That should be your goal.

To get there, you need to know how the individual functions work. What that means is that you need to study the functions (how they work). But to be able to put them into practice, you need to drill them. So after you understand how they work, drill each individual function until you can write each individual function from memory. Next, you should drill small scripts (like the one in this blog post). You ultimately want to be able to “put the pieces together” quickly and seamlessly in order to solve problems and get things done.

I’ve said it before: if you want a great data science job, you need to be one of the best. If you want to be one of the best, you need to master the toolkit. And to master the toolkit, you need to drill.

Sign up now, and discover how to rapidly master data science

To rapidly master data science, you need to practice.

You need to know what to practice, and you need to know how to practice.

Sharp Sight is dedicated to teaching you how to master the tools of data science as quickly as possible.

Sign up now for our email list, and you’ll receive regular tutorials and lessons. You’ll learn:

If you sign up for our email list right now, you’ll also get access to our “Data Science Crash Course” for free.

SIGN UP NOW

The post Simple practice: data wrangling the iris dataset appeared first on SHARP SIGHT LABS.

To leave a comment for the author, please follow the link and comment on their blog: r-bloggers – SHARP SIGHT LABS.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.