The 6 Most Fundamental Functions for Data Cleaning with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
In this blog post I’m going to show you the six most fundamental functions for your data cleaning journey. And as always, you can watch the video version of this blog post on YouTube:
Getting started
All of these functions come from the tidyverse. So this is why we are going to load that package.
library(tidyverse)
And of course, we’re going to need some data, and this is where the palmerpenguins package comes in. We can either load the palmerpenguins package like we did with the tidyverse, or we can just access the penguins data set via its full name, namely the palmerpenguins package name followed by ::
followed by the data set name.
penguins <- palmerpenguins::penguins penguins ## # A tibble: 344 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Adelie Torgersen 39.1 18.7 181 3750 ## 2 Adelie Torgersen 39.5 17.4 186 3800 ## 3 Adelie Torgersen 40.3 18 195 3250 ## 4 Adelie Torgersen NA NA NA NA ## 5 Adelie Torgersen 36.7 19.3 193 3450 ## 6 Adelie Torgersen 39.3 20.6 190 3650 ## 7 Adelie Torgersen 38.9 17.8 181 3625 ## 8 Adelie Torgersen 39.2 19.6 195 4675 ## 9 Adelie Torgersen 34.1 18.1 193 3475 ## 10 Adelie Torgersen 42 20.2 190 4250 ## # ℹ 334 more rows ## # ℹ 2 more variables: sex <fct>, year <int>
Count all the things with count()
Once we have that data set, we can pass it to the count()
function. Once we do that, we see in the console that we get a count.
penguins |> count() ## # A tibble: 1 × 1 ## n ## <int> ## 1 344
But this output right now isn’t particularly useful because it just shows us how many things are in this data set. You see, count()
becomes way more useful when you also add a column name from the dataset inside of count()
. That way, we could count how many different species are in the data set.
penguins |> count(species) ## # A tibble: 3 × 2 ## species n ## <fct> <int> ## 1 Adelie 152 ## 2 Chinstrap 68 ## 3 Gentoo 124
So that’s how we know how many Adelie, Gentoo and Chinstrap penguins are there. Furthermore, we can also sort the rows in this data set by also adding the sort = TRUE
argument inside of count()
. This is particularly helpful if you have a data set that has a whole bunch of different things inside one column and you want to get the most frequent ones at the top.
penguins |> count(species, sort = TRUE) ## # A tibble: 3 × 2 ## species n ## <fct> <int> ## 1 Adelie 152 ## 2 Gentoo 124 ## 3 Chinstrap 68
And we could even count combinations of different columns.
penguins |> count(species, island, sort = TRUE) ## # A tibble: 5 × 3 ## species island n ## <fct> <fct> <int> ## 1 Gentoo Biscoe 124 ## 2 Chinstrap Dream 68 ## 3 Adelie Dream 56 ## 4 Adelie Torgersen 52 ## 5 Adelie Biscoe 44
Let me note that count()
is super useful because obviously you can count things, but also I use it all of the time as intermediate steps when I filter or rearrange some part of the data. This helps me to check that the count of what is now inside of a data set matches what I expect after some data operation.
Get the right columns with select()
Next let us talk about the select()
function. This function helps you select columns from your data set. Say, you have a data set with a whole bunch of columns and you want to focus on a subset of that. Then you can just
- take your data,
- pass it to the
select()
function, and - list all of the things that you want to select.
penguins |> select(flipper_length_mm, bill_length_mm, species) ## # A tibble: 344 × 3 ## flipper_length_mm bill_length_mm species ## <int> <dbl> <fct> ## 1 181 39.1 Adelie ## 2 186 39.5 Adelie ## 3 195 40.3 Adelie ## 4 NA NA Adelie ## 5 193 36.7 Adelie ## 6 190 39.3 Adelie ## 7 181 38.9 Adelie ## 8 195 39.2 Adelie ## 9 193 34.1 Adelie ## 10 190 42 Adelie ## # ℹ 334 more rows
That’s was pretty easy, right? Here, I kind of want to leave it at this stage. But be aware that there are a whole bunch of advanced tricks that you can use in select()
to not have to list all of the things.
Instead, you can just describe the things that you want. For example, you can get all the columns
- that have a particular text in their name (like
_mm
in the penguins dataset) or - that are of specific type (e.g. numeric)
In these cases, you can use a couple of tricks in combination with select()
that make your life much easier. But here I really want to focus on the fundamentals. So let’s ignore those advanced tricks for now.
Reduce the rows to specific observations with filter()
The next essential function is filter()
. It is like select()
but for rows. Using that function you can filter your data according to specific criteria. That way, you get all of the columns of the data that you pass to filter()
, but you get only the rows that match a specific criteria.
For example, we can
- take our penguins data,
- pass it to the
filter()
function, and - state that we want to get only the penguins whose bill length is larger than 55 millimeters.
penguins |> filter(bill_length_mm > 55) ## # A tibble: 5 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Gentoo Biscoe 59.6 17 230 6050 ## 2 Gentoo Biscoe 55.9 17 228 5600 ## 3 Gentoo Biscoe 55.1 16 230 5850 ## 4 Chinstrap Dream 58 17.8 181 3700 ## 5 Chinstrap Dream 55.8 19.8 207 4000 ## # ℹ 2 more variables: sex <fct>, year <int>
I’ll explain how to read that code in a second (in case you don’t find it intuitive.) Let me add one more example first, though. For instance, we could also use the between()
function in combination with filter()
to find the penguins that have a bill length larger than 55 but smaller than 58.
penguins |> filter(between(bill_length_mm, 55, 58)) ## # A tibble: 4 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <fct> <fct> <dbl> <dbl> <int> <int> ## 1 Gentoo Biscoe 55.9 17 228 5600 ## 2 Gentoo Biscoe 55.1 16 230 5850 ## 3 Chinstrap Dream 58 17.8 181 3700 ## 4 Chinstrap Dream 55.8 19.8 207 4000 ## # ℹ 2 more variables: sex <fct>, year <int>
Okay, here you might wonder how to understand this thing here. So, let me give you an insight of how filter()
function does things.
Basically, whenever you describe a condition inside of filter()
, then what happens is that it takes the column that you used and applies the command to get a vector full of TRUE
and FALSE
. Let’s try this out manually to see what happens. Let’s grab the bill lengths.
bill_length_mm ## Error in eval(expr, envir, enclos): object 'bill_length_mm' not found
As you can see, we can’t just access the data column via its name outside of the filter()
function. R simply doesn’t know where this column name is coming from. Inside of the filter()
function R knows to look at the data set. But outside of filter()
you will have to tell R that this thing comes from our penguins data set. You do that by via the data set name and the $
operator.
penguins$bill_length_mm ## [1] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6 ## [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5 ## [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0 ## [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6 ## [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5 ## [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9 ## [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0 37.7 37.8 37.9 ## [106] 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1 ## [121] 36.2 37.7 40.2 41.4 35.2 40.6 38.8 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1 ## [136] 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8 ## [151] 36.0 41.5 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5 ## [166] 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9 46.1 44.5 47.8 ## [181] 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3 ## [196] 49.6 50.5 43.6 45.5 50.5 44.9 45.2 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5 ## [211] 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2 ## [226] 46.5 46.4 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3 ## [241] 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1 48.5 55.9 47.2 ## [256] 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8 ## [271] 47.2 NA 46.8 50.4 45.2 49.9 46.5 50.0 51.3 45.4 52.7 45.2 46.1 51.3 46.0 ## [286] 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6 ## [301] 46.7 52.0 50.5 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9 ## [316] 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4 45.7 50.7 ## [331] 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2
So now you see that we get a vector full of the length that we want to look at. Now, what happens inside of filter()
is that this vector is applied according to what you specified. In the first example, we just checked is this thing larger than 55. Here’s what that results in.
penguins$bill_length_mm > 55 ## [1] FALSE FALSE FALSE NA FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [181] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE ## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [253] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [265] FALSE FALSE FALSE TRUE FALSE FALSE FALSE NA FALSE FALSE FALSE FALSE ## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [289] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE ## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [337] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Now you see that what you get is a vector full of TRUE
s and FALSE
s. And as you can probably guess:
TRUE
corresponds to the occasions where the bill length was indeed larger than 55, andFALSE
corresponds to the cases where this wasn’t the case.
Really, what you put inside of filter()
are operations that at the end of the day deliver TRUE
s and FALSE
s. Then, filter()
just takes the rows that correspond to TRUE
. You could simulate this by taking these TRUE
s and FALSE
s that we’ve just captured manually and sticking that into the which()
function.
which(penguins$bill_length_mm > 55) ## [1] 186 254 268 294 340
This gives you all of the row numbers for which you had a TRUE
, and these are the rows that filter()
will return.
Now why did I explain this in so much detail when it was really a simple operation? The thing is, I want you to understand this as things become more complicated. For example, with our second example using the between()
function, you might wonder why this was valid syntax. Basically, you have to understand what between()
does. So here’s an example.
between(c(54, 55, 56, 57, 58), 55, 57) ## [1] FALSE TRUE TRUE TRUE FALSE
As you can see, between()
is a function that does exactly what we just talked about, i.e. it returns TRUE
s and FALSE
s. So really, this was just a different operation that has some slightly other syntax. But the syntax doesn’t matter to us as long as at the end of the day this operation that we use inside of filter()
will return TRUE
s and FALSE
s so that filter()
can do its thing.
Calculating and transforming columns with mutate()
Now let’s compute new things from our data. For example, we can compute a new column that takes the bill lengths that we’ve just looked at and scales them. Scaling is a technique used to transform the values of a variable so that they are centered around zero. This transformation allows you to get a better understanding of whether a value is high or low relative to the other values in the dataset (even if you are not familiar with the specific units of measurement.)
For example, if you scale the bill length variable, a scaled value of
- 0 would indicate an average bill length,
- while a positive value would indicate an above-average bill length, and
- a negative value would indicate a below-average bill length.
And to do this scaling, we need the mutate()
function. It allows you to modify existing columns or create new columns in a dataset. Here’s how it works:
Pass your data set to the
mutate()
function.Specify the column name and set it equal to the desired transformation.
penguins |> mutate(bill_length_mm = scale(bill_length_mm)[,1]) |> select(bill_length_mm) ## # A tibble: 344 × 1 ## bill_length_mm ## <dbl> ## 1 -0.883 ## 2 -0.810 ## 3 -0.663 ## 4 NA ## 5 -1.32 ## 6 -0.847 ## 7 -0.920 ## 8 -0.865 ## 9 -1.80 ## 10 -0.352 ## # ℹ 334 more rows
And via the same strategy, we can also create a new column. We just have to assign a new column name.
penguins |> mutate(scaled_bill_length = scale(bill_length_mm)[,1]) |> select(bill_length_mm, scaled_bill_length) ## # A tibble: 344 × 2 ## bill_length_mm scaled_bill_length ## <dbl> <dbl> ## 1 39.1 -0.883 ## 2 39.5 -0.810 ## 3 40.3 -0.663 ## 4 NA NA ## 5 36.7 -1.32 ## 6 39.3 -0.847 ## 7 38.9 -0.920 ## 8 39.2 -0.865 ## 9 34.1 -1.80 ## 10 42 -0.352 ## # ℹ 334 more rows
Behind the scenes, mutate()
takes the specified column…
penguins$bill_length_mm ## [1] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42.0 37.8 37.8 41.1 38.6 34.6 ## [16] 36.6 38.7 42.5 34.4 46.0 37.8 37.7 35.9 38.2 38.8 35.3 40.6 40.5 37.9 40.5 ## [31] 39.5 37.2 39.5 40.9 36.4 39.2 38.8 42.2 37.6 39.8 36.5 40.8 36.0 44.1 37.0 ## [46] 39.6 41.1 37.5 36.0 42.3 39.6 40.1 35.0 42.0 34.5 41.4 39.0 40.6 36.5 37.6 ## [61] 35.7 41.3 37.6 41.1 36.4 41.6 35.5 41.1 35.9 41.8 33.5 39.7 39.6 45.8 35.5 ## [76] 42.8 40.9 37.2 36.2 42.1 34.6 42.9 36.7 35.1 37.3 41.3 36.3 36.9 38.3 38.9 ## [91] 35.7 41.1 34.0 39.6 36.2 40.8 38.1 40.3 33.1 43.2 35.0 41.0 37.7 37.8 37.9 ## [106] 39.7 38.6 38.2 38.1 43.2 38.1 45.6 39.7 42.2 39.6 42.7 38.6 37.3 35.7 41.1 ## [121] 36.2 37.7 40.2 41.4 35.2 40.6 38.8 41.5 39.0 44.1 38.5 43.1 36.8 37.5 38.1 ## [136] 41.1 35.6 40.2 37.0 39.7 40.2 40.6 32.1 40.7 37.3 39.0 39.2 36.6 36.0 37.8 ## [151] 36.0 41.5 46.1 50.0 48.7 50.0 47.6 46.5 45.4 46.7 43.3 46.8 40.9 49.0 45.5 ## [166] 48.4 45.8 49.3 42.0 49.2 46.2 48.7 50.2 45.1 46.5 46.3 42.9 46.1 44.5 47.8 ## [181] 48.2 50.0 47.3 42.8 45.1 59.6 49.1 48.4 42.6 44.4 44.0 48.7 42.7 49.6 45.3 ## [196] 49.6 50.5 43.6 45.5 50.5 44.9 45.2 46.6 48.5 45.1 50.1 46.5 45.0 43.8 45.5 ## [211] 43.2 50.4 45.3 46.2 45.7 54.3 45.8 49.8 46.2 49.5 43.5 50.7 47.7 46.4 48.2 ## [226] 46.5 46.4 48.6 47.5 51.1 45.2 45.2 49.1 52.5 47.4 50.0 44.9 50.8 43.4 51.3 ## [241] 47.5 52.1 47.5 52.2 45.5 49.5 44.5 50.8 49.4 46.9 48.4 51.1 48.5 55.9 47.2 ## [256] 49.1 47.3 46.8 41.7 53.4 43.3 48.1 50.5 49.8 43.5 51.5 46.2 55.1 44.5 48.8 ## [271] 47.2 NA 46.8 50.4 45.2 49.9 46.5 50.0 51.3 45.4 52.7 45.2 46.1 51.3 46.0 ## [286] 51.3 46.6 51.7 47.0 52.0 45.9 50.5 50.3 58.0 46.4 49.2 42.4 48.5 43.2 50.6 ## [301] 46.7 52.0 50.5 49.5 46.4 52.8 40.9 54.2 42.5 51.0 49.7 47.5 47.6 52.0 46.9 ## [316] 53.5 49.0 46.2 50.9 45.5 50.9 50.8 50.1 49.0 51.5 49.8 48.1 51.4 45.7 50.7 ## [331] 42.5 52.2 45.2 49.3 50.2 45.6 51.9 46.8 45.7 55.8 43.5 49.6 50.8 50.2
…performs the requested transformation,
scale(penguins$bill_length_mm)[,1] ## [1] -0.88320467 -0.80993901 -0.66340769 NA -1.32279862 -0.84657184 ## [7] -0.91983750 -0.86488825 -1.79902541 -0.35202864 -1.12131806 -1.12131806 ## [13] -0.51687637 -0.97478674 -1.70744334 -1.34111504 -0.95647033 -0.26044656 ## [19] -1.74407616 0.38062795 -1.12131806 -1.13963448 -1.46932994 -1.04805240 ## [25] -0.93815391 -1.57922843 -0.60845845 -0.62677486 -1.10300165 -0.62677486 ## [31] -0.80993901 -1.23121655 -0.80993901 -0.55350920 -1.37774787 -0.86488825 ## [37] -0.93815391 -0.31539581 -1.15795089 -0.75498976 -1.35943145 -0.57182562 ## [43] -1.45101353 0.03261607 -1.26784938 -0.79162259 -0.51687637 -1.17626731 ## [49] -1.45101353 -0.29707939 -0.79162259 -0.70004052 -1.63417768 -0.35202864 ## [55] -1.72575975 -0.46192713 -0.90152108 -0.60845845 -1.35943145 -1.15795089 ## [61] -1.50596277 -0.48024354 -1.15795089 -0.51687637 -1.37774787 -0.42529430 ## [67] -1.54259560 -0.51687637 -1.46932994 -0.38866147 -1.90892390 -0.77330618 ## [73] -0.79162259 0.34399512 -1.54259560 -0.20549732 -0.55350920 -1.23121655 ## [79] -1.41438070 -0.33371222 -1.70744334 -0.18718091 -1.32279862 -1.61586126 ## [85] -1.21290014 -0.48024354 -1.39606428 -1.28616579 -1.02973599 -0.91983750 ## [91] -1.50596277 -0.51687637 -1.81734182 -0.79162259 -1.41438070 -0.57182562 ## [97] -1.06636882 -0.66340769 -1.98218956 -0.13223166 -1.63417768 -0.53519279 ## [103] -1.13963448 -1.12131806 -1.10300165 -0.77330618 -0.97478674 -1.04805240 ## [109] -1.06636882 -0.13223166 -1.06636882 0.30736229 -0.77330618 -0.31539581 ## [115] -0.79162259 -0.22381374 -0.97478674 -1.21290014 -1.50596277 -0.51687637 ## [121] -1.41438070 -1.13963448 -0.68172411 -0.46192713 -1.59754485 -0.60845845 ## [127] -0.93815391 -0.44361071 -0.90152108 0.03261607 -0.99310316 -0.15054808 ## [133] -1.30448221 -1.17626731 -1.06636882 -0.51687637 -1.52427919 -0.68172411 ## [139] -1.26784938 -0.77330618 -0.68172411 -0.60845845 -2.16535371 -0.59014203 ## [145] -1.21290014 -0.90152108 -0.86488825 -1.34111504 -1.45101353 -1.12131806 ## [151] -1.45101353 -0.44361071 0.39894437 1.11328455 0.87517115 1.11328455 ## [157] 0.67369059 0.47221003 0.27072946 0.50884286 -0.11391525 0.52715927 ## [163] -0.55350920 0.93012040 0.28904588 0.82022191 0.34399512 0.98506964 ## [169] -0.35202864 0.96675323 0.41726078 0.87517115 1.14991738 0.21578022 ## [175] 0.47221003 0.43557720 -0.18718091 0.39894437 0.10588173 0.71032342 ## [181] 0.78358908 1.11328455 0.61874135 -0.20549732 0.21578022 2.87166037 ## [187] 0.94843681 0.82022191 -0.24213015 0.08756532 0.01429966 0.87517115 ## [193] -0.22381374 1.04001889 0.25241305 1.04001889 1.20486662 -0.05896600 ## [199] 0.28904588 1.20486662 0.17914739 0.23409663 0.49052644 0.83853832 ## [205] 0.21578022 1.13160096 0.47221003 0.19746381 -0.02233317 0.28904588 ## [211] -0.13223166 1.18655021 0.25241305 0.41726078 0.32567871 1.90089038 ## [217] 0.34399512 1.07665172 0.41726078 1.02170247 -0.07728242 1.24149945 ## [223] 0.69200701 0.45389361 0.78358908 0.47221003 0.45389361 0.85685474 ## [229] 0.65537418 1.31476511 0.23409663 0.23409663 0.94843681 1.57119492 ## [235] 0.63705776 1.11328455 0.17914739 1.25981586 -0.09559883 1.35139794 ## [241] 0.65537418 1.49792926 0.65537418 1.51624567 0.28904588 1.02170247 ## [247] 0.10588173 1.25981586 1.00338606 0.54547569 0.82022191 1.31476511 ## [253] 0.83853832 2.19395302 0.60042493 0.94843681 0.61874135 0.52715927 ## [259] -0.40697788 1.73604265 -0.11391525 0.76527266 1.20486662 1.07665172 ## [265] -0.07728242 1.38803077 0.41726078 2.04742170 0.10588173 0.89348757 ## [271] 0.60042493 NA 0.52715927 1.18655021 0.23409663 1.09496813 ## [277] 0.47221003 1.11328455 1.35139794 0.27072946 1.60782775 0.23409663 ## [283] 0.39894437 1.35139794 0.38062795 1.35139794 0.49052644 1.42466360 ## [289] 0.56379210 1.47961284 0.36231154 1.20486662 1.16823379 2.57859773 ## [295] 0.45389361 0.96675323 -0.27876298 0.83853832 -0.13223166 1.22318303 ## [301] 0.50884286 1.47961284 1.20486662 1.02170247 0.45389361 1.62614416 ## [307] -0.55350920 1.88257397 -0.26044656 1.29644869 1.05833530 0.65537418 ## [313] 0.67369059 1.47961284 0.54547569 1.75435906 0.93012040 0.41726078 ## [319] 1.27813228 0.28904588 1.27813228 1.25981586 1.13160096 0.93012040 ## [325] 1.38803077 1.07665172 0.76527266 1.36971435 0.32567871 1.24149945 ## [331] -0.26044656 1.51624567 0.23409663 0.98506964 1.14991738 0.30736229 ## [337] 1.46129643 0.52715927 0.32567871 2.17563660 -0.07728242 1.04001889 ## [343] 1.25981586 1.14991738
and then assigns that vector do the desired column name.
Reduce your data to key quantities with summarize()
The summarize()
function in R allows you to summarize your data into single values. It works similarly to mutate()
, but instead of adding new columns, it reduces the data set to a single row. For example, you can use summarize()
to compute the mean of the bill_length_mm
and flipper_length_mm
columns
penguins |> summarize( mean_bill_length = mean(bill_length_mm, na.rm = TRUE), mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE) ) ## # A tibble: 1 × 2 ## mean_bill_length mean_flipper_length ## <dbl> <dbl> ## 1 43.9 201.
Note that we use na.rm = TRUE
to remove any missing values before computing the means. And if you want to repeat the calculations for different subsets of your data, such as different species of penguins, you can use the .by
argument:
penguins |> summarize( mean_bill_length = mean(bill_length_mm, na.rm = TRUE), mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE), .by = species ) ## # A tibble: 3 × 3 ## species mean_bill_length mean_flipper_length ## <fct> <dbl> <dbl> ## 1 Adelie 38.8 190. ## 2 Gentoo 47.5 217. ## 3 Chinstrap 48.8 196.
This will repeat the calculations separately for each species of penguin.
Sort your data with arrange()
The arrange()
function in R allows you to sort the rows of your dataset based on one or more columns. By default, it sorts in ascending order.
penguins |> summarize( mean_bill_length = mean(bill_length_mm, na.rm = TRUE), mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE), .by = species ) |> arrange(mean_flipper_length) ## # A tibble: 3 × 3 ## species mean_bill_length mean_flipper_length ## <fct> <dbl> <dbl> ## 1 Adelie 38.8 190. ## 2 Chinstrap 48.8 196. ## 3 Gentoo 47.5 217.
But you can use the desc()
function to sort in descending order:
penguins |> summarize( mean_bill_length = mean(bill_length_mm, na.rm = TRUE), mean_flipper_length = mean(flipper_length_mm, na.rm = TRUE), .by = species ) |> arrange(desc(mean_flipper_length)) ## # A tibble: 3 × 3 ## species mean_bill_length mean_flipper_length ## <fct> <dbl> <dbl> ## 1 Gentoo 47.5 217. ## 2 Chinstrap 48.8 196. ## 3 Adelie 38.8 190.
Conclusion
In this blog post/video, we covered the 6 most fundamental functions for data cleaning with R. These functions provide a solid foundation for transforming and manipulating data in R.
If you enjoyed this content and want to learn more advanced data cleaning techniques, be sure to check out my Data Cleaning with R Master Class. In this master class, we
- dive deeper into these functions,
- learn to clean specific data formats (such as Excel files),
- handling text data, and
- deal with date formats.
Be sure to check out the course. And if you found this helpful, here are some other ways I can help you:
- 3 Minute Wednesdays: A weekly newsletter with bite-sized tips and tricks for R users
- Insightful Data Visualizations for “Uncreative” R Users: A course that teaches you how to leverage
{ggplot2}
to make charts that communicate effectively without being a design expert.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.