Why should I use R: The Excel R Data Wrangling comparison: Part 1
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
This is part 1 of an ongoing series on why you should use R. Future blogs will be linked here as they are released.
The era of data manipulation and analysis using programming languages has arrived. But it can be tough to find the time and the right resources to fully switch over from more manual, time-consuming solutions, such as Excel. In this blog we will show a comparison between Excel and R to get you started!
When choosing between R and Excel, it is important to understand how both solutions can get you the results you need. However, one can make it an easy, reputable, convenient process, whereas the other can make it an extremely frustrating, time-consuming process prone to human errors.
R and Excel
When opening Excel and applying data manipulation techniques to your data, are you easily able to tell what manipulations have been made without clicking on the column or cells? If you were to share these Excel sheets with colleagues are they easily able to replicate your analyses without you telling them where to click or which formulas were applied?
With R all of these are possible. You automatically have all the code visible and in front of you in the form of scripts. Reading and understanding the code is possible because of its easy-to-use, easy-to-read syntax which allows you to track what the code is doing without having to be concerned about any hidden functions or modifications happening in the background.
Most people already learned the basics of Microsoft Excel in school. Once the data has been imported into an Excel sheet, using a point-and-click technique we can easily create basic graphs and charts. R, on the other hand, is a programming language with a steeper learning curve. It will take at most two weeks to become familiar with the basics of the language and the RStudio user interface. Luckily using R can easily become second-nature with practice.
Data comes in all shapes and sizes. It can often be difficult to know where to start. Whatever your problem, Jumping Rivers can help.
Replicating Analysis
R, while having a slightly steep learning curve, has the ability to reproduce analyses repeatedly and with different data sets. This is very helpful for large projects containing multiple data sets as it keeps our processes clean and consistent. Excel however, because of the point-and-click interface, allows us to rely frequently on memory and repetition, so we would have to repeat the same analyses multiple times by either copying and pasting or simply repeating the point-and-click process, which can be time-consuming, messy, and prone to human errors.
Unlike Excel, R is completely free and benefits from a large community of open-source contributors. To install R and the IDE (RStudio Desktop) to work with R, download and install the relevant versions for your operating system. Once you have successfully installed the IDE, the following user interface will be visible
The area on the left is where you will write R code in scripts, use terminals and run jobs. The right hand side of the IDE is comprised of two sections. The top is the environment that stores a list of defined variables and data sets, view the history, and connect to other database. The area below contains five different tabs: the Files tab which lists all of the folders within this project, the Plots tab, which displays any plots that have been generated; the Packages tab which allows you to manage packages within your environment; the Help tab which provides a manual; and the Viewer tab which allows you to view generated interactive content.
Loading the data sets
Excel
The data import steps in Excel are quite straightforward to a day-to-day Excel user, however, it is certainly not reproducible.
Steps:
1. Click the Data tab on the Ribbon 2. Click the Get Data button 3. Select From File 4. Select from TEXT/CSV 5. Select the file and click Import 6. Click Load
R
There are various ways to import data sets such as local files, online
datasets and even through database connections. We will use the
read_csv()
function from the {readr} package to import our csv files.
But first, what are packages? R packages are a collection of R
functions, compiled code and sample data that can be installed by R
users. Before using an R function such as read_csv()
to import the
data, we are required to install and load the {readr} package. Packages
are great because rather than having to have a huge programme containing
everything you could possibly need, the different packages specialise in
different things, and can be loaded in as and when you need them, saving
a lot of space.
# Installing the package install.packages("readr") # Loading the package library(readr) # Importing the data movies_data <- read_csv("blog data.csv")
Exploring our data
Before getting started with any data manipulation, let’s explore our data.
Excel
Excel has one basic data structure, which is the cell. These Excel cells are extremely flexible as they store data of various types (numeric, logical and characters). To obtain an overview of the data we could simply just scroll through the Excel data sheet. Now, let’s imagine a data set of 1 million rows and 200 columns, would it still be as easy to scroll through the data sheet to obtain an overview of data? Could we quickly and reliably view all the column names? To me, manually scrolling seems like a very time consuming, unreliable and messy process.
R
To view our data in R, we could simply click on it in the environment or
we could call the name of the data set in the script. If we are working
with a large data set, we can also view a subset of this data by using
functions like head()
and tail()
. We could also use the colnames
function to programmatically display the variable names within our data.
movies_data ## # A tibble: 26 × 5 ## Country Year Highest_profit Number_movies no_employees ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 England 2011 100 3 1500 ## 2 America 2012 150 2 2000 ## 3 America 2013 300 4 4000 ## 4 England 2013 130 2 4020 ## 5 South Korea 2013 177 3 5300 ## 6 America 2014 350 1 3150 ## 7 South Korea 2015 700 6 6000 ## 8 England 2016 650 2 5000 ## 9 America 2016 230 1 1420 ## 10 China 2016 440 3 5000 ## # … with 16 more rows str(movies_data) # Displays the structure of the data ## spc_tbl_ [26 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame) ## $ Country : chr [1:26] "England" "America" "America" "England" ... ## $ Year : num [1:26] 2011 2012 2013 2013 2013 ... ## $ Highest_profit: num [1:26] 100 150 300 130 177 350 700 650 230 440 ... ## $ Number_movies : num [1:26] 3 2 4 2 3 1 6 2 1 3 ... ## $ no_employees : num [1:26] 1500 2000 4000 4020 5300 3150 6000 5000 1420 5000 ... ## - attr(*, "spec")= ## .. cols( ## .. Country = col_character(), ## .. Year = col_double(), ## .. Highest_profit = col_double(), ## .. Number_movies = col_double(), ## .. no_employees = col_double() ## .. ) ## - attr(*, "problems")=<externalptr> head(movies_data) # Displays the first six rows of the data ## # A tibble: 6 × 5 ## Country Year Highest_profit Number_movies no_employees ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 England 2011 100 3 1500 ## 2 America 2012 150 2 2000 ## 3 America 2013 300 4 4000 ## 4 England 2013 130 2 4020 ## 5 South Korea 2013 177 3 5300 ## 6 America 2014 350 1 3150 tail(movies_data) # Displays the last six rows of data ## # A tibble: 6 × 5 ## Country Year Highest_profit Number_movies no_employees ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 England 2021 120 1 1325 ## 2 America 2021 800 3 6800 ## 3 America 2022 400 2 7200 ## 4 China 2021 230 2 3101 ## 5 South Korea 2022 320 1 4300 ## 6 England 2022 221 2 3999 colnames(movies_data) # Displays all the variable names ## [1] "Country" "Year" "Highest_profit" "Number_movies" ## [5] "no_employees"
The movies
data is comprised of five columns: country, year, highest
profit gained per movie, number of movies produced and number of
employees on set during production. It is clear that R programmatically
displays the output of our data whereas Excel requires of a lot of
eye-balling and manual scrolling. If we were interested in displaying a
subset of our data, in a report for example, using R we could simply use
the functions above. To do this in Excel we would have to copy and paste
the first 6 rows of the data and manually add it to the report document.
Summary Statistics
Now, let’s apply some summary statistics on our data. Summary statistics provide a quick summary of data and are particularly useful for comparing one project to another, or before and after.
Excel
It is very well known that Excel has a data storage limitation per spreadsheet. It can have a very limited amount of columns and rows, while R is made to handle larger data sets. Excel files are also known to crash when they exceed 20 tabs of data. Excel is able to handle a good chunk of data, but not much. This becomes very risky when you unknowingly start to lose data because the file has become too big and is unable to save. To generate summary statistics (such as the minimum and maximum values) of our data in Excel, we followed a few steps:
-
Scroll to the Home tab
-
In the Editing group, click the arrow next to AutoSum
-
Click Min
-
Click Max
-
Press Enter
These steps were quite easy to follow, however, I often forget where to click or which tab to select. After discussing this workflow with a colleague, we also discovered slight differences in the steps for different versions of Excel. This did not seem very effective or reproducible to us.
R
summary(movies_data) ## Country Year Highest_profit Number_movies ## Length:26 Min. :2011 Min. : 11.0 Min. :1.000 ## Class :character 1st Qu.:2013 1st Qu.:156.8 1st Qu.:2.000 ## Mode :character Median :2017 Median :320.0 Median :2.500 ## Mean :2017 Mean :350.0 Mean :2.654 ## 3rd Qu.:2021 3rd Qu.:485.0 3rd Qu.:3.000 ## Max. :2022 Max. :800.0 Max. :6.000 ## no_employees ## Min. :1325 ## 1st Qu.:2275 ## Median :4401 ## Mean :4338 ## 3rd Qu.:6375 ## Max. :7200 # Stardard deviation sd(movies_data$Highest_profit) ## [1] 224.3471 # Highest value of the Highest profit column min(movies_data$Highest_profit) ## [1] 11 # Highest value of the Highest profit column max(movies_data$Highest_profit) ## [1] 800
The dollar symbol, $
, used here simply dictates which data set and
column we are using for the analysis. It is evident that the source code
of R can be used repeatedly and with different data sets in ways that
Excel formulas cannot. R clearly shows the code (instructions), data and
columns used for an analysis in ways that Excel does not. If I were to
share this script with a colleague they would have a complete
understanding on how the summary statistics were generated because of
R’s human readable syntax.
Data Wrangling
Data manipulation tools assist us with modifying our data to make it easier to read and organise. For example, one of the easiest data manipulation tools in Excel is inserting columns and rows. The purpose of data manipulation is to create a consistent, organised and clean data set. With this in mind, let’s apply the following data manipulations in Excel and then R:
- Rename the columns into a consistent format
- Arrange the year column in ascending order
- Select and create a new column
- Remove a column from the data
- Select only the entries for the year 2014
- Remove only the entries from rows 4-11
1. Renaming columns in R and Excel
Excel
Renaming columns in R is a completely manual process, which makes it an extremely time-consuming and risky process especially if you are working between multiple messy Excel sheets.
R
For data manipulation in R, we use a powerful package in R called
dplyr
. Let’s load and install the package.
# Installing the packages install.packages("dplyr") # Loading the packages library(dplyr)
To rename the columns, there is a handy function called rename()
. We
simply pass this function the name of our data set (movies_data
), and
then rename each of the columns. There are other methods available in
other packages which can automatically make everything lower case, for
example, but for the purposes of this blog, we will stick with dplyr
.
# Renaming the column into a consistent format movies_data <- rename_with(movies_data, tolower)
2. Arrange the year column in ascending order
Excel
To change column to ascending order, we first had to:
-
Select the
year
column -
Direct to the
Sort and Filter tab
-
Select the option to sort from the largest to the smallest value
R
arrange(movies_data, year) ## # A tibble: 26 × 5 ## country year highest_profit no_movies no_employees ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 England 2011 100 3 1500 ## 2 America 2011 100 3 1500 ## 3 America 2012 150 2 2000 ## 4 South Korea 2012 11 5 1333 ## 5 America 2013 300 4 4000 ## 6 England 2013 130 2 4020 ## 7 South Korea 2013 177 3 5300 ## 8 America 2014 350 1 3150 ## 9 South Korea 2015 700 6 6000 ## 10 England 2016 650 2 5000 ## # … with 16 more rows
Again, with Excel representing a point-and-click nature, it is impossible to identify. by looking at a column, how the data was modified. If I were to replicate these steps in two years time I would likely have forgotten where to point and click. With R however, we have our code which clearly shows each step used to manipulate the data. If I were to return to my script in two years time, I would easily be able to replicate the analysis.
3. Selecting and adding a new column
Let’s reduce our data set by first selecting the country
, year
,
no_movies
and highest_profit
columns. Then we will generate a new
column called complete_profit
. The complete_profit
column should be
generated from taking the highest_profit
column divided by the
no_movies
column.
Excel
R
movies_data %>% select(country, year, no_movies, highest_profit) %>% mutate(complete_profit = highest_profit/no_movies) ## # A tibble: 26 × 5 ## country year no_movies highest_profit complete_profit ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 England 2011 3 100 33.3 ## 2 America 2012 2 150 75 ## 3 America 2013 4 300 75 ## 4 England 2013 2 130 65 ## 5 South Korea 2013 3 177 59 ## 6 America 2014 1 350 350 ## 7 South Korea 2015 6 700 117. ## 8 England 2016 2 650 325 ## 9 America 2016 1 230 230 ## 10 China 2016 3 440 147. ## # … with 16 more rows
4. Removing a column
Excel
In Excel, inserting or deleting a column is a manual process. First, we select the column then right-click at the top of a column and then select the Delete option.
R
select(movies_data, -year) ## # A tibble: 26 × 4 ## country highest_profit no_movies no_employees ## <chr> <dbl> <dbl> <dbl> ## 1 England 100 3 1500 ## 2 America 150 2 2000 ## 3 America 300 4 4000 ## 4 England 130 2 4020 ## 5 South Korea 177 3 5300 ## 6 America 350 1 3150 ## 7 South Korea 700 6 6000 ## 8 England 650 2 5000 ## 9 America 230 1 1420 ## 10 China 440 3 5000 ## # … with 16 more rows
In R, we simply used the select
function from the {dplyr} package to
select a column of our data frame. To remove a column we put a -
in
front of the variable to exclude it from our data.
5. Select only the entries for a particular year
Excel
Here we are interested in extracting the data collected only during the year 2021. Using Excel software, we first sort the year column and then manually select the years that we are interested in. While applying this manual technique of selecting pieces of data that we are interested in, it is very easy to select the wrong data or even accidentally delete data.
R
filter(movies_data, year == 2021) ## # A tibble: 4 × 5 ## country year highest_profit no_movies no_employees ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 America 2021 800 3 6800 ## 2 England 2021 120 1 1325 ## 3 America 2021 800 3 6800 ## 4 China 2021 230 2 3101
6. Remove only the row entries from 2-4
Excel
Removing rows in Excel is once again a manual process. We select the rows that we do not want to keep, then right click and delete those rows. These rows are now permanently deleted from the data sheet. If we were interested in adding them back into the sheet, we would have to find it (if we had a back up Excel sheet) and copy and paste it back into our data analysis Excel sheet. If we did not have a back up of the data that we had deleted, then this data would be completely lost.
R
In R we can use the slice
function to return a subset of rows based on their position. If you want to remove rows using slice
instead of retaining them you can just add a -
in front of the row indices you’re passing into the function. So, to remove rows 2, 3, and 4:
slice(movies_data, -(2:4)) ## # A tibble: 26 × 2 ## country no_employees ## <chr> <dbl> ## 1 England 1500 ## 2 America 2000 ## 3 America 4000 ## 4 England 4020 ## 5 South Korea 5300 ## 6 America 3150 ## 7 South Korea 6000 ## 8 England 5000 ## 9 America 1420 ## 10 China 5000 ## # … with 16 more rows
Using R and Excel
There are multiple ways in which data manipulation is used efficiently in data science. Data formatting is important and must be organised to be read by the various software programs, be it in R or Excel.
Excel is an excellent tool and is easy to use and at times it is the most appropriate tool. Excel is often used for data processing work under general and basic office requirements. However, Excel is limiting in that the data file itself can hold only approximately 1 million rows without the aid of other tools. The basic built in statistical analysis is too simple and has very little practical value. If you are an aspiring data analyst, you will need to expand your toolset and start thinking beyond the rows and columns of a spreadsheet. R functions cover almost any area where data is needed. Getting started with R is very simple especially because of the easy-to-use and understandable syntax. Most importantly, R facilitates reproducible analyses.
A hammer is great for driving nails, but it’s not the only tool out there.
If you’re interested in learning R, then attend our Introduction to R course.
For updates and revisions to this article, see the original post
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.