From excel to R: Part 1
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome welcome.
In this blog post, we will be exploring:
- How to efficiently read data from multiple sheets in an Excel workbook and combine them into a single dataframe using R.
- Dealing with mixed date formats using the
janitor
package
By the end of this post, you & me should have a solid understanding of how to read and combine data from multiple sheets in an Excel workbook using R, and how to handle mixed date formats to ensure consistent data analysis.
Load library
Show code
library(readxl) library(dplyr) library(purrr) library(janitor)
Define path
First, let’s define the path to the Excel file that we want to read in. In this example, we’ll be using the here
package to make the path relative to the current project directory. This data set can be obtained from my github repository
Show code
## Define path path <- here::here("_posts/2023-03-08-excel-to-r/input/store-sales.xlsx")
Reading all sheets
To read all the sheets:
Show code
df_excel <- path %>% excel_sheets() %>% map_df(~read_excel(.x, path = path) %>% mutate(sheet_name = .x))
- The
excel_sheets()
function from thereadxl
package is used to extract the names of all the sheets in the Excel file. - The
map_df()
function from thepurrr
package meanwhile is used to apply a function to each element of a vector and combine the results into a data frame - In this case, we are using
map_df()
to read in each sheet of the Excel file; using theread_excel()
function from thereadxl package
. - Finally, we add a new column. This new column is called
sheet_name
and contains the name of the sheet that the data came from.
The dataframe:
Show code
df_excel %>% knitr::kable()
Year | Date | Store | Sales | sheet_name |
---|---|---|---|---|
2012 | 1/26/2012 | A | 2000 | Year1 |
2012 | 1/26/2012 | B | 2500 | Year1 |
2012 | 1/28/2012 | C | 3400 | Year1 |
2012 | Jan 21, 2012 | D | 6000 | Year1 |
2013 | 1/26/2013 | A | 2000 | Year2 |
2013 | 1/21/2013 | B | 2500 | Year2 |
2013 | 1/28/2013 | C | 3400 | Year2 |
2013 | Jan 25, 2013 | D | 6000 | Year2 |
2013 | 41302 | E | 5500 | Year2 |
2014 | 1/26/2014 | A | 2000 | Year3 |
2014 | 1/28/2014 | B | 2500 | Year3 |
2014 | 2/28/2014 | C | 3400 | Year3 |
2014 | March 25, 2014 | D | 6000 | Year3 |
2014 | 41789 | E | 5500 | Year3 |
Mixed date formats
We have successfully loaded the information into one dataframe. However, it appears that this dataframe contains a column (Date
) with mixed date formats that have been imported from Excel. This can often be a problem when working with date data, as it can make it difficult to perform consistent operations and analyses on the data. To fix this issue, we will need to standardize the date format across the entire column, so that all dates can be treated consistently. One way to accomplish this is to use a function like convert_to_date()
from the janitor
package, which can convert the dates to a consistent format while also handling any errors or inconsistencies in the data:
Show code
df_excel <- df_excel %>% mutate(Date_change = convert_to_date(Date, string_conversion_failure = "warning", character_fun = lubridate::mdy))
- To explain,
convert_to_date()
function from thejanitor
package is used to convert theDate
column to a consistent date format. - The
string_conversion_failure
parameter is set to “warning”, to indicate that any non-date strings in theDate
column will be converted toNA
values and a warning message will be displayed, just to be safe. - The
character_fun
parameter is set tolubridate::mdy
, which indicates the expected order of month, day, and year in the date string.
Please refer to Date_change
for the standardized date column:
Show code
## Show dataset df_excel %>% knitr::kable()
Year | Date | Store | Sales | sheet_name | Date_change |
---|---|---|---|---|---|
2012 | 1/26/2012 | A | 2000 | Year1 | 2012-01-26 |
2012 | 1/26/2012 | B | 2500 | Year1 | 2012-01-26 |
2012 | 1/28/2012 | C | 3400 | Year1 | 2012-01-28 |
2012 | Jan 21, 2012 | D | 6000 | Year1 | 2012-01-21 |
2013 | 1/26/2013 | A | 2000 | Year2 | 2013-01-26 |
2013 | 1/21/2013 | B | 2500 | Year2 | 2013-01-21 |
2013 | 1/28/2013 | C | 3400 | Year2 | 2013-01-28 |
2013 | Jan 25, 2013 | D | 6000 | Year2 | 2013-01-25 |
2013 | 41302 | E | 5500 | Year2 | 2013-01-28 |
2014 | 1/26/2014 | A | 2000 | Year3 | 2014-01-26 |
2014 | 1/28/2014 | B | 2500 | Year3 | 2014-01-28 |
2014 | 2/28/2014 | C | 3400 | Year3 | 2014-02-28 |
2014 | March 25, 2014 | D | 6000 | Year3 | 2014-03-25 |
2014 | 41789 | E | 5500 | Year3 | 2014-05-30 |
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.