How to prepare data for analysis in r
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Welcome to our first article. This article aims to bestow the audience with commands that R offers to prepare the data for analysis in R.
Why prepare data?
If you are lucky, you may not have to prepare data, though real-life data has gaps, errors, missing values, etc. The quality of the analysis depends on the quality of the data. Remember the golden rule, garbage in, garbage out.
In the majority of cases, we visualize data in tabular form, though tabular data is somewhat easy to work with, yet it’s not always the case. Sometimes data for analysis is available as a combination of different files and requires multiple operations before it is ready for use.
In this article, we aim to look at:
- Prerequisite & Setup
- Data
- Strategy to prepare data for analysis
- Reading multiple CSV files into R
- Assigning variables names
- Understand the structure of imported data
- Merge data frames
- Rearrange and Prepare final data frame
Prerequisite and Setup
To practice along with this article, you need:
- R installed on your machine
- R Studio (optional)
- Data files
Details of the R environment used in this article are as follows:
sessionInfo() ## R version 4.0.1 (2020-06-06) ## Platform: x86_64-apple-darwin17.0 (64-bit) ## Running under: macOS Catalina 10.15.5 ## ## Matrix products: default ## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib ## ## locale: ## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] data.table_1.12.8 kableExtra_1.1.0 forcats_0.5.0 stringr_1.4.0 ## [5] dplyr_1.0.0 purrr_0.3.4 readr_1.3.1 tidyr_1.1.0 ## [9] tibble_3.0.1 ggplot2_3.3.1 tidyverse_1.3.0 ## ## loaded via a namespace (and not attached): ## [1] tidyselect_1.1.0 xfun_0.14 haven_2.3.1 lattice_0.20-41 ## [5] colorspace_1.4-1 vctrs_0.3.1 generics_0.0.2 viridisLite_0.3.0 ## [9] htmltools_0.5.0 yaml_2.2.1 blob_1.2.1 rlang_0.4.6 ## [13] pillar_1.4.4 glue_1.4.1 withr_2.2.0 DBI_1.1.0 ## [17] dbplyr_1.4.4 modelr_0.1.8 readxl_1.3.1 lifecycle_0.2.0 ## [21] munsell_0.5.0 gtable_0.3.0 cellranger_1.1.0 rvest_0.3.5 ## [25] evaluate_0.14 knitr_1.28 fansi_0.4.1 broom_0.5.6 ## [29] Rcpp_1.0.4.6 scales_1.1.1 backports_1.1.7 webshot_0.5.2 ## [33] jsonlite_1.7.0 fs_1.4.1 hms_0.5.3 digest_0.6.25 ## [37] stringi_1.4.6 grid_4.0.1 cli_2.0.2 tools_4.0.1 ## [41] magrittr_1.5 crayon_1.3.4 pkgconfig_2.0.3 ellipsis_0.3.1 ## [45] xml2_1.3.2 reprex_0.3.0 lubridate_1.7.9 assertthat_0.2.1 ## [49] rmarkdown_2.2 httr_1.4.1 rstudioapi_0.11 R6_2.4.1 ## [53] nlme_3.1-148 compiler_4.0.1
Data
Data used in the article is about s and p 500 companies financials. This data is available under Open Data Commons Public Domain Dedication and License.
Original “s and p 500 companies financials” data is available as one CSV file but for this exercise, we have split the data into five different files. These five files are as follows:
- constituents-financials-observations-1.csv
- constituents-financials-observations-2.csv
- constituents-financials-sectors.csv
- constituents-financials-symbol-descriptions.csv
- constituents-financials-variables.csv
Strategy
The strategy to prepare the data for analysis is to read files into R. Collect all observations into one data table followed by variable name assignment. Merge result data table with sector and symbols data tables to bring all variables together.
We will use the tidyverse package throughout this article.
Reading multiple CSV files into R
R presents several functions to read data. In this article to keep things simple and clear, we will use read.csv to read data into R. As per our strategy, let’s read the observation files and create data tables.
observatiion.files <- list.files(pattern = "^.*observations.*.csv$") observations.list <- lapply(observatiion.files, read.csv, header = FALSE) all.observations <- rbindlist(observations.list)
The first statement above creates a character vector of length two. Function list.files
search the working directory for any CSV file that has the word “observations” into its filename.
The second statement creates a list object with the help of lapply
function. It can take a vector or list as an input and applies a function over the elements. The function we have chosen here is read.csv
where the header option is set to false so that the first row is not skipped.
The third and final statement all.observations <- rbindlist(observations.list)
creates a data table object by binding all the rows of elements of the list observations.list
.
Let’s read other files using fread
function.
all.sectors <- fread("constituents-financials-sectors.csv") all.symbols <- fread("constituents-financials-symbol-descriptions.csv") variable.names <- fread("constituents-financials-variables.csv", header = FALSE)
Now look at the imported data:
str(all.observations) ## Classes 'data.table' and 'data.frame': 505 obs. of 12 variables: ## $ V1 : chr "MMM" "AOS" "ABT" "ABBV" ... ## $ V2 : num 222.9 60.2 56.3 108.5 150.5 ... ## $ V3 : num 24.3 27.8 22.5 19.4 25.5 ... ## $ V4 : num 2.33 1.15 1.91 2.5 1.71 ... ## $ V5 : num 7.92 1.7 0.26 3.29 5.44 1.28 7.43 3.39 6.19 0.03 ... ## $ V6 : num 259.8 68.4 64.6 125.9 162.6 ... ## $ V7 : num 175.5 48.9 42.3 60 114.8 ... ## $ V8 : num 1.39e+11 1.08e+10 1.02e+11 1.81e+11 9.88e+10 ... ## $ V9 : num 9.05e+09 6.01e+08 5.74e+09 1.03e+10 5.64e+09 ... ## $ V10: num 4.39 3.58 3.74 6.29 2.6 ... ## $ V11: num 11.34 6.35 3.19 26.14 10.62 ... ## $ V12: chr "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AOS" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV" ... ## - attr(*, ".internal.selfref")=<externalptr> str(all.sectors) ## Classes 'data.table' and 'data.frame': 505 obs. of 2 variables: ## $ Symbol: chr "MMM" "AOS" "ABT" "ABBV" ... ## $ Sector: chr "Industrials" "Industrials" "Health Care" "Health Care" ... ## - attr(*, ".internal.selfref")=<externalptr> str(all.symbols) ## Classes 'data.table' and 'data.frame': 505 obs. of 2 variables: ## $ Symbol: chr "MMM" "AOS" "ABT" "ABBV" ... ## $ Name : chr "3M Company" "A.O. Smith Corp" "Abbott Laboratories" "AbbVie Inc." ... ## - attr(*, ".internal.selfref")=<externalptr> str(variable.names) ## Classes 'data.table' and 'data.frame': 1 obs. of 12 variables: ## $ V1 : chr "Symbol" ## $ V2 : chr "Price" ## $ V3 : chr "Price/Earnings" ## $ V4 : chr "Dividend Yield" ## $ V5 : chr "Earnings/Share" ## $ V6 : chr "52 Week Low" ## $ V7 : chr "52 Week High" ## $ V8 : chr "Market Cap" ## $ V9 : chr "EBITDA" ## $ V10: chr "Price/Sales" ## $ V11: chr "Price/Book" ## $ V12: chr "SEC Filings" ## - attr(*, ".internal.selfref")=<externalptr>
All four commands above show the structure of four data.table
that we have created using fread
function. Let’s look at the first and last data.table
above. Both of these data tables are missing meaningful variable (column) names. The column names for first data tables all.observations
are available in variable.names
. Let’s assign variable names to all.observations
using variable.names
data table.
Assigning variable names
To assign the variable name, we would extract the first row from the data table
variable.names
. The output should be converted into a character vector. As a result, all column values of variable.names
data table would become elements of a new character vector. We would use this new character vector to assign the column name to our all.observations
data table.
colnames(all.observations) <- as.character(variable.names[1,]) str(all.observations) ## Classes 'data.table' and 'data.frame': 505 obs. of 12 variables: ## $ Symbol : chr "MMM" "AOS" "ABT" "ABBV" ... ## $ Price : num 222.9 60.2 56.3 108.5 150.5 ... ## $ Price/Earnings: num 24.3 27.8 22.5 19.4 25.5 ... ## $ Dividend Yield: num 2.33 1.15 1.91 2.5 1.71 ... ## $ Earnings/Share: num 7.92 1.7 0.26 3.29 5.44 1.28 7.43 3.39 6.19 0.03 ... ## $ 52 Week Low : num 259.8 68.4 64.6 125.9 162.6 ... ## $ 52 Week High : num 175.5 48.9 42.3 60 114.8 ... ## $ Market Cap : num 1.39e+11 1.08e+10 1.02e+11 1.81e+11 9.88e+10 ... ## $ EBITDA : num 9.05e+09 6.01e+08 5.74e+09 1.03e+10 5.64e+09 ... ## $ Price/Sales : num 4.39 3.58 3.74 6.29 2.6 ... ## $ Price/Book : num 11.34 6.35 3.19 26.14 10.62 ... ## $ SEC Filings : chr "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=MMM" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AOS" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABT" "http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV" ... ## - attr(*, ".internal.selfref")=<externalptr>
Command str(all.observations)
above shows variable names are all assigned to 12 different columns respectively.
Understand the structure of imported data
At this step, we know that all imported files are available to us as data tables. Structure of all.observation
data table shows that there are only two character variables and all other variables are of type numeric. On the flip side, the other two data tables have all variables of type character.
Before we start merging data tables, let’s familiarize ourselves with the common variable names between data tables. This will help us join the two data tables. Let’s look at the first 5 rows of each of the data tables and find out commonality.
names(all.observations) ## [1] "Symbol" "Price" "Price/Earnings" "Dividend Yield" ## [5] "Earnings/Share" "52 Week Low" "52 Week High" "Market Cap" ## [9] "EBITDA" "Price/Sales" "Price/Book" "SEC Filings" names(all.sectors) ## [1] "Symbol" "Sector" names(all.symbols) ## [1] "Symbol" "Name"
The result above shows that there are common variables names to merge the data tables. To bring all the data into one data table, we should start merging all.observations
with all.symbols
using the Symbol column and again with all.sectors
on Symbol column name. Let’s merge the data tables.
Merge data frames
complete.observations <- merge(merge(all.observations, all.symbols, by = "Symbol"), all.sectors , by = "Symbol")
Rearrange and Prepare final data frame
Now when all the data tables are merged, we will rearrange the variables within the final data table to keep all qualitative and quantitative columns together.
complete.observations <- setcolorder(complete.observations, c("Symbol","Name","Sector","Price","Price/Earnings","Dividend Yield","Earnings/Share","52 Week Low","52 Week High","Market Cap","EBITDA","Price/Sales","Price/Book","SEC Filings")) head(complete.observations) ## Symbol Name Sector Price Price/Earnings ## 1: A Agilent Technologies Inc Health Care 65.05 27.45 ## 2: AAL American Airlines Group Industrials 48.60 9.92 ## 3: AAP Advance Auto Parts Consumer Discretionary 109.63 19.54 ## 4: AAPL Apple Inc. Information Technology 155.15 16.86 ## 5: ABBV AbbVie Inc. Health Care 108.48 19.41 ## 6: ABC AmerisourceBergen Corp Health Care 91.55 15.54 ## Dividend Yield Earnings/Share 52 Week Low 52 Week High Market Cap ## 1: 0.8756979 2.10 75.00 49.23 21984606918 ## 2: 0.7782101 3.91 59.08 39.21 24594852352 ## 3: 0.2183207 6.19 169.55 78.81 8123611867 ## 4: 1.5795412 9.20 180.10 131.12 809508034020 ## 5: 2.4995599 3.29 125.86 60.05 181386347059 ## 6: 1.6132456 1.64 106.27 71.90 20587704101 ## EBITDA Price/Sales Price/Book ## 1: 1094000000 6.4935630 4.56 ## 2: 5761000000 0.5802257 6.03 ## 3: 853941000 1.1301061 2.51 ## 4: 79386000000 3.4586093 5.66 ## 5: 10310000000 6.2915710 26.14 ## 6: 991884000 0.1739698 9.73 ## SEC Filings ## 1: http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=A ## 2: http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAL ## 3: http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAP ## 4: http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=AAPL ## 5: http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABBV ## 6: http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=ABC
Eight steps above explain to us how to prepare data if the data is presented to us in multiple header-less CSV files.
Finally, when we have reached towards the end of the article, let’s compare the file we have prepared with the original file. This test will confirm that the operations we have performed are correct or not. Let’s read the original CSV file into R and compare it with the final data table we have prepared above.
original.file <- fread("constituents-financials_csv.csv") %>% arrange(Symbol,Name,Sector) complete.observations <- complete.observations %>% arrange(Symbol,Name,Sector) result <- colSums(original.file != complete.observations) which(!!result) ## named integer(0)
The result above confirms that the files are identical.
Lastly, the complete code is as follows:
### ---------------------------------------------------------------------------- ### Reading multiple header-less CSV files into R and preparing data to analyze ### Copyright - DataENQ.com ### Version - V1.0 ### ---------------------------------------------------------------------------- # finding all observation files into working directory observatiion.files <- list.files(pattern = "^.*observations.*.csv$") #reading all files into R using read.csv function observations.list <- lapply(observatiion.files, read.csv, header = FALSE) # creating one data table by appending all rows of list elements all.observations <- rbindlist(observations.list) # reading all other CSV files into R all.sectors <- fread("constituents-financials-sectors.csv") all.symbols <- fread("constituents-financials-symbol-descriptions.csv") variable.names <- fread("constituents-financials-variables.csv", header = FALSE) # assigning columns names by converting the contents of first row from variable.names data table. colnames(all.observations) <- as.character(variable.names[1,]) complete.observations <- merge(merge(all.observations, all.symbols, by = "Symbol"), all.sectors , by = "Symbol") complete.observations <- setcolorder(complete.observations, c("Symbol","Name","Sector","Price","Price/Earnings","Dividend Yield","Earnings/Share","52 Week Low","52 Week High","Market Cap","EBITDA","Price/Sales","Price/Book","SEC Filings")) original.file <- fread("constituents-financials_csv.csv") %>% arrange(Symbol,Name,Sector) complete.observations <- complete.observations %>% arrange(Symbol,Name,Sector) result <- colSums(original.file != complete.observations) which(!!result)Image Credit unsplash.com
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.