Inspecting Data Structures
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The first step of any data related task is to inspect the data we are dealing with. This is crucial for data wrangling as well, since we need to explore the current structure of the data, in order to identify the required transformations.
- Inspect tabular data interactively with
View()
- Examine the data structure of each object using
str()
View(___) str(___)
Interactive Inspection with View()
Before starting with any kind of data analysis, it is crucial to understand the data we are dealing with. Plotting is a very important tool to get a quick overview of the statistical properties of data and to detect possible outliers. However, visualization might not always be possible, due to the size or complexity of the data set.
As an alternative solution, it might be convenient to interactively dig through the data set. This could be done by a spreadsheet-like interface, similar to Microsoft Excel, which enables to filter, sort and inspect tabular data structures.
R provides the function View()
, which shows an interactive data viewer. Depending on the used platform and editor, this viewer might look differently. Below you can see an example of the View()
function in RStudio:
View(gapminder)
Quiz: Interactive Inspection with View()
Why should you inspect data sets withView()
before starting with your analysis?
- Get a first impression of the data quality.
- Find outliers and missing values.
- Interactively inspect the data set.
- Create reproducible outputs for reports.
Exercise: Interactive Inspection with View()
Use the View()
function on the gapminder
data set and determine the country with the highest life expectancy. Pay also attention to year the projection was made. Set the variables country
and year
accordingly!
Examining Data Structures with str()
Sometimes we need to analyze very large and complex data structures. Displaying these data sources may already be overwhelming and simply not possible with interactive tools. In these cases, the str()
function comes to the rescue and prints the structure, as well as the first few values of any R object. Even very large and complex data structures can easily be displayed in the console that way.
As an example, let’s take a look at structure of the TitanicSurvival
data set:
library(carData) str(TitanicSurvival) 'data.frame': 1309 obs. of 4 variables: $ survived : Factor w/ 2 levels "no","yes": 2 2 1 1 1 2 2 1 2 1 ... $ sex : Factor w/ 2 levels "female","male": 1 2 1 2 1 2 1 2 1 2 ... $ age : num 29 0.917 2 30 25 ... $ passengerClass: Factor w/ 3 levels "1st","2nd","3rd": 1 1 1 1 1 1 1 1 1 1 ...
It consists of three factor columns (survived
, sex
and passengerClass
) and one numeric column age
. Note, that for factor columns both the labels (e.g. "no"
,"yes"
) as well as the integer values are displayed.
Quiz: Examining Data Structures with str()
In which cases is it benefitial to use thestr()
function?
- Get an overview of highly complex data sets.
- Create summary statistics describing the data set.
- Plot histograms.
- Only for
data.frames
.str()
can only handledata.frames
and cannot be used for other objects.
Quiz: Interpret the Output of str()
library(babynames) str(babynames) tibble [1,924,665 × 5] (S3: tbl_df/tbl/data.frame) $ year: num [1:1924665] 1880 1880 1880 1880 1880 1880 1880 1880 1880 1880 ... $ sex : chr [1:1924665] "F" "F" "F" "F" ... $ name: chr [1:1924665] "Mary" "Anna" "Emma" "Elizabeth" ... $ n : int [1:1924665] 7065 2604 2003 1939 1746 1578 1472 1414 1320 1288 ... $ prop: num [1:1924665] 0.0724 0.0267 0.0205 0.0199 0.0179 ...Examine the output of the
str()
function with the babynames dataset above. Which statements about the data set are correct?
- The data set has five rows.
- The data set has five columns.
- The
prop
column is of typenumeric
. - The column
sex
is of typefactor
.
Inspecting Data Structures is an excerpt from the course Advanced Data Transformation, which is available at quantargo.com
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.