Examining Data with glimpse()

Exploring Data

2 years ago

[This article was first published on Exploring Data, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)

Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Source: https://bit.ly/3g9sNgK

Quick Overview

Exploring-Data is a place where I share easily digestible content aimed at making the wrangling and exploration of data more efficient (+fun).

And if you enjoy the post be sure to share it

Let’s Dive into an Example

I’m a nerd ???? when it comes to programming (esp. in R) and love learning about new functions.

Matt Dancho, early in his Data Science Foundations with R Course, teaches students the importance of examining data before diving into wrangling and exploring it further.

One of the functions I’ve adopted from the course is the glimpse() function – it’s awesome and I use it all the time!

The `glimpse()` function

When your data has a small number of columns it’s easy to print + view them in the RStudio console; however, when there are many columns it’s difficult to digest the view returned.

Let’s look at an example so I can stress the value in using the dplyr::glimpse() function when examining your data.

Let’s Get Some Data

The Tidy Tuesday Project is an awesome repository of useful data for practicing your Data Wrangling skills.

We will work with the Volcano Eruptions data as a case study for using the glimpse() function for examining your data prior to further exploration.

For simplicity, we are only going to focus on the Volcano data from the repository.

# Load Libraries
library(tidyverse)
library(tidyquant)
library(rmarkdown)
library(leaflet)
library(tmap)
library(DataExplorer)

# Import Data
volcano_raw_tbl <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-05-12/volcano.csv')

Data Examination with `print()`

The tidyverse version of a data.frame is a tibble and when printed to the RStudio console an overview of your data is returned.

When our data has only a few columns the tibble printed to the console is awesome! The user gets a quick overview of their data e.g., data-types, column count, etc.

Let’s filter our volcano data to see what this looks like in action.

# Filter + print data
print(volcano_raw_tbl[ , 0:4], n = 3) 
## # A tibble: 958 x 4
##   volcano_number volcano_name primary_volcano_type last_eruption_year
##            <dbl> <chr>        <chr>                <chr>             
## 1         283001 Abu          Shield(s)            -6850             
## 2         355096 Acamarachi   Stratovolcano        Unknown           
## 3         342080 Acatenango   Stratovolcano(es)    1972              
## # … with 955 more rows

That’s awesome!

A clean view is returned with information that allows the user to quickly examine their data. But what about if our data has many columns as is the case with the Volcanos Data?

print(volcano_raw_tbl, n = 5)
## # A tibble: 958 x 26
##   volcano_number volcano_name primary_volcano… last_eruption_y… country region
##            <dbl> <chr>        <chr>            <chr>            <chr>   <chr> 
## 1         283001 Abu          Shield(s)        -6850            Japan   Japan…
## 2         355096 Acamarachi   Stratovolcano    Unknown          Chile   South…
## 3         342080 Acatenango   Stratovolcano(e… 1972             Guatem… Méxic…
## 4         213004 Acigol-Nevs… Caldera          -2080            Turkey  Medit…
## 5         321040 Adams        Stratovolcano    950              United… Canad…
## # … with 953 more rows, and 20 more variables: subregion <chr>, latitude <dbl>,
## #   longitude <dbl>, elevation <dbl>, tectonic_settings <chr>,
## #   evidence_category <chr>, major_rock_1 <chr>, major_rock_2 <chr>,
## #   major_rock_3 <chr>, major_rock_4 <chr>, major_rock_5 <chr>,
## #   minor_rock_1 <chr>, minor_rock_2 <chr>, minor_rock_3 <chr>,
## #   minor_rock_4 <chr>, minor_rock_5 <chr>, population_within_5_km <dbl>,
## #   population_within_10_km <dbl>, population_within_30_km <dbl>,
## #   population_within_100_km <dbl>

Printing that many columns into the console makes the process of examining the data cumbersome – the columns being wrapped show the data-type for the columns but it’s difficult to extract that info. Also, for the wrapped columns you don’t see example values.

So why not just use the tibble::view() function and open the data in the RStudio Data Window? That’s a great idea but even before we go there we really want to get a sense of our data and the associated data-types.

Data Examination with `glimpse()`

I recommend using the glimpse() function and using it early ????

NOTE: On a mobile device (esp. a phone) this will not look good; check it out on a laptop or desktop ????

volcano_raw_tbl %>% glimpse()
## Rows: 958
## Columns: 26
## $ volcano_number           <dbl> 283001, 355096, 342080, 213004, 321040, 2831…
## $ volcano_name             <chr> "Abu", "Acamarachi", "Acatenango", "Acigol-N…
## $ primary_volcano_type     <chr> "Shield(s)", "Stratovolcano", "Stratovolcano…
## $ last_eruption_year       <chr> "-6850", "Unknown", "1972", "-2080", "950", …
## $ country                  <chr> "Japan", "Chile", "Guatemala", "Turkey", "Un…
## $ region                   <chr> "Japan, Taiwan, Marianas", "South America", …
## $ subregion                <chr> "Honshu", "Northern Chile, Bolivia and Argen…
## $ latitude                 <dbl> 34.500, -23.292, 14.501, 38.537, 46.206, 37.…
## $ longitude                <dbl> 131.600, -67.618, -90.876, 34.621, -121.490,…
## $ elevation                <dbl> 641, 6023, 3976, 1683, 3742, 1728, 1733, 125…
## $ tectonic_settings        <chr> "Subduction zone / Continental crust (>25 km…
## $ evidence_category        <chr> "Eruption Dated", "Evidence Credible", "Erup…
## $ major_rock_1             <chr> "Andesite / Basaltic Andesite", "Dacite", "A…
## $ major_rock_2             <chr> "Basalt / Picro-Basalt", "Andesite / Basalti…
## $ major_rock_3             <chr> "Dacite", " ", " ", "Basalt / Picro-Basalt",…
## $ major_rock_4             <chr> " ", " ", " ", "Andesite / Basaltic Andesite…
## $ major_rock_5             <chr> " ", " ", " ", " ", " ", " ", " ", " ", " ",…
## $ minor_rock_1             <chr> " ", " ", "Basalt / Picro-Basalt", " ", "Dac…
## $ minor_rock_2             <chr> " ", " ", " ", " ", " ", "Basalt / Picro-Bas…
## $ minor_rock_3             <chr> " ", " ", " ", " ", " ", " ", " ", "Andesite…
## $ minor_rock_4             <chr> " ", " ", " ", " ", " ", " ", " ", " ", " ",…
## $ minor_rock_5             <chr> " ", " ", " ", " ", " ", " ", " ", " ", " ",…
## $ population_within_5_km   <dbl> 3597, 0, 4329, 127863, 0, 428, 101, 51, 0, 9…
## $ population_within_10_km  <dbl> 9594, 7, 60730, 127863, 70, 3936, 485, 6042,…
## $ population_within_30_km  <dbl> 117805, 294, 1042836, 218469, 4019, 717078, …
## $ population_within_100_km <dbl> 4071152, 9092, 7634778, 2253483, 393303, 502…

`glimpse()` Packs a Punch ????

This view (if on a large enough screen) is awesome!

We’ve now got a glimpse() of an incredible amount of detail to help inform our next steps in examining and exploring these data. I love that in one view we can quickly see row/column counts, each column printed cleanly, the data-type of each column, and a handful of example values for each column.

We immediately gain insights into potential issues such as those we are seeing with the empty values being spaces (e.g., " ") instead of NA values.

We also see that these data are not tidy because headers of our columns in some instances are values, not names of variables – this is the case with the rock features and the population features.

The glimpse() function allowed us to quickly examine our data – now let’s have fun!

Subsetting Data for Exploration

I’ve done GIS work before and so I’m really interested in mapping some of these data.

The population counts based on proximity to volcanoes sound really interesting so let’s start there.

volcano_raw_subset_tbl <- volcano_raw_tbl %>% 
    select(volcano_name, country, 
           contains('tude'), contains('population'))

print(volcano_raw_subset_tbl, n = 5)
## # A tibble: 958 x 8
##   volcano_name country latitude longitude population_with… population_with…
##   <chr>        <chr>      <dbl>     <dbl>            <dbl>            <dbl>
## 1 Abu          Japan       34.5     132.              3597             9594
## 2 Acamarachi   Chile      -23.3     -67.6                0                7
## 3 Acatenango   Guatem…     14.5     -90.9             4329            60730
## 4 Acigol-Nevs… Turkey      38.5      34.6           127863           127863
## 5 Adams        United…     46.2    -121.                 0               70
## # … with 953 more rows, and 2 more variables: population_within_30_km <dbl>,
## #   population_within_100_km <dbl>

Get another `glimpse()`

We’ve reduced our data down to just 7 columns but it’s still a bit difficult to get a full view of what it contains.

Let’s use the glimpse() function again to get a better view ????

glimpse(volcano_raw_subset_tbl)
## Rows: 958
## Columns: 8
## $ volcano_name             <chr> "Abu", "Acamarachi", "Acatenango", "Acigol-N…
## $ country                  <chr> "Japan", "Chile", "Guatemala", "Turkey", "Un…
## $ latitude                 <dbl> 34.500, -23.292, 14.501, 38.537, 46.206, 37.…
## $ longitude                <dbl> 131.600, -67.618, -90.876, 34.621, -121.490,…
## $ population_within_5_km   <dbl> 3597, 0, 4329, 127863, 0, 428, 101, 51, 0, 9…
## $ population_within_10_km  <dbl> 9594, 7, 60730, 127863, 70, 3936, 485, 6042,…
## $ population_within_30_km  <dbl> 117805, 294, 1042836, 218469, 4019, 717078, …
## $ population_within_100_km <dbl> 4071152, 9092, 7634778, 2253483, 393303, 502…

Data Examination Cont.

Before exploring this data further, let’s examine it a little closer. Let’s see if there are any missing data in our subset of columns

Any Missing Records?

volcano_raw_subset_tbl %>% 
    DataExplorer::profile_missing() %>% 
    arrange(desc(num_missing)) %>% 
    rmarkdown::paged_table()

Nothing missing, that’s a good sign. And based on our earlier glimpse() all of the data-types looked good too.

Volcano Counts by Country: Top 5

Let’s see which countries have the most volcanoes.

volcano_raw_subset_tbl %>% 
    count(country, sort = T) %>% 
    top_n(5) %>% rmarkdown::paged_table()

I’m shocked that the United States has the most (in this data-set). While that’s interesting, a quick Google-search search shows that Indonesia is high in the ranks for countries with the most active volcanoes.

Let’s focus on Indonesia.

Data Wrangling

Let’s wrangle these data and transform them from a wide to long format – Tidy-Data ????

# wrangle + tidy data
indonesia_tidy_tbl <- volcano_raw_subset_tbl %>% 
    
    # filter to country of interest
    filter(country == 'Indonesia') %>%
    
    # pivot data into 'tidy' format
    pivot_longer(cols = contains('pop'),
                 names_to = 'distance_km',
                 values_to = 'population') %>% 
    
    # reorder columns
    select(volcano_name, country, distance_km,
           population, everything())
 

# view tidy data  
indonesia_tidy_tbl %>% 
    
    # format table
    rmarkdown::paged_table(options = list(rows.print = 5))

Our tidy-data table is coming together but it still has a glaring issue: there are text combined with numeric values in our distance_km column.

I’ve got just the trick for that: readr::parse_number()

Another Awesome Function

The readr::parse_number() function extracts numeric values from text – Simply Awesome!

# tidy up the distance feature
indonesia_tidy_tbl <- indonesia_tidy_tbl %>% 
    mutate(distance_km = parse_number(distance_km))

# format and view table
indonesia_tidy_tbl %>% paged_table(list(rows.print = 5))

Am I the only one who geeks out when they learn new functions?

Now that we’ve got our data wrangled let’s see if we can generate a quick map. I’ve never done this in R and so this ought to be fun ????

Map Indonesia Volcanoes

Let’s take a look at all volcanoes that have more than 10,000 people living within 5km (~3mi).

Click on a Volcano to see its name and population ????

indonesia_tidy_tbl %>%
  
    # filter data
    filter(country == 'Indonesia', distance_km == 5,
        population > 10000) %>% 
  
    # build text for popup
    mutate(popup_text = str_glue(
      "<b>Volcano Name:</b> {volcano_name}</p>
       <b>Population:</b> {scales::comma(population)}")) %>% 
  
    # build awesome volcano map
    leaflet(height = 300) %>% 
    addTiles() %>% 
    addCircleMarkers(
        lat = ~ latitude,
        lng = ~ longitude,
        radius = 5, color = 'red', 
        stroke = F,
        fillOpacity = 0.5,
        popup = ~ popup_text) # hover over to activate

In these data there are 28 volcanoes that have more than 10,000 people living within 3 miles.

I wonder how many of those volcanoes are active. At UC Davis I actually took a class on volcanoes and remember that cities living near active volcanoes have extensive evacuation plans.

Wrap-Up

That’s it for today!

We used glimpse() to quickly examine our data prior to Wrangling and Exploring it further.

Get the code here: Github Repo.

Subscribe + Share

Stay tuned for more posts on Wrangling + Exploring data.

Enter your Email Here to get the latest from Exploring-Data in your inbox.

PS: Be Kind and Tidy your Data ????

Learn R Fast ????

Interested in expediting your R learning path?

Head on over to Business Science and join me on the journey.

Link to my favorite R course: Data Science for Business 101

FREE Jumpstart Data-Science Course (opened for a limited time)

To leave a comment for the author, please follow the link and comment on their blog: Exploring Data.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.