Exploring Categorical Data With Inspectdf
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Exploring categorical data with inspectdf
What’s inspectdf
and what’s it for?
I often find myself viewing and reviewing dataframes throughout the
course of an analysis, and a substantial amount of time can be spent
rewriting the same code to do this. inspectdf
is an R package designed
to make common exploratory tools a bit more useful and easy to use.
In particular, it’s very powerful be able to quickly see the contents of
categorical features. In this article, we’ll summarise how to use the
inspect_cat()
function from inspectdf
for summarising and
visualising categorical columns.
First of all, you’ll need to have the inspectdf
package installed. You
can get it from github using
library(devtools) install_github("alastairrushworth/inspectdf")
Then load the package in. We’ll also load dplyr
for the starwars
data and for the pipe %>%
.
library(inspectdf) library(dplyr) # check out the starwars help file ?starwars
Tabular summaries using inspect_cat()
The starwars
data that comes bundled with dplyr
has 7 columns that
have character class, and is therefore a nice candidate for illustrating
the use of inspect_cat
. We can see this quickly using the
inspect_types()
function from inspectdf
.
starwars %>% inspect_types() ## # A tibble: 4 x 4 ## type cnt pcnt col_name ## <chr> <int> <dbl> <list> ## 1 character 7 53.8 <chr [7]> ## 2 list 3 23.1 <chr [3]> ## 3 numeric 2 15.4 <chr [2]> ## 4 integer 1 7.69 <chr [1]>
Using inspect_cat()
is very straightforward:
star_cat <- starwars %>% inspect_cat() star_cat ## # A tibble: 7 x 5 ## col_name cnt common common_pcnt levels ## <chr> <int> <chr> <dbl> <list> ## 1 eye_color 15 brown 24.1 <tibble [15 × 3]> ## 2 gender 5 male 71.3 <tibble [5 × 3]> ## 3 hair_color 13 none 42.5 <tibble [13 × 3]> ## 4 homeworld 49 Naboo 12.6 <tibble [49 × 3]> ## 5 name 87 Ackbar 1.15 <tibble [87 × 3]> ## 6 skin_color 31 fair 19.5 <tibble [31 × 3]> ## 7 species 38 Human 40.2 <tibble [38 × 3]>
So what does this tell us? Each row in the tibble returned from
inspect_cat()
corresponds to each categorical column (factor
,
logical
or character
) in the starwars
dataframe.
- The
cnt
column tells you how many unique levels there are for each column. For example, there are 15 unique entries in theeye_color
column. - The
common
column prints the most commonly occurring entry. For example, the most commoneye_color
isbrown
. The percentage occurrence is 24.1% which is shown undercommon_pcnt
. - A full list of levels and occurrence frequency is provided in the
list column
levels
.
A table of relative frequencies of eye_color
can be retrieved by
typing
star_cat$levels$eye_color ## # A tibble: 15 x 3 ## value prop cnt ## <chr> <dbl> <int> ## 1 brown 0.241 21 ## 2 blue 0.218 19 ## 3 yellow 0.126 11 ## 4 black 0.115 10 ## 5 orange 0.0920 8 ## 6 red 0.0575 5 ## 7 hazel 0.0345 3 ## 8 unknown 0.0345 3 ## 9 blue-gray 0.0115 1 ## 10 dark 0.0115 1 ## 11 gold 0.0115 1 ## 12 green, yellow 0.0115 1 ## 13 pink 0.0115 1 ## 14 red, blue 0.0115 1 ## 15 white 0.0115 1
There isn’t anything here that can’t be obtained by using the base
table()
function with some post-processing. inspect_cat()
automates
some of that functionality and wraps it into a single, convenient
function.
Visualising categorical columns with show_plot()
An important feature of inspectdf
is the ability to visualise
dataframe summaries. Visualising categories can be challenging, because
categorical columns can be very rich and contain many unique levels. A
simple stacked barplot can be produced using show_plot()
star_cat %>% show_plot()
Like the star_cat
tibble returned by inspect_cat()
, each row of the
plot is a single column, split by the relative frequency of occurrence
of each unique entry.
- Some of the bars are labelled, but in cases where the bars are
small, the labels are not shown. If you encounter categorical
columns with really long strings, labels can be suppressed
altogether with
show_plot(text_labels = FALSE)
. - Missing values or
NA
s are shown as gray bars. In this case, there are quite a fewstarwars
characters whosehomeworld
is not unknown or missing.
Combining rare entries with show_plot()
Some of the categorical columns like name
seems to have a lot of
unique entries. We should expect this – names often are unique (or
almost) in a small dataset. If we scaled this analysis up to a dataset
with millions of rows, there would be so many names with very small
relative frequencies that the name bars would be very difficult to see.
show_plot()
can help with this too!
star_cat %>% show_plot(high_cardinality = 1)
By setting the argument high_cardinality = 1
all entries that occur
only once are combined into a single group labelled high
cardinality. This makes it easier to see when some entries occur only
once (or extremely rarely).
- In the above, it’s now obvious that no two people in the
starwars
data share the same name, and that many come from a uniquehomeworld
orspecies
. - By setting
high_cardinality = 2
or even greater, it’s possible to group the ‘long-tail’ of rare categories even further. With larger datasets, this becomes increasingly important for visualisation. - A practical reason to combine rare entries, is plotting speed – it
can take a long time to render a plot with tens of thousands (or
more) unique bars! Using the
high_cardinality
argument can reduce this dramatically.
Playing with color options in show_plot()
It’s been pointed out that the default ggplot
color theme isn’t
particularly friendly to color-blind audiences. A more color-blind
friendly theme is
available by specifying col_palette = 1
:
star_cat %>% show_plot(col_palette = 1)
I’m also quite fond of the 80s theme by choosing col_palette = 2
:
star_cat %>% show_plot(col_palette = 2)
There are 5 palettes at the moment, so have a play around. Note that the
color palettes have not yet hit the CRAN version of inspectdf
– that
will come soon in an update, but for now you can get them from the
github version of the package using the code at the start of the
article.
Comments? Suggestions? Issues?
Any feedback is welcome! Find me on twitter at rushworth_a or write a github issue.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.