Site icon R-bloggers

There is usually more than one way in R

[This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Python has a fairly famous design principle (from “PEP 20 — The Zen of Python”):

There should be one– and preferably only one –obvious way to do it.

Frankly in R (especially once you add many packages) there is usually more than one way. As an example we will talk about the common R functions: str(), head(), and the tibble package‘s glimpse().

tibble::glimpse()

Consider the important task inspecting a data.frame to see column types and a few example values. The dplyr/tibble/tidyverse way of doing this is as follows:

library("tibble")
glimpse(mtcars)

Observations: 32
Variables: 11
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10....
$ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4
$ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 167.6, 167.6, 275.8, 275.8, 27...
$ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65,...
$ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92, 3.07, 3.07, 3.07, 2.93, 3.0...
$ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.440, 3.440, 4.070, 3.730, 3....
$ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18.30, 18.90, 17.40, 17.60, 18...
$ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1
$ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 4, 5, 5, 5, 5, 5, 4
$ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1, 2, 2, 4, 6, 8, 2

utils::str()

A common “base-R (actually from the utils package) way to examine the data is:

str(mtcars)

'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

However, both of the above summaries have unfortunately obscured an important fact about the mtcars data.frame: the car names! This is because mtcars stores this important key as row-names instead of as a column. Even base::summary() will hide this from the analyst.

utils::head()

The base-R command head() (again from the utils package) provides a good way to examine the first few rows of data:

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

We are missing the size of the table and the column types, but those are easy to get with “dim(mtcars)” and “stack(vapply(mtcars, class, character(1)))“. And we can get something like the “columns on the side” presentation as follows:

t(head(mtcars))

     Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout Valiant
mpg      21.00        21.000      22.80         21.400             18.70   18.10
cyl       6.00         6.000       4.00          6.000              8.00    6.00
disp    160.00       160.000     108.00        258.000            360.00  225.00
hp      110.00       110.000      93.00        110.000            175.00  105.00
drat      3.90         3.900       3.85          3.080              3.15    2.76
wt        2.62         2.875       2.32          3.215              3.44    3.46
qsec     16.46        17.020      18.61         19.440             17.02   20.22
vs        0.00         0.000       1.00          1.000              0.00    1.00
am        1.00         1.000       1.00          0.000              0.00    0.00
gear      4.00         4.000       4.00          3.000              3.00    3.00
carb      4.00         4.000       1.00          1.000              2.00    1.00

Also, head() is usually re-adapted (through R‘s S3 object system) to work with remote data sources.

library("sparklyr")
sc <- sparklyr::spark_connect(version='2.0.2',
                              master = "local")

dRemote <- copy_to(sc, mtcars)

head(dRemote)

# Source:   query [6 x 11]
# Database: spark connection master=local[4] app=sparklyr local=TRUE
#
# # A tibble: 6 x 11
#     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1  21.0     6   160   110  3.90 2.620 16.46     0     1     4     4
# 2  21.0     6   160   110  3.90 2.875 17.02     0     1     4     4
# 3  22.8     4   108    93  3.85 2.320 18.61     1     1     4     1
# 4  21.4     6   258   110  3.08 3.215 19.44     1     0     3     1
# 5  18.7     8   360   175  3.15 3.440 17.02     0     0     3     2
# 6  18.1     6   225   105  2.76 3.460 20.22     1     0     3     1

glimpse(dRemote)

# Observations: 32
# Variables: 11
# 
#  Rerun with Debug
#  Error in if (width[i] <= max_width[i]) next : 
#   missing value where TRUE/FALSE needed 

broom::glance(dRemote)

#  Error: glance doesn't know how to deal with data of class tbl_sparktbl_sqltbl_lazytbl 

Conclusion

R often has more than one way to nearly perform the same task. When working in R consider trying a few functions to see which one best fits your needs. Also be aware that base-R (R with the standard packages) often already has powerful capabilities.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.