Site icon R-bloggers

The R type system

[This article was first published on Digithead's Lab Notebook, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

R is a weird beast. Through it’s ancestor the S language, it claims a proud heritage reaching back to Bell Labs in the 1970’s when S was created as an interactive wrapper around a set of statistical and numerical subroutines. As a programming language, R takes ideas from Unix shell scripting, functional languages (Lisp and ML), and also a little from C. Programmers will usually have at least some background in these languages, but one aspect of R that might remain puzzling is it’s type system.

Because the purpose of R is programming with data, it has some fairly sophisticated tools to represent and manipulate data. First off, the basic unit of data in R is the vector. Even a single integer is represented as a vector of length 1. All elements in an atomic vector are of the same type. The sizes of integers and doubles are implementation dependent. Generic vectors, or lists, hold elements of varying types and can be nested to create compound data structures, as in Lisp-like languages.

Fundamental types

# a is a vector of length 1
> a <- 101
> length(a)
[1] 1

# the function c() combines is arguments
# construct a vector of numeric data and access its members
> ages <- c(40, 36, 2, 38, 27, 1)
> ages[2]
[1] 36
> ages[4:6]
[1] 38 27  1

> movie <- list(title='Monty Python's The Meaning of Life', year=1983, cast=c('Graham Chapman','John Cleese','Terry Gilliam','Eric Idle','Terry Jones','Michael Palin'))
> movie
$title
[1] "Monty Python's The Meaning of Life"
$year
[1] 1983
$cast
[1] "Graham Chapman" "John Cleese"    "Terry Gilliam"  "Eric Idle"      "Terry Jones"    "Michael Palin"

Attributes

R objects can have attributes – arbitrary key/value pairs – attached to them. One use for this is that elements in vectors or lists can be named. R’s object system is based on the class attribute. (OK, I really mean the simpler of R’s two object systems, but let’s avoid that topic.) Attributes are also used to turn one-dimensional vectors into multi-dimensional structures by specifying their dimensions, as we’ll see next.

Matrices and arrays

Matrices and arrays are special types of vectors, distinguished by having a dim (dimensions) attribute. A matrix has two dimensions, so the value of its dim attribute is a vector of length 2 specifying numbers of rows and columns in the matrix. Arrays are n dimensional vectors, sometimes used like an OLAP data cube, with dimension vectors of length n.

# create some data series
> bac = c(14.08, 7.05, 13.05, 16.21)
> hbc = c(48.67, 29.51, 41.93, 55.82)
> jpm = c(31.53, 28.14, 33.77, 41.37)

# create a matrix whose rows are companies and columns are quarters
# values in the matrix is closing stock price on the first day of the quarter
> m <- matrix(c(bac,hbc,jpm), nrow=3, byrow=T)
> rownames(m) <- c('bac','hbc','jpm')
> colnames(m) <- c('q1', 'q2', 'q3', 'q4')

> m
       q1    q2    q3    q4
bac 14.08  7.05 13.05 16.21
hbc 48.67 29.51 41.93 55.82
jpm 31.53 28.14 33.77 41.37

# check out the attributes
> attributes(m)
$dim
[1] 3 4
$dimnames
$dimnames[[1]]
[1] "bac" "hbc" "jpm"
$dimnames[[2]]
[1] "q1" "q2" "q3" "q4"

Factors

Statisticians divide data into four types: nominal, ordinal, interval and ratio. Factors are for the first two, depending on whether they are ordered or not. This makes a difference for some of the stats algorithms in R, but from a programmers point of view, a factor is just an enum. R turns character vectors into factors at the slightest provocation. It’s sometimes necessary to coerce factors back to character strings, using as.character().

Data frames

A data frame is a special list in which all elements are vectors of equal length. It is analagous to a table in a database, except that it’s column-oriented rather than row-oriented. Because the vectors are constrained to be of the same length, you can index any cell in a data frame by its row and column.

# make a simple data frame
> df <- data.frame(ticker=c('bac', 'hbc', 'jpm'), market.cap=c(137.37, 185.65, 157.80), yield=c(0.25,3.00,0.50))
> df
  ticker market.cap yield
1    bac     137.37  0.25
2    hbc     185.65  3.00
3    jpm     157.80  0.50

There’s more, of course, but this gives you enough to be dangerous. Note that, because R natively works with vectors, many operations in R are vectorized, meaning they operate on whole vectors at once, rather than on a single scalar value. The key to performance in R is making good use of vectorized operations. Also, being functional, R inherits a full compliment of higher-order functions – Map, Reduce, Filter and many forms of apply (lapply, sapply, and tapply). Mixing higher-order functions and vectorized operations can get confusing (and is the source of the proliferation of apply functions). Both these techniques, as well as the organization of the type system, encourage you to work with blocks of data as a unit. This is what John Chambers called high-level prototyping for computations with data.

More information

To leave a comment for the author, please follow the link and comment on their blog: Digithead's Lab Notebook.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.