Site icon R-bloggers

R Vocabulary – Part 3

[This article was first published on Anindya Mozumdar, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

This is the third part of the series of articles on R vocabulary. In this series, we explore most of the functions mentioned in Chapter 2 of the book Advanced R. The first part of the series can be read here and the second part of the series can be read here.

We start this article by looking at some functions which work on dates. The function strptime is used to convert character vectors to an object of class POSIXlt. This is just a list where each component of the list represents some aspect of a calendar date and time. The documentation for POSIXlt gives the list of components available. A couple of them are demonstrated in the example below.

mydates_c <- c("01-Jan-2017", "05-May-2012", "07-Aug-2022")
mydates <- strptime(mydates_c, format = "%d-%b-%Y")
class(mydates)
## [1] "POSIXlt" "POSIXt"
mydates[1]
## [1] "2017-01-01 IST"
mydates[[1]]$year
## [1] 117
mydates[[1]]$sec
## [1] 0

strptime can be used to convert characters in a wide varierty of formats to POSIXlt objects. In the example, the format string %d-%b-%Y represents the day of the month, the abbreviated month name and the 4-digit year respectively. The full list of format strings which are accepted can be found in the documentation for the function. strftime does the reverse and is demonstrated in the example below. The functions ISOdate and ISOdatetime provide convenient wrappers over strptime. Note that the default time zone for ISOdate is GMT. Finally the function date returns a character string of the current system date and time.

strftime(mydates, format = "%Y-%m-%d")
## [1] "2017-01-01" "2012-05-05" "2022-08-07"
strftime(mydates, format = "%d %B %y (%A)")
## [1] "01 January 17 (Sunday)" "05 May 12 (Saturday)"  
## [3] "07 August 22 (Sunday)"
ISOdate(2019, 2, 1)
## [1] "2019-02-01 12:00:00 GMT"
ISOdatetime(2019, 2, 1, 13, 23, 17)
## [1] "2019-02-01 13:23:17 IST"
date()
## [1] "Thu Apr 18 19:19:12 2019"

The function difftime creates time intervals in specified units. The largest possible units is “weeks”.

difftime(mydates, strptime("01-Oct-2000", "%d-%b-%Y"))
## Time differences in days
## [1] 5936 4234 7980
difftime(mydates, strptime("01-Oct-2000", "%d-%b-%Y"), units = "weeks")
## Time differences in weeks
## [1]  848.0000  604.8571 1140.0000

The functions julian, months, quarters and weekdays can be used extract parts from date-time objects. julian extracts the number of days since some origin. The lubridate package contains a lot of functions to make it easier to handle date and times.

julian(mydates)
## Time differences in days
## [1] 17166.77 15464.77 19210.77
## attr(,"origin")
## [1] "1970-01-01 GMT"
months(mydates)
## [1] "January" "May"     "August"
quarters(mydates)
## [1] "Q1" "Q2" "Q3"
weekdays(mydates)
## [1] "Sunday"   "Saturday" "Sunday"

The next set of functions we look at primarily relate to those which are useful for manipulating strings. grep and grepl look for a pattern in a character vector. The pattern can be a string or a regular expression. The value argument to grep determines whether the matches are returned or the indices of the matches are returned. The function agrep is used to do an approximate matching using the generalized Levenshtein edit distance function.

set.seed(123)
bnames <- sample(babynames::babynames$name, 50)
grep("An", bnames, fixed = TRUE, value = TRUE)
## [1] "Angel"
grep("An", bnames, fixed = TRUE, value = FALSE)
## [1] 29
grepl("An", bnames, fixed = TRUE)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE
grep("^N", bnames, value = TRUE) # regular expression - starts with N
## [1] "Navil"
agrep("Angie", bnames, fixed = TRUE, value = TRUE)
## [1] "Angel"

The function gsub can be used to replace a pattern with another pattern. In the example below, Angel is transformed to Pangel using the replacement pattern provided. strsplit can be used to split a string based on a character vector or regular expression.

gsub("An", "Pan", bnames[25:35], fixed = TRUE)
##  [1] "Jacques" "Breanne" "Rae"     "Aldin"   "Pangel"  "Curlie"  "Isaiyah"
##  [8] "Idania"  "Jaquawn" "Jillyan" "Ernest"
strsplit(c("A:B:C", "1:2:3"), ":", fixed = TRUE)
## [[1]]
## [1] "A" "B" "C"
## 
## [[2]]
## [1] "1" "2" "3"

chartr is used to translate characters in character vectors. tolower and toupper convert to lower and upper case respectively. nchar counts the number of characters. substr is used to extract a part of a string using integer indices. It can also be used in the left hand side of the assignment operator to replace that part.

chartr(":", "_", c("A:B:C", "1:2:3"))
## [1] "A_B_C" "1_2_3"
tolower("MAD DOG")
## [1] "mad dog"
toupper("radmuzom")
## [1] "RADMUZOM"
nchar(c("Anindya", "Mozumdar"))
## [1] 7 8
substr(c("Hello", "World"), 2, 3)
## [1] "el" "or"
x <- "Hello"
substr(x, 1, 1) <- "C"
x
## [1] "Cello"

paste is used to concatenate strings with a separator. The default separator is a single space. paste0 is a useful variant where the strings being concatenated are collapsed and there is no separator between them. The function trimws is used to remove leading or trailing whitespaces from a character vector.

paste("Anindya", "Mozumdar")
## [1] "Anindya Mozumdar"
paste("Anindya", "Mozumdar", sep = ",")
## [1] "Anindya,Mozumdar"
paste0("Anindya", "Mozumdar")
## [1] "AnindyaMozumdar"
x <- c("  Too", "many   ", "    spaces ")
trimws(x)
## [1] "Too"    "many"   "spaces"
trimws(x, "right")
## [1] "  Too"      "many"       "    spaces"

The stringr package provides a consistent set of string manipulation functions.

The next set of functions are related to factors. Factors are a basic data type in R. Internally, they are just integer vectors with an attribute level character vector. Factors can be created using the function factor.

x <- factor(sample(letters[1:3], 10, replace = TRUE))
str(x)
##  Factor w/ 3 levels "a","b","c": 1 2 3 1 2 1 1 3 3 2
attributes(x)
## $levels
## [1] "a" "b" "c"
## 
## $class
## [1] "factor"

The levels argument can be used to define the levels of a factor variable. The ordered argument is used to specify if the levels should be regarded as ordered. If a value is not specified in the levels argument, it is converted to a NA.

x <- factor(sample(letters[1:3], 10, replace = TRUE),
            levels = c("a", "b", "c"), ordered = TRUE)
x
##  [1] b a b a c b c c c b
## Levels: a < b < c
y <- factor(sample(letters[1:3], 10, replace = TRUE),
            levels = c("a", "b"), ordered = TRUE)
y
##  [1] <NA> b    <NA> a    b    a    b    b    b    a   
## Levels: a < b

The function nlevels is used to obtain the number of levels of a factor. The function levels is used to obtain the factor levels as a character vector. It can also be used in the left hand side of the assignment vector to change the levels.

nlevels(x)
## [1] 3
levels(x)
## [1] "a" "b" "c"
levels(x) <- c("p", "q", "r")
x
##  [1] q p q p r q r r r q
## Levels: p < q < r

The function reorder reorders the levels of a factor based on the values of a second variable, using a function applied to the second variable. In the example below, we create a factor which has three levels a, b or c. The 2nd argument n will a random normal vector of 10 values. We then reorder the levels of a depending on the sum of values in n which correspond to each level of a.

set.seed(123)
x <- factor(sample(letters[1:3], 10, replace = TRUE),
            levels = c("a", "b", "c"), ordered = TRUE)
x
##  [1] a c b c c a b c b b
## Levels: a < b < c
n <- rnorm(10)
n
##  [1]  1.7150650  0.4609162 -1.2650612 -0.6868529 -0.4456620  1.2240818
##  [7]  0.3598138  0.4007715  0.1106827 -0.5558411
y <- reorder(x, n, FUN = sum)
y
##  [1] a c b c c a b c b b
## attr(,"scores")
##          a          b          c 
##  2.9391468 -1.3504058 -0.2708272 
## Levels: b < c < a

The function relevel is used to order the levels of an unordered factor so that a reference level comes first and the remaining are moved down. Note that it can only be applied to an unordered factor.

set.seed(123)
x <- factor(sample(letters[1:3], 10, replace = TRUE),
            levels = c("a", "b", "c"))
x
##  [1] a c b c c a b c b b
## Levels: a b c
x <- relevel(x, "b")
x
##  [1] a c b c c a b c b b
## Levels: b a c

cut is used to convert a numeric variable into a factor by dividing it into intervals and coding the values of the variable depending on the level which it falls. By default, the left side of the interval is an open interval.

x <- rnorm(10)
x
##  [1]  1.7150650  0.4609162 -1.2650612 -0.6868529 -0.4456620  1.2240818
##  [7]  0.3598138  0.4007715  0.1106827 -0.5558411
cut(x, breaks = c(-Inf, 0.2, 0.5, 0.8, Inf))
##  [1] (0.8, Inf] (0.2,0.5]  (-Inf,0.2] (-Inf,0.2] (-Inf,0.2] (0.8, Inf]
##  [7] (0.2,0.5]  (0.2,0.5]  (-Inf,0.2] (-Inf,0.2]
## Levels: (-Inf,0.2] (0.2,0.5] (0.5,0.8] (0.8, Inf]
cut(x, breaks = c(-Inf, 0.2, 0.5, 0.8, Inf), labels = c("A", "B", "C", "D"))
##  [1] D B A A A D B B A A
## Levels: A B C D
cut(x, breaks = c(-Inf, 0.2, 0.5, 0.8, Inf), labels = c("A", "B", "C", "D"),
    include.lowest = TRUE)
##  [1] D B A A A D B B A A
## Levels: A B C D

findInterval is used to find the indices of a numeric variable in a set of intervals. In the example below, we take 10 random numbers from -100 to 100 and find which interval they lie in. The intervals are determined by the 2nd argument and are [-Inf, 0], [0, 5) and so on.

x <- sample(-100:100, 10)
x
##  [1]  93  80  37  57 -96  -7  47 -59 -39 -56
findInterval(x, c(-Inf, 0, 5, 10, 80, Inf))
##  [1] 5 5 4 4 1 1 4 1 1 1
findInterval(x, c(-Inf, 0, 5, 10, 80, Inf), left.open = TRUE)
##  [1] 5 4 4 4 1 1 4 1 1 1

interaction is used to find the interaction of two or more factors. By default, the . character is used to construct the new level names. It can be modified using the sep argument to interaction.

f1 <- factor(sample(letters[1:2], 10, replace = TRUE),
             levels = c("a", "b"))
f2 <- factor(sample(letters[25:26], 10, replace = TRUE),
             levels = c("y", "z"))
interaction(f1, f2)
##  [1] a.y a.y a.z a.y a.z a.y a.y a.z a.z b.y
## Levels: a.y b.y a.z b.z
interaction(f1, f2, sep = "|")
##  [1] a|y a|y a|z a|y a|z a|y a|y a|z a|z b|y
## Levels: a|y b|y a|z b|z

The next set of functions described in this chapter in the book pertain to statistics. We will be covering them in a separate article. In this article, we continue looking at functions which relate to working with R.

The function ls returns the names of objects, in a specified environment, as a character string. In the example below, it lists all the variables and functions we have defined till now. Inside a function, it returns the name of the function’s local variables.

ls()
## [1] "bnames"    "f1"        "f2"        "mydates"   "mydates_c" "n"        
## [7] "x"         "y"
f <- function(x, y) {
  print(ls())
  (x + y) ^ (x - y)
}
f(2, 3)
## [1] "x" "y"
## [1] 0.2

The exists function is used to search for the name of an object in an environment. rm can be used to remove objects. Note that after the call to rm, the function f1 which was previously displayed by ls no longer exists.

exists("f1")
## [1] TRUE
rm("f1")
ls()
## [1] "bnames"    "f"         "f2"        "mydates"   "mydates_c" "n"        
## [7] "x"         "y"

getwd is used to retrieve the current working directory, while setwd is used to set the working directory. The function quit, or it’s alias q, will terminate the current R session. source is used to accept R expressions from a named file, URL or connection. It can be used to define functions which are stored in an external file, and you don’t want to copy-paste them onto your current R session. install.packages is used to install an R package. library and require are used to load and attach packages. require is primarily for use inside functions; it gives a warning if the package does not exists. library will throw an error if it cannot find the required package. To remove a package from the current search list, use detach with the argument unload set to TRUE.

library(ggplot2)
detach("package:ggplot2", unload = TRUE)

apropos can be used to search for objects in the search list, and find returns where the particular object can be found. RSiteSearch can be used to search for words or phrases and view them in a web browser.

apropos("xy")
## [1] "plot.xy"      "sortedXyData" "xy.coords"    "xyinch"      
## [5] "xyTable"      "xyz.coords"
find("xyTable")
## [1] "package:grDevices"

citation tells you how to cite R and R packages. demo() provides you a list of topics on which demonstration scripts have been provided; run demo on a particular topic to view the demonstration. example allows you to run the examples in a particular help topic. vignette is used to list available vignettes or view a specific one.

citation(package = "ggplot2")
## 
## To cite ggplot2 in publications, please use:
## 
##   H. Wickham. ggplot2: Elegant Graphics for Data Analysis.
##   Springer-Verlag New York, 2016.
## 
## A BibTeX entry for LaTeX users is
## 
##   @Book{,
##     author = {Hadley Wickham},
##     title = {ggplot2: Elegant Graphics for Data Analysis},
##     publisher = {Springer-Verlag New York},
##     year = {2016},
##     isbn = {978-3-319-24277-4},
##     url = {https://ggplot2.tidyverse.org},
##   }

There are a set of functions which are primarily used to handle exceptions and debug R code. We won’t describe them here – it is recommended that you read the “Exceptions and debugging” chapter in the “Advanced R” book.

Next we look at some of the input/output functions. print is a generic function which prints it’s argument and returns it invisibly. cat concatenates the representations of the objects which are passed to it and outputs them.

f <- function(x) {
  print(x)
}
y <- f(2)
## [1] 2
y
## [1] 2
cat(c(3.3, 7.1), c("a", "b", "c"))
## 3.3 7.1 a b c

message and warning generates diagnostic or warning messages respectively. Note the string “Warning:” automatically added to the output of warning.

message("I am a message")
## I am a message
warning("I am a warning")
## Warning: I am a warning

dput is useful to create a text representation of an R object and write it to a file. It is especially useful if you want to share a small R object in an online forum while asking for help; the output can read by dget.

df <- data.frame(x = 1, y = 2)
z <- dput(df)
## structure(list(x = 1, y = 2), class = "data.frame", row.names = c(NA, 
## -1L))
z
##   x y
## 1 1 2

format is used to format an R object for pretty printing. sprintf is a wrapper for the C function by the same name which returns a character vector containing a formatted combination of text and variable values.

x <- rnorm(5) * 1000
format(x, nsmall = 1)
## [1] " 426.4642" "-295.0715" " 895.1257" " 878.1335" " 821.5811"
format(x, scientific = TRUE)
## [1] " 4.264642e+02" "-2.950715e+02" " 8.951257e+02" " 8.781335e+02"
## [5] " 8.215811e+02"
sprintf("N: %6.2g", x)
## [1] "N: 4.3e+02" "N: -3e+02"  "N:  9e+02"  "N: 8.8e+02" "N: 8.2e+02"

sink and capture.output are used to send R output to a character string, file or connection. There are a number of functions for reading and writing external data sources. We won’t describe them in detail here. The packages readr, haven and foreign can be used to read data in a wide variety of formats. The functions with the file. (file.path, file.copy, file.create, file.remove, file.rename, file.exists and file.info) prefix provide an interface to the file system interface from within R code. dir.create can be used to create a directory in the file system.

To leave a comment for the author, please follow the link and comment on their blog: Anindya Mozumdar.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.