Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Logic will get you from A to B. Imagination will you take everywhere. (Einstein)
R can already take you everywhere. With it we can learn about the minutest particles and the largest galaxies. So, to celebrate the release of R 4.3 (“Already Tomorrow”, on April 21st, 2023), let’s reverse Einstein’s quote and take you from A to B with logic.
Two modes of comparison
< !--- Compare && with & -->In R, almost all of your data will be stored as a vector. Even if your vector holds a single value it is still considered to be a vector by R. This is unlike many other languages, and getting comfortable “thinking for the whole vector” can gain you efficiencies from several viewpoints. Your code will be more concise and it may even run quicker, when compared with an iterative approach to the same problem.
1:10 # A vector of integers ## [1] 1 2 3 4 5 6 7 8 9 10 is.vector(1:10) ## [1] TRUE sum(1:10) # A vectorised computation ## [1] 55 integer(0) # An empty vector of integers ## integer(0) 1L # A single integer, stored as a vector ## [1] 1< !--- Example 1:10 is a vector; 1 is a vector --> < !--- Example x = 1:10; x + 1 vs x = 1:10; y < - sapply(function(z) z + 1, x) -->
But the conciseness that R’s vectorised operations provide may trip you up unexpectedly. A typical case is when you think you are working with a scalar (a length-1 vector) but you are actually working with an empty or multivalued vector.
The logical
values in R (TRUE
, FALSE
) are a little bit special. A
vector of logical values might be used to represent some quality in a
dataset, for example, to select those rows of a dataset that are to be
kept in dplyr::filter()
.
library("tidyverse") head(diamonds) ## # A tibble: 6 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 head(diamonds$cut == "Ideal") # A logical vector ## [1] TRUE FALSE FALSE FALSE FALSE FALSE filter(diamonds, cut == "Ideal") # Subsetting a data-frame using a logical vector ## # A tibble: 21,551 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.23 Ideal J VS1 62.8 56 340 3.93 3.9 2.46 ## 3 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 ## 4 0.3 Ideal I SI2 62 54 348 4.31 4.34 2.68 ## 5 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78 ## 6 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75 ## 7 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76 ## 8 0.23 Ideal G VS1 61.9 54 404 3.93 3.95 2.44 ## 9 0.32 Ideal I SI1 60.9 55 404 4.45 4.48 2.72 ## 10 0.3 Ideal I SI2 61 59 405 4.3 4.33 2.63 ## # ℹ 21,541 more rows head(diamonds$carat > 0.3) ## [1] FALSE FALSE FALSE FALSE TRUE FALSE filter(diamonds, carat > 0.3) ## # A tibble: 49,737 × 10 ## carat cut color clarity depth table price x y z ## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> ## 1 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 2 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71 ## 3 0.32 Premium E I1 60.9 58 345 4.38 4.42 2.68 ## 4 0.31 Very Good J SI1 59.4 62 353 4.39 4.43 2.62 ## 5 0.31 Very Good J SI1 58.1 62 353 4.44 4.47 2.59 ## 6 0.31 Good H SI1 64 54 402 4.29 4.31 2.75 ## 7 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78 ## 8 0.33 Ideal I SI2 61.2 56 403 4.49 4.5 2.75 ## 9 0.33 Ideal J SI1 61.1 56 403 4.49 4.55 2.76 ## 10 0.32 Good H SI2 63.1 56 403 4.34 4.37 2.75 ## # ℹ 49,727 more rows
But there are places where you use logical values, where it would make
no sense (and could potentially be dangerous) to use a multivalued
logical vector. We use if (...) {}
and while (...) {}
statements for
flow control in R. The conditional expression in these statements (the
...
in if (...) {}
) should always evaluate to a logical scalar:
either TRUE
or FALSE
.
When R 4.2.0 was released, stricter guarantees were placed on the length of these conditional expressions. We mentioned this in an earlier blog post. So in addition to getting an error when the conditional is empty, we now get an error when the conditional is too long:
# Comparison with an empty logical vector: if (logical(0)) { print("I didn't expect to get here") } ## Error in if (logical(0)) {: argument is of length zero # Comparison with an over-sized logical vector: numbers <- c(1, 3, 5, 6) print(numbers %% 2 == 0) # Determine if even ## [1] FALSE FALSE FALSE TRUE if (numbers %% 2 == 0) { print("Should we ever be allowed to get here?") } ## Error in if (numbers%%2 == 0) {: the condition has length > 1
Previously, R would use the first entry in a non-scalar conditional
vector to decide whether to enter the if
or while
block.
Data comes in all shapes and sizes. It can often be difficult to know where to start. Whatever your problem, Jumping Rivers can help.
Strictly comparing
So, we have two main ways of using a logical vector, one of which now requires that the vector is a scalar.
Another place where it is really important to know the length of your vectors is when combining logical values together.
R has a number of ways to combine logical values together that build on the AND and OR operations in Boolean algebra:
all
andany
for combining the values in a single vector (areall
of the values TRUE; areany
of the values TRUE)&
,&&
(representing “AND”),|
, and||
(for “OR”) for combining two different vectors
is_april = TRUE is_r_released = TRUE is_already_tomorrow = FALSE # Logical AND within a single vector all(c(is_april, is_r_released, is_already_tomorrow)) ## [1] FALSE # Logical OR within a single vector any(c(is_april, is_r_released, is_already_tomorrow)) ## [1] TRUE # Logical AND between vectors is_april & is_r_released ## [1] TRUE is_april && is_already_tomorrow ## [1] FALSE # Logical OR between vectors is_april | is_r_released ## [1] TRUE is_april || is_already_tomorrow ## [1] TRUE
For scalars, there’s no difference between the single-character
operators (&
, |
) and the two-character operators (&&
, ||
). So
why have a pair of operators for each concept?
&&
and||
are intended for use solely with scalars, they return a single logical value.&
and|
work with multivalued vectors, they return a vector whose length matches their input arguments.
Since they always return a scalar logical, you should use &&
and
||
in your if/while conditional expressions (when needed). If an &
or |
is used, you may end up with a non-scalar vector inside
if (...) {}
and R will throw an error.
To illustrate the difference between the scalar operators and vectorised operators, here’s an example:
x = c(TRUE, TRUE, FALSE, FALSE) y = c(TRUE, FALSE, TRUE, FALSE)
The vectorised operators apply AND/OR on matched pairs of elements:
x & y # c(x[1] && y[1], x[2] && y[2], ...) ## [1] TRUE FALSE FALSE FALSE x | y # c(x[1] || y[1], x[2] || y[2], ...) ## [1] TRUE TRUE TRUE FALSE
In R 4.2.0, a warning is thrown when a non-scalar input is passed to the
scalar-operators. But, a scalar logical is returned (here, the result of
x[1] && y[1]
). In earlier versions of R, no warning was printed.
# R 4.2 x && y [1] TRUE Warning messages: 1: In x && y : 'length(x) = 4 > 1' in coercion to 'logical(1)' 2: In x && y : 'length(x) = 4 > 1' in coercion to 'logical(1)'
This could lead to hidden bugs. For example, if you used this code in an
if
conditional, a warning would be printed when a non-scalar vector
was used but the code would continue happily:
# R 4.2 if (x && y) { print("The world can't end today...") } [1] "The world can't end today..." Warning messages: 1: In x && y : 'length(x) = 4 > 1' in coercion to 'logical(1)' 2: In x && y : 'length(x) = 4 > 1' in coercion to 'logical(1)'
In R 4.3.0, this warning has been elevated to an error and no value is returned:
# R 4.3 x && y Error in x && y : 'length = 4' in coercion to 'logical(1)'
This more strict version of the scalar comparison operators will help catch those bugs where you didn’t realise a logical variable could contain more than one entry.
< !--- Compare the purpose of vector-wise comparison (&) with scalar-logic comparison (&&) --> < !--- Errors in R4.3 from non-scalar comparison --> < !--- This builds on changes in R4.2 -->To check whether the strict comparison operators will affect your existing code, before upgrading to R 4.3.0, you can set an environment variable before running it:
# In R: Sys.setenv("_R_CHECK_LENGTH_1_LOGIC2" = TRUE)
Whether you want to start from scratch, or improve your skills, Jumping Rivers has a training course for you.
A more logical flow
< !--- Sequences a:b -->Where else do we work with scalars in R? Many functions expect certain
arguments to be scalars. For example, the seq()
function complains
with non-scalar arguments:
seq(from = 1:3, to = 4) ## Error in seq.default(from = 1:3, to = 4): 'from' must be of length 1 seq(from = 1, to = 4:5) ## Error in seq.default(from = 1, to = 4:5): 'to' must be of length 1
There are several other places where R will throw an error if we provide a value that is of the wrong size:
a_data_frame[[column_index]] # column_index must be a scalar a_matrix[rows, cols] = value # value must match the size of the replaced element(s)
There are other places where R will throw a warning, and try to gracefully handle values that are of an unexpected size:
# R's recycling rules are used to match the size of the vector input c(1, 3, 5) * c(2, 3) # c(1 * 2, 3 * 3, 5 * 2) ## Warning in c(1, 3, 5) * c(2, 3): longer object length is not a multiple of ## shorter object length ## [1] 2 9 10 # The smaller vector was recycled to match the size of the larger # c(1, 3, 5) * c(2, 3, 2)
An interesting case is the :
operator, which like seq()
, can be used
to create sequences of numbers.
3:5 ## [1] 3 4 5
If we provide a non-scalar on either side of the operator, R will warn us:
# R 4.2 (1:2) : 5 [1] 1 2 3 4 5 Warning message: In (1:2):5 : numerical expression has 2 elements: only the first used # R 4.2 1 : (4:6) [1] 1 2 3 4 Warning message: In 1:(4:6) : numerical expression has 3 elements: only the first used
Now, because the output should be a single sequence, R has to pick a specific value for the start- and the end-point of that sequence from the arguments provided. It uses the first entry in each argument. So,
(1:2) : 5
is equivalent to1:5
; and1 : (4:6)
is equivalent to1:4
.
If your code is providing non-scalar arguments to :
, there may be a
bug in your code or the packages that it depends upon. R 4.3.0 has
introduced a more strict setting, which will catch the use of non-scalar
values when constructing sequences with the :
operator.
Much like with the stricter logic comparisons described above, the R
developers have introduced this as an optional setting. After setting
the environment variable _R_CHECK_LENGTH_COLON_
to a true value, R
will throw an error whenever an oversized argument is passed into a:b
.
# R 4.3 # Without the check enabled: (1:2) : 5 [1] 1 2 3 4 5 Warning message: In (1:2):5 : numerical expression has 2 elements: only the first used # With the strict check enabled: Sys.setenv("_R_CHECK_LENGTH_COLON_" = TRUE) (1:2) : 5 Error in (1:2):5 : numerical expression has length > 1
And finally: Extracting from a pipe
Have you started using the native pipe yet? In our blog post to celebrate the release of R 4.2.0, we showed this example:
mtcars |> lm(mpg ~ disp, data = _) ## ## Call: ## lm(formula = mpg ~ disp, data = mtcars) ## ## Coefficients: ## (Intercept) disp ## 29.59985 -0.04122
Here the pipe |>
passes the value on it’s left-hand side into the
function on the right. By default that value will be used as the first
argument to the right-hand function. But when an underscore is present,
the piped-in value will replace that underscore. So the above is
equivalent to:
lm(mpg ~ disp, data = mtcars) ## ## Call: ## lm(formula = mpg ~ disp, data = mtcars) ## ## Coefficients: ## (Intercept) disp ## 29.59985 -0.04122
What if you want to extract values that are output by a pipeline? For
example, if you want the coef
entry from the linear model above. One
way would be to store the results in a variable and extract the coef
from that:
model = mtcars |> lm(mpg ~ disp, data = _) model$coef ## (Intercept) disp ## 29.59985476 -0.04121512
Or you could wrap the pipeline in parentheses:
( mtcars |> lm(mpg ~ disp, data = _) )$coef ## (Intercept) disp ## 29.59985476 -0.04121512
R 4.3.0 provides a much neater solution, where the underscore _
can be
used to refer to the final value from a pipeline. This can make your
code much neater:
mtcars |> lm(mpg ~ disp, data = _) |> _$coef (Intercept) disp 29.59985476 -0.04121512
Trying the latest version out for yourself
To take away the pain of installing the latest development version of R,
you can use docker. To use the devel
version of R, you can use the
following commands:
docker pull rstudio/r-base:devel-jammy docker run --rm -it rstudio/r-base:devel-jammy
See the r-docker
project for more
details.
See also
Do you have nostalgia for previous versions of R? If so, check out our previous blog posts:
For updates and revisions to this article, see the original post
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.