Booleans & NAs
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Missing values are inevitable in data science, and handling them is a constant issue. In the case of Boolean logic, it can behave fairly differently depending on the order of arguments and exactly how it is set up, unlike a lot of other data types. Whether this is useful or not depends on the scenario, but the behavior is something to keep in mind.
Most programming languages that I’m aware of have the capacity for short-circuit evaluation, which can allow for certain Boolean expressions to be evaluated given only one argument. Specifically, an OR will always be true if the first value is true, and an AND will always be false if the first argument is false. Other operators like XOR can’t be short-circuited due to actually needing to know both values to determine the output.
If the second argument is a missing value, you may expect the same as before, and you’d be right. Since you can’t short circuit an AND when the first argument is true or an OR when the first argument is false, those being missing isn’t surprising.
# in Julia; R and pandas.NA in Python behave the same > true & missing missing > false & missing false > true | missing true > false | missing missing
If the missing data is the first argument, however, what should happen is less clear. There’s an argument to be made for Boolean expressions short-circuiting to “missing” or “NA” in that case, but it turns out that’s not what happens – instead, it behaves exactly like if the missing value was second:
# in R > NA & TRUE NA > NA & FALSE FALSE > NA | TRUE TRUE > NA | FALSE NA
R’s documentation describes this like so:
NA is a valid logical object. Where a component of x or y is NA, the result will be NA if the outcome is ambiguous. In other words NA & TRUE evaluates to NA, but NA & FALSE evaluates to FALSE.
Julia and pandas.NA
operate in the same way, so presumably they use the same reasoning.
A Note Regarding Pandas
Above, I always referred to Python’s missing data type as pandas.NA
for one particular reason. In the past, Python’s pandas
would use numpy.NaN
to fill in missing values. Those won’t work in boolean operations, since their introduction would coerce any numeric data to floats, where AND and OR aren’t going to work.
> import numpy as np > a = np.array([True, np.NaN]) > a array([ 1., nan]) > a.dtype dtype('float64') > a & a Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
In version 1.0.0, however, pandas.NA was added, which can be used in conjunction with boolean operators:
> import pandas as pd > pd.NA & True <NA> > pd.NA & False False > pd.NA | True True > pd.NA | False <NA>
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.