Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
I have something a fondness for ridiculous variable names, so it’s useful to be able to check whether my latest concoction is legitimate. More so if it is automatically generated.
Not having an is_valid_variable_name
function is one of those odd omissions from R, and the assign
function doesn’t check validity.
To recap, there are a few rules on what makes a valid variable name. From ?name
Names are limited to 10,000 bytes (and were to 256 bytes inversions of R before 2.13.0).
The logic for this is pretty easy to deal with, but before I come to that, a note on the structure of is*
type functions. In scalary languages (C and it’s descendents), these functions seem to be standardised along the lines of
is_something <- function(x) { if(!some_condition) return(FALSE) if(!some_other_condition) return(FALSE) #etc. return(TRUE) }
The advantage of this is that as soon as a condition fails, the function returns, so the function can be fast. In a vectory languages like R, things aren’t quite as clean since different elements can fail on different conditions. The nearest equivalent function structure that I’ve come up with is something like:
is_something <- function(x) { ok ok[!some_condition] <- FALSE ok[!some_other_condition] <- FALSE #etc. ok }
So, back to our is_valid_variable_name
function. The first condition is easy to implement.
is_valid_variable_name <- function(x) { ok #is name too long? max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L #More logic still to come }
Now it gets trickier. In ?make.names
we have
A syntactically valid name consists of letters, numbers and the dot or underline characters and starts with a letter or the dot not followed by a number. Names such as ‘”.2way”’ are not valid, and neither are the reserved words.
When you read this, your first thought should be “regular expressions will save the day“. The trouble is, regular expressions that are that complicated are hard to write and hard to understand. Which means that you need *lots* of testing to make sure that they are correct.
In the spirit of laziness I decided to see if someone else had done the legwork. It transpires that someone has (yey CRAN). The MSToolkit
package contains a function validNames
which tries to solve the problem with one big regex. Unfortunately (as of version 2.0) it doesn’t always work. Here’s the regex that that function uses.
"^[\\.]?[a-zA-Z][\\.0-9a-zA-Z]*$"
That translates as: start with (“^”) a dot (“\\.”) that is optional (“?”), followed by a letter (“[a-zA-Z]“), then zero or more (“*”) dots, letters or numbers (“[\\.0-9a-zA-Z]“), then finish (“$”).
The first that pops into my mind when I see this is “what do French R programmers do?”. That is, we can define variables with accented characters áçöíþ <- 1
that the regex a-zA-Z
won’t pick up. there’s an easy fix here that nearly always works. We swap 0-9a-zA-Z
for [:alnum:]
and voila! Locale dependent letter and number matching. This isn’t quite perfect since, for example, in my UK English locale, I can define variables with greek letters µ
but the “alpha” regex won’t match them.
grepl("[[:alpha:]]", "µ") # FALSE
Glossing over the small letter matching issues for now, there are bigger problems with the MSToolkit regex.
Underscores aren’t permitted…
validNames("foo_bar") #throws error
and neither are names consisting only of dots…
validNames("..") #throws error
but many of the reserved words (see ?Reserved
for the list) are:
validNames("if") #TRUE
I don’t want to discredit the authors of MSToolkit – writing complex regexes is a difficult task. What we need is an easier approach. Lots of smaller regexes for individual cases are easier to understand. One other tiny complication: the ellipsis argument, ...
, and two dots followed by a number (which refers to the elements of the ellipsis) are valid variable names, but are reserved, so sometimes you want to think of them as valid, and sometimes you don’t.
is_valid_variable_name <- function(x, allow_reserved = TRUE) { ok <- rep.int(TRUE, length(x)) #is name too long? max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L ok[nchar(x) > max_name_length] <- FALSE #is it a reserved variable, i.e. #an ellipsis or two dots then a number? if(!allow_reserved) { ok[x == "..."] <- FALSE ok[grepl("^\\.{2}[[:digit:]]+$", x)] <- FALSE } #is it a reserved word? reserved_words <- c("if", "else", "repeat", "while", "function", "for", "in", "next", "break", "TRUE", "FALSE", "NULL", "Inf", "NaN", "NA", "NA_integer_", "NA_real_", "NA_complex_", "NA_character_") ok[grepl(paste(reserved_words, collapse = "|"), x)] #are there any illegal characters? ok[!grepl("^[[:alnum:]_.]+$", x)] <- FALSE #does it start with underscore? ok[grepl("^_", x)] <- FALSE #does it start with dot then a number? ok[grepl("^\\.[[:digit:]]", x)] <- FALSE ok }
So now we have lots of easier conditions to check. I was pretty pleased with myself after constructing this until I realised that the best way to solve this was to cheat. make.names
, that I mentioned earlier, contains logic to check for valid variable names, so if a variable name is valid, then x
will be the same as make.names(x)
. As a bonus, we can easily check for unique
variable names.
is_valid_variable_name <- function(x, allow_reserved = TRUE, unique = FALSE) { ok #is name too long? max_name_length <- if(getRversion() < "2.13.0") 256L else 10000L #is it a reserved variable, i.e. #an ellipsis or two dots then a number? if(!allow_reserved) { ok[x == "..."] <- FALSE ok[grepl("^\\.{2}[[:digit:]]+$", x)] <- FALSE } #are names valid (and maybe unique) ok[x != make.names(x, unique = unique)] <- FALSE ok }
While this answer isn’t quite as satisfactory because you can’t see what’s going on, it has the advantages that the locale-dependent letter problem vanishes, and if the specification for variable names changes, then make.names
will hopefully be updated to match it. And that makes it good enough for me.
Tagged: r, regex
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.