Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Introduction
Dan continues to crank out book chapter-length posts, which probably means that I should jump in before getting further behind…so here we go.
In the next few posts, I’d like to cover some work to help you to process aggregated proficiency testing (PT) data. Interpreting PT data from groups such as the College of American Pathologists (CAP) is, of course, a fundamental task for lab management. Comparing your lab’s results to peer group data from other users of the same instrumentation helps to ensure that your patients receive consistent results, and it provides at least a crude measure to ensure that your instrument performance is “in the ballpark”. Of course, many assays show significant differences between instrument models and manufacturers that can lead to results that are not comparable as a patient moves from institution to institution (or when your own lab changes instruments!). There are a number of standardization and harmonization initiatives underway (see http://harmonization.net, for example) to address this, and understanding which assays show significant bias compared to benchmark studies or national guidelines is a critical task for laboratorians. All of this is further complicated by the fact that sample matrix can significantly affect assay results, and sample commutability is one important reason why we can’t just take, say, CAP PT survey results (not counting the accuracy-based surveys) and determine which assays aren’t harmonized.
However.
With all of those caveats, it can still be useful to look through PT data in a systematic way to compare instruments. Ideally, we’d like to have everything in an R-friendly format that would allow us to ask systematic questions about data (things like “for how many assays does instrument X differ from instrument Y by >30% using PT material?”, or “how many PT materials give good concordance across all manufacturers?”). If we have good, commutable, accuracy-based testing materials, we can do even better. The first task is all of this fun, however, is getting the data into a format that R is happy with; no one I know likes the idea of retyping numbers from paper reports. I’m hoping to talk more about this in a future post, as there are lots of fun R text processing issues lurking here. In the mean time, though, we have a much more modest preliminary task to tackle.
Simple unit conversion
I’m currently staring at a CAP PT booklet. It happens to be D-dimer, but you can pick your own favorite analyte (and PT provider, for that matter). Some of the results are in ng/mL, some are ug/mL, and one is in mg/L. Let’s create an R function that allows us to convert between sets of comparable units. Now, although I know that Dan is in love with SI units (#murica), we’ll start by simply converting molar→molar and gravimetric→gravimetric. Yes, we can add fancy analyte-by-analyte conversion tables in the future…but right now we just want to get things on the same scale. In the process, we’ll cover three useful R command families.
First of all, we should probably decide how we want the final function to look. I’m thinking of something like this:
results <- labunit.convert(2.3, "mg/dL", "g/L") results ## [1] 0.023
…which converts 2.3 mg/dL to 0.023 g/L. We should also give ourselves bonus points if we can make it work with vectors. For example, we may have this data frame:
mydata ## Value Units Target.Units ## 1 2.30 g/dL mg/L ## 2 47.00 nmol/mL mmol/dL ## 3 0.19 IU/L mIU/L
and we would like to be able to use our function like this:
labunit.convert(mydata$Value, mydata$Units, mydata$Target.Units) ## [1] 2.3e+04 4.7e-03 1.9e+02
We should also handle things that are simpler
labunit.convert(0.23, "g", "mg") ## [1] 230
Getting started
Now that we know where we’re going, let’s start by writing a function that just converts between two units and returns the log difference. We’ll call this function convert.one.unit()
, and it will take two arguments:
convert.one.unit("mg", "ng") ## [1] 6
Basically, we want to take a character variable (like, say, “dL”) and break it into two pieces: the metric prefix (“d”) and the base unit (“L”). If it isn’t something we recognize, the function should quit and complain (you could also make it return ‘NA’ and just give a warning instead, but we’ll hold off on that for now). We’ll start with a list of things that we want to recognize.
convert.one.unit <- function (unitin, unitout) { metric.prefixes <- c("y", "z", "a", "f", "p", "n", "u", "m", "c", "d", "", "da", "h", "k", "M", "G", "T", "P", "E", "Z", "Y") metric.logmultipliers <- c(-24, -21, -18, -15, -12, -9, -6, -3, -2, -1, 0, 1, 2, 3, 6, 9, 12, 15, 18, 21, 24) units.for.lab <- c("mol", "g", "L", "U", "IU")
Notice that the metric.prefixes
variable contains the appropriate one- or two-character prefixes, and metric.logmultipliers
has the corresponding log multiplier (for example, metric.prefixes[8]
= “m”, and metric.logmultipliers[8]
is -3). It’s also worth noting the ""
(metric.prefixes[11]
), which corresponds to a log multiplier of 0. The fact that ""
is a zero-length string instead of a null means that we can search for it in a vector…which will be very handy!
And now for some regular expressions
This is the point where we tackle the first of the three command families that I told you about. If you’re not familiar with “regular expressions” in R or another language (Perl, Python, whatever), this is your entry point into some very useful text searching capabilities. Basically, a regular expression is a way of specifying a search for a matching text pattern, and it’s used with a number of R commands (grep(), grepl(), gsub(), regexpr(), regexec()
, etc.). We’ll use gsub()
as an example, since it’s one that many people are familiar with. Suppose that I have the character string “This is not a test”, and I want to change it to “This is a test”. I can feed gsub()
a pattern that I want to recognize and some text that I want to use to replace the pattern. For example:
my.string <- "This is not a test" my.altered.string <- gsub("not a ", "", my.string) # replace "not a " with an empty string, "" my.altered.string ## [1] "This is test"
That’s fine as far as it goes, but we will drive ourselves crazy if we’re limited to explicit matches. What if, for example, we also to also recognize “This is not…a test”, or “This is not my kind of a test”? We could write three different gsub statements, but that would get old fairly quickly. Instead of exactly matching the text, we’ll use a pattern. A regular expression that will match all three of our input statements is "not.+a "
, so we can do the following:
gsub("not.+a ", "", "This is not a test") ## [1] "This is test" gsub("not.+a ", "", "This is not my kind of a test") ## [1] "This is test"
You can read the regular expression "not.+a "
as “match the letters ‘not’ followed by a group of one or more characters (denoted by the special symbol ‘.’) followed by an ‘a’”. You can find some very nice tutorials on regular expressions through Google, but for the purposes of this brief lesson I’ll give you a mini-cheat sheet that probably handles 90% of the regular expressions that I have to write:
Special Character | Meaning |
---|---|
. | match any character |
d | match any digit |
D | match anything that isn’t a digit |
s | match white space |
S | match anything that isn’t white space |
t | match a tab (less important in R, since you usually already have things in a data frame) |
^ | match the beginning of the string (i.e. “^Bob” matches “Bob my uncle” but not “Uncle Bob”) |
$ | match the end of the string |
* | match the previous thing when it occurs 0 or more times |
+ | match the previous thing when it occurs 1 or more times |
? | match the previous thing when it occurs 0 or 1 times |
( .. ) | (parentheses) enclose a group of choices or a particular substring in the match |
| | match this OR that (e.g. “(Bob|Pete)” matches “Dr. Bob Smith” or “Dr. Pete Jones” but not “Dr. Sam Jones” |
It’s also important to remember for things like "d"
that R uses backslashes as the escape character…so you actually have to write a double backslash, like this: "\d"
. A regular expression to match one or more digits would be "\d+"
.
OK, back to work. Our next step is to remove all white space from the unit text (we want "dL"
to be handled the same way as " dL"
or "dL "
), so we’ll add the following lines:
unitin <- gsub("\s", "", unitin) unitout <- gsub("\s", "", unitout)
See what we’ve done? We asked gsub()
to replace every instance of white space (the regular expression is "\s"
) with ""
. Easy.
Paste, briefly
Next, we want to put together a regular expression that will detect any of our metric.prefixes
or units.for.lab
. To save typing, we’ll do it with paste()
, the second of our three R command families for the day. You probably already know about paste()
, but if not, it’s basically the way to join R character variables into one big string. paste("Hi", "there")
gives “Hi there” (paste()
defaults to joining things with a space), paste("Super", "cali", "fragi", "listic", sep="")
changes the separator to ""
and gives us “Supercalifragilistic”. paste0()
does the same thing as paste(..., sep="")
. The little nuance that it’s worth noting today is that we are going to join together elements from a single vector rather than a bunch of separate variables…so we need to use the collapse = "..."
option, where we set collapse
to whatever character we want. You remember from the last section that | (OR) lets us put a bunch of alternative matches into our regular expression, so we will join all of the prefixes like this:
prefix.combo <- paste0(metric.prefixes, collapse = "|") prefix.combo ## [1] "y|z|a|f|p|n|u|m|c|d||da|h|k|M|G|T|P|E|Z|Y"
What we’re really after is a regular expression that matches the beginning of the string, followed by 0 or 1 matches to one of the prefixes, followed by a match to one of the units. Soooo…
prefix.combo <- paste0(metric.prefixes, collapse = "|") unit.combo <- paste0(units.for.lab, collapse = "|") unit.search <- paste0("^(", prefix.combo, ")?(", unit.combo, ")$") unit.search ## [1] "^(y|z|a|f|p|n|u|m|c|d||da|h|k|M|G|T|P|E|Z|Y)?(mol|g|L|U|IU)$"
So much nicer than trying to type that by hand. Next we’ll do actual pattern matching using the regexec()
command. regexec()
, as the documentation so nicely states, returns a list of vectors of substring matches. This is useful, since it means that we’ll get one match for the prefix (in the first set of parentheses of our regular expression), and one match for the units (in the second set of parentheses of our regular expression). I don’t want to belabor the details of this, but if we feed the output of regexec()
to the regmatches()
command, we can pull out one string for our prefix and another for our units. Since these are returned as a list, we’ll also use unlist()
to coerce our results into one nice vector. If the length of that vector is 0, indicating no match, an error is generated.
match.unit.in <- unlist(regmatches(unitin, regexec(unit.search, unitin))) match.unit.out <- unlist(regmatches(unitout, regexec(unit.search, unitout))) if (length(match.unit.in) == 0) stop(paste0("Can't parse input units (", unitin, ")")) if (length(match.unit.out) == 0) stop(paste0("Can't parse output units (", unitout, ")"))
If we were to take a closer look look at match.unit.in
, we would see that the first entry is the full match, the second entry is the prefix match, and the third entry is the unit match. To make sure that the units agree (i.e. that we’re not trying to convert grams into liters or something similar), we use:
if (match.unit.in[3] != match.unit.out[3]) stop("Base units don't match")
…and then finish by using the match()
command to find the index in the metric.prefixes
vector corresponding to the correct prefix (note that if there’s no prefix matched, it matches the ""
entry of the vector–very handy). That index allows us to pull out the corresponding log multiplier, and we then return the difference to get a conversion factor. Our final function looks like this1:
convert.one.unit <- function (unitin, unitout) { # the prefix codes for the metric system metric.prefixes <- c("y", "z", "a", "f", "p", "n", "u", "m", "c", "d", "", "da", "h", "k", "M", "G", "T", "P", "E", "Z", "Y") # ...and their corresponding log multipliers metric.logmultipliers <- c(-24, -21, -18, -15, -12, -9, -6, -3, -2, -1, 0, 1, 2, 3, 6, 9, 12, 15, 18, 21, 24) # The units that we'd like to detect. I guess we could add distance, but that's not too relevant to most of the analytes that I can think of units.for.lab <- c("mol", "g", "L", "U", "IU") # remove white space unitin <- gsub("\s", "", unitin) unitout <- gsub("\s", "", unitout) # build the pieces of our regular expression... prefix.combo <- paste0(metric.prefixes, collapse = "|") unit.combo <- paste0(units.for.lab, collapse = "|") # ...and stitch it all together unit.search <- paste0("^(", prefix.combo, ")?(", unit.combo, ")$") # identify the matches match.unit.in <- unlist(regmatches(unitin, regexec(unit.search, unitin))) match.unit.out <- unlist(regmatches(unitout, regexec(unit.search, unitout))) if (length(match.unit.in) == 0) stop(paste0("Can't parse input units (", unitin, ")")) if (length(match.unit.out) == 0) stop(paste0("Can't parse output units (", unitout, ")")) if (match.unit.in[3] != match.unit.out[3]) stop("Base units don't match") # get the appropriate log multipliers logmult.in <- metric.logmultipliers[match(match.unit.in[2], metric.prefixes)] logmult.out <- metric.logmultipliers[match(match.unit.out[2], metric.prefixes)] # return the appropriate (log) conversion factor return(logmult.in - logmult.out) } # Try it out convert.one.unit("mL","L") ## [1] -3
‘Apply’-ing yourself
We’re actually most of the way there now. The final family of commands that we’d like to use is apply()
, with various flavors that allow you to repeatedly apply (no surprise) a function to many entries of a variable. Dan mentioned this in his last post. He also mentioned not understanding the bad press that for
loops get when they’re small. I completely agree with him, but the issue tends to arise when you’re used to a language like C (yes, I know we’re talking about compiled vs. interpreted in that case), where your loops are blazingly fast. You come to R and try nested loops that run from 1:10000, and then you have to go for coffee. lapply()
, mapply()
, mapply()
, apply()
, etc. have advantages in the R world. Might as well go with the flow on this one.
We’re going to make a convert.multiple.units()
function that takes unitsin
and unitsout
vectors, binds them together as two columns, and then runs apply()
to feed them to convert.one.unit()
. Because apply()
lets us interate a function over either dimension of a matrix, we can bind the two columns (a vector of original units and a vector of target units) and then iterate over each pair by rows (that’s what the 1
means as the second argument of apply()
: it applies the function by row). If the anonymous function syntax throws you off…let us know in the comments, and we’ll cover it some time. For now, just understand that the last part of the line feeds values to the convert.one.unit()
function.
convert.multiple.units <- function (unitsin, unitsout) { apply(cbind(unitsin, unitsout), 1, function (x) {convert.one.unit(x[1], x[2])}) }
Finally, we’ll go back to our original labunit.convert()
function. Our overall plan is to split each unit by recognizing the “/” character using strsplit()
. This returns a list of vectors of split groups (i.e. “mg/dL” becomes the a list where the first element is a character vector (“mg”, “dl”)). We then make sure that the lengths match (i.e. if the input is “mg/dL” and the output if “g/mL” that’s OK, but if the output is “g” then that’s a problem), obtain all the multipliers, and then add them all up. We add because they’re logs…and actually we mostly subtract, because we’re dividing. For cuteness points, we return 2*x[1] - sum(x)
, which will accurately calculate not only conversions like mg→g and mg/dL→g/L, but will even do crazy stuff like U/g/L→mU/kg/dL. Don’t ask me why you’d want to do that, but it works. The final multiplier is used to convert the vector of values (good for you if you notice that we didn’t check to make sure that the length of the values
vector matched the unitsin
vector…but we can always recycle our values that way).
labunit.convert <- function (values, unitsin, unitsout) { insep <- strsplit(unitsin, "/") outsep <- strsplit(unitsout, "/") lengthsin <- sapply(insep, length) lengthsout <- sapply(outsep, length) if (!all(lengthsin == lengthsout)) stop("Input and output units can't be converted") multipliers <- mapply(convert.multiple.units, insep, outsep) final.multiplier <- apply(t(multipliers), 1, function (x) {2*x[1] - sum(x)}) return(values * 10^final.multiplier) }
OK, enough. Back over to you, Dan. We now have a piece of code that we can use when we start comparing PT data from different instruments. That’s the immediate plan for future posts2, and before long there may even be an entry with nice graphics like those of my Canadian colleague.
-SRM
- I received a request to convert “G/L” to “M/mL”, which was interpreted as converting billions/L to millions/mL. This requires changing our
convert.one.unit()
function to handle a “no units” case. Actually, it’s not as difficult as it sounds; if we just add an empty string (i.e.""
) to the end of theunits.for.lab
vector, our regular expression does the right thing. Your edited line would readunits.for.lab <- c("mol", "g", "L", "U", "IU", "")
. The reason this works, incidentally, is that there’s no overlap (except""
) between the prefixes and the units, so the pattern match doesn’t have a chance to be confused.↩ - Following Dan’s lead, I should point out a major caveat to any such plans is James 4:13-15. Double extra credit if you are interested enough to look it up.↩
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.