Flat File Interface your Mass Spectrometer to the Laboratory Information System with R
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
The Problem
As Clinical Pathologists we work hard to create laboratory developed tests (LDTs) using liquid chromatography and tandem mass spectrometry (LC-MS/MS) that are robust, repeatable, accurate and have a wider dynamic range than commercial immunoassays. In our experience, properly developed LC-MS/MS assays are much less expensive and outperform their commercial immunoassay counterparts from an analytical standpoint.
However, despite mass spectrometry's communal obsession with analytical performance of our LDTs, sometimes we overlook the matter of handling the data we generate. Unlike traditional diagnostic companies (e.g. Siemens, Roche) who take care of upload and download of patient data and results via HL7 streams to the laboratory information system (LIS), mass spectrometry companies have not yet made this a priority. This leaves us either paying out a lot of money for custom middleware solutions or manually transcribing our LC-MS/MS results.
We might naively think, “How bad can the transcription be?” but over time, it becomes painfully evident that manual transcription of result is tedious, error–prone and inefficient use of tech–time.
Many LIS vendors offer what is called a “flat-file interface”. In this case, there is no HL7 stream generated using a communication socket between instrument and LIS. Rather, the results are saved in an ASCII text file with a pre-defined format and then transferred to the LIS via a secure shell (SSH) connection.
For this post, we are going to take some sample flat files from a SCIEX API5000 triple quadrupole mass spectrometer and prepare a flat file for the SunQuest LIS. Please note that this code is provided to you as is under the GNU Public Licence and without any guarantee. You know how all the LC-MS/MS vendors say their instruments are for “research use only”? –yeah, I'm giving this to you in the same spirit. If you use or modify it, you do so at your own risk. Any changes to how your flatfile is generated by your mass spectrometer or any upgrades to your LC-MS/MS software could make this code malfunction. You have been warned.
The Required Format
SunQuest requires the output file to be a comma separated values (CSV) file with a unique specimen or internal QC result in each row. The first column is the instrument ID, the second columns is the specimen container ID (an E followed by a 10–digit integer), the third is testcode and the fourth is the result. The file itself is required to have a time–stamp so that it has a traceable name and should have no header. For an instrument named PAPI (short for Providence API 5000) and a testcode TES (for testosterone), the file might look like this:
PAPI,E2324434511,TES,3.12 PAPI,E2324434542,TES,8.75 PAPI,E2324434565,TES,25.34 ...
The Starting Material
After we have completed an analytical run and reviewed all peaks to generate our fileable results, we can export the quatified sample batch to an ASCII text file. The file contains a whole lot of diagnostic information about the run like which multiple reaction monitoring (MRM) transitions we used, what the internal standard (IS) counts were, results from the quantifier and qualifier ion, fitted values for the calibrators etc. There are more than 80 columns in a typical file and we could talk about all the things we might do with this data but in this case, we are concerned with extracting and preparing the results file.
Dialogue Box
If we are actually going to make an R script usable by a human, it would be good to be able to choose which file we want to process and what test we want to extract using a simple graphical user interface (GUI). There are a number of tools one can use to build GUIs in R but the most rudimentary is TclTk. I have to confess that I find the language constructs for GUI creation both non–intuitive and boring. For this reason, I present without discussion, a modification of a recipe for creating a box with radio–buttons. We are going to choose which of three analytes (you can increase this number as you please) for which we wish to process a flat–file. These are: aldosterone, cortisol and testosterone. Please note that if you execute this code on a Mac, you will have to install XQuartz because Macs don't have native X-windows support despite the BSD Linux heritage of OSX.
library(tcltk2) #make a radiobutton widget #source for tk widget modifed from http://www.sciviews.org/recipes/tcltk/TclTk-radiobuttons/ #accessed Feb 10, 2016 win1 <- tktoplevel() win1$env$rb1 <- tk2radiobutton(win1) win1$env$rb2 <- tk2radiobutton(win1) win1$env$rb3 <- tk2radiobutton(win1) rbValue <- tclVar("Aldosterone") tkconfigure(win1$env$rb1, variable = rbValue, value = "Aldosterone") tkconfigure(win1$env$rb2, variable = rbValue, value = "Cortisol") tkconfigure(win1$env$rb3, variable = rbValue, value = "Testosterone") tkgrid(tk2label(win1, text = "Which analyte are you processing?"), columnspan = 2, padx = 10, pady = c(15, 5)) tkgrid(tk2label(win1, text = "Aldosterone"), win1$env$rb1, padx = 10, pady = c(0, 5)) tkgrid(tk2label(win1,text = "Cortisol"), win1$env$rb2, padx = 10, pady = c(0, 5)) tkgrid(tk2label(win1,text = "Testosterone"), win1$env$rb3, padx = 10, pady = c(0, 15)) onOK <- function() { rbVal <- as.character(tclvalue(rbValue)) tkdestroy(win1) } win1$env$butOK <- tk2button(win1, text = "OK", width = -6, command = onOK) tkgrid(win1$env$butOK, columnspan = 2, padx = 10, pady = c(5, 15)) tkfocus(win1) #this final line is necessary to prevent to the program from proceeding until this radio button widget has closed tkwait.window(win1)
This will give us the following pop-up window with radiobuttons in which I have selected testosterone.
You will notice that Tk windows do not appear native to the operating system. We can live with this because we are not shallow.
After you hit the OK button, the Tk widget then puts the chosen value into an Tk variable called rbValue
. We can determine the value using the command tclvalue(rbValue)
. The reason we need to know which analyte we are working with is because the name of the MRM we want to pull out of the flat file is dependent on the analyte of course. We will also need to replace results below the limit of quantitation (LoQ) with “< x”, whatever x happens to be, which will be a different threshold for each analyte.
In our case, the testcodes for aldosterone, cortisol and testosterone are ALD,CORT and TES respectively, the LoQs are 50 pmol/L, 1 nmol/L and 0.05 nmol/L and the MRM names are “Aldo 1”, “Aldo 2”, “Cortisol 1”, “Cortisol 2” and “Testo 1” and “Testo 2” as we defined them within SCIEX Analyst Software. We will use the switch()
function to define three variables (test.code
, LoQ
, and MRM.names
) which we will use later to process the flat–file. We will also define the name of the worksheet in a variable called worksheet
. These are the parameters you would have to change in order to modify the code for your purposes.
#set the testcode by test test.code <- switch(tclvalue(rbValue), "Aldosterone" = "ALD", "Testosterone" = "TES", "Cortisol" = "CORT" ) #set the LoQ by test LoQ <- switch(tclvalue(rbValue), "Aldosterone" = "<50", "Testosterone" = "<0.05", "Cortisol" = "<1" ) #set the MRM names by test MRM.names <- switch(tclvalue(rbValue), "Aldosterone" = c("Aldo 1", "Aldo 2"), "Testosterone" = c("Testo 1", "Testo 2"), "Cortisol" = c("Cortisol 1", "Cortisol 2") ) #set the worksheet name for your site worksheet <- "PAPI"
Building File Names
Now we will prompt the user to tell them that they are to choose an instrument flat–file and we will determine the path of the chosen file. We will need the path to both read in the appropriate file but also to write the output later.
#choose the flat file to process tkmessageBox(message="You are about to choose a flat file to process.") flat.file.path <- tk_choose.files(default = "", caption = "Select File", multi = FALSE, filters = NULL, index = 1) #determine the directory name of the chosen file flat.file.dir <- dirname(flat.file.path) #determine the file name of the chosen file flat.file.name <- basename(flat.file.path)
This code will create this message box:
and this file choice dialogue box:
and after a file is selected and the Open is pressed, the path to the flat–file is stored in the variable flat.file.path
.
Behold: The Data
So we chosen the file we want to read in but what does this file look like? To just get a gander at it, we could open it with Excel and see how it is laid out. But since we have broken up with Excel, we won't do this. SCIEX Analyst exports tab (not comma) delimited files. R has a built in function read.delim()
for reading these files but we will quickly discover that read.delim()
assumes the files have a rectangular structure, having the same number of columns in each row. R will make assumptions about the shape of the data file based on the first few rows and then try to read it in. In this case, it will fail and you will get gibberish. To get this to work for us we will need to tell R how many rows to skip before the real data starts or we will need to tell R the number of columns the file has (which is not guaranteed to be consistent between versions of vendor software). There are lots of ways to do this but I think the simplest is to use grep()
.
I did this by reading the file in with no parsing of the tabs using the readLines()
function. This function creates a vector for which each successive value is the entire content of the row of the file. I display the first 30 lines of the file. Suppose that we chose a testosterone flat file.
x <- readLines(flat.file.path) x[1:30]
## [1] "Peak Name: Testo-d3 2" ## [2] "Use as Internal Standard" ## [3] "Q1/Q3 Masses: 292.50/97.20 Da" ## [4] "" ## [5] "Peak Name: Testo 1" ## [6] "Internal Standard: Testo-d3 2" ## [7] "Q1/Q3 Masses: 289.50/97.20 Da" ## [8] "" ## [9] "FittQuadratictWeightingt1 / xtIteratetNo" ## [10] "a0t0.00658" ## [11] "a1t0.2" ## [12] "a2t-0.000443" ## [13] "Correlation coefficientt0.9999" ## [14] "Use Area" ## [15] "" ## [16] "Peak Name: Testo 2" ## [17] "Internal Standard: Testo-d3 2" ## [18] "Q1/Q3 Masses: 289.50/109.10 Da" ## [19] "" ## [20] "FittQuadratictWeightingt1 / xtIteratetNo" ## [21] "a0t0.00359" ## [22] "a1t0.17" ## [23] "a2t-0.000313" ## [24] "Correlation coefficientt0.9999" ## [25] "Use Area" ## [26] "" ## [27] "" ## [28] "" ## [29] "Sample NametSample IDtSample TypetSample CommenttSet NumbertAcquisition MethodtAcquisition DatetRack TypetRack PositiontVial PositiontPlate TypetPlate PositiontFile NametDilution FactortWeight To Volume RatiotSample AnnotationtDispositiontAnalyte Peak NametAnalyte UnitstAnalyte Peak Area (counts)tAnalyte Peak Area for DAD (mAU x min)tAnalyte Peak Height (cps)tAnalyte Peak Height for DAD (mAU)tAnalyte Concentration (nmol/L)tAnalyte Retention Time (min)tAnalyte Expected RT (min)tAnalyte RT Window (sec)tAnalyte Centroid Location (min)tAnalyte Start ScantAnalyte Start Time (min)tAnalyte Stop ScantAnalyte Stop Time (min)tAnalyte Integration TypetAnalyte Signal To NoisetAnalyte Peak Width (min)tStandard Query StatustAnalyte Mass Ranges (Da)tAnalyte Wavelength Ranges (nm)tArea RatiotHeight RatiotAnalyte AnnotationtAnalyte ChanneltAnalyte Peak Width at 50% Height (min)tAnalyte Slope of Baseline (%/min)tAnalyte Processing Alg.tAnalyte Peak AsymmetrytAnalyte Integration QualitytIS Peak NametIS UnitstIS Peak Area (counts)tIS Peak Area for DAD (mAU x min)tIS Peak Height (cps)tIS Peak Height for DAD (mAU)tIS Concentration (nmol/L)tIS Retention Time (min)tIS Expected RT (min)tIS RT Window (sec)tIS Centroid Location (min)tIS Start ScantIS Start Time (min)tIS Stop ScantIS Stop Time (min)tIS Integration TypetIS Signal To NoisetIS Peak Width (min)tIS Mass Ranges (Da)tIS Wavelength Ranges (nm)tIS ChanneltIS Peak Width at 50% Height (min)tIS Slope of Baseline (%/min)tIS Processing Alg.tIS Peak AsymmetrytIS Integration QualitytUse RecordtRecord ModifiedtCalculated Concentration (nmol/L)tCalculated Concentration for DAD (nmol/L)tRelative Retention TimetAccuracy (%)tResponse FactortAcq. Start Time (min)tInjection Volume usedt" ## [30] "BlankttBlanktBlankMPX_SAMPLE_ID:189001;Stream Number:2;Plate Code:Deep Well MTP 96 Cooled;Injection Volume:25;t0tTesto_DWP_S2.damt2/11/2013 5:06:45 PMtDeep Well MTP 96 Cooledt1t1tDeep Well MTP 96 Cooledt2t140305B1.wifft1.00t0.00tttTesto 1tnmol/Lt0tN/At0.00e+000tN/At0.00t0.00t1.29t30.0t0.00t0t0.00t0t0.00tNo PeaktN/At0.00tN/At289.500/97.200 DatN/At0.00e+000t0.00e+000ttN/At0.00t0.00e+000tSpecify Parameters - MQIIIt0.00t0.00tTesto-d3 2tnmol/Lt158416tN/At5.58e+004tN/At1.00t1.27t1.29t20.0t1.28t115t1.18t139t1.43tBase To BasetN/At0.248t292.500/97.200 DatN/AtN/At4.36e-002t1.06e+000tSpecify Parameters - MQIIIt1.60t0.956tt0tN/AtN/At0.00tN/AtN/At2.85t25t"
All of the t
's that you see are the tabs in the file which are has read in literally when we use readLines()
. We can see that in this file nothing of use happens until line 29 but this is not consistent from file to file so we should not just assume that 29 is always the magic number where the good stuff begins. We can see that the line starting “Sample Name t Sample ID” is the real starting point so we can determine how many lines to skip by using grep()
and prepare for some error–handling with a variable called problem
by which we can deal with the circumstance that no approriate starting row is identified.
skip.val <- grep("Sample NametSample ID", x, fixed = TRUE) - 1 #if no such row is found, then the wrong file has been chosen if (length(skip.val)==0){ problem <- TRUE } else { problem <- FALSE } skip.val
## [1] 28
Now that we know how many lines to skip we can read in the data:
my.data <- read.delim(flat.file.path, sep = "t", strip.white = TRUE, skip = skip.val, header = TRUE, stringsAsFactors = FALSE)
We can have a look at the structure of this file
str(my.data)
## 'data.frame': 196 obs. of 83 variables: ## $ Sample.Name : chr "Blank" "Blank" "STD1" "STD1" ... ## $ Sample.ID : logi NA NA NA NA NA NA ... ## $ Sample.Type : chr "Blank" "Blank" "Standard" "Standard" ... ## $ Sample.Comment : chr "BlankMPX_SAMPLE_ID:189001;Stream Number:2;Plate Code:Deep Well MTP 96 Cooled;Injection Volume:25;" "BlankMPX_SAMPLE_ID:189001;Stream Number:2;Plate Code:Deep Well MTP 96 Cooled;Injection Volume:25;" "StandardMPX_SAMPLE_ID:189002;Stream Number:2;Plate Code:Deep Well MTP 96 Cooled;Injection Volume:25;" "StandardMPX_SAMPLE_ID:189002;Stream Number:2;Plate Code:Deep Well MTP 96 Cooled;Injection Volume:25;" ... ## $ Set.Number : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Acquisition.Method : chr "Testo_DWP_S2.dam" "Testo_DWP_S2.dam" "Testo_DWP_S2.dam" "Testo_DWP_S2.dam" ... ## $ Acquisition.Date : chr "2/11/2013 5:06:45 PM" "2/11/2013 5:06:45 PM" "2/11/2013 5:13:14 PM" "2/11/2013 5:13:14 PM" ... ## $ Rack.Type : chr "Deep Well MTP 96 Cooled" "Deep Well MTP 96 Cooled" "Deep Well MTP 96 Cooled" "Deep Well MTP 96 Cooled" ... ## $ Rack.Position : int 1 1 1 1 1 1 1 1 1 1 ... ## $ Vial.Position : int 1 1 13 13 25 25 37 37 49 49 ... ## $ Plate.Type : chr "Deep Well MTP 96 Cooled" "Deep Well MTP 96 Cooled" "Deep Well MTP 96 Cooled" "Deep Well MTP 96 Cooled" ... ## $ Plate.Position : int 2 2 2 2 2 2 2 2 2 2 ... ## $ File.Name : chr "140305B1.wiff" "140305B1.wiff" "140305B1.wiff" "140305B1.wiff" ... ## $ Dilution.Factor : num 1 1 1 1 1 1 1 1 1 1 ... ## $ Weight.To.Volume.Ratio : num 0 0 0 0 0 0 0 0 0 0 ... ## $ Sample.Annotation : logi NA NA NA NA NA NA ... ## $ Disposition : logi NA NA NA NA NA NA ... ## $ Analyte.Peak.Name : chr "Testo 1" "Testo 2" "Testo 1" "Testo 2" ... ## $ Analyte.Units : chr "nmol/L" "nmol/L" "nmol/L" "nmol/L" ... ## $ Analyte.Peak.Area..counts. : int 0 0 5273 3464 19412 16195 37994 32722 87815 74821 ... ## $ Analyte.Peak.Area.for.DAD..mAU.x.min. : chr "N/A" "N/A" "N/A" "N/A" ... ## $ Analyte.Peak.Height..cps. : num 0 0 1830 1300 6620 5700 13600 11400 30900 26100 ... ## $ Analyte.Peak.Height.for.DAD..mAU. : chr "N/A" "N/A" "N/A" "N/A" ... ## $ Analyte.Concentration..nmol.L. : chr "0.00" "0.00" "0.108" "0.108" ... ## $ Analyte.Retention.Time..min. : num 0 0 1.29 1.29 1.29 1.28 1.29 1.29 1.29 1.29 ... ## $ Analyte.Expected.RT..min. : num 1.29 1.29 1.29 1.29 1.29 1.29 1.29 1.29 1.29 1.29 ... ## $ Analyte.RT.Window..sec. : num 30 30 30 30 30 30 30 30 30 30 ... ## $ Analyte.Centroid.Location..min. : num 0 0 1.3 1.29 1.29 1.29 1.3 1.3 1.3 1.29 ... ## $ Analyte.Start.Scan : int 0 0 119 120 119 118 120 120 119 119 ... ## $ Analyte.Start.Time..min. : num 0 0 1.22 1.23 1.22 1.21 1.23 1.23 1.22 1.22 ... ## $ Analyte.Stop.Scan : int 0 0 137 130 137 135 135 138 141 139 ... ## $ Analyte.Stop.Time..min. : num 0 0 1.41 1.33 1.41 1.39 1.39 1.42 1.45 1.43 ... ## $ Analyte.Integration.Type : chr "No Peak" "No Peak" "Base To Base" "Base To Base" ... ## $ Analyte.Signal.To.Noise : chr "N/A" "N/A" "N/A" "N/A" ... ## $ Analyte.Peak.Width..min. : num 0 0 0.186 0.103 0.186 0.176 0.155 0.186 0.227 0.207 ... ## $ Standard.Query.Status : chr "N/A" "N/A" "N/A" "N/A" ... ## $ Analyte.Mass.Ranges..Da. : chr "289.500/97.200 Da" "289.500/109.100 Da" "289.500/97.200 Da" "289.500/109.100 Da" ... ## $ Analyte.Wavelength.Ranges..nm. : chr "N/A" "N/A" "N/A" "N/A" ... ## $ Area.Ratio : num 0 0 0.0304 0.02 0.113 0.094 0.219 0.188 0.468 0.398 ... ## $ Height.Ratio : num 0 0 0.0304 0.0216 0.108 0.0933 0.225 0.189 0.471 0.398 ... ## $ Analyte.Annotation : logi NA NA NA NA NA NA ... ## $ Analyte.Channel : chr "N/A" "N/A" "N/A" "N/A" ... ## $ Analyte.Peak.Width.at.50..Height..min. : num 0 0 0.0447 0.0447 0.046 0.0432 0.0439 0.0454 0.0438 0.045 ... ## $ Analyte.Slope.of.Baseline....min. : num 0 0 16.4 23.2 2.89 7.54 4.65 3.46 0.631 3 ... ## $ Analyte.Processing.Alg. : chr "Specify Parameters - MQIII" "Specify Parameters - MQIII" "Specify Parameters - MQIII" "Specify Parameters - MQIII" ... ## $ Analyte.Peak.Asymmetry : num 0 0 1.8 0.874 1.83 1.41 1.51 2.18 2.18 2 ... ## $ Analyte.Integration.Quality : num 0 0 0.379 0.233 0.697 0.62 0.794 0.765 0.907 0.875 ... ## $ IS.Peak.Name : chr "Testo-d3 2" "Testo-d3 2" "Testo-d3 2" "Testo-d3 2" ... ## $ IS.Units : chr "nmol/L" "nmol/L" "nmol/L" "nmol/L" ... ## $ IS.Peak.Area..counts. : int 158416 158416 173383 173383 172263 172263 173811 173811 187783 187783 ... ## $ IS.Peak.Area.for.DAD..mAU.x.min. : chr "N/A" "N/A" "N/A" "N/A" ... ## $ IS.Peak.Height..cps. : num 55800 55800 60100 60100 61200 61200 60300 60300 65700 65700 ... ## $ IS.Peak.Height.for.DAD..mAU. : chr "N/A" "N/A" "N/A" "N/A" ... ## $ IS.Concentration..nmol.L. : num 1 1 1 1 1 1 1 1 1 1 ... ## $ IS.Retention.Time..min. : num 1.27 1.27 1.27 1.27 1.27 1.27 1.28 1.28 1.28 1.28 ... ## $ IS.Expected.RT..min. : num 1.29 1.29 1.29 1.29 1.29 1.29 1.29 1.29 1.29 1.29 ... ## $ IS.RT.Window..sec. : num 20 20 20 20 20 20 20 20 20 20 ... ## $ IS.Centroid.Location..min. : num 1.28 1.28 1.28 1.28 1.28 1.28 1.28 1.28 1.28 1.28 ... ## $ IS.Start.Scan : int 115 115 117 117 115 115 117 117 118 118 ... ## $ IS.Start.Time..min. : num 1.18 1.18 1.2 1.2 1.18 1.18 1.2 1.2 1.21 1.21 ... ## $ IS.Stop.Scan : int 139 139 140 140 140 140 139 139 139 139 ... ## $ IS.Stop.Time..min. : num 1.43 1.43 1.44 1.44 1.44 1.44 1.43 1.43 1.43 1.43 ... ## $ IS.Integration.Type : chr "Base To Base" "Base To Base" "Base To Base" "Base To Base" ... ## $ IS.Signal.To.Noise : chr "N/A" "N/A" "N/A" "N/A" ... ## $ IS.Peak.Width..min. : num 0.248 0.248 0.238 0.238 0.258 0.258 0.227 0.227 0.217 0.217 ... ## $ IS.Mass.Ranges..Da. : chr "292.500/97.200 Da" "292.500/97.200 Da" "292.500/97.200 Da" "292.500/97.200 Da" ... ## $ IS.Wavelength.Ranges..nm. : chr "N/A" "N/A" "N/A" "N/A" ... ## $ IS.Channel : chr "N/A" "N/A" "N/A" "N/A" ... ## $ IS.Peak.Width.at.50..Height..min. : num 0.0436 0.0436 0.0445 0.0445 0.0435 0.0435 0.0455 0.0455 0.0451 0.0451 ... ## $ IS.Slope.of.Baseline....min. : num 1.06 1.06 1.17 1.17 1.39 1.39 1.51 1.51 1.93 1.93 ... ## $ IS.Processing.Alg. : chr "Specify Parameters - MQIII" "Specify Parameters - MQIII" "Specify Parameters - MQIII" "Specify Parameters - MQIII" ... ## $ IS.Peak.Asymmetry : num 1.6 1.6 2.19 2.19 1.77 1.77 1.88 1.88 2.18 2.18 ... ## $ IS.Integration.Quality : num 0.956 0.956 0.971 0.971 0.968 0.968 0.969 0.969 0.97 0.97 ... ## $ Use.Record : int NA NA 1 1 1 1 1 1 1 1 ... ## $ Record.Modified : int 0 0 0 0 0 0 0 0 0 0 ... ## $ Calculated.Concentration..nmol.L. : chr "N/A" "N/A" "0.119" "0.0962" ... ## $ Calculated.Concentration.for.DAD..nmol.L.: chr "N/A" "N/A" "N/A" "N/A" ... ## $ Relative.Retention.Time : num 0 0 1.01 1.01 1.01 1.01 1.01 1.01 1.01 1.01 ... ## $ Accuracy.... : chr "N/A" "N/A" "111." "89.1" ... ## $ Response.Factor : chr "N/A" "N/A" "0.282" "0.185" ... ## $ Acq..Start.Time..min. : num 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 2.85 ... ## $ Injection.Volume.used : int 25 25 25 25 25 25 25 25 25 25 ... ## $ X : logi NA NA NA NA NA NA ...
Just Tell Me the Results
And we see that there is lots of stuff we don't need. What we do need are the columns titled “Sample.Name” (which is the specimen container ID in this case), the “Analyte.Peak.Name” (which is the MRM, either quantifier or qualifier), and the one whose name starts with “Calculated.Concentration..”. The last of these also contains the units of measure which is analyte–dependent. To get rid of this analyte–dependence of the column name, we can find out which column this is and rename it:
conc.col.num <- grep("Calculated.Concentration..",names(my.data), fixed = TRUE) names(my.data)[conc.col.num]<- "Calculated.Concentration"
Now we can pull out the three columns of interest and put them into a dataframe named results.
#pull out the columns of interest results <- my.data[,c("Sample.Name", "Analyte.Peak.Name","Calculated.Concentration")] names(results) <- c("sampleID", "mrm", "conc")
Now we only need the quantifier ion results which we were defined by the user with Tk GUI, so we can pull them out with grep. I will pull out the qualifiers also but we do not need them unless we wanted to compute ion-ratios, for example.
#handle non-numeric results quantifiers <- results[grep(MRM.names[1], results$mrm),] quantifiers$conc <- as.numeric(quantifiers$conc)
## Warning: NAs introduced by coercion
qualifiers <- results[grep(MRM.names[2], results$mrm),] qualifiers$conc <- as.numeric(qualifiers$conc)
## Warning: NAs introduced by coercion
Having pulled out the MRM of interest, we can define which rows correspond to standards, QC and patients by appropriate use of grep()
. It happens that the CIDs all start with E followed by a 10 digit number so we can search for this pattern with a simple regular expression. Since we only need the QCs and patient data, the variable standards
is calculated only as a matter of completeness.
#separate out sample types standards <- grep("Blank|STD",quantifiers$sampleID) qc <- grep("C-", quantifiers$sampleID) #create a regular expression to identify samples (E followed by 10 digits) regexp<-"(^E)([[:digit:]]{10})" patients <-grep(pattern=regexp,quantifiers$sampleID) output.data <- quantifiers[c(qc,patients),]
Preparing Data for Output
Now we can prepare to write a dataframe corresponding to the required format of the output file. To do so, we'll need to find out how many rows we are writing and then prepare a vector of the same length repeating the name of the worksheet and testcode:
#prepare the final data num.rows <- length(output.data$sampleID) final.output.data <- data.frame(rep(worksheet,num.rows), output.data$sampleID, rep(test.code, num.rows), output.data$conc) names(final.output.data) <- c("worksheet","sample","test","conc")
Now we can replace all the NA values that replaced “No Peak” with the correct LoQ according to which analyte we are looking at.
#to put LOQs in, we need to convert to character #this assumes that all non numeric results are undetectable final.output.data$conc <- as.character(final.output.data$conc) final.output.data$conc[is.na(final.output.data$conc)] <- LoQ
Our final.output.data
dataframe looks like it behaved properly.
head(final.output.data,10)
## worksheet sample test conc ## 1 PAPI C-LY1LR TES 0.557 ## 2 PAPI C-LY1 TES 5.65 ## 3 PAPI C-LY2 TES 20.6 ## 4 PAPI C-LY3 TES 28.1 ## 5 PAPI C-PTES TES 0.737 ## 6 PAPI E1234083035 TES 1.04 ## 7 PAPI E1234109065 TES 14.1 ## 8 PAPI E1234086634 TES 19.2 ## 9 PAPI E1234107491 TES 13 ## 10 PAPI E1234114052 TES 18.6
Timestamping, Writing and Archiving
And finally, we create directories to archive our data (if those directories do not exist) and write the files with an appropriate timestamp determined using Sys.time()
. Since colons (i.e : ) don't play nice in all operating systems as filenames, we can use gsub()
to get rid of them. We also pass along error messages or confirmation messages to the user as appropriate.
#If the data file happens to be empty because you selected the wrong file, abort if(nrow(final.output.data)==0){ tkmessageBox(message="Your flat file contained no patient data. Aborting file output") } else if (nrow(final.output.data)>0) { #create the output directory if it does not exist if(!dir.exists(file.path(flat.file.dir, "Processed"))){ dir.create(file.path(flat.file.dir, "Processed")) } if(!dir.exists(file.path(flat.file.dir, "Raw"))){ dir.create(file.path(flat.file.dir, "Raw")) } #create a ISO 8601 compliant timestamp - get rid of spaces and colons time.stamp <- gsub(":","", Sys.time(), fixed = TRUE) time.stamp <- gsub(" ","T", time.stamp, fixed = TRUE) #save a copy of the input file flat.file.copy.name <- paste(test.code,"_",time.stamp, "_Raw.txt", sep="") file.copy(flat.file.path, file.path(flat.file.dir,"Raw", flat.file.copy.name )) #write the final output file final.output.name <- paste(test.code,"_",time.stamp, ".txt", sep="") final.output.path <- file.path(flat.file.dir,"Processed" ,final.output.name) write.table(file = final.output.path, final.output.data, quote = FALSE, row.names = FALSE, col.names = FALSE, sep = ",") #check that the file was created as expected if(file.exists(final.output.path)){ tkmessageBox(message="Data successfully processed n Check Processed directory") } else { tkmessageBox(message="Your file was not created. There was a problem") } }
Finally, we would wrap all of the directory–creation and file–operation in an if statement tied to the variable called problem
we created previously. You will see this in the final source–code linked below.
Other Things You Can Do
Now, you can easily modify this to deal with multiple anlytes that are always on the same run, such as Vitamin D2 and Vitamin D3. If you wanted to suppress results failing ion ratio criteria (which could be concentration–dependent of course) or if you had specimens unexpectedly low IS counts, you could easily censor them to prevent their upload and then review them manually. You could also append canned comments to your results with a dash between your result and the comment. In fact, you could theoretically develop very elaborate middleware for QC evaluation and interpretation. You could also use RMarkdown to generate PDF reports for the run which could include calibration curve plots, plots of quantifier results vs qualifier results, and results that fail various criteria.
Source
You can download the source code and three example flat files here. Setting the source–code up as a “clickable” script is somewhat dependent on the operating system you are working on. Since most of you will be on a windows system you can follow this tutorial. You can also use a windows batch file to call your script.
Final Thought
Now that your file is generated, it is read to upload via ssh. This is usually performed manually but could be automated. Don't implement this code into routine use unless you know what you are doing and you have tested it extensively. By using and/or modifying it, you become entirely responsible for its correct operation. Excel is like a butter knife and R is like Swiss Army Knife. You must be careful with it because…
From everyone who has been given much, much will be demanded; and from the one who has been entrusted with much, much more will be asked.
Luke 12:48
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.