Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
by Joseph Rickert
Tuesday's post on a new Kaggle contest mentioned that Revolution Analytics offers a free trial for using Revolution R Enterprise in the Amazon cloud. One reason this might be of interest to contestants is the rxImport() function which reads delimited text data, fixed format text data, and with an appropriate ODBC driver, data stored in a database. (rxImport() also directly reads SAS and SPSS files, but I'm guessing that this feature is not lilely to be of interest to contestants). As it turns out, rxImport()is also useful for dealing for semistructure text data such as log files. For example, here are the first three lines of internet log file complements of gVim.
190.12.51.140 - - [24/Feb/2013:01:44:32 -0600] "GET /bin/macosx/leopard/contrib/2.12/FGN_1.5.tgz HTTP/1.0" 200 510166 "-" "R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)" 190.12.51.140 - - [24/Feb/2013:01:44:39 -0600] "GET /bin/macosx/leopard/contrib/2.12/fgui_1.0-2.tgz HTTP/1.0" 200 404275 "-" "R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)" 190.12.51.140 - - [24/Feb/2013:01:44:45 -0600] "GET /bin/macosx/leopard/contrib/2.12/fields_6.6.tgz HTTP/1.0" 200 2852202 "-" "R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)"
It is not quite space delimited, but it appears that the spaces may be useful. Indeed, reading the first five lines of the file, one line at a time, using just default parameters for rxImport()
# Point to file
dataDir <- "C:/DATA/REVO CRAN LOG"
file <- file.path(dataDir,"sLog.txt")
#——————————————-
# Read 5 rows to see how rxImport handles things
rxImport(inData=file,outFile="test",
rowsPerRead=1,
numRows=5,
overwrite=TRUE)
rxGetInfo(data="test",getVarInfo=TRUE,numRows=2)
produces a binary XDF file in the following form:
Number of observations: 5 Number of variables: 10 Number of blocks: 5 Compression type: zlib Variable information: Var 1: V1, Type: character Var 2: V2, Type: integer, Low/High: (NA, NA) Var 3: V3, Type: integer, Low/High: (NA, NA) Var 4: V4, Type: character Var 5: V5, Type: character Var 6: V6, Type: character Var 7: V7, Type: integer, Low/High: (200, 404) Var 8: V8, Type: integer, Low/High: (1051, 2852202) Var 9: V9, Type: character Var 10: V10, Type: character Data (2 rows starting with row 1): V1 V2 V3 V4 V5 1 190.12.51.140 NA NA [24/Feb/2013:01:44:32 -0600] 2 190.12.51.140 NA NA [24/Feb/2013:01:44:39 -0600] V6 V7 V8 V9 1 GET /bin/macosx/leopard/contrib/2.12/FGN_1.5.tgz HTTP/1.0 200 510166 - 2 GET /bin/macosx/leopard/contrib/2.12/fgui_1.0-2.tgz HTTP/1.0 200 404275 - V10 1 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 2 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)
Not perfect – but clearly useful! Moreover, an added bonus is that rxImport() assigns variable names to the columns: "V1", "V2", etc. which can be used in the import process. The following code, imports the file and does a bit of cleaning along the way, removing some columns and renaming others.
# Import data
colX <- list("V1" = list(type="character",newName = "IP"),
"V7" = list(type="character", newName = "Status"),
"V8" = list(type="integer", newName = "NoClue"),
"V10" = list(type="character", newName = "R_version"))
rxImport(inData=file,outFile="logData",
colInfo=colX,
varsToDrop=c("V2","V3","V9"),
transformVars = c("V4","V5"),
transforms=list(Date = substr(V4,2,12),
UTC = substr(V4,14,21),
Offset = as.numeric(substr(V5,1,5))),
overwrite=TRUE)
rxGetInfo(data="logData",getVarInfo=TRUE,numRows=2)
Notice that the "transforms" parameter is doing some elementary text processing on each chunk of data that is being read. The output from this step looks like:
IP V4 V5 1 190.12.51.140 [24/Feb/2013:01:44:32 -0600] 2 190.12.51.140 [24/Feb/2013:01:44:39 -0600] V6 Status NoClue 1 GET /bin/macosx/leopard/contrib/2.12/FGN_1.5.tgz HTTP/1.0 200 510166 2 GET /bin/macosx/leopard/contrib/2.12/fgui_1.0-2.tgz HTTP/1.0 200 404275 R_version Date UTC 1 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 24/Feb/2013 01:44:32 2 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 24/Feb/2013 01:44:39 Offset 1 -600 2 -600
Now we are getting somewhere. We can use the rxDataStep() function to remove the columns V4 and V5, which are no longer needed, and further process the data. The following code uses a transform function in the data step to break apart the V6 column into some meaningful fields.
rxDataStep(inData="logData",outFile="logData_2",
varsToDrop=c("V4","V5"),
transformVars = c("V6"),
transformFunc = function(data) {
temp <- unlist(strsplit(data$V6, ' '));
temp.1 <- seq(from = 1, to = length(temp), by = 3);
temp.2 <- seq(from = 2, to = length(temp), by = 3);
temp.3 <- seq(from = 3, to = length(temp), by = 3);
data$Command <- temp[temp.1];
data$File <- temp[temp.2];
data$Protocol <- temp[temp.3];
data },
overwrite=TRUE)
IP Status NoClue 1 190.12.51.140 200 510166 2 190.12.51.140 200 404275 R_version Date UTC 1 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 24/Feb/2013 01:44:32 2 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 24/Feb/2013 01:44:39 Offset Command File Protocol 1 -600 GET /bin/macosx/leopard/contrib/2.12/FGN_1.5.tgz HTTP/1.0 2 -600 GET /bin/macosx/leopard/contrib/2.12/fgui_1.0-2.tgz HTTP/1.0
The Transform function, transformFunc(), in the last block of code may look a bit mysterious. The key to understanding what it does is to realize that rxDataStep()reads a big file a chunk at a time. Each chunk holds the data in a list, and processing must take this structure into account. If the structure of the list is not clear, it is easy enough to print things out and take a look. The following code reads in 5 lines of the file 4 lines to a chunk and prints out the contents of the chunk.
# Look at what is going on in the chunks
rxImport(inData=file,outFile="test",
transformFunc = function(data) {
print(data);
# Internal variables can tell you aboutthe chunk
print(paste("chunk starts with row",.rxStartRow,"of file"));
print(paste("chunk number = ",.rxChunkNum));
print(paste("number of rows read = ",.rxNumRows));
data },
rowsPerRead=4, # reads 4 rows into a chunk if available
numRows=5, # only read 5 rows from the file
overwrite=TRUE) # overwrite the file if it exists
The code also points out some internal variables that may be useful in writing transform functions to process each chunk.
.rxStartRow contains the row of the file that begins the chunk.
.rxChunkNum contains the number of the chunk
.rxNumRows contains the number of rows in a chunk
Download Output to have a look at chunk printed out by the last block of code. In a future post, I'll look into squeezing some information out of this file.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.