Site icon R-bloggers

Importing a log file with rxImport()

[This article was first published on Revolutions, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

by Joseph Rickert

Tuesday's post on a new Kaggle contest mentioned that Revolution Analytics offers a free trial for using Revolution R Enterprise in the Amazon cloud. One reason this might be of interest to contestants is the rxImport() function which reads delimited text data, fixed format text data, and with an appropriate ODBC driver, data stored in a database. (rxImport() also directly reads SAS and SPSS files, but I'm guessing that this feature is not lilely to be of interest to contestants). As it turns out, rxImport()is also useful for dealing for semistructure text data such as log files. For example, here are the first three lines of internet log file complements of gVim.

190.12.51.140 - - [24/Feb/2013:01:44:32 -0600] "GET /bin/macosx/leopard/contrib/2.12/FGN_1.5.tgz HTTP/1.0" 200 510166 "-" "R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)"
190.12.51.140 - - [24/Feb/2013:01:44:39 -0600] "GET /bin/macosx/leopard/contrib/2.12/fgui_1.0-2.tgz HTTP/1.0" 200 404275 "-" "R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)"
190.12.51.140 - - [24/Feb/2013:01:44:45 -0600] "GET /bin/macosx/leopard/contrib/2.12/fields_6.6.tgz HTTP/1.0" 200 2852202 "-" "R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)"

It is not quite space delimited, but it appears that the spaces may be useful. Indeed, reading the first five lines of the file, one line at a time, using just default parameters for rxImport()

# Point to file
dataDir <- "C:/DATA/REVO CRAN LOG"
file <- file.path(dataDir,"sLog.txt")
#——————————————-
# Read 5 rows to see how rxImport handles things
rxImport(inData=file,outFile="test",
rowsPerRead=1,
numRows=5,
overwrite=TRUE)

rxGetInfo(data="test",getVarInfo=TRUE,numRows=2)

produces a binary XDF file in the following form:

Number of observations: 5 
Number of variables: 10 
Number of blocks: 5 
Compression type: zlib 
Variable information: 
Var 1: V1, Type: character
Var 2: V2, Type: integer, Low/High: (NA, NA)
Var 3: V3, Type: integer, Low/High: (NA, NA)
Var 4: V4, Type: character
Var 5: V5, Type: character
Var 6: V6, Type: character
Var 7: V7, Type: integer, Low/High: (200, 404)
Var 8: V8, Type: integer, Low/High: (1051, 2852202)
Var 9: V9, Type: character
Var 10: V10, Type: character
Data (2 rows starting with row 1):
             V1 V2 V3                    V4     V5
1 190.12.51.140 NA NA [24/Feb/2013:01:44:32 -0600]
2 190.12.51.140 NA NA [24/Feb/2013:01:44:39 -0600]
                                                            V6  V7     V8 V9
1    GET /bin/macosx/leopard/contrib/2.12/FGN_1.5.tgz HTTP/1.0 200 510166  -
2 GET /bin/macosx/leopard/contrib/2.12/fgui_1.0-2.tgz HTTP/1.0 200 404275  -
                                                     V10
1 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)
2 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0)

Not perfect – but clearly useful! Moreover, an added bonus is that rxImport() assigns variable names to the columns: "V1", "V2", etc. which can be used in the import process. The following code, imports the file and does a bit of cleaning along the way, removing some columns and renaming others.

# Import data
colX <- list("V1" = list(type="character",newName = "IP"),
"V7" = list(type="character", newName = "Status"),
"V8" = list(type="integer", newName = "NoClue"),
"V10" = list(type="character", newName = "R_version"))

rxImport(inData=file,outFile="logData",
colInfo=colX,
varsToDrop=c("V2","V3","V9"),
transformVars = c("V4","V5"),
transforms=list(Date = substr(V4,2,12),
UTC = substr(V4,14,21),
Offset = as.numeric(substr(V5,1,5))),
overwrite=TRUE)

rxGetInfo(data="logData",getVarInfo=TRUE,numRows=2)

Notice that the "transforms" parameter is doing some elementary text processing on each chunk of data that is being read. The output from this step looks like:

             IP                    V4     V5
1 190.12.51.140 [24/Feb/2013:01:44:32 -0600]
2 190.12.51.140 [24/Feb/2013:01:44:39 -0600]
                                                            V6 Status NoClue
1    GET /bin/macosx/leopard/contrib/2.12/FGN_1.5.tgz HTTP/1.0    200 510166
2 GET /bin/macosx/leopard/contrib/2.12/fgui_1.0-2.tgz HTTP/1.0    200 404275
                                               R_version        Date      UTC
1 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 24/Feb/2013 01:44:32
2 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 24/Feb/2013 01:44:39
  Offset
1   -600
2   -600

Now we are getting somewhere. We can use the rxDataStep() function to remove the columns V4 and V5, which are no longer needed, and further process the data. The following code uses a transform function in the data step to break apart the V6 column into some meaningful fields.

rxDataStep(inData="logData",outFile="logData_2",
            varsToDrop=c("V4","V5"),
            transformVars = c("V6"),
            transformFunc = function(data) {
                temp <- unlist(strsplit(data$V6, ' '));
                temp.1 <- seq(from = 1, to = length(temp), by = 3);
                temp.2 <- seq(from = 2, to = length(temp), by = 3);
                temp.3 <- seq(from = 3, to = length(temp), by = 3);
                data$Command <- temp[temp.1];
                data$File <- temp[temp.2];
                data$Protocol <- temp[temp.3];
            data },
            overwrite=TRUE)

             IP Status NoClue
1 190.12.51.140    200 510166
2 190.12.51.140    200 404275
                                               R_version        Date      UTC
1 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 24/Feb/2013 01:44:32
2 R (2.12.2 x86_64-apple-darwin9.8.0 x86_64 darwin9.8.0) 24/Feb/2013 01:44:39
  Offset Command                                            File Protocol
1   -600     GET    /bin/macosx/leopard/contrib/2.12/FGN_1.5.tgz HTTP/1.0
2   -600     GET /bin/macosx/leopard/contrib/2.12/fgui_1.0-2.tgz HTTP/1.0

The Transform function, transformFunc(), in the last block of code may look a bit mysterious. The key to understanding what it does is to realize that rxDataStep()reads a big file a chunk at a time. Each chunk holds the data in a list, and processing must take this structure into account. If the structure of the list is not clear, it is easy enough to print things out and take a look. The following code reads in 5 lines of the file 4 lines to a chunk and prints out the contents of the chunk.

# Look at what is going on in the chunks
rxImport(inData=file,outFile="test",
        transformFunc = function(data) {
        print(data);
# Internal variables can tell you aboutthe chunk
        print(paste("chunk starts with row",.rxStartRow,"of file"));
        print(paste("chunk number = ",.rxChunkNum));
        print(paste("number of rows read = ",.rxNumRows));
        data },
        rowsPerRead=4, # reads 4 rows into a chunk if available
        numRows=5, # only read 5 rows from the file
        overwrite=TRUE) # overwrite the file if it exists

The code also points out some internal variables that may be useful in writing transform functions to process each chunk.

.rxStartRow contains the row of the file that begins the chunk.
.rxChunkNum contains the number of the chunk
.rxNumRows contains the number of rows in a chunk

Download Output to have a look at chunk printed out by the last block of code. In a future post, I'll look into squeezing some information out of this file.

 

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.