Analyzing OWASP Mod Security Audit Log with R

[This article was first published on Programming Utility codes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Web servers are frequently scanned or attacked by remote HTTP clients. Attackers send HTTP requests via algorithms or by direct “human” intervention. In order to protect web servers against these threats, automated protection systems block potentially malicious HTTP requests. Every time a request is blocked, its description is logged in a audit file. Often, audit files are difficult-to-read/large text files.  For that reason, it is difficult to have a clear overview of the threats blocked by security mechanisms.  

In this tutorial I describe an algorithm that I wrote in order to analyse the HTTP requests blocked by mod security for Apache HTTPD.  In my environment Mod Security is interfaced with OWASP Core Rule Set version 3 but the algorithm should work for other rules. 

1. Parse the Audit File

Mod Security audit log file is organised in sections.  Each section starts on a new line and can be composed of multiple lines.  Sections are separated by a line specifying the transaction IDs and section code. A description of each section can be found here: https://www.feistyduck.com

 --38c0471f-A--  
 [04/Apr/2018:15:35:55 +0000] WsTw20sPICSQr9pUrF5lpQAAAAs x.x.x.x 46351 x.x.x.x 443  
 --38c0471f-B--  
 GET /sitemap.xml HTTP/1.1  
 Host: 155.198.29.141  
 Accept-Encoding: identity  

For performance issue, we decided to read the audit file line by line in R. This means that our algorithm acts pretty much as a “human” reader would. A set of functions “understand” the content of the current line given the information retrieved in the previous lines. For example, if the previous line was 

 --38c0471f-A-- 

Then, the current line must indicates the time of the request, its ID and  the source and destination IP addresses and ports.  (this is because -A– is an Audit Log Header)

 [04/Apr/2018:15:35:55 +0000] WsTw20sPICSQr9pUrF5lpQAAAAs x.x.x.x 46351 x.x.x.x 443  

Using regex we created a set of function that validate the type of line and extract the relevant content.

1.a Check and extract the content of the Section header and the Audit Log Header (A)

In order to check the presence of a specific text pattern or extract text within a text pattern we use regular expressions. We imported the package “stringr”, a great library that makes our life easier. The first function below evaluates if the current line is announcing a new section. The second function extracts the transaction ID.  The third function extracts the section type. The function “auditLogGetTransactionDate” retrieves the date of the transaction as an R Date object. The last function retrieves the source and destination IPs and Ports as two String (i.e. [“ipSource portSource”, “ipDest portDest”].

 # Load Regular Expression Library  
 require('stringr')  
    
 # Returns true if line is an Audit Log Header  
 auditLogIsTransaction <- function(line){  
  return(str_detect(line, "--[:xdigit:]{8}-[ABCDEFGHIJKZ]--"))  
 }  
   
 # Returns The transaction ID from the Audit Log Header  
 auditLogGetTransactionId <- function(line){  
  return(str_extract(line, "[:xdigit:]{8}"))  
 }  
   
 # Returns The section type from the Audit Log Header  
 auditLogGetTransactionSection <- function(line){  
  return(gsub("-","",str_extract(line, "-([ABCDEFGHIJKZ])-")))  
 }  
   
 # Returns the Transaction date  
 auditLogGetTransactionDate <- function(line){  
  extract <- str_extract(line, "[0123][0-9]/(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)/20(17|18|19|20)")  
  extractDate <- as.Date(extract,"%d/%b/%Y")  
  return( extractDate )  
 }  
 # Returns the IP addresses as an array. (i.e. [IP Source, IP Destination])  
 auditLogGetIPAddressesAndPorts <- function(line){  
  return(str_extract_all(line, "[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\s[0-9]{2,5}"))  
 }  
   

In our example, we can now recognize and extract the piece of information highlighted in yellow.

 --38c0471f-A--  
 [04/Apr/2018:15:35:55 +0000] WsTw20sPICSQr9pUrF5lpQAAAAs 66.x.x.x 46351 x.x.x.x 443  
 --38c0471f-B--  
 GET /sitemap.xml HTTP/1.1  
 Host: 155.198.29.141  
 Accept-Encoding: identity  

We now need to extract the bit of information explaining why the HTTP request was deemed potentially malicious by Mod Security.

1.b Check and extract OWASP Rule information in Section H

The section H of the Audit log gives the reason why an HTTP request was blocked. It gives both the rule's ID and the rule's description.  

 --38c0471f-H--  
 Message:... [id "920350"] [rev "2"] [msg "Host header is a numeric IP address"] ... 
 Apache-Error: [file "apache2_util.c"] ... [client %s] ModSecurity: %s%s [uri "%s"]%s  
 Stopwatch: 1522856155328261 2511 (- - -)  
 Stopwatch2: 1522856155328261 2511; combined=1679,..., sr=122, sw=0, l=0, gc=0  
 Response-Body-Transformed: Dechunked  
 Producer: ModSecurity for Apache/2.9.1 (http://www.modsecurity.org/); OWASP_CRS/3.0.2.  
 Server: Apache  
 Engine-Mode: "ENABLED"

If the last section header read by the algorithm was of type "H" then the current line could potentially contains the rule's Id and the rule's message. Testing for the presence and extracting the rule's Id is much easier than the rule's message.

The two following functions test if the line string contains a rule id and extracts the rule id.
 # Returns true if contains a rule id  
 auditLogHasErrorCodeMessage <- function(line){  
  require('stringr')  
  return(str_detect(line, '\\[id "\\d{3,10}"\\]'))  
 }  
 # Returns the id of the rule  
 auditLogGetErrorCodeMessage <- function(line){  
  require('stringr')  
  return(str_extract(str_extract(line, '\\[id "[0-9]{3,10}"\\]'),"[0-9]{3,10}"))  
 }  

1.c Read *and make sense of* the current Line 

Now we put all the functions described above together in order to make sense of what happens in the current line. A Data Frame called "data" acts as the memory of the reader.  At the first line data would be null and a new data frame needs  to be created. Line by line the reader gains new information that is stored in "data". The information stored in "data" helps the reader to understand the next line.

   
 processAuditLogLine <- function(fileLine, data){  
  # Create structured storage if absent  
  if(is.null(data)){  
   data <- data.frame(txId = NA, txSection = NA, txDate = NA, txSourceIp = NA, txSourceIpLocation = NA, txSourcePort = NA, txDestIp = NA, txDestPort = NA, txRule = NA )  
  }  
  # Check if line is transaction   
  if(auditLogIsTransaction(fileLine)){  
   # Retrieve Transaction Id  
   txId <- auditLogGetTransactionId(fileLine)  
   # Retrieve transaction Type  
   txSection <- auditLogGetTransactionSection(fileLine)  
   if(txSection == 'A'){  
    data <- rbind(data, data.frame(txId = txId, txSection = txSection, txDate = NA, txSourceIp = NA, txSourceIpLocation = NA, txSourcePort = NA, txDestIp = NA, txDestPort = NA, txRule = NA))  
   }else{  
    # retrive current data row  
    txRow <- dim(data)[1]  
    data[txRow,'txSection'] <- txSection   
   }  
     
  }else{  
    
   # retrive current data row  
   txRow <- dim(data)[1]  
   
   # Check line not empty  
   if(fileLine !=''){  
      
    # Retrieve current Section Type  
    currentSection <- data[txRow,'txSection']  
       # Check if section is for a new HTTP Request (A)  
    if(currentSection == 'A'){  
      # Retrieve Date  
      data[txRow,'txDate'] <- auditLogGetTransactionDate(fileLine)  
      # Retrieve IP addresses and ports  
      ipAddressesArr <- unlist(auditLogGetIPAddressesAndPorts(fileLine))  
      source <- unlist(strsplit(ipAddressesArr[1]," "))  
      dest <- unlist(strsplit(ipAddressesArr[2]," "))  
            # Store IP addesses and ports in data  
      data[txRow,'txSourceIp'] <- source[1]  
      data[txRow,'txSourcePort'] <- source[2]  
      data[txRow,'txDestIp'] <- dest[1]  
      data[txRow,'txDestPort'] <- dest[2]  
    }  
        # Check if section is giving information about the Rule (H)  
    if(currentSection == 'H'){  
           #   
     if(startsWith(fileLine, 'Message: ')){  
      data[txRow,'txRule'] <- auditLogGetErrorCodeMessage(fileLine)  
     }  
    }  
   }  
  }  
  return(data)  
 }  
   

1.d Read the file 

So far, we have seen how we can recognize the information and extract it in R. We also saw how, in one function, we can read and make sense of a line given the previous knowledge gained in the document.  The function below shows how the file is actually read.

At every line of the file we execute the method called "method" and pass the data in parameter.  Note that data is first declared as NULL.

 processLogFile <- function(filepath, method) {  
  con = file(filepath, "r")  
  data <- NULL  
  while ( TRUE ) {  
   line = readLines(con, n = 1)  
   if ( length(line) == 0 ) {  
    break  
   }  
   # execute method on line   
   data <- method(line, data)  
  }  
  close(con)  
   
  return(data)  
 }  
   

Now to get a data frame containing all the transactions of the audit log we can call:
 # File Location  
 pathAudit <- 'C:/Temp/Logs/modsec_audit_20181001.log'   
 # Parse/Read Audit Log  
 data <- processLogFile(pathAudit, processAuditLogLine)  
 # Display data  
 head(data, 9)  

The 9 first rows are displayed below. Of course source IP addresses have been changed to protect the anonymity of our attackers and ours (A bit of GDPR compliance in this world of savage would not hurt!)

txIdtxSectiontxDatetxSourceIptxSourceIpLocationtxSourcePorttxDestIptxDestPorttxRule
1bde4ba7fZ17801184.x.x.xNA41434x.x.x.x443920350
2b6c7467bZ17801198.x.x.xNA22434x.x.x.x443920440
3ad702d79Z1780278.x.x.xNA15438x.x.x.x443920280
45efba84cZ17802139.x.x.xNA40714x.x.x.x443920350
5d358e435Z17802220.x.x.xNA2580x.x.x.x443930130
66f227755Z17802220.x.x.xNA54458x.x.x.x443920350
7c207da7eZ17802220.x.x.xNA50019x.x.x.x443920280
830e96b28Z17802198.x.x.xNA54382x.x.x.x443920280
9bdf1496cZ1780260.x.x.xNA58764x.x.x.x443920350

In this table we can see that all sections have been read up to section Z in the column "txSection". Thats because section Z announces the end of a transaction/HTTP request.  The dates in column txDate are actually stored in number of days since the "1970-01-01". 

We already have a lot of information but we would also like to retrieve the country where each Source IP address is registered. 

2.a Add additional Insight from API. (i.e. IP Address Geolocation). 

We would like to extend the data set obtained previously by adding the IP addresses geolocations.  In order to do so, we have to fetch the countries name from an online IP geolocation API.  


In this tutorial we use the API proposed by  http://api.ipstack.com/ which worked really well. You will need to replace the API_Key by your own in order to use the API. The free account provides 10,000 API requests by month. 

 retrieveIPsLocations <- function(threatsData){  
  # Library  
  require('jsonlite')  
    
  # Find unique Ips  
  unIps <- unique(threatsData$txSourceIp)  
  countriesForIpsDataFrame <- data.frame(txSourceIp = unIps, txSourceCountry = rep(NA, length(unIps)))  
  # Using https://ipstack.com  
  API_Key <- "########################"  
  countriesForIpsDataFrame$txSourceCountry <- unlist(apply(countriesForIpsDataFrame, 1, FUN = function(x){  
   # API URL for SPECIFIC IP  
   URL <- paste("http://api.ipstack.com/",x[1],"?access_key=",API_Key, sep = "")  
   jsonObj <- fromJSON(URL)  
   if(is.null(jsonObj) || is.null(jsonObj$country_name))  
    return(NA)  
   else  
    return(jsonObj$country_name)  
  }))  
    
  threatsData <- merge(threatsData, countriesForIpsDataFrame, by='txSourceIp')  
  return(threatsData)  
 }  
   

We can think of other API uses here. For example we could check which IP addresses are on  blacklist. Many online APIs provide this info for small fees.

3 Analyzing the Data

By running the code above we should have one row per HTTP request. In each row the following information are stored:  the country, date and security rules. 

It is time to unleash our data analysis skills! 

3.a Analyzing the Geography of the attacks

Using the package "rworldmap", a simple world visualization library we can easily plot the amount of request by country.

The function below does just that.

 mapThreats <- function(threathDataset){  
  require('rworldmap')  
  #Aggregate By countries  
  threathDataset <- threats %>% group_by(txSourceCountry) %>% summarise(ThreatsCount=n()) %>% arrange(desc(ThreatsCount))  
  spMapDF <- joinCountryData2Map(threathDataset, joinCode = "NAME", nameJoinColumn = "txSourceCountry")  
  mapCountryData( spMapDF, nameColumnToPlot="ThreatsCount" )  
 }  

 # Visualise threats count by country  
 mapThreats(threats)  

For our audit log we got, *unsurprisingly* the following graph:



The rule 920350  prevents any remote clients to access the web server with an IP address only. While it is a really good rule to be protected against Malicious web vulnerability scanners, it is prone to false positive. Imagine for example that google is trying to index your website by using its IP address.

In order to have a clearer picture we filter the request that were not blocked by the rule 920350.  We use "dplyr" package for filtering but any other package (or none) could be used instead. 

 # Filter for IP Rule   
 threats <- threats %>% filter(txRule!=920350)  
 # Visualise threats count by country  
 mapThreats(threats)  


We can see that most request have been filtered out. We passed from 2,253 request in the country with the highest number of blocked request to just 102.  The result however are consistent. Most of the request originate from the USA, China, Sweden, Poland, Romania, France, Germany and the Netherlands.

3.b Analyzing the Type of attack

From section 3.a it is obvious that by far the most frequently triggered security rule is 920350. We will now visualize the other rules in a bar plot.   
 barPlotModSecRules <- function(threatsDataset){  
  require('ggplot2')  
  rules <- as.data.frame(threats %>% group_by(txRule) %>% summarise(ThreatsCount=n()) %>% arrange(desc(ThreatsCount)))  
  rules$txRule <- as.factor(rules$txRule)  
  ggplot(data=rules[1:10,], aes(x=txRule, y=ThreatsCount)) +  
   geom_bar(stat="identity", width=0.5)  
 }  
 # Visualize rules in a Bar plot  
 barPlotModSecRules(threats)  



The description and count of the rules in the bar plot are given below:

Rule
Message
Warnings
920350
host header is a numeric IP address (could be search engines)
3723
930130
Restricted File Access Attempt
98
920280
Request Missing a Host Header
95
980130
Anomaly score correlation rule
42
941100
XSS Attack Detected via libinjection
34
920210
Multiple/Conflicting Connection Header Data Found.
28
920440
URL file extension is restricted by policy
24
913100
Found User-Agent associated with security scanner
18
920170
GET or HEAD Request with Body Content.
6
920100
Invalid HTTP Request Line
4
930110
Path Traversal Attack (/../)
3
933160
PHP Injection Attack: High-Risk PHP Function Call Found
3
920340
Request Containing Content, but Missing Content-Type header
1
932150
Remote Command Execution: Direct Unix Command Execution
1

It seems that attacker mainly attempts to access restricted files. In total over the period of the log, 95 requests were made without a proper host header, this can be the work of Scanner Hacker or misconfigured HTTP clients (false positive)   

3.c Analyzing the chronology of attacks

Using our data.frame, we can visualize the chronology of attacks and search for seasonal patterns. In addition we can search for trends or stationarity.

 # Count blocked HTTP requests for each day in our dataset  
 dataAtHand <- threats %>% group_by(txDate) %>% summarise(count=n()) %>% arrange(txDate)  
 # Create dataset with all days between the first and last day of the log  
 allDates <-data.frame(txDate= seq(as.Date(unlist(head(dataAtHand,1)[1])), as.Date(unlist(tail(dataAtHand,1)[1])), by="days"))  
 # Merge data frame containing all the dates with the data frame containing the count.   
 datesData <- merge(allDates, dataAtHand , by="txDate" ,all.x=TRUE) %>% arrange(txDate)  
 # Set count equal to zero when no request for one day  
 datesData[is.na(datesData$count),'count']<- 0  
   
 # Plot time series  
 par(mar=c(8,4.1,4.1,2.1))  
 plot(ts(datesData$count), axes=F, ylab="Warnings",xlab="")  
 title(xlab="Time", line=6, cex.lab=1.2)  
 labelsIndexes <-seq(1,length(datesData$txDate), by=30)  
 axis(1, at = labelsIndexes, labels = datesData$txDate[labelsIndexes], las=3)  
 axis(2)  
   

The code above returns the following time series plot:

No clear seasonality and trend could be found be inspecting the graph visually.  Further test statistics could be used to assess whether either seasonality or trend effect are present.

4. Conclusion

It is difficult to have an overview of the threat by looking at the log files. With the code above we can not only have an overview of the attacks but also analyse our attackers and their behaviors. We can monitor the time of their attack, the type of their attack and their geographic location.   

The resulting data set can be used to test statistical hypothesis and improve our defense by understanding our attackers.

This post is intended to be used on https://www.r-bloggers.com

To leave a comment for the author, please follow the link and comment on their blog: Programming Utility codes.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)