Analyzing OWASP Mod Security Audit Log with R
[This article was first published on Programming Utility codes, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Web servers are frequently scanned or attacked by remote HTTP clients. Attackers send HTTP requests via algorithms or by direct “human” intervention. In order to protect web servers against these threats, automated protection systems block potentially malicious HTTP requests. Every time a request is blocked, its description is logged in a audit file. Often, audit files are difficult-to-read/large text files. For that reason, it is difficult to have a clear overview of the threats blocked by security mechanisms.
In this tutorial I describe an algorithm that I wrote in order to analyse the HTTP requests blocked by mod security for Apache HTTPD. In my environment Mod Security is interfaced with OWASP Core Rule Set version 3 but the algorithm should work for other rules.
1. Parse the Audit File
Mod Security audit log file is organised in sections. Each section starts on a new line and can be composed of multiple lines. Sections are separated by a line specifying the transaction IDs and section code. A description of each section can be found here: https://www.feistyduck.com
--38c0471f-A--
[04/Apr/2018:15:35:55 +0000] WsTw20sPICSQr9pUrF5lpQAAAAs x.x.x.x 46351 x.x.x.x 443
--38c0471f-B--
GET /sitemap.xml HTTP/1.1
Host: 155.198.29.141
Accept-Encoding: identity
For performance issue, we decided to read the audit file line by line in R. This means that our algorithm acts pretty much as a “human” reader would. A set of functions “understand” the content of the current line given the information retrieved in the previous lines. For example, if the previous line was
--38c0471f-A--
[04/Apr/2018:15:35:55 +0000] WsTw20sPICSQr9pUrF5lpQAAAAs x.x.x.x 46351 x.x.x.x 443
Using regex we created a set of function that validate the type of line and extract the relevant content.
1.a Check and extract the content of the Section header and the Audit Log Header (A)
In order to check the presence of a specific text pattern or extract text within a text pattern we use regular expressions. We imported the package “stringr”, a great library that makes our life easier. The first function below evaluates if the current line is announcing a new section. The second function extracts the transaction ID. The third function extracts the section type. The function “auditLogGetTransactionDate” retrieves the date of the transaction as an R Date object. The last function retrieves the source and destination IPs and Ports as two String (i.e. [“ipSource portSource”, “ipDest portDest”].# Load Regular Expression Library require('stringr') # Returns true if line is an Audit Log Header auditLogIsTransaction <- function(line){ return(str_detect(line, "--[:xdigit:]{8}-[ABCDEFGHIJKZ]--")) } # Returns The transaction ID from the Audit Log Header auditLogGetTransactionId <- function(line){ return(str_extract(line, "[:xdigit:]{8}")) } # Returns The section type from the Audit Log Header auditLogGetTransactionSection <- function(line){ return(gsub("-","",str_extract(line, "-([ABCDEFGHIJKZ])-"))) } # Returns the Transaction date auditLogGetTransactionDate <- function(line){ extract <- str_extract(line, "[0123][0-9]/(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)/20(17|18|19|20)") extractDate <- as.Date(extract,"%d/%b/%Y") return( extractDate ) } # Returns the IP addresses as an array. (i.e. [IP Source, IP Destination]) auditLogGetIPAddressesAndPorts <- function(line){ return(str_extract_all(line, "[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\s[0-9]{2,5}")) }
In our example, we can now recognize and extract the piece of information highlighted in yellow.
--38c0471f-A-- [04/Apr/2018:15:35:55 +0000] WsTw20sPICSQr9pUrF5lpQAAAAs 66.x.x.x 46351 x.x.x.x 443 --38c0471f-B-- GET /sitemap.xml HTTP/1.1 Host: 155.198.29.141 Accept-Encoding: identity
We now need to extract the bit of information explaining why the HTTP request was deemed potentially malicious by Mod Security.
1.b Check and extract OWASP Rule information in Section H
The section H of the Audit log gives the reason why an HTTP request was blocked. It gives both the rule's ID and the rule's description.
--38c0471f-H-- Message:... [id "920350"] [rev "2"] [msg "Host header is a numeric IP address"] ... Apache-Error: [file "apache2_util.c"] ... [client %s] ModSecurity: %s%s [uri "%s"]%s Stopwatch: 1522856155328261 2511 (- - -) Stopwatch2: 1522856155328261 2511; combined=1679,..., sr=122, sw=0, l=0, gc=0 Response-Body-Transformed: Dechunked Producer: ModSecurity for Apache/2.9.1 (http://www.modsecurity.org/); OWASP_CRS/3.0.2. Server: Apache Engine-Mode: "ENABLED"
If the last section header read by the algorithm was of type "H" then the current line could potentially contains the rule's Id and the rule's message. Testing for the presence and extracting the rule's Id is much easier than the rule's message.
The two following functions test if the line string contains a rule id and extracts the rule id.
# Returns true if contains a rule id auditLogHasErrorCodeMessage <- function(line){ require('stringr') return(str_detect(line, '\\[id "\\d{3,10}"\\]')) } # Returns the id of the rule auditLogGetErrorCodeMessage <- function(line){ require('stringr') return(str_extract(str_extract(line, '\\[id "[0-9]{3,10}"\\]'),"[0-9]{3,10}")) }
1.c Read *and make sense of* the current Line
Now we put all the functions described above together in order to make sense of what happens in the current line. A Data Frame called "data" acts as the memory of the reader. At the first line data would be null and a new data frame needs to be created. Line by line the reader gains new information that is stored in "data". The information stored in "data" helps the reader to understand the next line.processAuditLogLine <- function(fileLine, data){ # Create structured storage if absent if(is.null(data)){ data <- data.frame(txId = NA, txSection = NA, txDate = NA, txSourceIp = NA, txSourceIpLocation = NA, txSourcePort = NA, txDestIp = NA, txDestPort = NA, txRule = NA ) } # Check if line is transaction if(auditLogIsTransaction(fileLine)){ # Retrieve Transaction Id txId <- auditLogGetTransactionId(fileLine) # Retrieve transaction Type txSection <- auditLogGetTransactionSection(fileLine) if(txSection == 'A'){ data <- rbind(data, data.frame(txId = txId, txSection = txSection, txDate = NA, txSourceIp = NA, txSourceIpLocation = NA, txSourcePort = NA, txDestIp = NA, txDestPort = NA, txRule = NA)) }else{ # retrive current data row txRow <- dim(data)[1] data[txRow,'txSection'] <- txSection } }else{ # retrive current data row txRow <- dim(data)[1] # Check line not empty if(fileLine !=''){ # Retrieve current Section Type currentSection <- data[txRow,'txSection'] # Check if section is for a new HTTP Request (A) if(currentSection == 'A'){ # Retrieve Date data[txRow,'txDate'] <- auditLogGetTransactionDate(fileLine) # Retrieve IP addresses and ports ipAddressesArr <- unlist(auditLogGetIPAddressesAndPorts(fileLine)) source <- unlist(strsplit(ipAddressesArr[1]," ")) dest <- unlist(strsplit(ipAddressesArr[2]," ")) # Store IP addesses and ports in data data[txRow,'txSourceIp'] <- source[1] data[txRow,'txSourcePort'] <- source[2] data[txRow,'txDestIp'] <- dest[1] data[txRow,'txDestPort'] <- dest[2] } # Check if section is giving information about the Rule (H) if(currentSection == 'H'){ # if(startsWith(fileLine, 'Message: ')){ data[txRow,'txRule'] <- auditLogGetErrorCodeMessage(fileLine) } } } } return(data) }
1.d Read the file
So far, we have seen how we can recognize the information and extract it in R. We also saw how, in one function, we can read and make sense of a line given the previous knowledge gained in the document. The function below shows how the file is actually read.
At every line of the file we execute the method called "method" and pass the data in parameter. Note that data is first declared as NULL.
At every line of the file we execute the method called "method" and pass the data in parameter. Note that data is first declared as NULL.
processLogFile <- function(filepath, method) { con = file(filepath, "r") data <- NULL while ( TRUE ) { line = readLines(con, n = 1) if ( length(line) == 0 ) { break } # execute method on line data <- method(line, data) } close(con) return(data) }
Now to get a data frame containing all the transactions of the audit log we can call:
# File Location pathAudit <- 'C:/Temp/Logs/modsec_audit_20181001.log' # Parse/Read Audit Log data <- processLogFile(pathAudit, processAuditLogLine) # Display data head(data, 9)
The 9 first rows are displayed below. Of course source IP addresses have been changed to protect the anonymity of our attackers and ours (A bit of GDPR compliance in this world of savage would not hurt!)
txId | txSection | txDate | txSourceIp | txSourceIpLocation | txSourcePort | txDestIp | txDestPort | txRule | |
---|---|---|---|---|---|---|---|---|---|
1 | bde4ba7f | Z | 17801 | 184.x.x.x | NA | 41434 | x.x.x.x | 443 | 920350 |
2 | b6c7467b | Z | 17801 | 198.x.x.x | NA | 22434 | x.x.x.x | 443 | 920440 |
3 | ad702d79 | Z | 17802 | 78.x.x.x | NA | 15438 | x.x.x.x | 443 | 920280 |
4 | 5efba84c | Z | 17802 | 139.x.x.x | NA | 40714 | x.x.x.x | 443 | 920350 |
5 | d358e435 | Z | 17802 | 220.x.x.x | NA | 2580 | x.x.x.x | 443 | 930130 |
6 | 6f227755 | Z | 17802 | 220.x.x.x | NA | 54458 | x.x.x.x | 443 | 920350 |
7 | c207da7e | Z | 17802 | 220.x.x.x | NA | 50019 | x.x.x.x | 443 | 920280 |
8 | 30e96b28 | Z | 17802 | 198.x.x.x | NA | 54382 | x.x.x.x | 443 | 920280 |
9 | bdf1496c | Z | 17802 | 60.x.x.x | NA | 58764 | x.x.x.x | 443 | 920350 |
In this table we can see that all sections have been read up to section Z in the column "txSection". Thats because section Z announces the end of a transaction/HTTP request. The dates in column txDate are actually stored in number of days since the "1970-01-01".
We already have a lot of information but we would also like to retrieve the country where each Source IP address is registered.
2.a Add additional Insight from API. (i.e. IP Address Geolocation).
We would like to extend the data set obtained previously by adding the IP addresses geolocations. In order to do so, we have to fetch the countries name from an online IP geolocation API.
In this tutorial we use the API proposed by http://api.ipstack.com/ which worked really well. You will need to replace the API_Key by your own in order to use the API. The free account provides 10,000 API requests by month.
retrieveIPsLocations <- function(threatsData){ # Library require('jsonlite') # Find unique Ips unIps <- unique(threatsData$txSourceIp) countriesForIpsDataFrame <- data.frame(txSourceIp = unIps, txSourceCountry = rep(NA, length(unIps))) # Using https://ipstack.com API_Key <- "########################" countriesForIpsDataFrame$txSourceCountry <- unlist(apply(countriesForIpsDataFrame, 1, FUN = function(x){ # API URL for SPECIFIC IP URL <- paste("http://api.ipstack.com/",x[1],"?access_key=",API_Key, sep = "") jsonObj <- fromJSON(URL) if(is.null(jsonObj) || is.null(jsonObj$country_name)) return(NA) else return(jsonObj$country_name) })) threatsData <- merge(threatsData, countriesForIpsDataFrame, by='txSourceIp') return(threatsData) }
We can think of other API uses here. For example we could check which IP addresses are on blacklist. Many online APIs provide this info for small fees.
3 Analyzing the Data
By running the code above we should have one row per HTTP request. In each row the following information are stored: the country, date and security rules.
It is time to unleash our data analysis skills!
3.a Analyzing the Geography of the attacks
The function below does just that.
mapThreats <- function(threathDataset){ require('rworldmap') #Aggregate By countries threathDataset <- threats %>% group_by(txSourceCountry) %>% summarise(ThreatsCount=n()) %>% arrange(desc(ThreatsCount)) spMapDF <- joinCountryData2Map(threathDataset, joinCode = "NAME", nameJoinColumn = "txSourceCountry") mapCountryData( spMapDF, nameColumnToPlot="ThreatsCount" ) } # Visualise threats count by country mapThreats(threats)
For our audit log we got, *unsurprisingly* the following graph:
The rule 920350 prevents any remote clients to access the web server with an IP address only. While it is a really good rule to be protected against Malicious web vulnerability scanners, it is prone to false positive. Imagine for example that google is trying to index your website by using its IP address.
In order to have a clearer picture we filter the request that were not blocked by the rule 920350. We use "dplyr" package for filtering but any other package (or none) could be used instead.
# Filter for IP Rule threats <- threats %>% filter(txRule!=920350) # Visualise threats count by country mapThreats(threats)
We can see that most request have been filtered out. We passed from 2,253 request in the country with the highest number of blocked request to just 102. The result however are consistent. Most of the request originate from the USA, China, Sweden, Poland, Romania, France, Germany and the Netherlands.
3.b Analyzing the Type of attack
From section 3.a it is obvious that by far the most frequently triggered security rule is 920350. We will now visualize the other rules in a bar plot.
barPlotModSecRules <- function(threatsDataset){ require('ggplot2') rules <- as.data.frame(threats %>% group_by(txRule) %>% summarise(ThreatsCount=n()) %>% arrange(desc(ThreatsCount))) rules$txRule <- as.factor(rules$txRule) ggplot(data=rules[1:10,], aes(x=txRule, y=ThreatsCount)) + geom_bar(stat="identity", width=0.5) } # Visualize rules in a Bar plot barPlotModSecRules(threats)
The description and count of the rules in the bar plot are given below:
Rule | Message | Warnings |
920350 | host header is a numeric IP address (could be search engines) | 3723 |
930130 | Restricted File Access Attempt | 98 |
920280 | Request Missing a Host Header | 95 |
980130 | Anomaly score correlation rule | 42 |
941100 | XSS Attack Detected via libinjection | 34 |
920210 | Multiple/Conflicting Connection Header Data Found. | 28 |
920440 | URL file extension is restricted by policy | 24 |
913100 | Found User-Agent associated with security scanner | 18 |
920170 | GET or HEAD Request with Body Content. | 6 |
920100 | Invalid HTTP Request Line | 4 |
930110 | Path Traversal Attack (/../) | 3 |
933160 | PHP Injection Attack: High-Risk PHP Function Call Found | 3 |
920340 | Request Containing Content, but Missing Content-Type header | 1 |
932150 | Remote Command Execution: Direct Unix Command Execution | 1 |
It seems that attacker mainly attempts to access restricted files. In total over the period of the log, 95 requests were made without a proper host header, this can be the work of Scanner Hacker or misconfigured HTTP clients (false positive)
3.c Analyzing the chronology of attacks
Using our data.frame, we can visualize the chronology of attacks and search for seasonal patterns. In addition we can search for trends or stationarity.
The code above returns the following time series plot:
# Count blocked HTTP requests for each day in our dataset dataAtHand <- threats %>% group_by(txDate) %>% summarise(count=n()) %>% arrange(txDate) # Create dataset with all days between the first and last day of the log allDates <-data.frame(txDate= seq(as.Date(unlist(head(dataAtHand,1)[1])), as.Date(unlist(tail(dataAtHand,1)[1])), by="days")) # Merge data frame containing all the dates with the data frame containing the count. datesData <- merge(allDates, dataAtHand , by="txDate" ,all.x=TRUE) %>% arrange(txDate) # Set count equal to zero when no request for one day datesData[is.na(datesData$count),'count']<- 0 # Plot time series par(mar=c(8,4.1,4.1,2.1)) plot(ts(datesData$count), axes=F, ylab="Warnings",xlab="") title(xlab="Time", line=6, cex.lab=1.2) labelsIndexes <-seq(1,length(datesData$txDate), by=30) axis(1, at = labelsIndexes, labels = datesData$txDate[labelsIndexes], las=3) axis(2)
The code above returns the following time series plot:
No clear seasonality and trend could be found be inspecting the graph visually. Further test statistics could be used to assess whether either seasonality or trend effect are present.
4. Conclusion
It is difficult to have an overview of the threat by looking at the log files. With the code above we can not only have an overview of the attacks but also analyse our attackers and their behaviors. We can monitor the time of their attack, the type of their attack and their geographic location.
The resulting data set can be used to test statistical hypothesis and improve our defense by understanding our attackers.
This post is intended to be used on https://www.r-bloggers.com.
This post is intended to be used on https://www.r-bloggers.com.
To leave a comment for the author, please follow the link and comment on their blog: Programming Utility codes.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.