MICAD: A new algorithm/R package for anomaly detection
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
Overview
Anomaly detection algorithms are core to many fraud and security applications/business solutions. Identifying cases where specific values are outside norms can be useful in outlier detection (as a predicate to predictive modeling) and to identify cases of interest when labeled data is not available for supervised learning models. For example, an insurance company might run anomaly detection against a claims database in the hopes of identifying potentially fraudulent (anomalous) claims. If the medical bills for personal injury claims are anomalously high (given the other characteristics of the claim), then those cases can be further reviewed (by a claims adjuster). Finally these newly (human labeled) claims could be used in a supervised model to predict fraud.
The Most Common Approach to Anomaly Detection
Probably the most common approach to identifying anomalies is a case-wise comparison (value by value) to peer group averages. For example, if we take a personal injury claim and compare it’s medical bill total to it’s peer claim medical bill averages. If our claim has one or more extreme variable values when compared to the cluster distribution (for the same variable) it can be considered an outlier. Here’s some pseudocode for a naive modeling algorithm based on this approach:
cluster cases;
for each cluster
for each variable
calculate variable averages;
calculate standard deviation;
endfor;
endfor;
for each case
score case using cluster model to determine appropriate cluster;
if abs(case variable – cluster average for variable) > 4.0 * cluster stddev
anomaly score += variable weight;
endif;
endfor;
High scores indicate anomalies. Supplying variables weights to the above algorithm allows you to tune the overall score (such that [subjectively] important variables contribute more heavily to the overall score total). Once a supervised model is built, these weights can be tuned using the variable importance measure of the model.
Again this approach is pretty naive. There are several challenges with this approach:
- How do we handle nominal/unordered (factor) variables?
- What if the data distributions are strongly skewed?
- What about variable interactions? Might outlier values be perfectly predictable (and normal) if we included variable interactions?
MICAD
MICAD is an attempt to improve upon the above naive approach. The simplest explanation of MICAD is:
Multiple imputation comparison anomaly detection (MICAD) is an algorithm that compares the imputed (or predicted) value of each variable to the actual value. If the predicted value != the actual value, the anomaly score is incremented by the variable weight.
Imputation of values is done using RandomForest (or similar predictive model). The predictors are the remaining variables in the case. For example, using the Iris data set we can impute the Sepal Length using the Species, Petal Length, Petal Width and Sepal Width.
Here is the pseudocode for MICAD:
# data preparation
for each variable
if (type of variable is numeric)
convert variable to quartile;
endif;
endfor;
# model building
for each variable
build randomForest classifcation model to predict variable using remaining variables;
store randomForest model;
endfor;
# model scoring
for each variable
retrieve randomForest model;
score randomForest model for all cases;
if (predicted class != actual class)
anomaly score += variable weight
endif;
endfor;
An Example Using an Appended Iris Data Set
Downloading & Installing MICAD
install.packages("devtools") library("devtools") install_github("smutchler/micad/micad") library("micad")
Loading the Appended Iris Data Set
data(iris_anomaly)
Building the MICAD S3 Model
micad.model <- micad(x=iris_anomaly[iris_anomaly$ANOMALY==0,], vars=c("SEPAL_LENGTH","SEPAL_WIDTH", "PETAL_LENGTH","PETAL_WIDTH", "SPECIES"), weights=c(10,10,10,10,20)) print(micad.model)
We build the model while excluding the anomaly records. The reason we do this is because Iris is a small data set and a few anomaly records will have a large impact on the models being built. In production data sets, the affects of a few anomaly records will [likely] not have such a large impact on the models.
The weights are driven by subject matter expertise intially. Once a supervised model can be built, the weights could be adjusted using the variable importances of each variable.
Scoring the Iris Anomaly Data Set
scored.data <- predict(micad.model, iris_anomaly) tail(scored.data)
The output is:
The last 4 cases are labeled anomaly = 1. The appended A$_SCORE column reveals high aggregate scores for the anomaly cases.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.