Isolation Forests and unsupervised outlier detection

Introduction

This is the next article in my collection of blogs on anomaly detection. This time we will be taking a look at unsupervised learning using the Isolation Forest algorithm for outlier detection. I’ve mentioned this before, but this time we will look at some of the details more closely. Let’s start by framing our problem.

Goal

This is a very similar to the description here. We want to identify outliers in our dataset and previously we have used an ensemble of predictor classes to do this. That approach involved building a collection of models. Each model in the collection predicts the value of a single variable based on all the other variables. Doing this in a round-robin fashion gives us X models for X variables. Training these models and combining their errors in effect gives us a measure of outlier-ness.

Think of a group of friends regularly going out for dinner together. They meet at different times of year and frequent different restaurants based on a number of factors (climate, public holidays, birthdays etc.). What they order – as a group – will vary accordingly. We might think of the different occasions as forming “modes” or clusters, and we can expect that orders from the menu will reflect that. So in this context an “outlier” might be if some members of the group order something that is totally out-of-season – out of mode, if you like.

Divide and …cluster

We can achieve the same result using an Isolation Forest algorithm, although it works slightly differently. It partitions up the data randomly. The fewer partitions that are needed to isolate a particular data point, the more anomalous that point is deemed to be (as it will be easier to partition off – or isolate – from the rest). An isolation forest is effectively building one big cluster and measuring how far individual points are away from the cluster’s center of gravity. The model does not attempt to make a judgement about what constitutes an outlier. That responsibility is left to the observer – but can easily be achieved by setting up some kind of statistical boundary and checking when that is traversed.

Let’s try and extend our analogy. This time, our friends frequent a single restaurant on a particular day of the year, every year. Their menu orders form therefore a single cluster. Now, outlier behaviour can be measured by how far a particular evening’s orders depart from an observed “average”.

Cluster coverage

One observation as an aside. It is reasonable to expect that both approaches are only as good as their “cluster coverage” – that is to say, the extent to which training data includes data from all expected modes of behaviour. The ensemble of class predictors might be less susceptible to this as it may still be able to capture as-yet-unseen-but-legitimate behaviour based on patterns between variables that are preserved across all modes. An Isolation Forest will have no choice but to allocate unseen data based on splits defined on known data.

Data

We need some data to train our model. I am going to use data from a Balluff Condition Monitor sensor once more. There are not many features, but we can augment them by introducing some statistics taken over a rolling window. I have added standard deviation and variance to the average for each of the 4 features, so we now have 12 measurements in all (4 x {average, stdev, variance}). Everything is nicely numerical and also non-categorical. I haven’t normalized or standardized the data, as that is not necessary for a decision tree-based algorithm. And note that we deliberately remove the timestamp. I just want to check the shape of our data and the relationship of features to one another, without introducing any time-series element into the analysis.

Model definition

The following example uses the H2O library, but one could equally well use the algorithm available with scikit-learn.

#-------------------------------
# where df is a pandas DataFrame
#-------------------------------
dff = h2o.H2OFrame(df)
df_train, df_test = dff.split_frame(ratios=[0.9])
#-----------------------------------------------------------------------
# see http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/if.html
#-----------------------------------------------------------------------
model = H2OIsolationForestEstimator(
    model_id='model_name_goes_here',
    sample_rate = 0.1,
    max_depth = 20,
    ntrees = 40
)
model.train(training_frame=df_train)
predictions = model.predict(df_test)

I initialize the model, setting just a few of the available parameters and then run a prediction over our test data set. The prediction will be a count of partitions needed to isolate the record from all other records. Remember that the lower this number is, the more anomalous the data. The next part is a bit trial-and-error as we want to set a limit on this value. However, since an Isolation Forest does not attempt to say what “normal” data looks like, we are free to do that after the model has been trained. We can do this easily in H2O by setting the quantile property to a particular level, like this:

quantile = 0.999
quantile_frame = predictions.quantile([quantile])

We can then derive a classification for each data point based on whether it is above or below this threshold:

threshold = quantile_frame[0, "predictQuantiles"]
predictions["predicted_class"] = predictions["predict"] > threshold

I’d like the outliers to have a higher anomaly score, so we simply take the inverse of the branch-count:

dfpred = predictions.as_data_frame()
dfpred['err'] = dfpred.apply(lambda a: abs(1/a['mean_length']), axis=1)

Plotting these values for a subset of our data (I’ve taken 400 samples) will yield something like this:

We can see a clear peak to the left, and some mid-range flutter off over to the right. As a next step we could set our threshold value according to our use-case and/or observations and then use this value as the basis for triggering alerts.

Further visualization

If we had 2 or 3 dimensions, we could plot this data and see if the visual presentation helps us. We have 12 features, but we can project these onto fewer dimensions using Principal Component Analysis (PCA). Doing this in three dimensions gives us this:

Interpretation and application

The outliers are shown in red, and the rest in green. We could set our quantile threshold according to our use-case. Take online retail fraud detection for example. Here, a low threshold – i.e. more outliers – may be appropriate if it used to just add friction to the process. We can add friction by using a looks-like-fraud score to limit the payment options, without blocking the user entirely (for which we would use a high threshold). Adding friction introduces low-level inconvenience without losing business, and it may be better to let a few suspect cases slide than to penalise innocent shoppers (who may never return).

We can repeat the PCA in two dimensions which makes the picture even clearer:

H2O gives us a nice option for deploying a model like this. We can export the model, either as a java class (export as POJO), or as a resource binary (export as MOJO). The former option may be appropriate for small models. For flexibility, however, it is better to access the model from outside of your java code. This is so that the model can be swapped out by simply changing the property defining its path. This would look something like this:

        URL mojoUrl = this.getClass().getClassLoader().getResource("model/IFest.zip");
        MojoReaderBackend reader = MojoReaderBackendFactory.createReaderBackend(mojoUrl,
                MojoReaderBackendFactory.CachingStrategy.MEMORY);
        MojoModel model = ModelMojoReader.readFrom(reader);
        EasyPredictModelWrapper modelWrapper = new EasyPredictModelWrapper(model);
        
        RowData testRow = new RowData();
        testRow.put("temp_avg", 43.5);
        testRow.put("temp_std", 0.000425553);
        testRow.put("temp_var", 0.000000181095);
        testRow.put("x_avg", 0.955458);
        testRow.put("x_std", 0.0756441);
        testRow.put("x_var", 0.00572202);
        testRow.put("y_avg", 0.285716);
        testRow.put("y_std", 0.0293099);
        testRow.put("y_var", 0.000859071);
        testRow.put("z_avg", 0.488643);
        testRow.put("z_std", 0.0281854);
        testRow.put("z_var", 0.000794415);

        AnomalyDetectionPrediction prediction = (AnomalyDetectionPrediction) modelWrapper.predict(testRow);

Note: the MOJO option is not available for all H2O models.

Conclusion

To summarize, then: H2O offers us some nice options for building and deploying Isolation Forest models to enterprise java code:

  • No need to normalize data (applies to most decision-tree based algorithms)
  • If we have categorical data, then that can be encoded on-the-fly, using a parameter in the model constructor
  • We can export our model either as a java class or as a java resource

Further reading / references

http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/if.html

https://github.com/h2oai/h2o-tutorials/blob/master/tutorials/isolation-forest/isolation-forest.ipynb

https://github.com/h2oai/h2o-tutorials/blob/master/tutorials/isolation-forest/interpreting_isolation-forest.ipynb

https://en.wikipedia.org/wiki/Isolation_forest

https://arxiv.org/pdf/1811.02141.pdf

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x
%d bloggers like this: