In my first article on unsupervised anomaly detection we looked at building an ensemble of random forest classifiers. Another way of approaching this problem is to dispense with training models altogether and to instead analyze the data inline. This has a number of advantages:
- We don’t need much data – in fact we only need enough data to determine any trends or cyclical behavior
- And we don’t even need any cyclical data
- Furthermore, we don’t need to know anything about our data
This is all great because we can develop a solution that is generic and just about as much “data-analysis-to-go” as is possible. However, for the (rightly) skeptical amongst my audience, let’s go through those claims in a little more detail.
Not all models have to be pre-trained. We can design our microservice to cache the data as it moves through a configured time window, and once a minimum cache size has been reached, we can start to fit curves to our signal. We can self-tune our configuration at given intervals. This would be equivalent to a training phase, but one done “on the fly”, so to speak.
The algorithms will expect cyclical behavior, but we can make some sensible estimates as to what this looks like by sampling the data beforehand (so, ok, my claim about pre-training was not entirely accurate, but this step is also an optional one). Any continuous signal over a period of time can be expressed as a spectral wave set. If that seems counter-intuitive then take a quick look at this demonstration, which shows how a drawing of Homer Simpson can be constructed by combining waves of different amplitude and frequency.
Cyclical data that I know about
These two algorithms both presuppose univariate data. That just means that we are applying each algorithm to a single variable. We are trying to fit the algorithm to a single line of data as it moves through time. In fact, the algorithm does not need to know anything about the “line” in question. It’s just a shape that has a trend and a form, and that’s enough.
Each time-window is inspected to deduce information about the shape of our line – and of course that time-window could be unbounded to account for all known data. So each window is independent of any other data and we have no “training” phase where we seek to understand the relationships between variables. Neither do we have to understand what the line represents (at least as far as it relates to fitting a curve).
Inline predictions – two approaches
Right, so we have an algorithm and a signal against which to fit it. We can take one of two approaches when we extrapolate the curve we have fitted. It can be used for either:
- a prediction of future measurements, or
- a measurement of anomaly.
The specifics of our use-case will largely determine which of these options make the most sense. If the signal is clearly cyclical in nature, has a low frequency and few harmonics then a “long” prediction may be valuable. Changes in the signal shape will be slow-moving in nature and this “stability” means that our extrapolation is robust and predictive. If on the other hand the signal is volatile we may only be able to make very short-term predictions, which will not allow us time to react. In this case we can remember our short-term predictions and compare them to the actual values provided when our signal “catches up” with the predicted measurements. In other words – longer predictions allow us to act; shorter predictions allow us to analyse.
Implementation – Fast Fourier Transform
In the example later we will use two algorithms – Fast Fourier Transforms (FFT) and Holt-Winters triple exponential smoothing (HWES). For the sake of brevity I will only discuss the first of these in detail. With the FFT algorithm we first remove any linear trend in the signal before extracting the wave set. Then we remove noise and restore the signal so we can extrapolate it. “Noise” in this context could be either amplitude- or frequency-based. Let’s decide to keep the waves with the N largest amplitudes, stripping out with those with low amplitude.
But here we need to be careful – if we keep too many harmonics we will in effect “over-fit” our wave set, and if we take too few we will lose specificity. This is best shown with an example. Here we have a set of readings from a Balluff Condition Monitoring (BCM) sensor – we have plotted (dark blue) and fitted (red) our wave set before extrapolating it a few steps into the future (turquoise). The deviation from the actual value is plotted at the bottom in green. Ideally we want this green line to be as flat (=well-fitted) and consistent (= not over-fitted) as possible.
With one harmonic:
With 24 harmonics:
That looks a bit better, so let’s just try significantly more. Below we have several hundred harmonics: but we are now overfitting! There is a perfect overlap for our observed data, but reduced ability to track as-yet-unseen data (see how the green error metric at the bottom suddenly takes off with new data):
As implied above, there are a few things which we have to consider. How large a time window will we apply? Will we remove noise? How many harmonics should be retained when doing this? Can we do better than just guessing? Yes – we can use other tools to show us the wave-set breakdown and use that as a basis for our design. FFT can do this as well, but let’s look at another possibility.
Inspecting data with PyEMD
It breaks a signal down into separate Intrinsic Mode Functions (IMF). These are a little similar to harmonic functions, except that they can display trend and can vary in amplitude and/or frequency. IMFs refer to the entirety of the input data set. As the process is serial in nature, EMD is inherently slow. So it’s a poor candidate for in-line analysis, but a good one for a preliminary examination of data. One of the issues with EMD is that the signal components are not always separated into the same mode each time. This effect is partially improved with an ensemble approach (EEMD: Ensemble Empirical Mode Decomposition), which adds white noise to mitigate the mixed mode effect.
Let’s look at an example with an extract taken from the same dataset we used above.
If we expect our signal to be cyclical then maybe we can the capture the essential shape of our data using a subset of the IMFs. In the image above, it looks like IMFs 4-9 provide this without over-fitting too much. It looks like a good place to start is to set the number of harmonics to 6 (i.e. the bottom 6 graphs).
For our FFT we can select different values of a) the time window to cache and b) the number of harmonics to retain when fitting and extrapolating our curve. At defined intervals we will apply all combinations of these variables to the most recent window. The winner is then selected for the next interval (until re-fitting). By “winner” we mean this: extrapolate the fitted curve and then derive a score (such as RSME: root mean squared error) for every combination of our variables. The lowest value is selected. How often we can do this and how many combinations are feasible is dependent on the specific use case and the available resources.
Implementation – Holt-Winters Exponential Smoothing
Without going into as much detail, here is a brief summary of our second algorithm. Holt-Winters is a triple exponential smoothing algorithm. This means there are three levels of smoothing: a weighted average, a trend and seasonality. Just like FFT, we don’t have to pre-configure much, but we do have to provide a value for how many cycles (seasons) are expected in a given time window. If we are not sure, we can be guided by EEMD, as shown above. Or we can provide different values for our self-tuning function to iterate over.
Example and interpretation
Here is an example of BCM temperature readings that are provided via MQTT. We have enriched them with two fitted algorithms (FFT and HWES). The averaged difference between the two algorithms and the actual values are plotted in orange. Note the different scale on the right of the graph. Despite a steady increase in temperature, the fit remains good and the deviation fairly consistent. This is roughly what we would expect: changes in the signal are not in and of themselves significant, but rather changes that are not in harmony (pun intended) with the current cyclical behaviour. If sudden or sustained peaks start to occur in the orange line, then it would be time to investigate the process further.
In this article we looked at fitting a curve to data without the need for prior training. This fitting can be trained “inline”, as it were, by regularly applying different sets of configurable variables and then intermittently re-selecting these using our self-tuning process. Any clear deviation between predicted and actual values can be regarded as an anomaly measurement – as opposed to an anomaly detection – and as such is a useful starting point for further investigation. This investigation will normally be iterative in nature, but at least we have a method for reducing the size of the haystack in which we search for the proverbial needle!