We live in a world of unprecedented connectivity and computational power. It is not unreasonable to assume that bigger, more complex models, will provide better forecasting power. However, the “M” competition – also known as the Makridakis_Competitions, named after their founder Spyros Makridakis – that has run in different forms since the 1980s, shows that this is often not the case. These open competitions compare and evaluate different approaches to, and implementations of, time-series forecasting.
The most recent competition – M5 – took place in 2020 and you can read the provisional findings here. I’d like to share some impressions gleaned from the paper and from looking at some of the available solutions here:
Competition context and background
The competition has increased in size over the years from 6 participants (M1: 1982) to 20,000 (M5: 2020). Nassim Taleb references the M-competitions in his 2007 book “The Black Swan”. He quotes Makridakis as saying that (certainly up to and including M3, completed in 1999):
statistically sophisticated and complex methods do not necessarily provide more accurate forecasts than simpler ones.Nassim Taleb, “The Black Swan”
Indeed it was not until M4 (2018) that more sophisticated models began out-performing the more classical, statistically-based ones. This trend continued in M5, where most of the top-50 placed teams worked with sophisticated machine-learning models (most notably, LightGBM). From the provisional findings paper linked to above:
LightGBM proved that it can be used to effectively process numerous, correlated series and exogenous/explanatory variables and reduce forecast error. Moreover, deep learning methods, like DeepAR and N-BEATS, that provide advanced, state of the-art ML implementations, showed potential for further improving forecasting, accuracy in hierarchical retail sales applications.Spyros Makridakis, M5 Review paper
Competition results: intuitives and counter-intuitives
The majority of the winners were neither experts in the retail domain nor forecasting theory, but more than compensated for this with experimentation and thinking deeply about the data. They focused their efforts on data preparation and in selecting the best ensemble of models. All competitions have shown the importance of combining models. This applies both to the use of ensembles (e.g. boosted decision trees) and also combinations of different algorithms.
Moreover, the winner didn’t seem to spend excessive time on model parameter tuning, instead investing time in data/feature preparation and the treatment of external data sources (e.g. climatic or calendrical patterns). The recent advent of automated self-tuning strategies (e.g. hyperopt, Auto-ML) *does* of course mean that some of the heavy lifting shifts from the data scientist onto the model-training framework. – but it would be a mistake to over-emphasize this or to think of it as amounting to some kind of a “secret sauce”.
The sheer number-crunching and iterative nature of model tuning does not require the same insight into the data that data wrangling does. For instance, one has to decide how best to modulate, combine or pivot the data – or what is an appropriate retention depth for time-series-related features. AI is not going to replace this type of work anytime soon!
The specific goals of the competition are also noteworthy. Each model was to conduct a forecast over a window of 28 days. The data provider – Walmart – wanted a measure of forecast accuracy over this time range. This may make perfect, intuitive sense to the layperson – anything guaranteeing excessive long-term accuracy would be greeted with suspicion. But it does call into question the prediction ranges that are at times touted by enterprise analytics tools. Just because a prediction is possible months or even years into the future, does not mean that such a forecast has any predictive value.
The second aspect of the competition is relevant here. In another branch of the competition, the teams submitted a measure of uncertainty to their predictions. This was a fairly novel requirement and is worth highlighting as a general principle. We should know our models and their limitations so that we can challenge their conclusions. It is preferable to break our models and expose them in order to discover more about reality, than to avoid questioning reality because “the model said so”. We should be pursuing a paradigm of model usage that allows maximum transparency as to what exactly the model is doing internally.
Open vs. closed models
There is a point at which we no longer view technologies as being “open” – where our full trust has not yet been fully earned – but rather as “closed”. What does this mean exactly? I was the proud owner of an early Texas Instruments pocket calculator (though back then the pockets had to be fairly large!) in the 1970s – the https://en.wikipedia.org/wiki/TI-30 or one of its cousins – which had a bug in the calculation of inverse tangents. You couldn’t trust the thing blindly. But no-one thinks twice about using a calculator today: it just “works”.
We may not know when we reached this level of trust. We may not even know if it is really warranted. But that we did, at some point, cannot be disputed. It’s the same with elevators. Gone are the days when an operator was ever-present, ostensibly to operate the buttons (but in reality to provide assurance that the contraption was sufficiently safe). We haven’t reached that point yet with forecasting models, and yet may sometimes be too quick to just “trust the (data) science”. Sometimes we need to retain a healthy level of scepticism! This is easier said than done, but it is one reason why the European Commission has started these two initiatives. Joe Norman, the founder of Applied Complexity Science LLC, puts it this way:
We don’t need “models that can give us reliable answers”. We need to understand how NOT to use models, that is as stand ins for real systems that can “give us reliable answers”. Models should be first and foremost used for probing their assumptions–we should be busting models!https://twitter.com/normonics/status/1332095414976307201
Nuts and bolts
Walmart provided its sales data in “flattened” form i.e. lots of columns and relatively few rows. Each row represents a sales entity (e.g. combination of store and product etc.) with a column for each day of sales data. Each row is independent of the others. This data structure retains the time-series element although it is in-line, and there is not enough data (i.e. not enough rows) for model training.
Many teams transformed the basic structure of the data before they did anything else. If each row is pivoted to represent a day, then aggregate columns can be added to hold a certain number of lags and rolling averages. This makes it difficult to vary the batch during training, but model tuning did not seem to be as significant as getting the right paradigm for model combination.
Pivoting the data means we go from about 12k rows to 23m. This is still manageable on a single machine, and we now have about 50 or so columns instead of 1900. Some of the candidate models applied different levels of aggregation to capture different types of information. Indeed, the review paper describes how in some cases different models yielded good results for different levels of aggregation.
The competition rules specified the last 28 days of figures as validation data. In many cases teams ignored the first two years’ of data as they considered any trends to be too stale for predictive accuracy.
Competition winner: LightGBM
LightGBM was the clear winner, featuring in all but one of the top 5 solutions. It is similar to XGBoost in that both components make use of boosted decision trees. There are a few differences here that account for the increased efficiency of LightGBM:
- LightGBM does not build its trees level-by-level, but leaf-by-leaf. This can lead to overfitting when the data is sparse. We can use a parameter (maximum leaf depth) to limit this.
- LightGBM determines the split point for tree nodes by using histogram algorithms instead of pre-sorted feature values. A search for the splits is efficient as the number of bins is less than the number of values. XGBoost uses a histogram algorithm too, but with LightGBM this process is optimized during training by weighting the model instances in favour of those that are under trained.
- If features are mutually exclusive (they are neither simultaneously trivial or non-trivial) they can be combined in a single feature. LightGBM does this to reduce the number of overall features which also aids training.
Additionally, there are not that many parameters to optimize when training a LightGBM model. Makridakis mentions 5 parameters in the review paper. These are: learning rate, number of iterations, maximum number of bins that feature values will be bucketed in, number of estimators, and loss function.
It is worth noting that LightGBM is a fairly new player in the forecasting arena. Microsoft started the project in 2016, and the collaborators presented the foundational paper in 2017. Any new library needs to have its tyres kicked and prove itself before it is adopted by established frameworks. At the time of writing, Anaconda does not distribute LightGBM – although it can be installed with pip. There then may be a lag between the point at which developers are willing and eager to utilize such a library (early adoption), and the point at which all-in-one provider solutions make that available (tried and tested). Solution providers may thus offer stability and familiarity at the expense of the latest cutting-edge developments.
The volume of M5 training data after pivoting is still sufficiently small to allow processing on a single machine. However, some competitors commented on the Kaggle board that in-memory processing was reaching its limit. I could only find one comment regarding using LightGBM with Dask (Dask is a python framework based on pandas which allows out-of-memory processing). Without going into detail regarding Dask, there is a dask-lightgbm library under development that could facilitate the processing of data that exceeds the available RAM.
As Dask it based on pandas, it is often possible to use very similar code to process data. Dask also has wrappers around some of the sklearn classes such as GridSearch. However, the distributed nature of Dask means that you have to give more thought to things like sorting and splitting. Here is a simple example showing how to train a classifier in both LightGBM and Dask-LightGBM. The LightGBM example is largely borrowed from this article.
from numpy import mean from numpy import std from sklearn.datasets import make_classification from lightgbm import LGBMClassifier from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold X, y = make_classification(n_samples=10000, n_features=10, n_informative=5, n_redundant=5, random_state=1) model = LGBMClassifier() cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise') print('Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores))) # fit the model on the whole dataset model = LGBMClassifier() model.fit(X, y) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = model.predict(row) print('Prediction: %d' % yhat) # predict on the whole data set y_pred = model.predict(X) acc_score = (y == y_pred).sum() / len(y) acc_score = acc_score print(acc_score)
Which yields the following output:
Accuracy: 0.941 (0.007) Prediction: 1 0.9772
import dask.dataframe as dd import dask.array as da import dask_lightgbm.core as dlgbm import dask_ml.wrappers from dask_ml.datasets import make_classification as make_classification_dask import numpy as np from dask.distributed import Client, LocalCluster cluster = LocalCluster(n_workers=1, threads_per_worker=1, memory_limit='4GB') client = Client(cluster) client X_dask, y_dask = make_classification_dask(n_samples=10000, n_features=10, n_informative=5, n_redundant=5, random_state=1, chunks=10) dmodel = dlgbm.LGBMClassifier(n_estimators=400) dmodel.fit(X_dask, y_dask) # make a single prediction row = [[2.56999479, -0.13019997, 3.16075093, -4.35936352, -1.61271951, -1.39352057, -2.48924933, -1.93094078, 3.26130366, 2.05692145]] yhat = dmodel.predict(da.from_array(np.array(row))) print('Prediction: %d' % yhat) y_pred = dmodel.predict(X_dask, client=client) acc_score = (y_dask == y_pred).sum() / len(y_Dask) acc_score = acc_score.compute() print(acc_score)
Prediction: 1 0.9797
The prediction across the entire data set is simply to compare metrics as it is not easy to perform cross validation with Dask. Note the
make_classification_dask wrapper around sklearn’s
make_classification and the ability to specify exactly how much RAM we want to make available to the Dask worker processes. The Dask example takes significantly longer on my laptop. That is expected as the data easily fits in memory so we have the Dask overhead without benefiting from its strengths.
- The effort invested in data wrangling and feature engineering far outweighs the effort in model tuning. It may even approximate to the 80:20 rule.
- Being able to quantify our confidence in a forecast is as important as being able to forecast. It is therefore essential to pursue an “open model” paradigm – or at least to have a commitment to understanding as much as we can about model internals, even if it we means we try and break them in the process.
- Up-to-date technologies will inevitably lag adoption by established frameworks. Data Scientists will need to decide how best to balance this compromise use-case by use-case.
- Dask has significant potential to allow out-of-memory and distributed processing with a familiar syntax (as it uses pandas behind the scenes). It has backing from Anaconda and is therefore not purely community-driven.