Introduction
When we work with classification or regression models we are seeking to predict either a discrete value, or a value along a continuum. The accuracy of these models is implicit in the metric we use when training: either in the cost function during the actual training, or something like an r2 score or AUC etc. when we assess our model’s predictive power against our test dataset. However, sometimes it would be nice for the model to emit a measure of uncertainty along with a single prediction – to make this uncertainty explicit. This is achieved by predicting a probability distribution rather than a value. A confidence interval will be thus inherent in the prediction.
This does not exclude the prediction of a specific value, as we can, of course, extract that value from the distribution, but we have more flexibility in how we use the result. The steps to achieve this are as follows:
- Identify what kind of distribution best fits our use-case
- Define the model so that it outputs parameters that define the chosen distribution
- Define a cost function that allows the model to fit itself to the training data
- Train the model
- Plot sample results to illustrate usage
Distributions
There are many probability distributions to choose from. See this list for examples. We will using the ubiquitous Boston housing dataset, which you can load directly from scikit-learn. It’s a very small dataset (about 500 records) so we are using it with Keras just by way of illustration. The target feature is the house price, which lies on a continuum. We will look at two distributions, both of which will predict a value along a continuum (i.e. regression). Knowing what our output will be means that we can narrow the candidate distributions.
Discrete
Our prediction will be:
- Discrete (i.e. we are predicting an integer value)
- Non-negative (as the house prices are not negative)
- Unbounded (in the positive direction: there is theoretically no upper limit)
Based on this a good candidate is the negative binomial distribution.
Continuous
As above, except that the predicted value will be a decimal value (continuous) and we will use the Weibull distribution. Note the following:
- Our example will be for a continuous distribution, but we can define loss functions for both discrete and continuous cases (and the discrete case is just a special case of the continuous one anyway)
- The Weibull distribution can be used for so-called “censored” data: referring to data life-cycles that are not complete. For example, for survival functions (relating to life-expectancy, or time-to-failure) we can use open-ended cases as the Weibull distribution takes this property into account when making predictions
- We won’t be dealing with censored data in this article, so our Weibull cost function will be accordingly simplified
Negative binomial distribution
Both distributions require 2 parameters. For the negative binomial distribution these are:
- R: number of failures until the experiment is stopped
- this is a positive integer
- P: success probability in each experiment
- this must lie between 0 and 1
These properties will be important when we look at the next step, defining our activation function.
The model
Since neural networks are flexible with regard to output of the final layer, we will be using a neural network implemented with the Keras library. Tensorflow will be the background network, but Keras offers a simpler model API.
Here is the model definition. The important bit to note is the final Dense layer and the Activation function that follows, as here we will emit two variables needed to define our distribution. Here is the model definition:
model = Sequential()
model.add(Dense(12, input_dim=X.shape[1], kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.001), activation='relu'))
model.add(Dropout(rate=0.01))
model.add(Dense(8, kernel_initializer='glorot_uniform', activation='relu'))
model.add(Dropout(rate=0.01))
model.add(Dense(8, kernel_initializer='glorot_uniform', activation='relu'))
model.add(Dropout(rate=0.01))
model.add(Dense(2))
model.add(Activation(activate_negative_binomial))
Activation function
We output 2 values (Dense(2)) and then emit from them two functions that will map R and P. As R must be a positive integer, we will use a Softplus function. For P (between 0 and 1) we will use a sigmoid function. An overview of different activation functions and their output shapes is given here (we could have taken other functions: the important thing is to ensure the outputs are limited in the same way that the distribution parameters are limited). These two functions can be combined in a single activation function like this:
def activate_negative_binomial(x):
a = k.softplus(x[:, 0])
b = k.sigmoid(x[:, 1])
a = k.reshape(a, (k.shape(a)[0], 1))
b = k.reshape(b, (k.shape(b)[0], 1))
return k.concatenate((a, b), axis=1)
Normally when we train a neural network for regression (output of a value along a continuum), we have one predicted value and a known target value for the given inputs. We can then use a standard measurement (e.g. root mean square) to measure the error and then adjust our parameters to try and reduce this error. How do we do this with distribution parameters? Well, we want the choose R and P such that the distribution returns the highest probability for the target value, so we could use the distribution probability mass function to do this, given by:
(k+r-1)!p^r(1-p)^k/k!(r-1)!
Loss function
But, by definition, the probability mass function will have a peak around certain values. If we start training with parameter values a long way from this peak, it will difficult them to converge, as the gradients are small the further away from the peaks we are. We want the opposite: we want a curve where there are steep gradients away from the peak (but drawing towards the peak values) and shallow ones near the peak. It is easier for a ball to settle in the centre of a bowl as the shape of the bowl – steep gradients at the extremes, and flat in the “optimum” middle – “push” the ball towards its final resting place.
If our bowl was the shape of our PMF, it would do the exact opposite. The solution to this is to take the negative log of the PMF: this is called the negative log-likelihood and will look like this:
-log((k+r-1)!p^r(1-p)^k/k!(r-1)!)
= -log(k+r-1)! - rlogp – klog(1-p) + logk! + log(r-1)!
Tensorflow contains utility functions for things like the log-gamma function, so our python code is:
def loss_negative_binomial(y_true, y_pred):
n = y_pred[:, 0][0]
p = y_pred[:, 1][0]
return (
tf.math.lgamma(n)
+ tf.math.lgamma(y_true + 1)
- tf.math.lgamma(n + y_true)
- n * tf.math.log(p)
- y_true * tf.math.log(1 - p)
)
We can now use this as our cost function, like this
model.compile(loss=loss_negative_binomial, optimizer=Adam(), metrics=['mae'])
and then train the model:

Prediction and interpretation
To understand what our prediction has produced, let’s take a single case.
from scipy.stats import nbinom
# Perform inference for a single sample
element = 9
prediction_nbinom = model.predict(X[element:element+1])
y_actual_nbinom = y[element:element+1]
print(f'y_actual is {y_actual[0]}')
n = prediction_nbinom[:,0]; p = prediction_nbinom[:,1]
print(f'distribution parameters (n/p) for this element are {n} and {p}')
t_nbinom = range(y_actual_nbinom[0]*2)
probs = nbinom.pmf(t_nbinom, n, p)
y_prediction_nbinom = nbinom.median(n,p)
confint = 80
lower_nbinom, upper_nbinom = nbinom.interval(confint/100, n, p)
print(f'{confint}% bounds are {lower_nbinom[0]} --> {upper_nbinom[0]}')
Which yields this output:
y_actual is 18.9
distribution parameters (n/p) for this element are [13.300168] and [0.39923918]
80% bounds are 11.0 --> 29.0
We can plot this:
plt.plot(t_nbinom, probs, label='pmf')
plt.axvline(lower_nbinom, ls='--', label='lower ({}%)'.format(confint))
plt.axvline(upper_nbinom, ls='--', color='green', label='upper ({}%)'.format(confint))
plt.axvline(y_prediction_nbinom, ls='--', color='red', label='median')
plt.axvline(y_actual_nbinom, color='black', label='actual')
plt.legend();

This shows us that the actual value (black line) lies within our 80% confidence limit. We could extend this for different confidence limits by adding further pairs of upper/lower bands.
Let’s try the same thing with the Weibull distribution.
Weibull distribution
The model
The model is more-or-less unchanged: we are still emitting two values from the final layer, but the activation function is specific to Weibull:
weibull_model = Sequential()
weibull_model.add(Dense(12, input_dim=X.shape[1], kernel_initializer='glorot_uniform', kernel_regularizer=regularizers.l2(0.001), activation='relu'))
weibull_model.add(Dropout(rate=0.01))
weibull_model.add(Dense(8, kernel_initializer='glorot_uniform', activation='relu'))
weibull_model.add(Dropout(rate=0.01))
weibull_model.add(Dense(8, kernel_initializer='glorot_uniform', activation='relu'))
weibull_model.add(Dropout(rate=0.01))
weibull_model.add(Dense(2))
weibull_model.add(Activation(activate_weibull))
Activation function
The activation function is given below.
def activate_weibull(ab):
a = k.exp(ab[:, 0])
b = k.softplus(ab[:, 1])
a = k.reshape(a, (k.shape(a)[0], 1))
b = k.reshape(b, (k.shape(b)[0], 1))
return k.concatenate((a, b), axis=1)
Loss function
Next comes the log-likelihood function for continuous values (for the activation, loss, and plotting functions I have used the used the ones here and here):
def weibull_loglik_continuous(y_true, ab_pred, name=None):
y_ = y_true
u_ = 1
a_ = ab_pred[:, 0]
b_ = ab_pred[:, 1]
ya = (y_ + 1e-35) / a_
return -1 * k.mean(u_ * (k.log(b_) + b_ * k.log(ya)) - k.pow(ya, b_))
We can now add that to our model’s compile step:
weibull_model.compile(loss=weibull_loglik_continuous, optimizer=RMSprop(lr=.001), metrics=['mae'])
Prediction and interpretation
As before with the negative binomial distribution, we pick a single element (the same one as before, so we can compare):
# for a single prediction:
element = 9
prediction = weibull_model.predict(X[element:element+1])
y_actual = y[element:element+1]
a = prediction.T[0,0]
b = prediction.T[1,0]
print(f'showing results for actual value {y_actual} using weibull parameters (scale/shape): {a}, {b}')
Yielding:
showing results for actual value [18.9] using weibull parameters (scale/shape): 20.299489974975586, 3.0945098400115967
Let’s plot some confidence interval (80%, as before):
def weibull_pdf(a, b, t):
return (b/a) * (t/a)**(b-1)*np.exp(- (t/a)**b)
def weibull_quantiles(a, b, p):
return a*np.power(-np.log(1.0-p),1.0/b)
def weibull_median(a, b):
return a*(-np.log(.5))**(1/b)
t=np.arange(int(y_actual)-30,int(y_actual)+30)
confint = 80
median = weibull_median(a, b)
upper = weibull_quantiles(a, b, confint/100)
lower = weibull_quantiles(a, b, (100-confint)/100)
plt.plot(t, weibull_pdf(a, b, t), label='pdf')
plt.axvline(median, ls='--', color='red', label='median')
plt.axvline(lower, ls='--', label='lower ({}%)'.format(confint))
plt.axvline(upper, ls='--', color='green', label='upper ({}%)'.format(confint))
plt.axvline(y_actual, color='black', label='actual')
plt.legend();

This is slightly different from the result from the negative binomial distribution, but the accuracy is similar. We can combine these graphs to show their respectives shapes and bands of confidence:

Summary
In this article we have looked at how to adapt a Keras model to output multiple functions that we can use to describe a distribution, rather than a single prediction. We need to choose the distribution wisely, but once the model is trained, we can make statements about its accuracy in a way that is independent of the training stage.
Links
https://github.com/ragulpr/wtte-rnn/
https://github.com/gm-spacagna/deep-ttf

[…] and features that Dl4j supports, as the libraries are maintained independently of each other. In an earlier article we discussed using a custom activation layer to fit a probability distribution using our predicted […]