Evaluating a prediction interval#
In practice, producing a prediction interval that offers the correct level of coverage is tricky. If the model is a poor fit then it will produce PIs that are too wide. While more complex methods tend to be overconfident and produce intervals that are too narrow.
Imports#
import pandas as pd
import numpy as np
from forecast_tools.baseline import SNaive, Naive1
from forecast_tools.metrics import (winkler_score,
absolute_coverage_difference,
coverage)
from forecast_tools.datasets import load_emergency_dept
import warnings
warnings.filterwarnings('ignore')
Empirical Coverage#
A simple way evaluate a prediction interval is to quantify its empirical coverage of a test set.
The intervals in the example below have 83% coverage of the true values.
intervals = np.array([[37520, 58225],
[29059, 49764],
[47325, 68030],
[36432, 57137],
[35865, 56570],
[33419, 54124]])
y_true = np.array([37463, 40828, 56148, 45342, 43741, 45907])
mean_cov = coverage(y_true, intervals)
print(round(mean_cov, 2))
0.83
Absolute Coverage Difference (ACD)#
You can enhance the basic coverage metric by considering a desired (or target) coverage that the method promised. The absolute coverage difference is the absolute difference between average coverage and the target coverage level.
For example if the average coverage is 83% and the target is a 95% coverage then the absolute coverage difference is
\(|0.83 - 0.95| = 0.12\)
intervals = np.array([[37520, 58225],
[29059, 49764],
[47325, 68030],
[36432, 57137],
[35865, 56570],
[33419, 54124]])
y_true = np.array([37463, 40828, 56148, 45342, 43741, 45907])
acd = absolute_coverage_difference(y_true, intervals, target=0.95)
print(round(acd, 2))
0.12
Winkler score (also called interval score)#
An alternative way to assess prediction interval coverage for a specific time series is using the Winkler score.
The winkler score is defined as
Where
\(u_{\alpha, t}\) is the upper prediction interval value for \(\alpha\) and horizon \(t\)
\(l_{\alpha, t}\) is the lower prediction interval value for \(\alpha\) and horizon \(t\)
\(y_t\) is the ground truth observation at horizon \(t\)
sources
https://www.tandfonline.com/doi/abs/10.1198/016214506000001437
https://www.tandfonline.com/doi/pdf/10.1080/01621459.1972.10481224?needAccess=true
Interpretation#
A Winkler score is the width of the interval plus a penality proportional to the deviation (above or below the interval) and 2/\(\alpha\)
Smaller winkler scores are better!
Winkler score for an individual obs#
Assume our model has generated a prediction interval of [744.54, 773.22]. We know that the true observation lies outside of the lower interval 741.84. The winkler score is therefore:
alpha = 0.2
interval = [744.54, 773.22]
y_t = 741.84
ws = winkler_score(interval, y_t, alpha)
print(round(ws, 2))
55.68
if instead our y_t fell within the interval. The winkler score would be equal to the width of the interval.
interval = [744.54, 773.22]
y_t = 745.0
ws = winkler_score(interval, y_t, alpha)
print(f'Winkler score: {round(ws, 2)}; Interval width: {round(interval[1]-interval[0], 2)}')
Winkler score: 28.68; Interval width: 28.68
Winkler score for multiple step prediction#
More commonly a model will generate a prediction over a multi-step horizon. Here for example Seasonal naive is used to predict a 7 day holdout sample in the ED dataset.
HOLDOUT = 7
PERIOD = 7
attends = load_emergency_dept()
# train-test split
train, test = attends[:-HOLDOUT], attends[-HOLDOUT:]
model = SNaive(PERIOD)
# returns 80 and 90% prediction intervals by default.
preds, intervals = model.fit_predict(train, HOLDOUT, return_predict_int=True)
ws = winkler_score(intervals[0], test, alpha=0.2)
print(f'Winkler score: {ws:.2f}')
Winkler score: 130.75