# Automatically selecting a naive model to use as a benchmark

forecast-tools provides a `auto_naive` function that uses point-forecast cross validation to select the 'best' naive model to use as a benchmark.  

The function tests all of the naive `Forecast` methods.

This notebook covers how to use `auto_naive` and also how to trouble shoot it use if there are conflicts between parameters.

## Imports

In [None]:
import sys

# if running in Google Colab install forecast-tools
if 'google.colab' in sys.modules:
    !pip install forecast-tools

In [1]:
import numpy as np
from forecast_tools.datasets import load_emergency_dept
from forecast_tools.model_selection import auto_naive                                    

In [2]:
help(auto_naive)

Help on function auto_naive in module forecast_tools.model_selection:

auto_naive(y_train, horizon=1, seasonal_period=1, min_train_size='auto', method='cv', step=1, window_size='auto', metric='mae')
    Automatic selection of the 'best' naive benchmark on a 'single' series
    
    The selection process uses out-of-sample cv performance.
    
    By default auto_naive uses cross validation to estimate the mean
    point forecast peformance of all naive methods.  It selects the method
    with the lowest point forecast metric on average.
    
    If there is limited data for training a basic holdout sample could be
    used.
    
    Dev note: the plan is to update this to work with multiple series.
    It would be best to use MASE for multiple series comparison.
    
    Parameters:
    ----------
    y_train: array-like
        training data.  typically in a pandas.Series, pandas.DataFrame
        or numpy.ndarray format.
    
    horizon: int, optional (default=1)
        Forecast ho

## Load the training data

In [3]:
y_train = load_emergency_dept()

## Select the best naive model for a h-step horizon of 7 days.

Let's select a method for the emergency deparment daily level to predict 7 days ahead.  By default the function using the **mean absolute error** to evaluate forecast accuracy.

In [4]:
best = auto_naive(y_train, horizon=7, seasonal_period=7)
best

{'model': Average(), 'mae': 19.679856211931035}

In [5]:
y_preds = best['model'].fit_predict(y_train, horizon=7)
y_preds

array([221.06395349, 221.06395349, 221.06395349, 221.06395349,
       221.06395349, 221.06395349, 221.06395349])

## Using a different forecasting error metric

In [6]:
best = auto_naive(y_train, horizon=7, seasonal_period=7, metric='mape')
best

{'model': Average(), 'mape': 8.69955926909263}

# Using a single train-test split when data are limited.

If your forecast horizon means that h-step cross-validation is infeasible then you can automatically select using a single holdout sample.

In [7]:
best = auto_naive(y_train, horizon=7, seasonal_period=7, method='holdout')
best

{'model': Average(), 'mae': 30.182280627384486}

## Trouble shooting use of `auto_naive`

**Problem 1:** Training data is shorter than the `min_train_size` + `horizon`

For any validation to take place, including a simple holdout - the time series used must allow at least one train test split to take place.  This can be a problem when seasonal_period is set to a length similar to the length of the time series.

In [8]:
# generate a synthetic daily time series of exactly one year in length.
y_train = np.random.randint(100, 250, size=365)

Let's set seasonal period to `seasonal_period=365` (the length of the time series) and `horizon=7`.

We will also manually set `min_train_size=365`

This will generate a `ValueError` reporting that the "The training data is shorter than min_train_size + horizon  No validation can be performed."

In [9]:
best = auto_naive(y_train, horizon=7, seasonal_period=365, method='ro', 
                  min_train_size=365, metric='mae')

best

ValueError: The training data is shorter than min_train_size=365 + horizon=7 No validation can be performed. 

A longer time series or a shorter seasonal period will fix this problem.

In [10]:
# a longer synthetic time series.
y_train = np.random.randint(100, 250, size=365+7)
best = auto_naive(y_train, horizon=7, seasonal_period=365, method='ro', 
                  min_train_size=365, metric='mae')

best

{'model': Average(), 'mae': 43.29549902152642}

In [11]:
# a shorter seasonal period and minimum training size
y_train = np.random.randint(100, 250, size=365)
best = auto_naive(y_train, horizon=7, seasonal_period=7, method='ro', 
                  min_train_size=7, metric='mae')

best

{'model': Average(), 'mae': 37.50786553941686}