Empirial Distributions

Empirial Distributions#

In many applications of simulation, theoretical distributons may not be a good fit to the data; for example, if the data are bimodal. A number of empirical distribution options are provided by sim-tools to support modelling in these situations

RawContinuousEmpirical
RawDiscreteEmpirical
GroupedContinuousEmpirical
DiscreteEmpirical

from sim_tools.distributions import (
    RawContinuousEmpirical,
    GroupedContinuousEmpirical,
    RawDiscreteEmpirical,
    DiscreteEmpirical,
    DistributionRegistry
)

import math
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "notebook"

1. Sampling from the raw data with interpolation#

If a user has access to the raw sample data they can pass this to RawContinuousEmpirical. The class implements the method described in Law and Kelton’s Simulation Modeling and Analysis. The approach creates a piecewise linear model where interpolation is used for gaps between data samples. The maximum and minimum values of the sample are the bounds of the distribution.

1.1 A simple example#

The example below illustrates the difference between the standard Empirical Cumulative Distribution Function (a stepped function) and Law and Kelton’s method of linear interpolation between points that is used in the sampling.

# Generate a simple dataset with 10 values representing length of stay in days
# Values chosen to be spread out to clearly show linear lines between points
simple_los = np.array([1.0, 3.5, 5.0, 7.2, 10.0, 12.5, 15.0, 18.0, 22.0, 30.0])

simple_dist = RawContinuousEmpirical(simple_los)

# plot the standard ECDF
fig = simple_dist.plotly_ecdf_standard(showlegend=False)

# Display the plot
fig.show()

# Create the empirical distribution
simple_dist = RawContinuousEmpirical(data=simple_los, random_seed=42)

# plot the linear interpolation
fig = simple_dist.plotly_ecdf_linear_interpolation()

# Display the plot
fig.show()

1.2 More complex data#

We will generate four datasets to illustrate the distributon in use. One is approx uniform , one right skewed (exponential), one bimodal (a mix of two normal distributions) and finally one with outliers.

# Create a dictionary of distribution configurations for batch creation
dist_configs = {
    # 1. Basic uniform distribution for Hospital A
    'uniform_dist': {
        'class_name': 'Uniform',
        'params': {
            'low': 0.5,
            'high': 45.5
        }
    },
    
    # 2. Exponential distribution for Hospital B
    'exponential_dist': {
        'class_name': 'Exponential',
        'params': {
            'mean': 5
        }
    },
    
    # 3. Normal distributions for bimodal Hospital C
    'normal_short': {
        'class_name': 'Normal',
        'params': {
            'mean': 3,
            'sigma': 1
        }
    },
    'normal_long': {
        'class_name': 'Normal',
        'params': {
            'mean': 20,
            'sigma': 6
        }
    },
    
    # 4. Distributions for the realistic Hospital D
    'lognormal_dist': {
        'class_name': 'Lognormal',
        'params': {
            'mean': 5.4,
            'stdev': 3.5
        }
    },
    'uniform_outliers': {
        'class_name': 'Uniform',
        'params': {
            'low': 30,
            'high': 45
        }
    }
}

# Create distributions using the DistributionRegistry create_batch method
# Setting only the main_seed parameter - the class handles individual seeds automatically
distributions = DistributionRegistry.create_batch(dist_configs, main_seed=42)

# Sample from distributions to create our hospital length of stay data
# 1. Hospital A - uniform distribution
los_uniform = distributions['uniform_dist'].sample(100)

# 2. Hospital B - exponential distribution with offset and clipping
los_skewed = distributions['exponential_dist'].sample(100) + 0.5
los_skewed = np.clip(los_skewed, 0.5, 45.5)

# 3. Hospital C - bimodal distribution (combining two normal distributions)
los_bimodal = np.concatenate([
    distributions['normal_short'].sample(70),  # Short stays centered around 3 days
    distributions['normal_long'].sample(30)    # Longer stays centered around 20 days
])
los_bimodal = np.clip(los_bimodal, 0.5, 45.5)  # Clip to our desired range

# 4. Hospital D - realistic distribution with outliers
los_realistic = np.concatenate([
    distributions['lognormal_dist'].sample(95),  # Main distribution
    distributions['uniform_outliers'].sample(5)  # Few outliers with very long stays
])
los_realistic = np.clip(los_realistic, 0.5, 45.5)  # Clip to our desired range

# Create a DataFrame to organize our data
hospital_data = pd.DataFrame({
    'Hospital A': los_uniform,
    'Hospital B': los_skewed,
    'Hospital C': los_bimodal,
    'Hospital D': los_realistic
})

# A summary of the synthetic dataset
hospital_data.describe()

	Hospital A	Hospital B	Hospital C	Hospital D
count	100.000000	100.000000	100.000000	100.000000
mean	22.816191	6.210169	8.031861	6.724240
std	13.473663	5.504185	8.356952	6.686789
min	1.156456	0.524006	0.863552	1.558499
25%	10.534497	2.255529	2.709289	3.443588
50%	23.637873	4.358754	3.721741	4.683944
75%	33.637560	8.140663	13.543078	7.103679
max	45.348266	25.440116	32.305724	36.572435

The function create_combined_plotly_figure is included as a helper to merge four plots of the hospital datasets together.

def create_combined_plotly_figure(
    distributions_dict,
    plot_method_name="plotly_ecdf_linear_interpolation",
    cols=2,
    plot_args=None, # Dictionary to pass arguments to the plot method
    subplot_title_prefix="",
    main_title="Combined Distribution Plots",
    share_xaxes=False,
    share_yaxes=False,
    layout_options=None, # Dictionary for fig.update_layout
    ) -> go.Figure:
    """
    Combines multiple Plotly figures from a dictionary of distributions
    into a single figure with subplots.

    Parameters
    ----------
    distributions_dict : dict
        A dictionary where keys are names (e.g., 'Hospital A') and values
        are instances of RawContinuousEmpirical (or similar).
    plot_method_name : str, default="plot_ecdf_linear_interpolation_plotly"
        The name of the method on the distribution object that returns
        a Plotly figure (e.g., 'plotly_ecdf_standard_plot').
    cols : int, default=2
        Number of columns in the subplot grid. Rows are calculated automatically.
    plot_args : dict, optional
        Keyword arguments to pass to the individual plotting method specified
        by plot_method_name.
    subplot_title_prefix : str, default=""
        A prefix to add before the distribution name in subplot titles.
    main_title : str, default="Combined Distribution Plots"
        The main title for the combined figure.
    share_xaxes : bool, default=False
        Whether to share x-axes across subplots [2].
    share_yaxes : bool, default=False
        Whether to share y-axes across subplots [2].
    layout_options : dict, optional
        Additional layout options to apply to the final figure using
        fig.update_layout().

    Returns
    -------
    plotly.graph_objects.Figure
        The combined Plotly figure with subplots.
    """
    if not distributions_dict:
        return go.Figure().update_layout(title_text="No distributions provided")

    if plot_args is None:
        plot_args = {}

    dist_names = list(distributions_dict.keys())
    num_plots = len(dist_names)
    rows = math.ceil(num_plots / cols)

    subplot_titles = [f"{subplot_title_prefix}{name}" for name in dist_names]

    # Create the subplot figure structure [2, 3]
    fig = make_subplots(
        rows=rows,
        cols=cols,
        subplot_titles=subplot_titles,
        shared_xaxes=share_xaxes,
        shared_yaxes=share_yaxes
    )

    # Iterate through distributions, generate plots, and add traces [1, 4]
    for i, name in enumerate(dist_names):
        distribution = distributions_dict[name]
        plot_method = getattr(distribution, plot_method_name)

        # Generate the individual figure for this distribution
        # Pass any specific plot arguments
        individual_fig = plot_method(**plot_args)

        # Calculate subplot position (1-based index)
        current_row = i // cols + 1
        current_col = i % cols + 1

        # Add each trace from the individual figure to the main subplot figure [1, 5]
        for trace in individual_fig.data:
            # Preserve legend grouping if the individual plot used it
            trace.legendgroup = name
            # Optionally hide duplicate legend entries if desired,
            # though Plotly often handles this with legendgroup
            if i > 0:
                 trace.showlegend = plot_args.get('showlegend', True) # Keep legend status consistent

            fig.add_trace(trace, row=current_row, col=current_col)

    # Update overall layout
    fig.update_layout(title_text=main_title)
    # Hide duplicate legends if traces weren't automatically grouped well
    # fig.update_layout(showlegend=True) # Or False if too cluttered

    if layout_options:
        fig.update_layout(layout_options)


    return fig

# create the datasets
distributions = {}
for hospital in hospital_data.columns:
    distributions[hospital] = RawContinuousEmpirical(
        data=hospital_data[hospital],
        random_seed=42
    )

# plot the linear interpolation function
plot_args = {"showlegend": False}
combined_fig = create_combined_plotly_figure(
    distributions, 
    plot_method_name='plotly_ecdf_linear_interpolation',
    plot_args=plot_args,
    cols=2,
    main_title="Linear Interpolation ECDFs",
)
combined_fig.show()

Lastly, we can check the sampling is working as expected using the function plotly_original_versus_sampled. This will compare the original raw data with the samples in a histogram. Change the hospital name to see a different chart.

def plotly_original_versus_sampled(sample_dict, original_data):
    """
    Creates a Plotly chart with a dropdown to switch between datasets in sample_dict.

    Args:
        sample_dict (dict): Dictionary where keys are dataset names and values are the data.
        original_data (dict): Dictionary containing the original data. Same keys as sample_dict.

    Returns:
        plotly.graph_objects.Figure: A Plotly figure with the original data and a dropdown menu.
    """

    # Create a figure with subplots
    fig = make_subplots(rows=1, cols=2, subplot_titles=("Original Data", "Sampled Data"))

    first_key = next(iter(sample_dict))

    # Add original data histogram
    fig.add_trace(
        go.Histogram(x=original_data[first_key], nbinsx=20, name="Original", histnorm='probability density'),
        row=1, col=1
    )

    # Add initial sampled data histogram (first key in the dictionary)
    
    fig.add_trace(
        go.Histogram(x=sample_dict[first_key], nbinsx=20, name="Sampled", histnorm='probability density'),
        row=1, col=2
    )

    # Create dropdown buttons
    dropdown_buttons = []
    for key in sample_dict:
        n = len(sample_dict[key])
        button = dict(method='update',
                      label=key,
                      args=[{'x': [original_data[key], sample_dict[key]]},
                            {'title': f"{key}: Original vs Sampled Data ({key}; n = {n})"}])
        dropdown_buttons.append(button)


    # Update layout with dropdown menu
    fig.update_layout(
        updatemenus=[dict(active=0,
                          buttons=dropdown_buttons,
                          x=1.0,          # Center horizontally
                          xanchor='right', # Anchor point is the center
                          y=1.1,          # Position vertically (adjust as needed)
                          yanchor='top')], # Anchor point is the top
        title_text=f"{first_key}: Original vs Sampled Data ({first_key})",
        xaxis_title_text="Length of Stay (Days)",
        yaxis_title_text="Relative Freq",
        height=500,
        width=900
    )

    return fig

# Generate samples from each distribution
samples = {}
for hospital in hospital_data.columns:
    samples[hospital] = distributions[hospital].sample(size=1000)


fig = plotly_original_versus_sampled(samples, hospital_data)


fig.show()

2. Sampling from the Raw data directly#

If the raw data is available then an alternative to Law and Kelton’s interpolation method israw data directly using RawDiscreteEmpirical.

distributions = {}
samples = {}
for hospital in hospital_data.columns:
    distributions[hospital] = RawDiscreteEmpirical(hospital_data[hospital])
    samples[hospital] = distributions[hospital].sample(size=1000)

fig = plotly_original_versus_sampled(samples, hospital_data)

fig.show()

3. Sampling from grouped empirical data#

In some circumstances a modeller may need to sample from simpler grouped data. For example, if the raw data are not available (e.g. if taken from an academic research paper) or if the dataset is large and causes memory issues or a bottleneck in model run time.

GroupedContinuousEmpirical implements the interpolation method described in Law and Kelton’s Simulation Modeling and Analysis.

We need to provide the data as k adjacent intervals

\[[a_0, a1), [a_1, a_2), \dots, [a_{k-1}, a_k)\]

In practice we provide GroupedContinuousEmpirical with two arrays: lower_bounds and upper_bounds. Note that the intervals do no need to be equally spaced.

A third array freq contains the number of observations in each interval.

grp_dist = GroupedContinuousEmpirical(
    lower_bounds=[0, 5, 10, 15, 30, 45, 60, 120, 180, 240, 480],
    upper_bounds=[5, 10, 15, 30, 45, 60, 120, 180, 240, 480, 2880],
    freq=[34, 4, 8, 13, 15, 13, 19, 13, 9, 12, 73],
    random_seed=42
)

# sample will interpolate across and within each group.
grp_dist.sample()

1297.0730413946362