Year of publication#

Primary research questions:#

The notebook partly answers the following research question:

How is sharing affected by FOSS, Covid-19, publication type and year of publication?

Specifically we look at the number of shared models split into general health and Covid-19 subgroups. We also analyse how the number of shared models as a proportion of the literature.

1. Imports#

1.1. Standard Imports#

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator

# for converting svg to tif
from PIL import Image

# set up plot style as ggplot
plt.style.use('ggplot')

1.2 Imports from preprocessing module#

# function for loading full dataset
from preprocessing import load_clean_dataset

2. Constants#

FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
    + 'des_sharing_lit_review/main/data/share_sim_data_extract.zip'

RG_LABEL = 'reporting_guidelines_mention'
NONE = 'None'
WIDTH = 0.5

3. Functions#

3.1. Functions to create summary statistics#

Two functions are used together in order to generate the high level results by year.

high_level_metrics - takes a subgroup of the dataset and generates summary statistics and counts
analysis_by_year - loop through the years passing each to high_levle_metrics and concatenates datasets at the end.

def high_level_metrics(df, name='None'):
    '''A simple high level summary of the review.
    
    Returns a dict containing simple high level counts
    and percentages in the data#
    
    Params:
    -------
    df: pd.DataFrame 
        A cleaned dataset.  Could be overall or subgroups/categories
        
    Returns:
    --------
        dict 
    '''
    results = {}
    included = df[df['study_included'] == 1]
    available = included[included['model_code_available'] == 1]
    results['n_included'] = len(included[included['study_included'] == 1])
    results['n_foss'] = len(included[included['foss_sim'] == '1'])
    results['n_covid'] = len(included[included['covid'] == 1])
    results['n_avail'] = len(included[included['model_code_available'] == 1])
    results['n_foss_avail'] = len(available[available['foss_sim'] == '1'])
    results['n_covid_avail'] = len(available[available['covid'] == 1])
    results['per_foss'] = results['n_foss'] / results['n_included']
    results['per_covid'] = results['n_covid'] / results['n_included']
    results['per_avail'] = results['n_avail'] / results['n_included']
    results['per_foss_avail'] = results['n_foss_avail'] / results['n_foss']
    
    # fix for 2019 as `n_covid` = 0
    if results['n_covid'] > 0:
        results['per_covid_avail'] = \
            results['n_covid_avail'] / results['n_covid']
    else:
        results['n_covid_avail'] = 0.0
    results['reporting_guide'] = \
        len(included[included['reporting_guidelines_mention'] != 'None'])
    results['per_reporting_guide'] = \
        results['reporting_guide'] / results['n_included']
    return pd.Series(results, name=name)

def analysis_by_year(df_clean, decimals=4):
    '''
    Conducts a high level analysis by year of publication
    2019-2022
    
    Params:
    -------
    df_clean: pd.DataFrame
        Assumes a cleaned version of the dataset.
    
    Returns: 
    -------
    pd.DataFrame
        Containing the result summary
        
    '''
    overall_results = high_level_metrics(df_clean, 'overall')
    year_results = []
    years = df_clean['pub_yr'].unique().tolist()
    for year in years:
        subset = df_clean[df_clean['pub_yr'] == year]
        year_results.append(high_level_metrics(subset, name=str(year)))
    year_results = [overall_results] + year_results
    year_results = pd.DataFrame(year_results).T.round(decimals)
    return year_results[sorted(year_results.columns.tolist())]

3.2 Functions to plot results#

get_subgroups_as_dataframes. Creates simpler to use DataFrame containing annual summaries of shared and non-shared models. These are split into non-covid and covid subgroups.
plot_sharing_by_year - plot with 2 subfigures. The top subfigure is a stacked barchart of general health models + covid-19 models that have been shared by year of publication. The lower subfigure presents yearly results as a percentage of total papers included in the study.

def get_subgroups_as_dataframes(yr_summary):
    '''
    Create seperate subgroups from the yr_summary table
    and return as independent DataFrame objects.  The overall column
    is dropped and only annual figures are included.
    
    Makes assumptions about the naming of colunms in yr_summary.
    
    1. Non covid models shared
    2. Covid models shared
    3. Covid models not shared
    4. Non covid models shared.
    
    Params:
    -------
    yr_summary: pd.DataFrame
        Dataframe containing summary of metrics by publication year
        
    Returns:
    -------
    tuple
    non-covid shared, covid shared, covid not shared, non covid not shared.
    '''
    
    # non covid shared models 
    non_covid_shared = yr_summary.T['n_avail'] \
        - yr_summary.T['n_covid_avail']
    
    # covid shared models
    covid_shared = yr_summary.T['n_covid_avail']
    
    # Covid models not shared
    covid_not_shared = yr_summary.T['n_covid'] \
        - yr_summary.T['n_covid_avail']
    
    # non covid not shared
    non_covid_not_shared = yr_summary.T['n_included'] \
        - yr_summary.T['n_avail'] - yr_summary.T['n_covid']
    
    # trim the overall column
    return non_covid_shared[:-1], covid_shared[:-1], covid_not_shared[:-1], \
           non_covid_not_shared[:-1]

def plot_sharing_by_year(n_general_models, n_covid_models, 
                         per_models, xlabels, width=WIDTH, grid=True,
                         figsize=(6, 5)):
    '''
    Plot the number of models shared by year of publication (in citation).
    
    In this function we split models into two subgroups: covid and non-covid
    Non-covid models are refered to as 'general health models'
    
    The function creates a figure with two subplots.  
    
    Subplot 1: A stacked barchart of general health models (non covid) 
    + covid-19 models that have been shared by year of publication.  A legend is
    provided to distinguish between general and covid models.
    
    Subplot 2: A bar chart expressing the total number of shared models as a 
    percentage of the number of included studies.
    
    Params:
    -------
    n_general_models: pd.DataFrame
        Count of non covid-19 models that have been shared by year of pub.
        
    n_covid_models: pd.DataFrame
         Count of covid-19 models that have been shared by year of pub.
         
    per_models: pd.DataFrame
        Percent of total studies included that have been shared by year of pub.
        
    xlabels: list:
        The years that represent the x-axis
        
    width: float, optional (default=WIDTH)
        A parameter to manipulated the width of the bars (applies to both plots)
        
    grid: bool, optional (default=True)
        Display grid lines (applies to both subplots)
        
    figsize: tuple(int, int), optional (default=(6,5)
        The size of the figure.
        
    Returns:
    --------
    fig, ax

    '''
    fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=False, figsize=figsize)

    # include x, y grid 
    if grid:
        ax1.grid(ls='--', axis='y')
        ax2.grid(ls='--', axis='y')
    
    # plot one: stacked bar chart.  general health + covid models
    y1 = ax1.bar(xlabels, n_general_models, 
                label='General Health', width=width)

    y2 = ax1.bar(xlabels, n_covid_models,
               label='Covid-19', bottom=n_general_models, width=width)

    _ = ax1.set_ylabel('Shared Models')
    _ = ax1.legend(ncols=1, loc='upper left', fontsize=8)
    
    # force integer values on y axis
    ax1.yaxis.set_major_locator(MaxNLocator(integer=True))

    # include x, y grid 
    if grid:
        ax1.grid(ls='--', axis='y')
        ax2.grid(ls='--', axis='y')
        
    ax1.bar_label(y2)
    
    # set size of x, y ticks
    ax1.tick_params(axis='both', labelsize=9)
    ax2.tick_params(axis='both', labelsize=9)

    # plot 2: percentage available by year
    y2 = ax2.bar(xlabels, per_models * 100.0, label='All models', 
                width=width, color='green')
    
    labels = [f'{m:.1f}' for m in (per_models * 100.0)]
    
    ax2.bar_label(ax2.containers[0], labels=labels)


    _ = ax2.set_ylabel('Percent of literature')
    
    return fig, (ax1, ax2)

def plot_all_subgroups_by_year(non_covid_shared, covid_shared, 
                               per_models, non_covid_not_shared, 
                               covid_not_shared,
                               xlabels, width=WIDTH, grid=True,
                               figsize=(6, 6)):
    '''
    Plot the number of models across subgroups by year of publication (in citation).
    
    In this function we split models into four subgroups: covid and non-covid 
    and shared and non-shared.
    
    Non-covid models are refered to as 'general health models'
    
    The function creates a figure with two subplots.  
    
    Subplot 1: A stacked barchart of unshared general health models (non covid)
    + unshared covid-19 + shared non-covid + shared covid that have been shared 
    by year of publication.  
    
    A legend is provided to distinguish between the four subgroups
    
    Subplot 2: A bar chart expressing the total number of shared models as a 
    percentage of the number of included studies.
    
    Params:
    -------
    non_covid_shared: pd.DataFrame
        Count of non covid-19 models that have been shared by year of pub.
        
    covid_shared: pd.DataFrame
         Count of covid-19 models that have been shared by year of pub.
         
    non_covid_not_shared: pd.DataFrame
        Count of non covid-19 models that have NOT been shared by year of pub.
        
    covid_not_shared: pd.DataFrame
         Count of covid-19 models that have NOT been shared by year of pub.
         
    per_models: pd.DataFrame
        Percent of total studies included that have been shared by year of pub.
        
    xlabels: list:
        The years that represent the x-axis
        
    width: float, optional (default=WIDTH)
        A parameter to manipulated the width of the bars (applies to both plots)
        
    grid: bool, optional (default=True)
        Display grid lines (applies to both subplots)
        
    figsize: tuple(int, int), optional (default=(6,5)
        The size of the figure.
        
    Returns:
    --------
    fig, ax

    '''
    
    fig, (ax1, ax2) = plt.subplots(nrows=2, sharex=True, figsize=figsize)

    # include x, y grid 
    if grid:
        ax1.grid(ls='--', axis='y')
        ax2.grid(ls='--', axis='y')
    
    # plot one: stacked bar chart.  general health + covid models
    y1 = ax1.bar(xlabels, non_covid_not_shared, 
                label='Closed General', width=width)

    y2 = ax1.bar(xlabels, covid_not_shared,
                 label='Closed Covid-19', bottom=non_covid_not_shared, 
                 width=width)
    
    y3 = ax1.bar(xlabels, non_covid_shared,
                 label='Open General', 
                 bottom=non_covid_not_shared+covid_not_shared, width=width)

    y4 = ax1.bar(xlabels, covid_shared,
                 label='Open Covid-19', 
                 bottom=non_covid_not_shared+covid_not_shared+
                 non_covid_shared, width=width)

    _ = ax1.set_ylabel('Models')
    
    _ = fig.legend(loc='upper left', ncol=2, bbox_to_anchor=(0.11, 1.0), 
                   fontsize=9)

    # include x, y grid 
    if grid:
        ax1.grid(ls='--', axis='y')
        ax2.grid(ls='--', axis='y')
        
    ax1.bar_label(y4)

    # set size of x, y ticks
    ax1.tick_params(axis='both', labelsize=9)
    ax2.tick_params(axis='both', labelsize=9)

    # plot 2: percentage available by year
    _ = ax2.bar(xlabels, per_models, label='All models', 
                width=width, color='green')

    _ = ax2.set_ylabel('Percent of literature')
    
    return fig, (ax1, ax2)

4. Read in data#

clean = load_clean_dataset(FILE_NAME)

5. Results#

5.1 Overall summary table#

# overall
year_summary = analysis_by_year(clean)
year_summary

	2019	2020	2021	2022	overall
n_included	126.0000	135.0000	158.0000	145.0000	564.0000
n_foss	18.0000	21.0000	35.0000	27.0000	101.0000
n_covid	0.0000	9.0000	35.0000	25.0000	69.0000
n_avail	5.0000	14.0000	15.0000	13.0000	47.0000
n_foss_avail	5.0000	7.0000	10.0000	7.0000	29.0000
n_covid_avail	0.0000	4.0000	11.0000	2.0000	17.0000
per_foss	0.1429	0.1556	0.2215	0.1862	0.1791
per_covid	0.0000	0.0667	0.2215	0.1724	0.1223
per_avail	0.0397	0.1037	0.0949	0.0897	0.0833
per_foss_avail	0.2778	0.3333	0.2857	0.2593	0.2871
per_covid_avail	NaN	0.4444	0.3143	0.0800	0.2464
reporting_guide	10.0000	21.0000	12.0000	29.0000	72.0000
per_reporting_guide	0.0794	0.1556	0.0759	0.2000	0.1277

Average shared by year.#

year_summary.loc['n_included'][:-1].mean()

141.0

year_summary.loc['n_covid'][1:-1].mean()

23.0

Model and code sharing practices in healthcare discrete-event simulation - a systematic review

Year of publication

Contents

Year of publication#

Primary research questions:#

1. Imports#

1.1. Standard Imports#

1.2 Imports from preprocessing module#

2. Constants#

3. Functions#

3.1. Functions to create summary statistics#

3.2 Functions to plot results#

4. Read in data#

5. Results#

5.1 Overall summary table#

Average shared by year.#

Model and code sharing practices in healthcare discrete-event simulation - a systematic review

Year of publication

Contents

Year of publication#

Primary research questions:#

1. Imports#

1.1. Standard Imports#

1.2 Imports from preprocessing module#

2. Constants#

3. Functions#

3.1. Functions to create summary statistics#

3.2 Functions to plot results#

4. Read in data#

5. Results#

5.1 Overall summary table#

Average shared by year.#

5.3. Sharing of models by year of publication#

5.4 All sharing subgroups by year of publication.#