Overall summary#

This notebook provides an overall summary of the sharing of models against our best practices for enabling other to use and execute the simulation model. It answers the following research question:

  1. To what extent do the DES health community follow best practice for open science when sharing computer models?

Data used in analysis#

The dataset is a subset of the main review - limited to models shared. The type of model shared is coded as Visual Interactive Modelling (VIM) based (e.g Anylogic, Simul8, Arena) versus CODE (e.g. Matlab, Python, SimPy, Java, R Simmer).

The following fields are analysed in this notebook.

  • model_format - VIM or CODE

  • model_has_doi - do the model artefacts have their own minted DOI? (0/1)

  • orcid - do the researchers provide an ORCID with the model? (0/1)

  • license - does the model have an explicit license defining how it can be used? (str)

  • readme - is there an obvious file(s) where a user would look first? (0/1)

  • steps_run - are there steps to run a model? (0/1)

  • formal_dep_mgt - has the model been shared with formal software dependency management? (0/1)

  • informal_dep_mgt - have any informal methods of dependency management been shared? E.g. a list of software requirements. (0/1)

  • evidence_testing - do the model and artefacts in the repository contain any evidence that they have been tested? (0/1)

  • downloadable - can the model and artefacts be downloaded and executed locally? (0/1)

  • interactive_online - can the model and its artefacts be executed online without local installation? (0/1)

1. Imports#

1.1. Standard#

import pandas as pd
import numpy as np

1.2 Preprocessing#

from preprocessing import load_clean_bpa, drop_columns

2. Constants#

FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
    + 'des_sharing_lit_review/main/data/bp_audit.zip'

3. Analysis functions#

A number of simple functions to conduct the analysis and format output.

def balance_of_model_format(df):
    '''
    REturns the counts of VIM versus code
    
    Params:
    -------
    df: pd.DataFrame
        Subset of the best practice dataset to analyse
        
    Returns:
    (labels: List, counts: List)
    '''
    unique_elements, counts_elements = np.unique(df['model_format'], 
                                                 return_counts=True)
    return unique_elements, counts_elements
def category_frequencies_by_model_format(df, cols):
    '''
    Calculate the frequencies of 0/1s for VIM versus code.
    Return concatenated in a pandas dataframe
    
    Params:
    ------
    df: pd.DataFrame
        DAtaframe containing subset of best practice audit to summarise.
    
    Returns:
    -------
    pd.DataFrame
    '''

    # key to select fields where category is 1.
    key = [('CODE', 1), ('VIM', 1)]

    df = pd.DataFrame()

    # operation needs to be done separetly on each criteria then combined.
    for col in list(clean[cols]):
        # group by VIM and code and get frequencies of 1/0
        results = clean.groupby('model_format')[col].value_counts(dropna=False)
        # concat to single dataframe
        df = pd.concat([df, results.loc[key]], axis=1)
        
    # drop multi-index, transpose and relabel
    df = df.reset_index()
    df = df.T
    df = df.drop(['level_0', 'level_1'])
    df.columns = ['CODE', 'VIM']
    
    # add percentages
    # get total number of code and vim based models.
    _, (n_code, n_vim) = balance_of_model_format(clean)
    per_cols = ['CODE_%', 'VIM_%']
    df[per_cols[0]] = (df['CODE'] / n_code * 100).map('{:,.1f}'.format)
    df[per_cols[1]] = (df['VIM'] / n_vim * 100).map('{:,.1f}'.format)
    
    return df
def model_has_license(license):
    '''
    Recode the license column from multiple categories down to binary.
    None = 0 else 1
    
    Params:
    ------
    license: pd.Series
        The series containing the license info
        
    Returns:
    -------
    int
    '''
    if license == "None":
        return 0
    else:
        return 1
def format_bpa_results(summary):
    '''
    Convert 4 column able of n and % into two 
    columns where each column is n (%)
    
    Params:
    -------
    summary: pd.DataFrame
        The unformatted table.  Assumes 4 cols and index.
        
    Returns:
    -------
    pd.DataFrame
    '''
    
    row_headings = ['Model has DOI',
                    'ORCID',
                    'Licensed',
                    'Readme',
                    'Steps to run',
                    'Formal Dep Mgt',
                    'Informal Dep Mgt',
                    'Evidence of testing',
                    'Model downloadable',
                    'Model interactive online']
    
    summary['CODE (\%)'] = summary['CODE'].map('{:,.0f}'.format) \
        + ' (' + summary['CODE_%'] + ')'
    
    summary['VIM (\%)'] = summary['VIM'].map('{:,.0f}'.format) \
        + ' (' + summary['VIM_%'] + ')'
    
    
    summary = summary.drop(['CODE', 'VIM', 'CODE_%', 'VIM_%'], axis=1)
    summary['criteria'] = row_headings
    summary = summary.set_index('criteria')
    return summary

4. Load and inspect dataset#

The clean data set has 27 fields included. These are listed below.

clean = load_clean_bpa(FILE_NAME)
clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   model_format                  47 non-null     category
 1   key                           47 non-null     object  
 2   item_type                     47 non-null     category
 3   pub_yr                        47 non-null     int64   
 4   author                        47 non-null     object  
 5   doi                           46 non-null     object  
 6   reporting_guidelines_mention  47 non-null     category
 7   covid                         47 non-null     category
 8   sim_software                  47 non-null     object  
 9   foss_sim                      47 non-null     category
 10  model_archive                 5 non-null      object  
 11  model_repo                    21 non-null     object  
 12  model_journal_supp            10 non-null     object  
 13  model_personal_org            6 non-null      object  
 14  model_platform                11 non-null     object  
 15  github_url                    21 non-null     object  
 16  model_has_doi                 47 non-null     category
 17  orcid                         46 non-null     category
 18  license                       47 non-null     object  
 19  readme                        47 non-null     category
 20  link_to_paper                 37 non-null     category
 21  steps_run                     47 non-null     category
 22  formal_dep_mgt                47 non-null     category
 23  informal_dep_mgt              47 non-null     category
 24  evidence_testing              25 non-null     category
 25  downloadable                  47 non-null     category
 26  interactive_online            47 non-null     category
dtypes: category(15), int64(1), object(11)
memory usage: 7.1+ KB

5. Results#

5.1 Summary table.#

The table illustrates the number of

cols = ['model_has_doi', 'orcid', 'license_y', 'readme', 'steps_run', 
        'formal_dep_mgt', 'informal_dep_mgt', 
        'evidence_testing', 'downloadable', 'interactive_online']

clean['license_y'] = clean['license'].apply(model_has_license)
unformatted = category_frequencies_by_model_format(clean, cols)
unformatted
CODE VIM CODE_% VIM_%
model_has_doi 4 3 12.9 18.8
orcid 3 3 9.7 18.8
license_y 15 6 48.4 37.5
readme 21 7 67.7 43.8
steps_run 13 3 41.9 18.8
formal_dep_mgt 7 0 22.6 0.0
informal_dep_mgt 7 8 22.6 50.0
evidence_testing 3 0 9.7 0.0
downloadable 31 11 100.0 68.8
interactive_online 4 6 12.9 37.5

5.2 Formatted Results for paper + \(\LaTeX\)#

table = format_bpa_results(unformatted)
table
CODE (\%) VIM (\%)
criteria
Model has DOI 4 (12.9) 3 (18.8)
ORCID 3 (9.7) 3 (18.8)
Licensed 15 (48.4) 6 (37.5)
Readme 21 (67.7) 7 (43.8)
Steps to run 13 (41.9) 3 (18.8)
Formal Dep Mgt 7 (22.6) 0 (0.0)
Informal Dep Mgt 7 (22.6) 8 (50.0)
Evidence of testing 3 (9.7) 0 (0.0)
Model downloadable 31 (100.0) 11 (68.8)
Model interactive online 4 (12.9) 6 (37.5)
# output as latex
print(table.style.to_latex(hrules=True, 
                          label="Table:bpa_results", 
                          caption="Best practice audit results"))
\begin{table}
\caption{Best practice audit results}
\label{Table:bpa_results}
\begin{tabular}{lll}
\toprule
 & CODE (\%) & VIM (\%) \\
criteria &  &  \\
\midrule
Model has DOI & 4 (12.9) & 3 (18.8) \\
ORCID & 3 (9.7) & 3 (18.8) \\
Licensed & 15 (48.4) & 6 (37.5) \\
Readme & 21 (67.7) & 7 (43.8) \\
Steps to run & 13 (41.9) & 3 (18.8) \\
Formal Dep Mgt & 7 (22.6) & 0 (0.0) \\
Informal Dep Mgt & 7 (22.6) & 8 (50.0) \\
Evidence of testing & 3 (9.7) & 0 (0.0) \\
Model downloadable & 31 (100.0) & 11 (68.8) \\
Model interactive online & 4 (12.9) & 6 (37.5) \\
\bottomrule
\end{tabular}
\end{table}