Preprocessing#

This notebook provides an overview of the code to read in the data extracted from the best practice audit of model sharing.

The dataset is a subset of the main review - limited to models shared. The type of model shared is coded as Visual Interactive Modelling (VIM) based (e.g Anylogic, Simul8, Arena) versus CODE (e.g. Matlab, Python, SimPy, Java, R Simmer).

Additional fields have been extracted as part of the best practice review.

model_format - VIM or CODE
model_has_doi - do the model artefacts have their own minted DOI? (0/1)
orcid - do the researchers provide an ORCID with the model? (0/1)
license - does the model have an explicit license defining how it can be used? (str)
readme - is there an obvious file(s) where a user would look first? (0/1)
link_to_paper - does the model repository contain a link back to the pre-print or peer reviewed article? (0/1)
steps_run - are there steps to run a model? (0/1)
formal_dep_mgt - has the model been shared with formal software dependency management? (0/1)
informal_dep_mgt - have any informal methods of dependency management been shared? E.g. a list of software requirements. (0/1)
evidence_testing - do the model and artefacts in the repository contain any evidence that they have been tested? (0/1)
downloadable - can the model and artefacts be downloaded and executed locally? (0/1)
interactive_online - can the model and its artefacts be executed online without local installation? (0/1)

1. Imports#

import pandas as pd
import numpy as np

2. Constants#

FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
    + 'des_sharing_lit_review/main/data/bp_audit.zip'

# used to drop redudant fields not needed in analysis.
COLS_TO_DROP = [5, 6, 8, 18, 24, 29, 33]

3. Functions to read and clean dataset#

def cols_to_lower(df):
    '''
    Convert all column names in a dataframe to lower case
    
    Params:
    ------
    df - pandas.DataFrame
    
    Returns:
    -------
    pandas.DataFrame
    '''
    new_cols = [c.lower() for c in df.columns]
    df.columns = new_cols
    return df

def drop_columns(df, to_drop):
    '''
    Remove fields that are not needed for the clean
    analysis best practice dataset.
    
    Uses the COLS_TO_DROP constant list.
    
    Params:
    -------
    df - pd.DataFrame
        The raw data
    
    Returns:
    --------
    pd.DataFrame
    
    '''
    return df.drop(df.columns[to_drop], axis=1)

def load_clean_bpa(file_name):
    '''
    Loads a cleaned verion of the BEST PRACTICE AUDIT dataset
    
    1.  Replaces space in the column names with "_" and renameds model doi col.
    2.  Converts all column names to lower case
    3.  Drop columns not needed for analysis
    4.  Recode values in Janssen method columns
    5.  Convert relevant cols to Categorical data type
    6.  Performs remaining type conversions.
    '''
    
    labels = {'DOI.1': 'model_has_doi',
              'Item Type': 'item_type',
              'Publication Year': 'pub_yr',
              'Publication Title': 'pub_title'}
    
    recoded = {'model_repo': {'Github':'GitHub'},
               'model_journal_supp': {'R model in word file':'Word doc'},
               'model_personal_org': {'personex': 'Organisational website',
                                      'Personex ': 'Organisational website',
                                      'Personex': 'Organisational website',
                'https://resp.core.ubc.ca/research/Specific_Projects/EPIC':
                 'Organisational website'},
               'model_platform':{'Anylogic Cloud': 'AnyLogic Cloud'}}
    
    clean = (pd.read_csv(file_name)
               .rename(columns=labels)
               .pipe(cols_to_lower)
               .pipe(drop_columns, COLS_TO_DROP)
               .replace(recoded)
               .assign(model_format=lambda x: pd.Categorical(x['model_format']),
                       reporting_guidelines_mention=lambda x: 
                           pd.Categorical(x['reporting_guidelines_mention']),
                       covid=lambda x: pd.Categorical(x['covid']),
                       foss_sim=lambda x: pd.Categorical(x['foss_sim']),
                       item_type=lambda x: pd.Categorical(x['item_type']),
                       model_has_doi=lambda x: 
                           pd.Categorical(x['model_has_doi']),
                       orcid=lambda x: pd.Categorical(x['orcid']),
                       readme=lambda x: pd.Categorical(x['readme']),
                       link_to_paper=lambda x: 
                           pd.Categorical(x['link_to_paper']),
                       steps_run=lambda x: pd.Categorical(x['steps_run']),
                       formal_dep_mgt=lambda x: 
                           pd.Categorical(x['formal_dep_mgt']),
                       informal_dep_mgt=lambda x: 
                           pd.Categorical(x['informal_dep_mgt']),
                       evidence_testing=lambda x:
                           pd.Categorical(x['evidence_testing']),
                       downloadable=lambda x: 
                           pd.Categorical(x['downloadable']),
                       interactive_online=lambda x: 
                           pd.Categorical(x['interactive_online']))
            )
    return clean
        

4. Example load and inspection of cleaned dataset#

4.2 Fields#

The clean data set has 27 fields included. These are listed below.

clean = load_clean_bpa(FILE_NAME)
clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 model_format                  47 non-null     category
 key                           47 non-null     object  
 item_type                     47 non-null     category
 pub_yr                        47 non-null     int64   
 author                        47 non-null     object  
 doi                           46 non-null     object  
 reporting_guidelines_mention  47 non-null     category
 covid                         47 non-null     category
 sim_software                  47 non-null     object  
 foss_sim                      47 non-null     category
model_archive                 5 non-null      object  
model_repo                    21 non-null     object  
model_journal_supp            10 non-null     object  
model_personal_org            6 non-null      object  
model_platform                11 non-null     object  
github_url                    21 non-null     object  
model_has_doi                 47 non-null     category
orcid                         46 non-null     category
license                       47 non-null     object  
readme                        47 non-null     category
link_to_paper                 37 non-null     category
steps_run                     47 non-null     category
formal_dep_mgt                47 non-null     category
informal_dep_mgt              47 non-null     category
evidence_testing              25 non-null     category
downloadable                  47 non-null     category
interactive_online            47 non-null     category
dtypes: category(15), int64(1), object(11)
memory usage: 7.1+ KB

4.2. Balance of the classes in `model_format`#

unique_elements, counts_elements = np.unique(clean['model_format'], 
                                                   return_counts=True)
print(unique_elements, counts_elements)

print(counts_elements[0]/len(clean))
print(counts_elements[1]/len(clean))

['CODE' 'VIM'] [31 16]
0.6595744680851063
0.3404255319148936

Model and code sharing practices in healthcare discrete-event simulation - a systematic review

Preprocessing

Contents

Preprocessing#

1. Imports#

2. Constants#

3. Functions to read and clean dataset#

4. Example load and inspection of cleaned dataset#

4.2 Fields#

4.2. Balance of the classes in `model_format`#

Model and code sharing practices in healthcare discrete-event simulation - a systematic review

Preprocessing

Contents

Preprocessing#

1. Imports#

2. Constants#

3. Functions to read and clean dataset#

4. Example load and inspection of cleaned dataset#

4.2 Fields#

4.2. Balance of the classes in model_format#

4.2. Balance of the classes in `model_format`#