Dataset pre-processing#

This notebook provides an overview of the code to read in the data extracted from the review.

The data set is held in a CSV file that has been an extracted from a Zotero library (TODO: INSERT Zotero library link). The following data was then extracted from each paper

  • study_included - has the study been included in the final analysis

  • model_code_available - is the model made publically available in some manner

  • reporting_guidelines_mention - have reporting guidelines been mentioned or explicitly cited used.

  • covid - is DES being used to tackle covid-19

  • sim_software - name of simulation software or programming language if stated.

  • foss_sim - free and open source simulation software? 0/1

  • model_archive - name of archive if used

  • model_repo - name of model repo if used

  • model_journal_supp - what is stored in the journal supplementary material

  • model_personal_org - name of personal or organisational website if used

  • model_platform - name of cloud platform used (e.g. Binder or Anylogic cloud)

  • excluded_reason - One of four reasons that the study was excluded.

1. Imports#

import pandas as pd
import numpy as np

2. Constants#

FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
    + 'des_sharing_lit_review/main/data/share_sim_data_extract.zip'

# used to drop redudant manuscript fields outputted by zotero 
# e.g. keywords and abstracts.
COLS_TO_KEEP = [2, 3, 4, 5, 6, 7, 10, 11, 44, 45, 46, 47, 
                48, 49, 50, 51, 52, 52, 53, 54, 55, 57]

3. Function to read and clean dataset#

We have implemented the read and clean up of the dataset using pandas

3.1 Cleaning helper functions#

Two supporting functions are defined for the main routine. These trim redundant columns and convert all column names to lower case.

def trim_columns(df):
    '''
    Remove fields that are not needed for the clean
    analysis dataset.
    
    Uses the COLS_TO_KEEP constant list.
    
    Params:
    -------
    df - pd.DataFrame
        The raw data
    
    Returns:
    --------
    pd.DataFrame
    
    '''
    return df[df.columns[COLS_TO_KEEP]]
def cols_to_lower(df):
    '''
    Convert all column names in a dataframe to lower case
    
    Params:
    ------
    df - pandas.DataFrame
    
    Returns:
    -------
    pandas.DataFrame
    '''
    new_cols = [c.lower() for c in df.columns]
    df.columns = new_cols
    return df

3.2. Main load and clean function#

The main function makes use of pandas method chaining functions.

def load_clean_dataset(file_name):
    '''
    Loads a cleaned verion of the dataset
    
    1.  Trims the columns to only those relevant to the analysis
    2.  Replaces space in the column names with "_"
    3.  Converts all column names to lower case
    4.  Convert relevant cols to Categorical data type
    5.  Performs remaining type conversions.
    '''
    labels = {'Item Type': 'item_type',
               'Publication Year': 'pub_yr',
               'Publication Title': 'pub_title'}

    type_conversions = {'pub_yr': 'int'}
    
    recoded_types = {'item_type': {'bookSection':'book'},
                     'reporting_guidelines_mention': {'ISPOR-SMDM': 'ISPOR',
                                                      '0': 'None'},
                     'sim_software': {'Anylogic': 'AnyLogic',
                                      'Treeage': 'TreeAge',
                                      'Matlab Simulink':'MATLAB',
                                      'Matlab SimEvents':'MATLAB',
                                      'Matlab':'MATLAB',
                                      'MatLab SimEvents':'MATLAB',
                                      'MatLab':'MATLAB'}}

    clean = (pd.read_csv(file_name)
             .pipe(trim_columns)
             .rename(columns=labels) 
             .pipe(cols_to_lower)
             .replace(recoded_types)
             .assign(study_included=lambda x: 
                         pd.Categorical(x['study_included']),
                     model_code_available=lambda x: 
                         pd.Categorical(x['model_code_available']),
                     reporting_guidelines_mention=lambda x: 
                         pd.Categorical(x['reporting_guidelines_mention']),
                     covid=lambda x: pd.Categorical(x['covid']),
                     foss_sim=lambda x: pd.Categorical(x['foss_sim']),
                     item_type=lambda x: pd.Categorical(x['item_type']))
            .astype(type_conversions)
            
    )

    return clean

4. Example read in, clean.#

Here we run the preprocessing of the main dataset and then examine the DataFrame information and peak at the head and tail.

clean = load_clean_dataset(FILE_NAME)
clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 22 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   key                           665 non-null    object  
 1   item_type                     665 non-null    category
 2   pub_yr                        665 non-null    int64   
 3   author                        664 non-null    object  
 4   title                         665 non-null    object  
 5   pub_title                     636 non-null    object  
 6   doi                           588 non-null    object  
 7   url                           450 non-null    object  
 8   study_included                665 non-null    category
 9   model_code_available          572 non-null    category
 10  reporting_guidelines_mention  571 non-null    category
 11  covid                         575 non-null    category
 12  sim_software                  574 non-null    object  
 13  foss_sim                      573 non-null    category
 14  model_archive                 5 non-null      object  
 15  model_repo                    24 non-null     object  
 16  model_journal_supp            7 non-null      object  
 17  model_journal_supp            7 non-null      object  
 18  model_personal_org            5 non-null      object  
 19  model_platform                11 non-null     object  
 20  available_on_req              66 non-null     object  
 21  excluded_reason               100 non-null    object  
dtypes: category(6), int64(1), object(15)
memory usage: 88.1+ KB
clean.head(2)
key item_type pub_yr author title pub_title doi url study_included model_code_available ... sim_software foss_sim model_archive model_repo model_journal_supp model_journal_supp model_personal_org model_platform available_on_req excluded_reason
0 6CYNDDIL journalArticle 2021 Saidani, M.; Kim, H. A Discrete Event Simulation-Based Model to Opt... Simulation in healthcare : journal of the Soci... 10.1097/SIH.0000000000000565 https://www.scopus.com/inward/record.uri?eid=2... 1 1.0 ... AnyLogic 0 NaN NaN File File NaN NaN NaN NaN
1 WJR7T7VY book 2021 Kenny, E.; Hassanzadeh, H.; Khanna, S.; Boyle,... Patient flow simulation using historically inf... NaN NaN https://www.scopus.com/inward/record.uri?eid=2... 1 0.0 ... SimPy 1 NaN NaN NaN NaN NaN NaN NaN NaN

2 rows × 22 columns

clean.tail(2)
key item_type pub_yr author title pub_title doi url study_included model_code_available ... sim_software foss_sim model_archive model_repo model_journal_supp model_journal_supp model_personal_org model_platform available_on_req excluded_reason
663 AY6AYBAM journalArticle 2021 Jaime, J.; Möller, J.; Santhirapala, V.; Gill,... Predicting Hospital Resource Use During COVID-... Value in health : the journal of the Internati... 10.1016/j.jval.2021.05.023 https://www.sciencedirect.com/science/article/... 0 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN Not DES
664 ZAX8CEH7 journalArticle 2021 Lu, Y.; Guan, Y.; Zhong, X.; Fishe, JN.; Hogan... CASE - Hospital Beds Planning and Admission Co... 2021 IEEE 17th International Conference on Aut... 10.1109/case49439.2021.9551589 https://search.bvsalud.org/global-literature-o... 1 0.0 ... Arena 0 NaN NaN NaN NaN NaN NaN NaN NaN

2 rows × 22 columns