Dataset pre-processing#

This notebook provides an overview of the code to read in the data extracted from the review.

The data set is held in a CSV file that has been an extracted from a Zotero library (TODO: INSERT Zotero library link). The following data was then extracted from each paper

study_included - has the study been included in the final analysis
model_code_available - is the model made publically available in some manner
reporting_guidelines_mention - have reporting guidelines been mentioned or explicitly cited used.
covid - is DES being used to tackle covid-19
sim_software - name of simulation software or programming language if stated.
foss_sim - free and open source simulation software? 0/1
model_archive - name of archive if used
model_repo - name of model repo if used
model_journal_supp - what is stored in the journal supplementary material
model_personal_org - name of personal or organisational website if used
model_platform - name of cloud platform used (e.g. Binder or Anylogic cloud)
excluded_reason - One of four reasons that the study was excluded.

1. Imports#

import pandas as pd
import numpy as np

2. Constants#

FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
    + 'des_sharing_lit_review/main/data/share_sim_data_extract.zip'

# used to drop redudant manuscript fields outputted by zotero 
# e.g. keywords and abstracts.
COLS_TO_KEEP = [2, 3, 4, 5, 6, 7, 10, 11, 44, 45, 46, 47, 
                48, 49, 50, 51, 52, 52, 53, 54, 55, 57]

3. Function to read and clean dataset#

We have implemented the read and clean up of the dataset using pandas

3.1 Cleaning helper functions#

Two supporting functions are defined for the main routine. These trim redundant columns and convert all column names to lower case.

def trim_columns(df):
    '''
    Remove fields that are not needed for the clean
    analysis dataset.
    
    Uses the COLS_TO_KEEP constant list.
    
    Params:
    -------
    df - pd.DataFrame
        The raw data
    
    Returns:
    --------
    pd.DataFrame
    
    '''
    return df[df.columns[COLS_TO_KEEP]]

def cols_to_lower(df):
    '''
    Convert all column names in a dataframe to lower case
    
    Params:
    ------
    df - pandas.DataFrame
    
    Returns:
    -------
    pandas.DataFrame
    '''
    new_cols = [c.lower() for c in df.columns]
    df.columns = new_cols
    return df

3.2. Main load and clean function#

The main function makes use of pandas method chaining functions.

def load_clean_dataset(file_name):
    '''
    Loads a cleaned verion of the dataset
    
    1.  Trims the columns to only those relevant to the analysis
    2.  Replaces space in the column names with "_"
    3.  Converts all column names to lower case
    4.  Convert relevant cols to Categorical data type
    5.  Performs remaining type conversions.
    '''
    labels = {'Item Type': 'item_type',
               'Publication Year': 'pub_yr',
               'Publication Title': 'pub_title'}

    type_conversions = {'pub_yr': 'int'}
    
    recoded_types = {'item_type': {'bookSection':'book'},
                     'reporting_guidelines_mention': {'ISPOR-SMDM': 'ISPOR',
                                                      '0': 'None'},
                     'sim_software': {'Anylogic': 'AnyLogic',
                                      'Treeage': 'TreeAge',
                                      'Matlab Simulink':'MATLAB',
                                      'Matlab SimEvents':'MATLAB',
                                      'Matlab':'MATLAB',
                                      'MatLab SimEvents':'MATLAB',
                                      'MatLab':'MATLAB'}}

    clean = (pd.read_csv(file_name)
             .pipe(trim_columns)
             .rename(columns=labels) 
             .pipe(cols_to_lower)
             .replace(recoded_types)
             .assign(study_included=lambda x: 
                         pd.Categorical(x['study_included']),
                     model_code_available=lambda x: 
                         pd.Categorical(x['model_code_available']),
                     reporting_guidelines_mention=lambda x: 
                         pd.Categorical(x['reporting_guidelines_mention']),
                     covid=lambda x: pd.Categorical(x['covid']),
                     foss_sim=lambda x: pd.Categorical(x['foss_sim']),
                     item_type=lambda x: pd.Categorical(x['item_type']))
            .astype(type_conversions)
            
    )

    return clean

4. Example read in, clean.#

Here we run the preprocessing of the main dataset and then examine the DataFrame information and peak at the head and tail.

clean = load_clean_dataset(FILE_NAME)
clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 22 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 key                           665 non-null    object  
 item_type                     665 non-null    category
 pub_yr                        665 non-null    int64   
 author                        664 non-null    object  
 title                         665 non-null    object  
 pub_title                     636 non-null    object  
 doi                           588 non-null    object  
 url                           450 non-null    object  
 study_included                665 non-null    category
 model_code_available          572 non-null    category
reporting_guidelines_mention  571 non-null    category
covid                         575 non-null    category
sim_software                  574 non-null    object  
foss_sim                      573 non-null    category
model_archive                 5 non-null      object  
model_repo                    24 non-null     object  
model_journal_supp            7 non-null      object  
model_journal_supp            7 non-null      object  
model_personal_org            5 non-null      object  
model_platform                11 non-null     object  
available_on_req              66 non-null     object  
excluded_reason               100 non-null    object  
dtypes: category(6), int64(1), object(15)
memory usage: 88.1+ KB

clean.head(2)

	key	item_type	pub_yr	author	title	pub_title	doi	url	study_included	model_code_available	...	sim_software	foss_sim	model_archive	model_repo	model_journal_supp	model_journal_supp	model_personal_org	model_platform	available_on_req	excluded_reason
0	6CYNDDIL	journalArticle	2021	Saidani, M.; Kim, H.	A Discrete Event Simulation-Based Model to Opt...	Simulation in healthcare : journal of the Soci...	10.1097/SIH.0000000000000565	https://www.scopus.com/inward/record.uri?eid=2...	1	1.0	...	AnyLogic	0	NaN	NaN	File	File	NaN	NaN	NaN	NaN
1	WJR7T7VY	book	2021	Kenny, E.; Hassanzadeh, H.; Khanna, S.; Boyle,...	Patient flow simulation using historically inf...	NaN	NaN	https://www.scopus.com/inward/record.uri?eid=2...	1	0.0	...	SimPy	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

2 rows × 22 columns

clean.tail(2)

	key	item_type	pub_yr	author	title	pub_title	doi	url	study_included	model_code_available	...	sim_software	foss_sim	model_archive	model_repo	model_journal_supp	model_journal_supp	model_personal_org	model_platform	available_on_req	excluded_reason
663	AY6AYBAM	journalArticle	2021	Jaime, J.; Möller, J.; Santhirapala, V.; Gill,...	Predicting Hospital Resource Use During COVID-...	Value in health : the journal of the Internati...	10.1016/j.jval.2021.05.023	https://www.sciencedirect.com/science/article/...	0	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	Not DES
664	ZAX8CEH7	journalArticle	2021	Lu, Y.; Guan, Y.; Zhong, X.; Fishe, JN.; Hogan...	CASE - Hospital Beds Planning and Admission Co...	2021 IEEE 17th International Conference on Aut...	10.1109/case49439.2021.9551589	https://search.bvsalud.org/global-literature-o...	1	0.0	...	Arena	0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN