Dataset pre-processing
Contents
Dataset pre-processing#
This notebook provides an overview of the code to read in the data extracted from the review.
The data set is held in a CSV file that has been an extracted from a Zotero library (TODO: INSERT Zotero library link). The following data was then extracted from each paper
study_included
- has the study been included in the final analysismodel_code_available
- is the model made publically available in some mannerreporting_guidelines_mention
- have reporting guidelines been mentioned or explicitly cited used.covid
- is DES being used to tackle covid-19sim_software
- name of simulation software or programming language if stated.foss_sim
- free and open source simulation software? 0/1model_archive
- name of archive if usedmodel_repo
- name of model repo if usedmodel_journal_supp
- what is stored in the journal supplementary materialmodel_personal_org
- name of personal or organisational website if usedmodel_platform
- name of cloud platform used (e.g. Binder or Anylogic cloud)excluded_reason
- One of four reasons that the study was excluded.
1. Imports#
import pandas as pd
import numpy as np
2. Constants#
FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
+ 'des_sharing_lit_review/main/data/share_sim_data_extract.zip'
# used to drop redudant manuscript fields outputted by zotero
# e.g. keywords and abstracts.
COLS_TO_KEEP = [2, 3, 4, 5, 6, 7, 10, 11, 44, 45, 46, 47,
48, 49, 50, 51, 52, 52, 53, 54, 55, 57]
3. Function to read and clean dataset#
We have implemented the read and clean up of the dataset using pandas
3.1 Cleaning helper functions#
Two supporting functions are defined for the main routine. These trim redundant columns and convert all column names to lower case.
def trim_columns(df):
'''
Remove fields that are not needed for the clean
analysis dataset.
Uses the COLS_TO_KEEP constant list.
Params:
-------
df - pd.DataFrame
The raw data
Returns:
--------
pd.DataFrame
'''
return df[df.columns[COLS_TO_KEEP]]
def cols_to_lower(df):
'''
Convert all column names in a dataframe to lower case
Params:
------
df - pandas.DataFrame
Returns:
-------
pandas.DataFrame
'''
new_cols = [c.lower() for c in df.columns]
df.columns = new_cols
return df
3.2. Main load and clean function#
The main function makes use of pandas method chaining functions.
def load_clean_dataset(file_name):
'''
Loads a cleaned verion of the dataset
1. Trims the columns to only those relevant to the analysis
2. Replaces space in the column names with "_"
3. Converts all column names to lower case
4. Convert relevant cols to Categorical data type
5. Performs remaining type conversions.
'''
labels = {'Item Type': 'item_type',
'Publication Year': 'pub_yr',
'Publication Title': 'pub_title'}
type_conversions = {'pub_yr': 'int'}
recoded_types = {'item_type': {'bookSection':'book'},
'reporting_guidelines_mention': {'ISPOR-SMDM': 'ISPOR',
'0': 'None'},
'sim_software': {'Anylogic': 'AnyLogic',
'Treeage': 'TreeAge',
'Matlab Simulink':'MATLAB',
'Matlab SimEvents':'MATLAB',
'Matlab':'MATLAB',
'MatLab SimEvents':'MATLAB',
'MatLab':'MATLAB'}}
clean = (pd.read_csv(file_name)
.pipe(trim_columns)
.rename(columns=labels)
.pipe(cols_to_lower)
.replace(recoded_types)
.assign(study_included=lambda x:
pd.Categorical(x['study_included']),
model_code_available=lambda x:
pd.Categorical(x['model_code_available']),
reporting_guidelines_mention=lambda x:
pd.Categorical(x['reporting_guidelines_mention']),
covid=lambda x: pd.Categorical(x['covid']),
foss_sim=lambda x: pd.Categorical(x['foss_sim']),
item_type=lambda x: pd.Categorical(x['item_type']))
.astype(type_conversions)
)
return clean
4. Example read in, clean.#
Here we run the preprocessing of the main dataset and then examine the DataFrame
information and peak at the head and tail.
clean = load_clean_dataset(FILE_NAME)
clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 665 entries, 0 to 664
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 key 665 non-null object
1 item_type 665 non-null category
2 pub_yr 665 non-null int64
3 author 664 non-null object
4 title 665 non-null object
5 pub_title 636 non-null object
6 doi 588 non-null object
7 url 450 non-null object
8 study_included 665 non-null category
9 model_code_available 572 non-null category
10 reporting_guidelines_mention 571 non-null category
11 covid 575 non-null category
12 sim_software 574 non-null object
13 foss_sim 573 non-null category
14 model_archive 5 non-null object
15 model_repo 24 non-null object
16 model_journal_supp 7 non-null object
17 model_journal_supp 7 non-null object
18 model_personal_org 5 non-null object
19 model_platform 11 non-null object
20 available_on_req 66 non-null object
21 excluded_reason 100 non-null object
dtypes: category(6), int64(1), object(15)
memory usage: 88.1+ KB
clean.head(2)
key | item_type | pub_yr | author | title | pub_title | doi | url | study_included | model_code_available | ... | sim_software | foss_sim | model_archive | model_repo | model_journal_supp | model_journal_supp | model_personal_org | model_platform | available_on_req | excluded_reason | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 6CYNDDIL | journalArticle | 2021 | Saidani, M.; Kim, H. | A Discrete Event Simulation-Based Model to Opt... | Simulation in healthcare : journal of the Soci... | 10.1097/SIH.0000000000000565 | https://www.scopus.com/inward/record.uri?eid=2... | 1 | 1.0 | ... | AnyLogic | 0 | NaN | NaN | File | File | NaN | NaN | NaN | NaN |
1 | WJR7T7VY | book | 2021 | Kenny, E.; Hassanzadeh, H.; Khanna, S.; Boyle,... | Patient flow simulation using historically inf... | NaN | NaN | https://www.scopus.com/inward/record.uri?eid=2... | 1 | 0.0 | ... | SimPy | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 rows × 22 columns
clean.tail(2)
key | item_type | pub_yr | author | title | pub_title | doi | url | study_included | model_code_available | ... | sim_software | foss_sim | model_archive | model_repo | model_journal_supp | model_journal_supp | model_personal_org | model_platform | available_on_req | excluded_reason | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
663 | AY6AYBAM | journalArticle | 2021 | Jaime, J.; Möller, J.; Santhirapala, V.; Gill,... | Predicting Hospital Resource Use During COVID-... | Value in health : the journal of the Internati... | 10.1016/j.jval.2021.05.023 | https://www.sciencedirect.com/science/article/... | 0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | Not DES |
664 | ZAX8CEH7 | journalArticle | 2021 | Lu, Y.; Guan, Y.; Zhong, X.; Fishe, JN.; Hogan... | CASE - Hospital Beds Planning and Admission Co... | 2021 IEEE 17th International Conference on Aut... | 10.1109/case49439.2021.9551589 | https://search.bvsalud.org/global-literature-o... | 1 | 0.0 | ... | Arena | 0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 rows × 22 columns