Preprocessing
Contents
Preprocessing#
This notebook provides an overview of the code to read in the data extracted from the best practice audit of model sharing.
The dataset is a subset of the main review - limited to models shared. The type of model shared is coded as Visual Interactive Modelling (VIM) based (e.g Anylogic, Simul8, Arena) versus CODE (e.g. Matlab, Python, SimPy, Java, R Simmer).
Additional fields have been extracted as part of the best practice review.
model_format
- VIM or CODEmodel_has_doi
- do the model artefacts have their own minted DOI? (0/1)orcid
- do the researchers provide an ORCID with the model? (0/1)license
- does the model have an explicit license defining how it can be used? (str)readme
- is there an obvious file(s) where a user would look first? (0/1)link_to_paper
- does the model repository contain a link back to the pre-print or peer reviewed article? (0/1)steps_run
- are there steps to run a model? (0/1)formal_dep_mgt
- has the model been shared with formal software dependency management? (0/1)informal_dep_mgt
- have any informal methods of dependency management been shared? E.g. a list of software requirements. (0/1)evidence_testing
- do the model and artefacts in the repository contain any evidence that they have been tested? (0/1)downloadable
- can the model and artefacts be downloaded and executed locally? (0/1)interactive_online
- can the model and its artefacts be executed online without local installation? (0/1)
1. Imports#
import pandas as pd
import numpy as np
2. Constants#
FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
+ 'des_sharing_lit_review/main/data/bp_audit.zip'
# used to drop redudant fields not needed in analysis.
COLS_TO_DROP = [5, 6, 8, 18, 24, 29, 33]
3. Functions to read and clean dataset#
def cols_to_lower(df):
'''
Convert all column names in a dataframe to lower case
Params:
------
df - pandas.DataFrame
Returns:
-------
pandas.DataFrame
'''
new_cols = [c.lower() for c in df.columns]
df.columns = new_cols
return df
def drop_columns(df, to_drop):
'''
Remove fields that are not needed for the clean
analysis best practice dataset.
Uses the COLS_TO_DROP constant list.
Params:
-------
df - pd.DataFrame
The raw data
Returns:
--------
pd.DataFrame
'''
return df.drop(df.columns[to_drop], axis=1)
def load_clean_bpa(file_name):
'''
Loads a cleaned verion of the BEST PRACTICE AUDIT dataset
1. Replaces space in the column names with "_" and renameds model doi col.
2. Converts all column names to lower case
3. Drop columns not needed for analysis
4. Recode values in Janssen method columns
5. Convert relevant cols to Categorical data type
6. Performs remaining type conversions.
'''
labels = {'DOI.1': 'model_has_doi',
'Item Type': 'item_type',
'Publication Year': 'pub_yr',
'Publication Title': 'pub_title'}
recoded = {'model_repo': {'Github':'GitHub'},
'model_journal_supp': {'R model in word file':'Word doc'},
'model_personal_org': {'personex': 'Organisational website',
'Personex ': 'Organisational website',
'Personex': 'Organisational website',
'https://resp.core.ubc.ca/research/Specific_Projects/EPIC':
'Organisational website'},
'model_platform':{'Anylogic Cloud': 'AnyLogic Cloud'}}
clean = (pd.read_csv(file_name)
.rename(columns=labels)
.pipe(cols_to_lower)
.pipe(drop_columns, COLS_TO_DROP)
.replace(recoded)
.assign(model_format=lambda x: pd.Categorical(x['model_format']),
reporting_guidelines_mention=lambda x:
pd.Categorical(x['reporting_guidelines_mention']),
covid=lambda x: pd.Categorical(x['covid']),
foss_sim=lambda x: pd.Categorical(x['foss_sim']),
item_type=lambda x: pd.Categorical(x['item_type']),
model_has_doi=lambda x:
pd.Categorical(x['model_has_doi']),
orcid=lambda x: pd.Categorical(x['orcid']),
readme=lambda x: pd.Categorical(x['readme']),
link_to_paper=lambda x:
pd.Categorical(x['link_to_paper']),
steps_run=lambda x: pd.Categorical(x['steps_run']),
formal_dep_mgt=lambda x:
pd.Categorical(x['formal_dep_mgt']),
informal_dep_mgt=lambda x:
pd.Categorical(x['informal_dep_mgt']),
evidence_testing=lambda x:
pd.Categorical(x['evidence_testing']),
downloadable=lambda x:
pd.Categorical(x['downloadable']),
interactive_online=lambda x:
pd.Categorical(x['interactive_online']))
)
return clean
4. Example load and inspection of cleaned dataset#
4.2 Fields#
The clean data set has 27 fields included. These are listed below.
clean = load_clean_bpa(FILE_NAME)
clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model_format 47 non-null category
1 key 47 non-null object
2 item_type 47 non-null category
3 pub_yr 47 non-null int64
4 author 47 non-null object
5 doi 46 non-null object
6 reporting_guidelines_mention 47 non-null category
7 covid 47 non-null category
8 sim_software 47 non-null object
9 foss_sim 47 non-null category
10 model_archive 5 non-null object
11 model_repo 21 non-null object
12 model_journal_supp 10 non-null object
13 model_personal_org 6 non-null object
14 model_platform 11 non-null object
15 github_url 21 non-null object
16 model_has_doi 47 non-null category
17 orcid 46 non-null category
18 license 47 non-null object
19 readme 47 non-null category
20 link_to_paper 37 non-null category
21 steps_run 47 non-null category
22 formal_dep_mgt 47 non-null category
23 informal_dep_mgt 47 non-null category
24 evidence_testing 25 non-null category
25 downloadable 47 non-null category
26 interactive_online 47 non-null category
dtypes: category(15), int64(1), object(11)
memory usage: 7.1+ KB
4.2. Balance of the classes in model_format
#
unique_elements, counts_elements = np.unique(clean['model_format'],
return_counts=True)
print(unique_elements, counts_elements)
print(counts_elements[0]/len(clean))
print(counts_elements[1]/len(clean))
['CODE' 'VIM'] [31 16]
0.6595744680851063
0.3404255319148936