Overall summary
Contents
Overall summary#
This notebook provides an overall summary of the sharing of models against our best practices for enabling other to use and execute the simulation model. It answers the following research question:
To what extent do the DES health community follow best practice for open science when sharing computer models?
Data used in analysis#
The dataset is a subset of the main review - limited to models shared. The type of model shared is coded as Visual Interactive Modelling (VIM) based (e.g Anylogic, Simul8, Arena) versus CODE (e.g. Matlab, Python, SimPy, Java, R Simmer).
The data can be found here: https://raw.githubusercontent.com/TomMonks/des_sharing_lit_review/main/data/bp_audit.zip
The following fields are analysed in this notebook.
model_format
- VIM or CODEmodel_has_doi
- do the model artefacts have their own minted DOI? (0/1)orcid
- do the researchers provide an ORCID with the model? (0/1)license
- does the model have an explicit license defining how it can be used? (str)readme
- is there an obvious file(s) where a user would look first? (0/1)steps_run
- are there steps to run a model? (0/1)formal_dep_mgt
- has the model been shared with formal software dependency management? (0/1)informal_dep_mgt
- have any informal methods of dependency management been shared? E.g. a list of software requirements. (0/1)evidence_testing
- do the model and artefacts in the repository contain any evidence that they have been tested? (0/1)downloadable
- can the model and artefacts be downloaded and executed locally? (0/1)interactive_online
- can the model and its artefacts be executed online without local installation? (0/1)
1. Imports#
1.1. Standard#
import pandas as pd
import numpy as np
1.2 Preprocessing#
from preprocessing import load_clean_bpa, drop_columns
2. Constants#
FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
+ 'des_sharing_lit_review/main/data/bp_audit.zip'
3. Analysis functions#
A number of simple functions to conduct the analysis and format output.
def balance_of_model_format(df):
'''
REturns the counts of VIM versus code
Params:
-------
df: pd.DataFrame
Subset of the best practice dataset to analyse
Returns:
(labels: List, counts: List)
'''
unique_elements, counts_elements = np.unique(df['model_format'],
return_counts=True)
return unique_elements, counts_elements
def category_frequencies_by_model_format(df, cols):
'''
Calculate the frequencies of 0/1s for VIM versus code.
Return concatenated in a pandas dataframe
Params:
------
df: pd.DataFrame
DAtaframe containing subset of best practice audit to summarise.
Returns:
-------
pd.DataFrame
'''
# key to select fields where category is 1.
key = [('CODE', 1), ('VIM', 1)]
df = pd.DataFrame()
# operation needs to be done separetly on each criteria then combined.
for col in list(clean[cols]):
# group by VIM and code and get frequencies of 1/0
results = clean.groupby('model_format')[col].value_counts(dropna=False)
# concat to single dataframe
df = pd.concat([df, results.loc[key]], axis=1)
# drop multi-index, transpose and relabel
df = df.reset_index()
df = df.T
df = df.drop(['level_0', 'level_1'])
df.columns = ['CODE', 'VIM']
# add percentages
# get total number of code and vim based models.
_, (n_code, n_vim) = balance_of_model_format(clean)
per_cols = ['CODE_%', 'VIM_%']
df[per_cols[0]] = (df['CODE'] / n_code * 100).map('{:,.1f}'.format)
df[per_cols[1]] = (df['VIM'] / n_vim * 100).map('{:,.1f}'.format)
return df
def model_has_license(license):
'''
Recode the license column from multiple categories down to binary.
None = 0 else 1
Params:
------
license: pd.Series
The series containing the license info
Returns:
-------
int
'''
if license == "None":
return 0
else:
return 1
def format_bpa_results(summary):
'''
Convert 4 column able of n and % into two
columns where each column is n (%)
Params:
-------
summary: pd.DataFrame
The unformatted table. Assumes 4 cols and index.
Returns:
-------
pd.DataFrame
'''
row_headings = ['Model has DOI',
'ORCID',
'Licensed',
'Readme',
'Steps to run',
'Formal Dep Mgt',
'Informal Dep Mgt',
'Evidence of testing',
'Model downloadable',
'Model interactive online']
summary['CODE (\%)'] = summary['CODE'].map('{:,.0f}'.format) \
+ ' (' + summary['CODE_%'] + ')'
summary['VIM (\%)'] = summary['VIM'].map('{:,.0f}'.format) \
+ ' (' + summary['VIM_%'] + ')'
summary = summary.drop(['CODE', 'VIM', 'CODE_%', 'VIM_%'], axis=1)
summary['criteria'] = row_headings
summary = summary.set_index('criteria')
return summary
4. Load and inspect dataset#
The clean data set has 27 fields included. These are listed below.
clean = load_clean_bpa(FILE_NAME)
clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model_format 47 non-null category
1 key 47 non-null object
2 item_type 47 non-null category
3 pub_yr 47 non-null int64
4 author 47 non-null object
5 doi 46 non-null object
6 reporting_guidelines_mention 47 non-null category
7 covid 47 non-null category
8 sim_software 47 non-null object
9 foss_sim 47 non-null category
10 model_archive 5 non-null object
11 model_repo 21 non-null object
12 model_journal_supp 10 non-null object
13 model_personal_org 6 non-null object
14 model_platform 11 non-null object
15 github_url 21 non-null object
16 model_has_doi 47 non-null category
17 orcid 46 non-null category
18 license 47 non-null object
19 readme 47 non-null category
20 link_to_paper 37 non-null category
21 steps_run 47 non-null category
22 formal_dep_mgt 47 non-null category
23 informal_dep_mgt 47 non-null category
24 evidence_testing 25 non-null category
25 downloadable 47 non-null category
26 interactive_online 47 non-null category
dtypes: category(15), int64(1), object(11)
memory usage: 7.1+ KB
5. Results#
5.1 Summary table.#
The table illustrates the number of
cols = ['model_has_doi', 'orcid', 'license_y', 'readme', 'steps_run',
'formal_dep_mgt', 'informal_dep_mgt',
'evidence_testing', 'downloadable', 'interactive_online']
clean['license_y'] = clean['license'].apply(model_has_license)
unformatted = category_frequencies_by_model_format(clean, cols)
unformatted
CODE | VIM | CODE_% | VIM_% | |
---|---|---|---|---|
model_has_doi | 4 | 3 | 12.9 | 18.8 |
orcid | 3 | 3 | 9.7 | 18.8 |
license_y | 15 | 6 | 48.4 | 37.5 |
readme | 21 | 7 | 67.7 | 43.8 |
steps_run | 13 | 3 | 41.9 | 18.8 |
formal_dep_mgt | 7 | 0 | 22.6 | 0.0 |
informal_dep_mgt | 7 | 8 | 22.6 | 50.0 |
evidence_testing | 3 | 0 | 9.7 | 0.0 |
downloadable | 31 | 11 | 100.0 | 68.8 |
interactive_online | 4 | 6 | 12.9 | 37.5 |
5.2 Formatted Results for paper + \(\LaTeX\)#
table = format_bpa_results(unformatted)
table
CODE (\%) | VIM (\%) | |
---|---|---|
criteria | ||
Model has DOI | 4 (12.9) | 3 (18.8) |
ORCID | 3 (9.7) | 3 (18.8) |
Licensed | 15 (48.4) | 6 (37.5) |
Readme | 21 (67.7) | 7 (43.8) |
Steps to run | 13 (41.9) | 3 (18.8) |
Formal Dep Mgt | 7 (22.6) | 0 (0.0) |
Informal Dep Mgt | 7 (22.6) | 8 (50.0) |
Evidence of testing | 3 (9.7) | 0 (0.0) |
Model downloadable | 31 (100.0) | 11 (68.8) |
Model interactive online | 4 (12.9) | 6 (37.5) |
# output as latex
print(table.style.to_latex(hrules=True,
label="Table:bpa_results",
caption="Best practice audit results"))
\begin{table}
\caption{Best practice audit results}
\label{Table:bpa_results}
\begin{tabular}{lll}
\toprule
& CODE (\%) & VIM (\%) \\
criteria & & \\
\midrule
Model has DOI & 4 (12.9) & 3 (18.8) \\
ORCID & 3 (9.7) & 3 (18.8) \\
Licensed & 15 (48.4) & 6 (37.5) \\
Readme & 21 (67.7) & 7 (43.8) \\
Steps to run & 13 (41.9) & 3 (18.8) \\
Formal Dep Mgt & 7 (22.6) & 0 (0.0) \\
Informal Dep Mgt & 7 (22.6) & 8 (50.0) \\
Evidence of testing & 3 (9.7) & 0 (0.0) \\
Model downloadable & 31 (100.0) & 11 (68.8) \\
Model interactive online & 4 (12.9) & 6 (37.5) \\
\bottomrule
\end{tabular}
\end{table}