Model archiving
Contents
Model archiving#
This notebook provides an summary of how DES models were shared. This uses Janssen et al (2020) methodology to classify model archiving. The summary breaks into the following categories: open science archives, online code repositories, personal or organisation websites, or an online platform. This is further summarise by models developed via code based tools or Visual Interative Modelling (VIM) software. The latter is typically a single file.
The notebook answers the following research question:
What methods, tools, and resources did authors use to share their computer models and code?
Data used in analysis#
The dataset is a subset of the main review - limited to models shared. The type of model shared is coded as Visual Interactive Modelling (VIM) based (e.g Anylogic, Simul8, Arena) versus CODE (e.g. Matlab, Python, SimPy, Java, R Simmer).
The data can be found here: https://raw.githubusercontent.com/TomMonks/des_sharing_lit_review/main/data/bp_audit.zip
1. Imports#
1.1. Standard#
import pandas as pd
import numpy as np
1.2 Preprocessing#
from preprocessing import load_clean_bpa, drop_columns
2. Constants#
FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
+ 'des_sharing_lit_review/main/data/bp_audit.zip'
3. Analysis functions#
A number of simple functions to conduct the analysis and format output.
def get_counts(df, column):
'''
For a specified column return a Dataframe containing two columns
methods and counts. The methods are unique and the n represents
the number of instances in the dataset.
Params:
------
df: pd.DataFrame
The pandas dataframe containing the cohort of interest
columns: str
The column containing the values to count.
Returns:
-------
pd.DataFrame
'''
method = df[~df[column].isna()][column]
unique_elements, counts_elements = np.unique(method, return_counts=True)
unique_elements, counts_elements = pd.DataFrame(unique_elements), \
pd.DataFrame(counts_elements)
results = pd.concat([unique_elements, counts_elements], axis=1)
results.columns = ['method', 'n']
return results.set_index('method').sort_values('n', ascending=False)
def get_model_format_summary(model_format, df_code, df_vim, category):
code = get_counts(df_code, model_format)
vim = get_counts(df_vim, model_format)
comb = pd.concat([code, vim], axis=1)
comb.columns = ['CODE', 'VIM']
comb = comb.fillna(0).astype('int')
comb['category'] = category
comb = comb.reset_index()
return comb.set_index(['category', 'method'])
def multiple_archive_methods(df, jansson_method):
'''
identifies if the column has 1 or mode models that are shared
by multiple archiving methods. For example, Zenodo + GitHub.
Returns list of all archive methods in a list.
Params:
------
df: pd.DataFrame
The pandas dataframe containing the cohort of interest
jansson_method: list
A list of jansson method fields. Assumes first field in model_format
and this is excluded from analysis.
Returns:
-------
list
'''
jansson = df[jansson_method[1:]].fillna(0)
# all non zeros to 1 (via bool -> int)
jansson = jansson.astype(bool).astype(int)
multiple_achived = clean[jansson.sum(axis=1) > 1][jansson_method]
multiple_achived
# loop through columns and get uniques
results = []
for col in jansson_method[1:]:
results += get_counts(multiple_achived, col).index.tolist()
return results
4. Load and inspect dataset#
The dataframe clean
contains the full dataset used in the best practice audit.
clean = load_clean_bpa(FILE_NAME)
Split into code and visual interactive dataframes to assist in creating main summary
jansson_method = ['model_format', 'model_archive', 'model_repo', 'model_journal_supp',
'model_personal_org', 'model_platform']
df_code = clean[jansson_method]
df_code = df_code[df_code['model_format'] == 'CODE']
df_vim = clean[jansson_method]
df_vim = df_vim[df_vim['model_format'] == 'VIM']
5. Results#
The main aim of the results section is to summarise all archiving methods in a single table (in the same style of Janssen et al, 2020). This is built up category by category and then all tables are combined.
5.1 Overall numeric summary#
clean[jansson_method].groupby(by='model_format').count().T
model_format | CODE | VIM |
---|---|---|
model_archive | 2 | 3 |
model_repo | 20 | 1 |
model_journal_supp | 6 | 4 |
model_personal_org | 4 | 2 |
model_platform | 5 | 6 |
5.2 Open science archives#
ARCHIVE = 'model_archive'
archive_results = get_counts(clean[jansson_method], ARCHIVE)
archive_results
n | |
---|---|
method | |
Zenodo | 2 |
Institutional | 1 |
Mendeley | 1 |
Research Square | 1 |
archive_comb = get_model_format_summary('model_archive', df_code, df_vim,
'Archive')
archive_comb
CODE | VIM | ||
---|---|---|---|
category | method | ||
Archive | Institutional | 1 | 0 |
Zenodo | 1 | 1 | |
Mendeley | 0 | 1 | |
Research Square | 0 | 1 |
5.2 Model repositories#
repo_results = get_counts(clean[jansson_method], 'model_repo')
repo_results
n | |
---|---|
method | |
GitHub | 20 |
GitLab | 1 |
repo_comb = get_model_format_summary('model_repo', df_code, df_vim,
'Repository')
repo_comb
CODE | VIM | ||
---|---|---|---|
category | method | ||
Repository | GitHub | 19 | 1 |
GitLab | 1 | 0 |
5.3 Format of models stored in journal supplmentary material#
supp_results = get_counts(clean[jansson_method], 'model_journal_supp')
supp_results
n | |
---|---|
method | |
File | 5 |
Word doc | 3 |
1 | |
r script | 1 |
supp_comb = get_model_format_summary('model_journal_supp', df_code, df_vim,
'Journal')
supp_comb
CODE | VIM | ||
---|---|---|---|
category | method | ||
Journal | Word doc | 3 | 0 |
File | 1 | 4 | |
1 | 0 | ||
r script | 1 | 0 |
5.4 Personal and organisational websites#
org_results = get_counts(clean[jansson_method], 'model_personal_org')
org_results
n | |
---|---|
method | |
Organisational website | 4 |
Google Drive | 2 |
org_comb = get_model_format_summary('model_personal_org', df_code, df_vim,
'Personal or Organisational')
org_comb
CODE | VIM | ||
---|---|---|---|
category | method | ||
Personal or Organisational | Organisational website | 4 | 0 |
Google Drive | 0 | 2 |
5.5 Platform#
platform_results = get_counts(clean[jansson_method], 'model_platform')
platform_results
n | |
---|---|
method | |
AnyLogic Cloud | 6 |
CRAN | 2 |
BinderHub | 1 |
Google Colab | 1 |
R Shiney | 1 |
platform_comb = get_model_format_summary('model_platform', df_code, df_vim,
'Platform')
platform_comb
CODE | VIM | ||
---|---|---|---|
category | method | ||
Platform | CRAN | 2 | 0 |
BinderHub | 1 | 0 | |
Google Colab | 1 | 0 | |
R Shiney | 1 | 0 | |
AnyLogic Cloud | 0 | 6 |
5.7 Overall summary table#
jansson_table = pd.concat([archive_comb, repo_comb,
supp_comb, org_comb, platform_comb])
jansson_table
CODE | VIM | ||
---|---|---|---|
category | method | ||
Archive | Institutional | 1 | 0 |
Zenodo | 1 | 1 | |
Mendeley | 0 | 1 | |
Research Square | 0 | 1 | |
Repository | GitHub | 19 | 1 |
GitLab | 1 | 0 | |
Journal | Word doc | 3 | 0 |
File | 1 | 4 | |
1 | 0 | ||
r script | 1 | 0 | |
Personal or Organisational | Organisational website | 4 | 0 |
Google Drive | 0 | 2 | |
Platform | CRAN | 2 | 0 |
BinderHub | 1 | 0 | |
Google Colab | 1 | 0 | |
R Shiney | 1 | 0 | |
AnyLogic Cloud | 0 | 6 |
5.8 Modify Jansson table to indicate combinations#
The table below incorporates a small change in the table. All archiving methods that have been used in combination with others are flagged with an asterisk.
multi_methods = multiple_archive_methods(clean, jansson_method)
recode = {'method':{}}
for method in multi_methods:
recode['method'][method] = f'{method}*'
recode
jansson_table = jansson_table.reset_index()
jansson_table = jansson_table.replace(recode)
jansson_table = jansson_table.set_index(['category', 'method'])
jansson_table
CODE | VIM | ||
---|---|---|---|
category | method | ||
Archive | Institutional* | 1 | 0 |
Zenodo* | 1 | 1 | |
Mendeley | 0 | 1 | |
Research Square | 0 | 1 | |
Repository | GitHub* | 19 | 1 |
GitLab* | 1 | 0 | |
Journal | Word doc | 3 | 0 |
File | 1 | 4 | |
1 | 0 | ||
r script | 1 | 0 | |
Personal or Organisational | Organisational website* | 4 | 0 |
Google Drive | 0 | 2 | |
Platform | CRAN | 2 | 0 |
BinderHub* | 1 | 0 | |
Google Colab | 1 | 0 | |
R Shiney* | 1 | 0 | |
AnyLogic Cloud | 0 | 6 |
6. Output table as LaTeX#
print(jansson_table.style.to_latex(hrules=True,
label="Table:4",
caption="Janssen et al. classification of mode archiving"))
\begin{table}
\caption{Janssen et al. classification of mode archiving}
\label{Table:4}
\begin{tabular}{llrr}
\toprule
& & CODE & VIM \\
category & method & & \\
\midrule
\multirow[c]{4}{*}{Archive} & Institutional* & 1 & 0 \\
& Zenodo* & 1 & 1 \\
& Mendeley & 0 & 1 \\
& Research Square & 0 & 1 \\
\multirow[c]{2}{*}{Repository} & GitHub* & 19 & 1 \\
& GitLab* & 1 & 0 \\
\multirow[c]{4}{*}{Journal} & Word doc & 3 & 0 \\
& File & 1 & 4 \\
& PDF & 1 & 0 \\
& r script & 1 & 0 \\
\multirow[c]{2}{*}{Personal or Organisational} & Organisational website* & 4 & 0 \\
& Google Drive & 0 & 2 \\
\multirow[c]{5}{*}{Platform} & CRAN & 2 & 0 \\
& BinderHub* & 1 & 0 \\
& Google Colab & 1 & 0 \\
& R Shiney* & 1 & 0 \\
& AnyLogic Cloud & 0 & 6 \\
\bottomrule
\end{tabular}
\end{table}