Open Research
Contents
Open Research#
This notebook provides an analysis of the sharing of models against our best practices for open research. In summary this is defined as:
Shared models have their own DOI and hence guarantees on persistence;
The authors of shared models and artefacts can be uniquely identified by ORCIDs;
Models are shared with an open license that sets out how the model can be used/adapted, author liability and if credit is needed.
Notebook aims#
The notebook analyses the following questions related to best practice:
What proportion of the share model artefacts has a DOI and guarantees on persistence?
What proportion of artefacts are linked to the researcher via ORCID(s)?
What proportion of models have an open license?
When a model is licensed what was the most popular license?
How do licenses relate to approaches to sharing models?
Data used in analysis#
The dataset is a subset of the main review - limited to models shared. The type of model shared is coded as Visual Interactive Modelling (VIM) based (e.g Anylogic, Simul8, Arena) versus CODE (e.g. Matlab, Python, SimPy, Java, R Simmer).
The data can be found here: https://raw.githubusercontent.com/TomMonks/des_sharing_lit_review/main/data/bp_audit.zip
The following fields are analysed in this notebook.
model_format
- VIM or CODEmodel_has_doi
- do the model artefacts have their own minted DOI? (0/1)orcid
- do the researchers provide an ORCID with the model? (0/1)license
- does the model have an explicit license defining how it can be used? (str)model_archive
- name of archive if used (0/1)model_repo
- name of model repo if used (0/1)model_journal_supp
- what is stored in the journal supplementary material (0/1)model_personal_org
- name of personal or organisational website if used (0/1)model_platform
- name of cloud platform used (e.g. Binder or Anylogic cloud) (0/1)|
1. Imports#
1.1. Standard#
import pandas as pd
import numpy as np
1.2 Preprocessing#
from preprocessing import load_clean_bpa
2. Constants#
FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
+ 'des_sharing_lit_review/main/data/bp_audit.zip'
LICENSE_LABEL = 'license'
NONE_LABEL = 'None'
3. Analysis functions#
A number of simple functions to conduct the analysis and format output.
def balance_of_model_format(df):
unique_elements, counts_elements = np.unique(df['model_format'],
return_counts=True)
return unique_elements, counts_elements
def license_versus_no_license(df):
'''
Returns a tuple containined the (number of licensed models, not licensed)
contained with the dataset.
Parameters:
-----------
df: pd.DataFrame
A dataset to analyse. Could be full dataset or a partial subset
Returns:
--------
tuple (int, int)
'''
n_not_lincensed = len(df[df[LICENSE_LABEL]==NONE_LABEL])
return len(df) - n_not_lincensed, n_not_lincensed
def field_by_sharing_tools(df, field=LICENSE_LABEL):
'''
Return a DataFrame containing licenses (rows) by type of sharing
i.e. archive, cloud repo, journal supp , personal/org website, platform.
Parameters:
-----------
df: pd.DataFrame
Contains data to analysis. Eg.full dataset or subset
Returns:
-------
DataFrame (9, 6)
'''
selected_columns = ['model_archive', 'model_repo', 'model_journal_supp',
'model_personal_org', 'model_platform']
license_by_sharing = df.groupby(by=field)[selected_columns].count()
return license_by_sharing.sort_values(by='model_repo',
ascending=False)
def format_license_table(df):
'''
Format the license table.
'''
column_headers = ['Archive', 'Repository',
'Journal', 'Personal/org', 'Platform']
df.columns = column_headers
return df
3. Load and inspect dataset#
The clean data set has 27 fields included. These are listed below.
clean = load_clean_bpa(FILE_NAME)
clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model_format 47 non-null category
1 key 47 non-null object
2 item_type 47 non-null category
3 pub_yr 47 non-null int64
4 author 47 non-null object
5 doi 46 non-null object
6 reporting_guidelines_mention 47 non-null category
7 covid 47 non-null category
8 sim_software 47 non-null object
9 foss_sim 47 non-null category
10 model_archive 5 non-null object
11 model_repo 21 non-null object
12 model_journal_supp 10 non-null object
13 model_personal_org 6 non-null object
14 model_platform 11 non-null object
15 github_url 21 non-null object
16 model_has_doi 47 non-null category
17 orcid 46 non-null category
18 license 47 non-null object
19 readme 47 non-null category
20 link_to_paper 37 non-null category
21 steps_run 47 non-null category
22 formal_dep_mgt 47 non-null category
23 informal_dep_mgt 47 non-null category
24 evidence_testing 25 non-null category
25 downloadable 47 non-null category
26 interactive_online 47 non-null category
dtypes: category(15), int64(1), object(11)
memory usage: 7.1+ KB
4. Results#
4.2 What proportion of artefacts are linked to the researcher via ORCID(s)?#
unique_elements, counts_elements = np.unique(clean['orcid'],
return_counts=True)
has_orcid = counts_elements[1]
has_orcid_percent = (has_orcid / len(clean)) * 100
orcid_result = f'A total of {has_orcid} ({has_orcid_percent:.1f}\%) models ' \
+ 'were provided were linked to a researcher via an ORCID.'
print(orcid_result)
A total of 6 (12.8\%) models were provided were linked to a researcher via an ORCID.
Of this small number what was the format of the model sharing.
orcids = clean[clean['orcid'] == 1]
model_format, counts = balance_of_model_format(orcids)
print(model_format, counts)
['CODE' 'VIM'] [3 3]
format_license_table(field_by_sharing_tools(orcids, field='orcid'))
Archive | Repository | Journal | Personal/org | Platform | |
---|---|---|---|---|---|
orcid | |||||
0.0 | 0 | 0 | 0 | 0 | 0 |
1.0 | 2 | 0 | 4 | 0 | 0 |
4.3 What proportion of models have an open license?#
We extracted the type of license included with each shared model. When no license was included we recorded this as None
. For one model shared as supplementary material with a journal we were unable to determine what license had been applied. We labelled this as Unknown
. When a model was published as journal supplementary material we assigned the same license as applied to the paper if it was not explicitly stated. For example, if a paper was published under a CC-BY 4.0 license and there was no explicit license attached to supplementary material we assumed the same license for the model.
licensed, not_licensed = license_versus_no_license(clean)
per_licensed, per_not_licensed = (licensed / len(clean)) * 100, (not_licensed / len(clean)) * 100
license_txt = f'Of the models shared a total of {licensed} ({per_licensed:.1f}\%)' \
+ 'had an open license attached.'
print(license_txt)
Of the models shared a total of 21 (44.7\%)had an open license attached.
4.4 When a model is licensed what was the most popular license?#
licenses, n_license = np.unique(clean[LICENSE_LABEL],
return_counts=True)
license_results = pd.concat([pd.Series(licenses), pd.Series(n_license)], axis=1)
license_results.columns = ['License', 'n']
license_results = license_results.set_index('License')
# drop none from the results
license_results = license_results.drop(NONE_LABEL)
license_results.sort_values(by='n', ascending=False)
n | |
---|---|
License | |
CC-BY 4.0 | 6 |
GPL-3 | 5 |
MIT | 3 |
Apache | 1 |
BSD-3 | 1 |
CC BY-NC 4.0 | 1 |
CC BY-NC-ND 4.0 | 1 |
CC BY-NC-SA 4.0 | 1 |
CC-BY-NC 4.0 | 1 |
Unknown | 1 |
Creative Commons (CC) type licenses are the most popular overall.
cc_licenses = [x for x in license_results.index if x[:2] == 'CC']
n_cc_licenses = license_results.loc[cc_licenses].sum()[0]
print(n_cc_licenses)
10
pop_license = 'The most popular type of license were the creative commons variants' \
+ f' with a total of {n_cc_licenses} out of {clean.shape[0]} models.'
print(pop_license)
The most popular type of license were the creative commons variants with a total of 10 out of 47 models.
4.5 How do licenses relate to approaches to sharing models?#
Note that our results reflect that models might be shared by a combination of approaches. For example Zenodo + Github. The license may be attached to one e.g. Zenodo, but not visible in another e.g. Github.
format_license_table(field_by_sharing_tools(clean))
Archive | Repository | Journal | Personal/org | Platform | |
---|---|---|---|---|---|
license | |||||
None | 0 | 13 | 4 | 6 | 6 |
GPL-3 | 0 | 3 | 0 | 0 | 2 |
MIT | 1 | 3 | 0 | 0 | 1 |
Apache | 0 | 1 | 0 | 0 | 0 |
BSD-3 | 1 | 1 | 0 | 0 | 0 |
CC BY-NC 4.0 | 0 | 0 | 0 | 0 | 1 |
CC BY-NC-ND 4.0 | 0 | 0 | 1 | 0 | 0 |
CC BY-NC-SA 4.0 | 0 | 0 | 0 | 0 | 1 |
CC-BY 4.0 | 3 | 0 | 3 | 0 | 0 |
CC-BY-NC 4.0 | 0 | 0 | 1 | 0 | 0 |
Unknown | 0 | 0 | 1 | 0 | 0 |
5. Summary of results#
summary_txt = doi_result + ' ' + orcid_result + ' ' + license_txt + ' ' + pop_license
print(summary_txt)
A total of 7 (14.9\%) models were provided with a DOI. A total of 6 (12.8\%) models were provided were linked to a researcher via an ORCID. Of the models shared a total of 21 (44.7\%)had an open license attached. The most popular type of license were the creative commons variants with a total of 10 out of 47 models.