Open Research#

This notebook provides an analysis of the sharing of models against our best practices for open research. In summary this is defined as:

Shared models have their own DOI and hence guarantees on persistence;
The authors of shared models and artefacts can be uniquely identified by ORCIDs;
Models are shared with an open license that sets out how the model can be used/adapted, author liability and if credit is needed.

Notebook aims#

The notebook analyses the following questions related to best practice:

What proportion of the share model artefacts has a DOI and guarantees on persistence?
What proportion of artefacts are linked to the researcher via ORCID(s)?
What proportion of models have an open license?
When a model is licensed what was the most popular license?
How do licenses relate to approaches to sharing models?

Data used in analysis#

The dataset is a subset of the main review - limited to models shared. The type of model shared is coded as Visual Interactive Modelling (VIM) based (e.g Anylogic, Simul8, Arena) versus CODE (e.g. Matlab, Python, SimPy, Java, R Simmer).

The data can be found here: https://raw.githubusercontent.com/TomMonks/des_sharing_lit_review/main/data/bp_audit.zip

The following fields are analysed in this notebook.

model_format - VIM or CODE
model_has_doi - do the model artefacts have their own minted DOI? (0/1)
orcid - do the researchers provide an ORCID with the model? (0/1)
license - does the model have an explicit license defining how it can be used? (str)
model_archive - name of archive if used (0/1)
model_repo - name of model repo if used (0/1)
model_journal_supp - what is stored in the journal supplementary material (0/1)
model_personal_org - name of personal or organisational website if used (0/1)
model_platform - name of cloud platform used (e.g. Binder or Anylogic cloud) (0/1)|

1. Imports#

1.1. Standard#

import pandas as pd
import numpy as np

1.2 Preprocessing#

from preprocessing import load_clean_bpa

2. Constants#

FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
    + 'des_sharing_lit_review/main/data/bp_audit.zip'

LICENSE_LABEL = 'license'
NONE_LABEL = 'None'

3. Analysis functions#

A number of simple functions to conduct the analysis and format output.

def balance_of_model_format(df):
    unique_elements, counts_elements = np.unique(df['model_format'], 
                                                 return_counts=True)
    return unique_elements, counts_elements

def license_versus_no_license(df):
    '''
    Returns a tuple containined the (number of licensed models, not licensed)
    contained with the dataset.
    
    Parameters:
    -----------
    df: pd.DataFrame
        A dataset to analyse.  Could be full dataset or a partial subset
        
    Returns:
    --------
    tuple (int, int)
    
    '''
    n_not_lincensed = len(df[df[LICENSE_LABEL]==NONE_LABEL])
    return len(df) - n_not_lincensed, n_not_lincensed
        

def field_by_sharing_tools(df, field=LICENSE_LABEL):
    '''
    Return a DataFrame containing licenses (rows) by type of sharing
    i.e. archive, cloud repo, journal supp , personal/org website, platform.
    
    Parameters:
    -----------
    df: pd.DataFrame
        Contains data to analysis.  Eg.full dataset or subset
        
    Returns:
    -------
    DataFrame (9, 6)
    '''
    selected_columns = ['model_archive', 'model_repo', 'model_journal_supp',
                        'model_personal_org', 'model_platform']
    license_by_sharing = df.groupby(by=field)[selected_columns].count()
    return license_by_sharing.sort_values(by='model_repo', 
                                          ascending=False)

def format_license_table(df):
    '''
    Format the license table.
    '''
    column_headers = ['Archive', 'Repository', 
                      'Journal', 'Personal/org', 'Platform']
    df.columns = column_headers
    return df

3. Load and inspect dataset#

The clean data set has 27 fields included. These are listed below.

clean = load_clean_bpa(FILE_NAME)
clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 model_format                  47 non-null     category
 key                           47 non-null     object  
 item_type                     47 non-null     category
 pub_yr                        47 non-null     int64   
 author                        47 non-null     object  
 doi                           46 non-null     object  
 reporting_guidelines_mention  47 non-null     category
 covid                         47 non-null     category
 sim_software                  47 non-null     object  
 foss_sim                      47 non-null     category
model_archive                 5 non-null      object  
model_repo                    21 non-null     object  
model_journal_supp            10 non-null     object  
model_personal_org            6 non-null      object  
model_platform                11 non-null     object  
github_url                    21 non-null     object  
model_has_doi                 47 non-null     category
orcid                         46 non-null     category
license                       47 non-null     object  
readme                        47 non-null     category
link_to_paper                 37 non-null     category
steps_run                     47 non-null     category
formal_dep_mgt                47 non-null     category
informal_dep_mgt              47 non-null     category
evidence_testing              25 non-null     category
downloadable                  47 non-null     category
interactive_online            47 non-null     category
dtypes: category(15), int64(1), object(11)
memory usage: 7.1+ KB

4. Results#

4.2 What proportion of artefacts are linked to the researcher via ORCID(s)?#

unique_elements, counts_elements = np.unique(clean['orcid'], 
                                                   return_counts=True)

has_orcid = counts_elements[1]
has_orcid_percent = (has_orcid / len(clean)) * 100
orcid_result = f'A total of {has_orcid} ({has_orcid_percent:.1f}\%) models ' \
    + 'were provided were linked to a researcher via an ORCID.'
print(orcid_result)

A total of 6 (12.8\%) models were provided were linked to a researcher via an ORCID.

Of this small number what was the format of the model sharing.

orcids = clean[clean['orcid'] == 1]
model_format, counts = balance_of_model_format(orcids)
print(model_format, counts)

['CODE' 'VIM'] [3 3]

format_license_table(field_by_sharing_tools(orcids, field='orcid'))

	Archive	Repository	Journal	Personal/org	Platform
orcid
0.0	0	0	0	0	0
1.0	2	0	4	0	0

4.3 What proportion of models have an open license?#

We extracted the type of license included with each shared model. When no license was included we recorded this as None. For one model shared as supplementary material with a journal we were unable to determine what license had been applied. We labelled this as Unknown. When a model was published as journal supplementary material we assigned the same license as applied to the paper if it was not explicitly stated. For example, if a paper was published under a CC-BY 4.0 license and there was no explicit license attached to supplementary material we assumed the same license for the model.

licensed, not_licensed = license_versus_no_license(clean)
per_licensed, per_not_licensed = (licensed / len(clean)) * 100, (not_licensed / len(clean)) * 100

license_txt = f'Of the models shared a total of {licensed} ({per_licensed:.1f}\%)' \
    + 'had an open license attached.' 

print(license_txt)

Of the models shared a total of 21 (44.7\%)had an open license attached.

4.4 When a model is licensed what was the most popular license?#

licenses, n_license = np.unique(clean[LICENSE_LABEL], 
                               return_counts=True)

license_results = pd.concat([pd.Series(licenses), pd.Series(n_license)], axis=1)
license_results.columns = ['License', 'n']     
license_results = license_results.set_index('License')
# drop none from the results
license_results = license_results.drop(NONE_LABEL)
license_results.sort_values(by='n', ascending=False)

	n
License
CC-BY 4.0	6
GPL-3	5
MIT	3
Apache	1
BSD-3	1
CC BY-NC 4.0	1
CC BY-NC-ND 4.0	1
CC BY-NC-SA 4.0	1
CC-BY-NC 4.0	1
Unknown	1

Creative Commons (CC) type licenses are the most popular overall.

cc_licenses = [x for x in license_results.index if x[:2] == 'CC']
n_cc_licenses = license_results.loc[cc_licenses].sum()[0]
print(n_cc_licenses)

pop_license = 'The most popular type of license were the creative commons variants' \
              + f' with a total of {n_cc_licenses} out of {clean.shape[0]} models.'
print(pop_license)                                                                

The most popular type of license were the creative commons variants with a total of 10 out of 47 models.

4.5 How do licenses relate to approaches to sharing models?#

Note that our results reflect that models might be shared by a combination of approaches. For example Zenodo + Github. The license may be attached to one e.g. Zenodo, but not visible in another e.g. Github.

format_license_table(field_by_sharing_tools(clean))

	Archive	Repository	Journal	Personal/org	Platform
license
None	0	13	4	6	6
GPL-3	0	3	0	0	2
MIT	1	3	0	0	1
Apache	0	1	0	0	0
BSD-3	1	1	0	0	0
CC BY-NC 4.0	0	0	0	0	1
CC BY-NC-ND 4.0	0	0	1	0	0
CC BY-NC-SA 4.0	0	0	0	0	1
CC-BY 4.0	3	0	3	0	0
CC-BY-NC 4.0	0	0	1	0	0
Unknown	0	0	1	0	0

5. Summary of results#

summary_txt = doi_result + ' ' + orcid_result + ' ' + license_txt + ' ' + pop_license
print(summary_txt)

A total of 7 (14.9\%) models were provided with a DOI. A total of 6 (12.8\%) models were provided were linked to a researcher via an ORCID. Of the models shared a total of 21 (44.7\%)had an open license attached. The most popular type of license were the creative commons variants with a total of 10 out of 47 models.

Model and code sharing practices in healthcare discrete-event simulation - a systematic review

Open Research

Contents

Open Research#

Notebook aims#

Data used in analysis#

1. Imports#

1.1. Standard#

1.2 Preprocessing#

2. Constants#

3. Analysis functions#

3. Load and inspect dataset#

4. Results#

4.2 What proportion of artefacts are linked to the researcher via ORCID(s)?#

4.3 What proportion of models have an open license?#

4.4 When a model is licensed what was the most popular license?#

5. Summary of results#

Model and code sharing practices in healthcare discrete-event simulation - a systematic review

Open Research

Contents

Open Research#

Notebook aims#

Data used in analysis#

1. Imports#

1.1. Standard#

1.2 Preprocessing#

2. Constants#

3. Analysis functions#

3. Load and inspect dataset#

4. Results#

4.1 What proportion of the share model artefacts has a DOI and guarantees on persistence?#

4.2 What proportion of artefacts are linked to the researcher via ORCID(s)?#

4.3 What proportion of models have an open license?#

4.4 When a model is licensed what was the most popular license?#

4.5 How do licenses relate to approaches to sharing models?#

5. Summary of results#