Open Research#

This notebook provides an analysis of the sharing of models against our best practices for open research. In summary this is defined as:

  1. Shared models have their own DOI and hence guarantees on persistence;

  2. The authors of shared models and artefacts can be uniquely identified by ORCIDs;

  3. Models are shared with an open license that sets out how the model can be used/adapted, author liability and if credit is needed.

Notebook aims#

The notebook analyses the following questions related to best practice:

  1. What proportion of the share model artefacts has a DOI and guarantees on persistence?

  2. What proportion of artefacts are linked to the researcher via ORCID(s)?

  3. What proportion of models have an open license?

  4. When a model is licensed what was the most popular license?

  5. How do licenses relate to approaches to sharing models?

Data used in analysis#

The dataset is a subset of the main review - limited to models shared. The type of model shared is coded as Visual Interactive Modelling (VIM) based (e.g Anylogic, Simul8, Arena) versus CODE (e.g. Matlab, Python, SimPy, Java, R Simmer).

The following fields are analysed in this notebook.

  • model_format - VIM or CODE

  • model_has_doi - do the model artefacts have their own minted DOI? (0/1)

  • orcid - do the researchers provide an ORCID with the model? (0/1)

  • license - does the model have an explicit license defining how it can be used? (str)

  • model_archive - name of archive if used (0/1)

  • model_repo - name of model repo if used (0/1)

  • model_journal_supp - what is stored in the journal supplementary material (0/1)

  • model_personal_org - name of personal or organisational website if used (0/1)

  • model_platform - name of cloud platform used (e.g. Binder or Anylogic cloud) (0/1)|

1. Imports#

1.1. Standard#

import pandas as pd
import numpy as np

1.2 Preprocessing#

from preprocessing import load_clean_bpa

2. Constants#

FILE_NAME = 'https://raw.githubusercontent.com/TomMonks/' \
    + 'des_sharing_lit_review/main/data/bp_audit.zip'

LICENSE_LABEL = 'license'
NONE_LABEL = 'None'

3. Analysis functions#

A number of simple functions to conduct the analysis and format output.

def balance_of_model_format(df):
    unique_elements, counts_elements = np.unique(df['model_format'], 
                                                 return_counts=True)
    return unique_elements, counts_elements
def license_versus_no_license(df):
    '''
    Returns a tuple containined the (number of licensed models, not licensed)
    contained with the dataset.
    
    Parameters:
    -----------
    df: pd.DataFrame
        A dataset to analyse.  Could be full dataset or a partial subset
        
    Returns:
    --------
    tuple (int, int)
    
    '''
    n_not_lincensed = len(df[df[LICENSE_LABEL]==NONE_LABEL])
    return len(df) - n_not_lincensed, n_not_lincensed
        
def field_by_sharing_tools(df, field=LICENSE_LABEL):
    '''
    Return a DataFrame containing licenses (rows) by type of sharing
    i.e. archive, cloud repo, journal supp , personal/org website, platform.
    
    Parameters:
    -----------
    df: pd.DataFrame
        Contains data to analysis.  Eg.full dataset or subset
        
    Returns:
    -------
    DataFrame (9, 6)
    '''
    selected_columns = ['model_archive', 'model_repo', 'model_journal_supp',
                        'model_personal_org', 'model_platform']
    license_by_sharing = df.groupby(by=field)[selected_columns].count()
    return license_by_sharing.sort_values(by='model_repo', 
                                          ascending=False)
def format_license_table(df):
    '''
    Format the license table.
    '''
    column_headers = ['Archive', 'Repository', 
                      'Journal', 'Personal/org', 'Platform']
    df.columns = column_headers
    return df

3. Load and inspect dataset#

The clean data set has 27 fields included. These are listed below.

clean = load_clean_bpa(FILE_NAME)
clean.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47 entries, 0 to 46
Data columns (total 27 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   model_format                  47 non-null     category
 1   key                           47 non-null     object  
 2   item_type                     47 non-null     category
 3   pub_yr                        47 non-null     int64   
 4   author                        47 non-null     object  
 5   doi                           46 non-null     object  
 6   reporting_guidelines_mention  47 non-null     category
 7   covid                         47 non-null     category
 8   sim_software                  47 non-null     object  
 9   foss_sim                      47 non-null     category
 10  model_archive                 5 non-null      object  
 11  model_repo                    21 non-null     object  
 12  model_journal_supp            10 non-null     object  
 13  model_personal_org            6 non-null      object  
 14  model_platform                11 non-null     object  
 15  github_url                    21 non-null     object  
 16  model_has_doi                 47 non-null     category
 17  orcid                         46 non-null     category
 18  license                       47 non-null     object  
 19  readme                        47 non-null     category
 20  link_to_paper                 37 non-null     category
 21  steps_run                     47 non-null     category
 22  formal_dep_mgt                47 non-null     category
 23  informal_dep_mgt              47 non-null     category
 24  evidence_testing              25 non-null     category
 25  downloadable                  47 non-null     category
 26  interactive_online            47 non-null     category
dtypes: category(15), int64(1), object(11)
memory usage: 7.1+ KB

4. Results#

4.1 What proportion of the share model artefacts has a DOI and guarantees on persistence?#

unique_elements, counts_elements = np.unique(clean['model_has_doi'], 
                                                   return_counts=True)

has_doi = counts_elements[1]
has_doi_percent = (has_doi / len(clean)) * 100
doi_result = f'A total of {has_doi} ({has_doi_percent:.1f}\%) models ' \
    + 'were provided with a DOI.'
print(doi_result)
A total of 7 (14.9\%) models were provided with a DOI.

4.2 What proportion of artefacts are linked to the researcher via ORCID(s)?#

unique_elements, counts_elements = np.unique(clean['orcid'], 
                                                   return_counts=True)

has_orcid = counts_elements[1]
has_orcid_percent = (has_orcid / len(clean)) * 100
orcid_result = f'A total of {has_orcid} ({has_orcid_percent:.1f}\%) models ' \
    + 'were provided were linked to a researcher via an ORCID.'
print(orcid_result)
A total of 6 (12.8\%) models were provided were linked to a researcher via an ORCID.

Of this small number what was the format of the model sharing.

orcids = clean[clean['orcid'] == 1]
model_format, counts = balance_of_model_format(orcids)
print(model_format, counts)
['CODE' 'VIM'] [3 3]
format_license_table(field_by_sharing_tools(orcids, field='orcid'))
Archive Repository Journal Personal/org Platform
orcid
0.0 0 0 0 0 0
1.0 2 0 4 0 0

4.3 What proportion of models have an open license?#

We extracted the type of license included with each shared model. When no license was included we recorded this as None. For one model shared as supplementary material with a journal we were unable to determine what license had been applied. We labelled this as Unknown. When a model was published as journal supplementary material we assigned the same license as applied to the paper if it was not explicitly stated. For example, if a paper was published under a CC-BY 4.0 license and there was no explicit license attached to supplementary material we assumed the same license for the model.

licensed, not_licensed = license_versus_no_license(clean)
per_licensed, per_not_licensed = (licensed / len(clean)) * 100, (not_licensed / len(clean)) * 100
license_txt = f'Of the models shared a total of {licensed} ({per_licensed:.1f}\%)' \
    + 'had an open license attached.' 

print(license_txt)
Of the models shared a total of 21 (44.7\%)had an open license attached.

4.5 How do licenses relate to approaches to sharing models?#

Note that our results reflect that models might be shared by a combination of approaches. For example Zenodo + Github. The license may be attached to one e.g. Zenodo, but not visible in another e.g. Github.

format_license_table(field_by_sharing_tools(clean))
Archive Repository Journal Personal/org Platform
license
None 0 13 4 6 6
GPL-3 0 3 0 0 2
MIT 1 3 0 0 1
Apache 0 1 0 0 0
BSD-3 1 1 0 0 0
CC BY-NC 4.0 0 0 0 0 1
CC BY-NC-ND 4.0 0 0 1 0 0
CC BY-NC-SA 4.0 0 0 0 0 1
CC-BY 4.0 3 0 3 0 0
CC-BY-NC 4.0 0 0 1 0 0
Unknown 0 0 1 0 0

5. Summary of results#

summary_txt = doi_result + ' ' + orcid_result + ' ' + license_txt + ' ' + pop_license
print(summary_txt)
A total of 7 (14.9\%) models were provided with a DOI. A total of 6 (12.8\%) models were provided were linked to a researcher via an ORCID. Of the models shared a total of 21 (44.7\%)had an open license attached. The most popular type of license were the creative commons variants with a total of 10 out of 47 models.