Monday 24 February 2020

The need for a Sample to Data standard

The experimental design is a cornerstone of modern science, especially for data scientists.  But, How experimental design can be captured for better reuse, reproducibility, and understanding of the original results? How we can write in a file the complexity of our experimental design? 

This post is about that




Why do we need a standard file format to capture the experimental design? 

  • Standards within a scientific domain have the potential to provide uniformity and consistency in the data generated by different researchers, organizations, and technologies.  
  • Facilitate more effective reuse, integration, and mining of those data by other researchers and third-party software applications, as well as enable easier collaboration between different groups. 
  • Software analysis tools which – of necessity – require some sort of regularized data input are very often designed to process data that conform to public data formatting standards when such is available for the domain of interest.
To understand how the conclusions of a study were obtained, I see different levels of information that MUST be available in a standard, easy to use and readable formats:


Dataset Metadata levels (Dataset General Description, Sample to Data file, and Data files)

The dataset general description or general metadata describes the minimum information about the experiment. Title, description, date of publication, type of experiment (eg. proteomics, transcriptomics, etc) are some of the properties, attributes that are part of the general description of the dataset. The dataset description is mainly what most of the databases and repositories (figshare) ask you when depositing the data. For example,  ProteomeXchange submission.px (When submitting the data to a proteomics database the general information about the experiment will be requested). 

The data files are all the files associated with the dataset including the raw data output from instruments, the database files (e.g. fasta files), the processed and data analysis files (e.g. peptide identification files, bam sequence alignment files), workflow files, configuration files, and plot files. 

The sample to data files is the standards that define the major characteristics of the sample and how they are related to each file - especially instrument raw files-. 

These three components metadata components are combined in multiple standards such as ISATAB or MAGETAB. 

Existing file formats in omics domains. The more popular and active ones are MAGE-TAB and ISA-TAB. 

While general metadata and data files are well-understood and provided for most of the datasets in omics archives and repositories. The sample to data files is not always present public data making difficult to associate the samples to the instrument data files and ultimately reproduce the experimental design. 

Two Approaches to annotate sample metadata 

I have seen two major approaches to annotate the sample metadata including clinical information such as disease, phenotype, age of the patients, etc: 
  • Sample metadata file formats (e.g. ISA-TAB, MAGE-TAB): In this approach, a tab-delimited file is used to annotate the metadata around the sample and the link to the data file - most of the time an instrument output file, eg. RAW file -. 
  • Encode the sample metadata into the data files (e.g. VCF): In this approach, the most relevant sample information is propagated to the actual data files.  

Proteomics where we are and where we should be

Resultado de imagen de slug slow cartoon
In Proteomics the sample to data files information is missing for all PX datasets. While originally the sample metadata was model in the mzML files, none of the instrument providers and open source tools include the sample information in the mzML files. 
A second attempt try to add more sample metadata into mzTab files (metadata section) but again most of the tools producing the files are not annotating properly this information. 

But, what about trying the first approach (sample to data files). Recently, a group of researchers created a GitHub repository (https://github.com/bigbio/proteomics-metadata-standard) to model how a sample to data format should look like for proteomics. Based on SDRF - a well-known file format for Transcriptomics -, the current proposal capture the sample metadata but also important information around the RAW files such as posttranslational modifications under study (variables and fixed), instrument used, etc. 

The three major features of the proposal are: 

  • The tab-delimited format is compatible with the existing ProteomeXchange submission systems. 
  • Compatible with existing formats such as MAGE-TAB and ISA-TAB which eventually will enable multi-omics studies. 
  • Ontology-based for most of the properties but also enable free-text annotations. 

SDRF (sample to data file format) in a Nutshell. 

What is next? 

  1. We are calling the proteomics community, biologist and bioinformaticians to contribute with ideas to improve the file format.  (If you read this blog post, tweet about this, create an issue in the repository or present the idea in next your lab meeting)
  2. We are calling developers of commercial and open-source tools to take a look at the file format and give us feedback about it and most important:
    1. Try to convert your current experimental design representation and sample metadata into this file format. 
    2. Try to convert this current file format into your current experimental design. 
  3. We are calling the community to annotate public datasets using this representation following these steps: 
    1. Read the SDRF specification
    2. Depending on the type of dataset, choose the appropriate sample template
    3. Annotate the corresponding ProteomeXchange PX following the guidelines
    4. Validate your sdrf
      1. In order to validate your sdrf, you can install the sdrfchek tool in python (pip install sdrfchek)
      2. validate the sdrf (python sdrfchecker.py validate-sdrf --sdrf_file sdrf.txt)
    5. Fork the current repository add a folder with the ProteomeXchange accession and the annotated sdrf.txt
We have collected a list of the 100 most downloaded datasets in PRIDE a good list to start with https://docs.google.com/spreadsheets/d/1-kb_rP0MEt2eTlUPnZe4llTwJi4xdKYnJtBQ5cXKnhw/edit




No comments:

Post a Comment