Science requires trust – reproducibility in bioinformatics

The reproducibility is an important topic in bioinformatics and computational biology. People expect that with the raw data and the analysis processes provided, they could get the same results as you. To be more precise, in this blog, the reproducibility is defined as “a third party using the same data, models, and codes as you in your publication obtains the same results as you”.

Recent years, the standards of sharing data and codes have become higher. When I started my research in 2017 sharing data and codes were not very common. But when I first submitted my manuscript in 2021, many journals required that the data and codes were shared. I was stupid and didn’t follow it, unsurprisingly received criticism from reviewers and a failure of submission!

Science requires trust. Sharing data and codes is the one of the key elements to ensure your results are accurate and trustworthy. It is necessary to let scientific community examine your data and codes and understand how you made the results.

Besides, your analysis may be also useful for others with their own data, so others may re-use your codes, or integrate your codes with their own analysis. Sharing data and codes can potentially boost scientific research, and you will also get credits by citations.

Be clear, careful and remember to document

Until here the reproducibility sounds easy, right? You have your data and codes, just share them. However, reproducibility is not that easy.

  1. The format of analysis may be hard to share. Sensitive data you have collected cannot be shared (please be extra careful on this).
  2. Your codes may be badly arranged and hard to understand. You know how to run your own code, but how can others know it?  A good structure of codes cannot only help you to better running analysis, but also help others to understand what you were doing. Try to avoid unnecessary manual steps.
  3. Software, tools, packages, or databases may have different versions. For software, tools, or packages, previous bugs may be fixed and new features may be added into the newer versions. For databases, more items may be added into the subsequent releases. Thus, different results may be generated even with the same resources. So, versions information should be marked clearly.
  4. Parameters used in analysis must be clear. The computational tools or packages may have many parameters in their settings. Small changes of parameters may lead to different results. For example, small changes of random numbers may make your results different (use a fixed “seed” for random numbers). So, you should report key parameters and mark others if remaining defaults.
  5. Document how you did your analyses in a clear way. For example, give comments inside your scripts to explain each step, document the expected input and output, and order your scripts as step 1, step 2.

Making your data and codes reproducible is an endless effort. Today, many workflow systems such as Nextflow or Snakemake have been widely applied in research. Today’s medium standard might be tomorrow’s minimum. It requires us to keep learning and practicing reproducibility in our research.

Further reading:

Ten Simple Rules for Reproducible Computational Research

(https://doi.org/10.1371/journal.pcbi.1003285)

Reproducibility standards for machine learning in the life sciences

Ning Wang, bioinformatician