Reproducibility
Reproducibility is a fundamental cornerstone of science and enables other scientists to critically assess the correctness of scientific claims made and conclusions drawn. Conventionally it was sufficient that the research approach was described in enough detail that other scientists with similar skills and means could follow the steps described in published work and obtain the same results within the margins of experimental error.
However, this approach should also encompass the underpinning research data (from its original form) and the ability to repeat an analysis of a particular dataset and obtain the same or similar results. As a starting point you should aim to ensure you can repeat all the necessary steps involved from the original state of the data through to the dataset(s) being analysed. This repeatability is invaluable to you as the original researcher(s) as you may need to check and repeat the workflow before finalising your research. The workflow can be tested as part of a pilot test of your experimental approach and then used, with confidence, to repeat the approach efficiently.
Reproducibility is the ability for other researchers to reproduce what you have done.
It is essential to plan in approaches and tasks in your research to ensure reproducibility as these need to be in place before any data handling takes place. For example:
Set out a file-naming convention for all parts of the workflow: data files, models, code, analysis runs, results tables etc. This ensures everyone in the project can identify the correct resources and how they relate to each other.
Document your data and all steps of your workflow of how you got from raw data through to analyses and values in results tables. This forms your audit trail, captures the provenance of each dataset (which will be useful to refer back to) and will help if needing to re-run anything.
Use code, where possible. Instead of pointing and clicking, use programming (through tools such as R, SAS, Matlab, etc.) to download, transform, sub-select, join and output your data, and save command-line scripts.
Use open-source tools. Code transparency is key to reproducibility, so use open-source tools whenever possible.
Track versions. Use version-control on all documents, data and code. Consider version control software such as Git and GitHub.
Archive your data and obtain a persistent identifier such as a DOI. Ideally archive all the originating data rather than just a subset. Archive when the data are complete or, where datasets are being continually updated, archive a version at key points, e.g. when submitting an article for publication.
Replicate your environment. Software containers, e.g. Docker to package code, data and a computing environment together, making it easy for users to recreate the developer’s system.