Data Documentation
Documenting your data
Comprehensive data documentation should be considered as part of best practice research in terms of creating, organising and managing research data. It is a key component of reproducibility of your research, i.e., how you got from capturing the data through to results. This will be invaluable to you if you need to re-run any part, to ensure confidence in your results and to be able to publish a dataset as a science output. To ensure accurate documentation, begin at the onset of a project and continue throughout the research process.
Documentation of data should explain their lineage and provenance - how data were created or where acquired from, their content and structure, and any manipulations or alterations that may have taken place. It ensures that data can be understood during research projects, that researchers continue to understand data in the longer term and that re-users of data are able to interpret the data and use them appropriately. Good documentation is vital for successful data preservation and sharing and will be needed to generate documentation and metadata describing each dataset published.
What metadata should be included for useful data documentation?
The metadata you should include for your dataset will be field-specific, but generally all data documentation should include:
- the context of data collection: project history, aims, objectives and hypotheses
- data collection methods: data collection protocol, sampling design, instruments, hardware and software used, data scale and resolution, temporal coverage and geographic coverage
- dataset structure of data files/tables, cases, relationships between files/tables
- data sources used
- data validation, checking, proofing, cleaning and other quality assurance procedures carried out
- modifications made to data over time since their original creation and identification of different versions of datasets
- within project access - read/edit permissions, IPR (especially where project partners involved), data confidentiality /sensitivity
- public data sharing arrangements (authors, IPR, embargo periods etc)
At data-level, datasets should also be documented with
- names, labels and descriptions for variables, records and their values
- explanation of codes and classification schemes used
How does creating quality metadata benefit me?
Having comprehensive metadata with your dataset is a key part of following the FAIR data principles. It helps users to:
- Understand the context of your dataset
- Be confident in the reliability and quality of your data
- Efficiently access and re-use your data
- Easily find your data
- Ensure they are following any use constraints or specific licence conditions.
- Keep up to date with different versions of your dataset
All of these make it easier for other groups to access, re-use and cite your data, providing more space for collaboration and new science opportunities!