Data formats
Choice of formats
The format and software in which research data are created and digitised usually depend on how researchers plan to analyse data, the hardware used, the availability of software, or can be determined by discipline-specific standards and customs.
The most appropriate software and data format for the long-term usability of data, may be different from that used during the majority of the research project. Despite the backward compatibility of many software packages to import data created in previous software versions, and the interoperability between competing popular software programs, the safest option to guarantee long-term data access and interoperability is to convert data to standard formats that most software is capable of interpreting, and that are suitable for data interchange and transformation. This typically means using non-proprietary, open or standard formats such as OpenDocument Format (ODF), ASCII, tab-delimited format or comma-separated values (csv) as opposed to proprietary ones.
It is therefore important to consider whether research data used during the project will need to be converted to a non-proprietary, open format for long-term re-use before the end of the project. If so, the resource will need to be factored in and activities planned.
A range of different file formats are available for storing data. Much research data is stored in tabular format, i.e. two-dimensional files (rows and columns), where each row represents a measure identification (like time or location the measure was taken), and each column represents a variable (also called feature or attribute) being measured (e.g. temperature, wind speed, water level, etc).
The most common file format for tabular data is the Comma Separated Values (CSV), which is a simple file format that can be read by most software. The CSV format is usually recommended because you can use different tools to read the files, like Notepad and Excel, and it offers a high level of security because it is not easily corrupted.
Another option that offers the ability to provide more information is the JavaScript Object Notation (JSON) format. Unlike CSV, it allows users to create hierarchies between the data, which is useful for storing huge amounts of data efficiently and provides scalability and support to relational data.
Sometimes, however, users don’t want to be dependent on software and hardware tools to store and transfer data. For this, they can use the Extensible Markup Language (XML) structure, which allows data to be processed by any application, whatever the platform is.
Is it FAIR?
The choice of file formats can influence how FAIR research data are. File formats that are more likely to make research data Interoperable and Re-usable in the future are:
- Non-proprietary
- Open, documented standard
- Common usage by research community
- Standard representation (ASCII, Unicode)
- Unencrypted
- Uncompressed
In order to make data FAIR, for long-term management non-proprietary formats are best e.g., csv rather than xls (MS Excel). The NERC Data Centres can accept other formats e.g., ESRI ArcGIS. For more information, please contact your assigned NERC Data Centre. The NERC Data Centre and the project Data Manager should agree in advance the format each dataset needs to be in when transferred for curation.
Note this does not dictate what format researchers will use for other purposes during the lifetime of the project.
Choosing a format for the data in a project
Ask yourself these questions about your chosen data format:
- Is this a standard data format widely used by researchers in this field?
- If not, why are you using a non-standard format?
- There is no standardised format for this data type
- It is optimised for processing speed and/or volume
- What other reason do you have for not using a standard file format for this type?
- Does this data format enable sharing and long-term archiving?
- If not, will you need to convert to a file format more suitable for archiving later?
- Is the chosen data format completely described? Will other researchers be able to understand and use the data?
Spatial data
If you are dealing with spatial data, there are several different ways your data can be stored, and that will depend on the nature of your data. It could be made by a regular grid of pixel values (Raster), of which the most common format is the GeoTIFF; or it could be comprised of vertices and paths denoting points, lines or polygons, of which common formats include GeoPackage or SpatiaLite. It’s also common ( e.g. for global coverage data) that a single file represents information varying in space and time (3-dimensional matrices). This can be represented using the NetCDF format. Many other formats can also be used for spatial data.