What is research data in open science?
Research data should be accessible in accordance with the FAIR principles set out below. Under these principles, closed data may be retained for relevant reasons, such as protection of personal information, privacy rights, legitimate business interests, trade secrets, intellectual property in the case of patent applications, national security, other legitimate interests and restrictions, or commercial exploitation. All reasons for data closure should be explained in the data management plan or project proposal, as appropriate. If the reasons for the closure of research data no longer apply, the data may be made available.
Standard data formats
The digital output of a research project is a variety of standard data formats. For example, digital video is most commonly in MPEG, AVI, MXF, MKV format. Files suitable for acoustic analysis and archiving are in WAVE, AIFF, MP3, MXF, FLAC. Texts can be of data type: xml, pdf/a, html, json, txt, rtf, images: jpg, gif, tiff, png, ai, svg, geospatial data: shp, dbf, geoTIFF, netCDF, e00, archiving multiple files: tar, gzip, zip, etc. Measured data (raw measurements e.g. CO2 etc.) are in comma separated values in plain ASCII format. For a detailed description of how to work with the different data types, see for example the How To FAIR website.
Open access to research data (Open Data) is beneficial for the scientific community in many ways. By publishing the underlying research data of scientific publications, or by publishing stand-alone research data (datasets), a scientist will make his/her work more visible to the broader scientific and professional community, increase the citation rate of his/her publications, improve the quality of research, and foster collaboration. Sharing open data also contributes to better use of research funding and inspires further research. For this reason, most funders of research projects require open data sharing. Research data should be accessible according to the motto “open as possible, closed as necessary”, i.e. research data will be open by default unless there are legitimate reasons for closing it.
The secure management of data is an integral part of the work of researchers. There are 4 basic categories of managed data: public data, internal data, discrete data and sensitive data.
Categories of managed data
- Public data is accessible to anyone without any restrictions in line with the FAIR principles.
- Internal data are for internal use only by a broadly defined group of people (e.g. project collaborators, institutional staff, etc.).
- Discrete data are for internal use only by a well-defined group of persons (e.g. employee, manager). By their nature, they require regulation or protection, typically the data is protected by law or by some contract/license.
- Sensitive data is strictly for the internal use of a well-defined group of persons (e.g. a health professional and his/her patient, project developers working with data subject to commercial or similar confidentiality, etc.). By their nature, they require special regulation or protection, typically data that is strictly protected by law or by contract/licence.
In open science, there are 4 categories of access to research data:
- Open data is the term used in the context of open access to research data to freely use, modify and share the data by anyone for any purpose. Research data shared through open access in the context of FAIR principles is data made available under clearly defined conditions. Open access to research data is usually implemented through electronic data repositories. Research data must be made available in a form that allows for its further use, both technically and legally. Access, use, reproduction and dissemination of data must be free of charge. Research data as research datasets may be published on their own or as companion data to open access publications.
- Data access with embargo. The data manager shall indicate in the repository the date from which the dataset will be made publicly available, e.g. due to a publisher’s request.
- Restricted data access is a specific case where the data custodian specifies the conditions under which users will be granted access to datasets in the repository. The user requesting access in the repository is asked to justify the reason for requesting access to the data. The data manager shall not charge users for granting access to data, e.g. in the Zenodo repository.
- Closed data access is applied for reasons of trade secrets, intellectual property protection, security rules and other reasons. Closed data may be stored in a closed data access repository with a basic description.
FAIR data is a term used in the context of sharing research data in open science, i.e. FAIR principles. The acronym represents the 4 basic requirements for data: findable, accessible, interoperable and reusable. A detailed description is given in the FAIR Principles.