CODATA, The Committee on Data for Science and Technology

International Council for Science : Committee on Data for Science and Technology

< home > < newsletter > < discussion list > < data science journal > < contact > < members area >
C O D A T A

Data Citation Standards and Practices Task Group

The need for robust data citation capabilities

As the growth of electronic publishing of literature has created new challenges, such as the need for mechanisms for citing online references in ways that can be relied upon for many years into the future, the growth in online datasets (as distinguished from literature) presents related, yet additional challenges. Data citation standards and good practices can also form the basis for increased incentives, recognition, and rewards for scientific data activities that in many cases are currently lacking in all fields of research. The rapidly-expanding universe of online digital data holds the promise of allowing peer-examination and review of conclusions or analysis based on experimental or observational data, as well as the ability for subsequent users to make new and unforeseen uses and analyses of the same data – either in isolation, or in combination with other datasets.

This promise, however, depends upon the ability to reliably identify, locate, access, interpret and verify the version, integrity, and provenance of the digital datasets. The problem of citing online data is complicated by the lack established practices for referring to portions or subsets of data. Unlike in the realm of literature where a printed edition may be the version of record for a document, typically there is no such hard-copy of a database. Even if it were feasible for there to be such a hard-copy, scientists lack the necessary constructs and conventions for referring to portions of a database, analogous to the volume and page numbers, or titles, chapters, and sections, that we use commonly in citing to text that is published in books or serial publications.

As funding sources for scientific research have begun to require data management plans as part of their selection and approval processes, it is important that the necessary standards, incentives, and conventions to support data citation, preservation, and accessibility be put into place. There are, in fact, a number of initiatives in different organizations, countries, and disciplines already underway. One important group is DataCite. Others remain ad hoc and uncoordinated. The Task Group, being organized jointly by several CODATA committees and ICSTI, together with representatives from several other organizations, would examine a number of key issues related to data citation, help coordinate activities in this area, and promote common practices and standards in the scientific community.

Issues Requiring Attention

There are many issues that need to be addressed in establishing standards and good practices in the data citation arena. Below is a preliminary and partial annotated list that the Task Group would consider, prioritize, and address as appropriate.

A. Technical
1. Interoperability and Facilitation of Re-use. There is already considerable diversity in database formats, such as various flat-file, hierarchical, relational, object-oriented, and XML-based databases. There is every reason to expect that new modalities and formats for storing and manipulating digital data will continue to emerge.
2. Citation Formats. What data citation conventions have been developed already? How are they similar and how do they differ? Can they be standardized?
3. Metadata. How do metadata conventions or standards affect citation formats?
4. Database Versioning. Datasets are more dynamic than documents, and this creates additional challenges for citation practice. When should the dataset as a whole be cited? How can a specific, time-fixed version be cited? What changes to the data constitute a new contribution or added value? How should this be acknowledged? How are database versions controlled and labelled?

B. Scientific
In addition to diversity of database formats, various creators and users of online datasets have diverse needs that need to be considered in the development of persistent identifier standards and models. For example, different disciplines may have disparate needs for granularity at which digital “objects” are identified. What are the differences among disciplines that need to be addressed distinctly?

C. Institutional
What are the roles of the respective stakeholders in the system—the data managers, researcher umbrella groups, universities, libraries, publishers, research funders? What are the implications for these stakeholders? Does this vary by discipline?

D. Financial
In a field that requires a lot of granularity as noted in B, even nominal registration fees per object can quickly become cost-prohibitive. In order for a data citation system to be useful, it must be accessible and its costs affordable by all necessary user communities.

E. Sustainability
As noted above, there is a need for registration and persistent identification for online digital datasets. Some registry and resolution models for this function have already emerged, but the various models – for-profit vs. not-for-profit, public vs. private, etc. – must be examined to assure that they are sustainable in the long term. Moreover, just as the persistence of the connection from print citations to the correct physical copies depends on libraries or publishers keeping, the persistence of the connection between data citation and the actual data ultimately must also depend on some form of commitment by durable institutions to preserving data that is cited. Although a top down, centralized archive that keeps and organizes all data is an obviously attractive concept and works in some fields, creating such a trustworthy structure is probably not feasible universally, especially given the huge increases in the amount and types of data being generated or used by the scientific community. Distributed approaches to preservation such as institutional repositories, the Data Preservation Alliance for Social Science, and LOCKSS are emerging examples of alternatives to the centralized archiving model.

F. Persistent Identifiers
One existing service is the use of the DOI (Digital Object Identifier) System. DOI names are widely used in scientific publishing to cite journal articles. More than 30 million scholarly articles have been registered with DOI names by CrossRef so far. The use of DOI names for the citing of data sets would make their provenance trackable and citable and therefore allow interoperability with existing reference services like Thomson Reuters “Web of Science”. The use of DOI names for datasets is promoted by the not-for-profit DataCite consortium which has registered over 600,000 datasets so far in cooperation with several World Data Centers. There are however significant differences between data and documents, and it is possible that these differences will make some aspects of the DOI system less attractive.

The Corporation for National Research Initiatives supports the Handle system for digital identifiers. This uses technical protocols identical to DOI’s (which are based on the original hdl protocol), but offers a different business and service model. Because handle registration fees are substantially lower, many digital library and institutional repository systems such as DSPACE, Fedora, and the Dataverse Network in the U.S. support this form of identifier.

The not-for-profit Online Computer Library Center (OCLC) also runs a persistent URL (PURL) resolution service, provides PURL resolver server software, and encourages other organizations that wish to run PURL servers. This distributed model may help to avoid the “single point of failure” present in the failed URL-shortening service, but the reliance upon any registry and resolution system to assure continued access to important and useful data militates strongly for an examination of its long-term sustainability, including the sufficiency of contingency and continuity of operation plans to mitigate the risks associated with the demise of a sponsoring organization.

G. Legal Issues/Intellectual Property Rights
Any registry system must accommodate emerging intellectual property rights mechanisms such as Creative Commons and Science Commons licensing, as well as traditional copyright law.

H. Socio-cultural and Community Norms
A major reason for promoting the adoption of standard data citation practices is to develop a common basis and community of practice for recognizing and rewarding data work and incentivizing disclosure of data in interoperable and quality controlled ways. What are the factors that need to be considered in this area? Of particular interest to the Task Group is how such data management activities might impact the personal performance evaluations of scientists and the reward and promotion structures in science.

Attribution is not quite the same as citation, although citation is one of the ways of giving attribution. Licences akin to Creative Commons may require attribution, but this can result in “attribution stacking”, where the work of tens or hundreds may have to be acknowledged. The route through this may be by establishing community norms for what are acceptable levels of attribution for datasets.

J. Other Issues
There are certain to be other important elements to the proper development and implementation of data citation standards and good practices, especially discipline-specific ones that may arise once the Task Group is established.

Working to improve the quality, reliability, management and accessibility of Data for Science and Technology