Posted on

Incentivizing data reuse

Sharing data for reuse is a widely embraced ideal that benefits society as a whole by maximally exploiting the data generated. However, the benefits of sharing for the individual can be tenuous or mixed. This is partly due to the fact that successful sharing often depends on data producers spending scarce and precious time on making data reusable. At the same time, data consumers are unable to easily acknowledge and verify the provenance of any data they acquire and many times resort to recompiling the data themselves instead, leading to duplicate efforts.

We argue that breaking down these processes and incentivizing more contributors to help accomplish each objective will go a long way to make data sharing more achievable.

Towards this vision, datafair.xyz is a marketplace for data that provides monetary incentives to reward both the provision and curation of data. The use of open licenses when a dataset is listed enables other members of the community to further curate and add value to a dataset they purchased, and subsequently resell the resulting dataset at a marked up price, as a reward for their efforts.

The reuse of data is a complex process which requires scientists to have the ability to discover and access intelligible, trustworthy, and relevant data.  Features for personal feedback, data previews, messaging towards the data originator are implemented in the marketplace and assist in high quality reuse.  Additionally, to aid both the data consumer and the data producer, we are implementing a nanopublication scheme complimented with trustyURIs to accentuate transparent data governance for accountability, research assessment, and data provider citation.

Nanopublications are an approach to represent a small unit of publishable information, such as the existence of data set,  in an RDF-based formal notation consisting of named graphs that can be easily mined, queried, retrieved and cited by others. A nanopublication has 3 basic elements: an assertion, provenance containing some context about the assertion, and publication information which is metadata about the nanopublication as a whole.  In the context of Datafair.xyz, we have developed a nanopublication model to cite the datasets offered in the marketplace.

For the Datafair.xyz nanopublications, the assertion is the minimal unit of information needed to describe the contents of a dataset.  It states there is a dataset product that has a certain type of information about a category of things.  The nanopublication provenance information exposes the derivation, attribution, and generation time details for the assertion, which can provide a view for quality assessment, an important concern with reusing datasets in general.  The publication information has triples indicating the time of the generation of the nanopublication and its derivation.  The term schema for the nanopublications include widely used vocabularies, such as, PROV,  as well as proprietary terms.  The nanopublication then is given a trusty URI and propagated to one of the nanopublication server networks.

There will always need to be a ‘mixed economy’ of incentives that encompass different forms of data sharing.  If data producers see no immediate direct benefit to share their data with unknown future users, other strategies such as compliance to policies guidelines become the major drivers. Therefore, within different types of sharing mechanisms, there should be no value judgement about which form is best because, ultimately, broad access to data has the potential to stimulate progress by revealing previously overlooked critical information and reducing redundant work.