Dataset guidelines


Follow our guidelines to create the best dataset possible

Dataset Metadata

Name

Your dataset should have a name that informs buyers about the main content found in the dataset. The dataset name is what is seen when buyers are browsing through the dataset listings. The name should focus on what makes your dataset stand out.

Product Description

A dataset product description stated clearly and in detail, leaves no room for confusion or doubt about the quality and quantity of data in the product. The “Product Description” field should be used to give an extended description of the data.  Answer all the questions that you think a buyer may have about your data.

To be completely transparent, we suggest you specifically mention the following:

Number of files

A dataset can contain more than one file and the buyer should be told the number of files to expect. For example, one dataset may be one file with the values and an accompanying “Read Me” to explain the semantics of the values or one dataset may be composed of several tables, each in its own file.

Missing data

It is important for buyers to understand if there are missing data points and what this means in the dataset.  For example, if there is a missing data point that should have been recorded but wasn’t, it may be important to keep it during subsequent analyses. Other times a missing data point may represent something that couldn’t be measured because it would never occur and these are missing data points that may potentially be dropped in subsequent analyses.

Any known limitations of the dataset

Full disclosure and transparency encourages trust. It is better to be upfront with buyers if there are some known limitations with the dataset, rather than have them give you a bad recommendation which will discourage subsequent buyers.

Product Short Description

The short description is available for you to add a headline-type description of your data. It appears under the dataset name when a buyer is looking at the details of your listing.  Something unique can encourage buyers to look further into your dataset metadata.

Categories

Buyers can browse datasets by category.  Pick the categories in which your dataset belongs.

Tags

You can label your dataset with tags that are specific to your content.  Buyers can search for all datasets with specific tags.

Image

Often, data consumers will rely on images to form an opinion about your dataset’s value. In the media section of the “Add dataset” page, you can add an image that represents your dataset.  You must use images that you own or control the copyright, or images that are in the public domain. You should not use photos that show an identifiable person or persons. If you have made a graph or some sort of visualization from your data this is perfect option.  One image is uploaded and displayed as the feature image.  Other images may be uploaded to the gallery and these will be displayed under the featured image and can viewed by the buyer when they are reviewing the details of your dataset. By clicking on the images in the gallery they are shown in a larger display.

Price

Enter the price, in Euros, that you think is reasonable for your dataset.  The price really depends on what you think buyers will pay, and remember you probably should not expect to recover your investment in creating the dataset in just one sale.  A some point you may also decide to put your dataset on promotional sale.  60 days a year is the maximum number of days a dataset can be put on sale.

Data File(s)

This is where you upload your dataset file or several files. Click on “Add file”.  Drag your file into the upload space. Click on the “Insert URL” button. If your dataset file does not have an intuitive name, you can give it a name by entering something in the “Name” field.  Leaving this blank if your dataset file is easily understood from the file name is also an option.

Format

Indicate the format of the dataset. Formats such as CSV that can used immediately are often more valuable than pdfs for most buyers.  Some buyers are looking for more specialized datasets that include a schema such as RDF or JSON, so it is very important to clearly explain the format.

License

Selecting a license (or deed) is required. A dropdown list gives a choice of  licenses and deeds that allow liberal reuse of datasets.  We suggest selecting licenses that necessitate little or no extra obligations for the buyer. See the licensing page for more specific details on each of the proposed licenses.  It is also possible to add your own bespoke license and for this select “Other” from the dropdown list. In the case of a bespoke license, it must be text file that is included in the download of the dataset.

Data source

Describe the source(s) of the data.  Sources can vary widely, for example, the data could be from one or several websites, from a specific recruited population, or generated by a machine.

Data collection method

Explain how the data was gathered. It can be important to give the details of your experimental method (or link to a reference with the explanation) so the buyer understands appropriate ways to use the data in subsequent analyses.

Data date

Indicate the date the data was collected.  It can important to make the buyer aware if there will be new releases of the dataset or if it is a one time only dataset.  If datasets will be released on a timed interval, indicate how the datasets will be versioned so the buyer can easily distinguish which version they are getting. The details of release dates can be given in the Product Description field.

Number of observations and variables

Buyers care about the number of observations and variables.  There are 2 fields that allow you to enter the number of variables and observations in your dataset.

More guidelines

Tidy datasets sell better

Structuring datasets to facilitate manipulation, visualization and modeling improves the quality and makes them more attractive to buyers.  It is very helpful to follow the standards for tidy data as described by Hadley Wickham’s publication in the Journal of Statistical Software. If data is supplied in table format, the main idea is to format data with each variable as a column, each observation as a row.  The values are in the body of the table. The column variables are the things that were measured, for example variable could be passes completed, temperature, or grams of fat.  An observation is the thing that the variables were measured on. The observations for previous variables examples could be a football player, a specific day, or a food.

Licensing

Licenses protect both the buyer and the seller, so it is required that all datasets are sold with a license. A license informs the buyers of what they are legally allowed to do with the data.  The license can also assert that the seller gives no warranties about the dataset and disclaims liability for all uses of the dataset. We have provided a selection of licenses that are the preferred licenses of Datafair.  We suggest sellers use very liberal licenses which can encourage the reuse of datasets by the consumers in many different ways. Bespoke licenses can be applied by the seller but it is up to the seller to determine compliance by the buyers with their license.