Skip to content

TL;DR

Checklist for dataset sustainability

  1. Produce High-Quality Datasets
    • Clean and validate the data
    • Use established or standardized naming conventions for variables
    • Publish in standard, machine-readable formats
    • Ensure that annotated data has consistent and accurate labels
    • Test the data on various models
  2. Create Privacy-Preserving Datasets
    • Adhere to local regulations and protocols regarding data protection and privacy
  3. Document your Dataset Extensively
    • Create a data sheet
    • Document any models used to augment or annotate the data
    • Version the dataset
  4. Host datasets accessibly
    • Publish in widely acceptable and available data-sharing platforms
    • License the data with appropriate open-source licenses
    • Compress the data
  5. Promote your datasets widely and Build Community for Your Dataset
    • Host competitions to innovate/problem-solve with the dataset.
    • Publish the dataset in a data journal, conference or paper
    • Generate DOIs for your dataset
    • Social media and online forums
    • Dataset registries and catalogues:
    • Partner with academic institutions and industry players
    • Organize regular meetups and webinars
  6. Create and Maintain your datasets ethically
  7. Consider sustainable funding for your Dataset
    • Grants
    • Partnerships
    • Crowd-sourcing