TL;DR
Checklist for dataset sustainability
- Produce High-Quality Datasets
- Clean and validate the data
- Use established or standardized naming conventions for variables
- Publish in standard, machine-readable formats
- Ensure that annotated data has consistent and accurate labels
- Test the data on various models
- Create Privacy-Preserving Datasets
- Adhere to local regulations and protocols regarding data protection and privacy
- Document your Dataset Extensively
- Create a data sheet
- Document any models used to augment or annotate the data
- Version the dataset
- Host datasets accessibly
- Publish in widely acceptable and available data-sharing platforms
- License the data with appropriate open-source licenses
- Compress the data
- Promote your datasets widely and Build Community for Your Dataset
- Host competitions to innovate/problem-solve with the dataset.
- Publish the dataset in a data journal, conference or paper
- Generate DOIs for your dataset
- Social media and online forums
- Dataset registries and catalogues:
- Partner with academic institutions and industry players
- Organize regular meetups and webinars
- Create and Maintain your datasets ethically
- Consider sustainable funding for your Dataset
- Grants
- Partnerships
- Crowd-sourcing