Sustainable AI Training Datasets

Defining Sustainability

We define sustainable AI training datasets as high-quality curated datasets that are created with long-term accessibility and broad usability in mind, prioritise societal benefit, respect and reflect the diversity of the communities they represent and are created with an awareness of the specific context and through non-exploitative practices.

Why Dataset Sustainability is Important

Publishing and maintaining sustainable AI training datasets is critical to developing and advancing machine learning and artificial intelligence technologies. The datasets produced through the Lacuna Fund grant aim to serve as the foundation for training models and algorithms in low-resourced contexts. The quality and relevance of the training data are therefore critical factors in determining the effectiveness and accuracy of the resulting models and their eventual impact in these contexts. Therefore, publishing high-quality and sustainable datasets is paramount for ensuring that AI systems are developed ethically and responsibly.

One of the main benefits of publishing AI training datasets is increasing the availability of knowledge resources and promoting collaboration in building context-relevant models and AIs. Researchers have the opportunity to also learn from each other's work and build on previous discoveries. This is also advantageous as it conserves resources and ensures communities do not suffer from research fatigue as datasets are reused.

As AI systems become more complex and are integrated into various sectors and applications that impact societies it is essential to ensure that these systems are developed responsibly and that their decisions are explainable and understandable. By publishing training datasets, researchers help promote transparency and accountability by providing information about the data used to train the models. Consequently, these open datasets create incentives for researchers to create high-quality models.

Sustainable AI training datasets are also critical for ensuring that AI technologies developed have a positive impact on society. By ensuring that datasets are representative and inclusive of diverse populations, researchers can help to minimize potential biases and ensure that AI systems are fair and equitable.

Ultimately, by publishing sustainable training datasets, researchers can ensure that their expected benefits are realized and maintained even after the end of the Lacuna Fund grant.