Skip to content

Welcome to the AI Training Dataset Sustainability Toolkit

Guidance to the Toolkit

The purpose of this toolkit is to support Lacuna Fund grantees and the wider machine learning community in publishing sustainable AI training datasets in standard formats. The toolkit is designed as a step-by-step playbook, with guidance and checklists that researchers can reference as they prepare their datasets to ensure widespread reuse by others in the Machine Learning (ML) community. The toolkit covers topics including guidance on producing high-quality datasets before publication, selection of appropriate platforms to host the dataset and, activities they can undertake to promote the datasets and maximize their discoverability and impact. The toolkit also emphasizes ethical considerations including guidelines on how to protect personal information and privacy. Examples provided within this initial draft of the toolkit are specific to datasets for use in ML in Agriculture, Climate, Natural Language Processing, Health and Energy.

This work was carried out by the Local Development Research Institute (LDRI) with support from Lacuna Fund and Canada’s International Development Research Centre (IDRC).

We welcome feedback on these guidelines by contacting Leonida at leo@developlocal.org or Cecilia at cecilia@developlocal.org.

Logo 1 Logo 2 Logo 3