Promoting Discoverability and Reuse
Once published in accessible and findable formats, it is important to actively promote your datasets to increase their visibility, discoverability, and impact within the AI research community.
Build Community for Your Dataset
In the context of ML training datasets, we refer to the community as two groups of stakeholders.
The first of these are the individuals and communities who are represented in or affected by the dataset. Engagement with data subjects ensures that the dataset is not only comprehensive with declared biases but is respectful and considerate of the nuances of the represented languages, indigenous knowledge and cultures. This engagement to produce culturally sensitive datasets enhances their relevance and applicability in context-aware AI applications.
The second group of stakeholders include communities of researchers and developers with specific domain expertise who can provide valuable insights into the terminology and use case development and data gaps unique to their contexts.
Both sets of communities can provide feedback that iteratively improves the quality of the dataset and ensures the dataset remains relevant and robust.
Tip
This engagement can be in the form of consultations with native speakers, local linguists, language experts, and community members. This collaborative effort also creates a sense of ownership and representation among the communities involved.
Collaborations with initiatives like Common Voice and Masakhane can also broaden the dataset's scope and enhance its quality. These partnerships contribute significantly to creating a dataset that is not only linguistically rich but also domain-relevant.
The following are suggested strategies to engage the community in dataset development and improvement. Active engagement involves creating an ecosystem where stakeholders, including researchers, developers, end-users, and subject matter experts, actively participate and contribute.
- Hosting competitions to innovate/problem-solve with the dataset.
While these competitions encourage data scientists and researchers to learn and apply their skills to real-world problems, they are also an effective way for dataset creators to increase the visibility and re-use of the dataset in new applications, promote specific use cases and receive feedback and validation on the quality of the datasets from other researchers in the community. Examples of popular competition platforms include Kaggle and Zindi, and for academic researchers CodaLab. Incentives may sometimes be required to increase the level of engagement and participation by community researchers.
- Publishing the dataset in a data journal, conference or paper
Additionally, promoting datasets through academic conferences, workshops, and webinars can help reach a more targeted audience of researchers and practitioners in the field. Presenting datasets and their potential applications at these events can spark interest, encourage collaboration, and facilitate knowledge exchange among experts.
Consider writing and publishing a paper detailing your dataset, its development process, and its performance in test models. This can be submitted to conferences, journals, or even as a preprint on platforms like arXiv, Data in Brief or Journal of Open Research Software. These publications focus on the datasets themselves and can help increase their visibility within the research community.
- Data citation and DOIs
By publishing on sites that allow indexing of datasets e.g. using URI and DOIs, datasets are discoverable from research papers and on web pages that can be easily indexed and ranked by search engines. Additionally, use appropriate keywords in metadata and datasheets for better visibility in search engines.
- Social media and online forums
Share and promote datasets through your social media platforms, mailing lists, and relevant online forums and discussion boards such as AI/ML subreddits or AI research groups on LinkedIn. These provide opportunities for community members can discuss dataset development, share ideas, and provide feedback.
- Dataset registries and catalogues
Contribute to existing dataset registries and catalogues, such as Registry of Open Data on AWS or DataHub, which can increase the visibility of the datasets and make them more discoverable.
- Partnerships with academic institutions and industry
Collaborate with academic institutions and industry partners to create, maintain, and share datasets, which can increase their visibility and discoverability among researchers in the field. These institutions can create research projects, internships, or thesis work that contributes to the dataset.
- Organize regular meetups and webinars
These topical meetups provide you with an opportunity to discuss progress, challenges, and future directions of the dataset. Inviting expert speakers and conducting trainings using your dataset in these workshops will educate and engage your data science community.