arXiv:2412.17847v1 Announce Type: new
Abstract: Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities–popular text, speech, and video datasets–from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video.
Analyzing the Attributes of AI Datasets: A Multi-disciplinary Perspective
In the field of AI, the availability of high-quality training data plays a crucial role in driving progress. However, there is a lack of empirical analysis that delves into the attributes of well-established datasets beyond text. In a groundbreaking study, researchers conducted a comprehensive longitudinal audit across multiple modalities, including text, speech, and video datasets. This article aims to provide a deep analysis of this study, highlighting its multi-disciplinary nature and discussing the potential implications and future directions.
Examining Sourcing Trends and Use Restrictions
The study’s manual analysis covered nearly 4000 public datasets from 1990 to 2024, encompassing 608 languages, 798 sources, 659 organizations, and 67 countries. One major finding is the overwhelming reliance on web-crawled, synthetic, and social media platforms, such as YouTube, for training sets in multimodal machine learning applications. This shift in data sourcing has emerged as the dominant trend since 2019, suggesting the potential advantages and limitations of these sources in shaping AI models.
Furthermore, the researchers traced the chain of dataset derivations and discovered that while less than 33% of datasets carry restrictive licenses, over 80% of the source content in widely-used text, speech, and video datasets are subject to non-commercial restrictions. This observation raises important questions about the accessibility and fair use of such datasets and calls for a careful balance between copyright concerns and the need for open data resources.
Geographical and Linguistic Representation
Despite the increasing number of languages and geographies represented in public AI training datasets, the audit reveals that measures of relative geographical and multilingual representation have not significantly improved since 2013. This finding suggests possible limitations in data collection efforts, raising concerns about potential biases in AI models and their applications. As AI technology continues to advance and be deployed globally, addressing the lack of representation in datasets becomes crucial to avoid perpetuating inequalities and cultural biases.
The Multi-disciplinary Nature of the Audit
The significance of this study lies in its multi-disciplinary nature, encompassing elements from computer science, linguistics, geography, and copyright law. By analyzing data sourcing, restrictions, and geographical representation at an ecosystem-level, the researchers provide a comprehensive perspective on the state of AI datasets. This approach allows for a more nuanced understanding of the challenges and potential improvements in dataset transparency and responsible use.
Implications and Future Directions
The findings of this audit highlight important considerations for the AI community. Firstly, the dominance of web-crawled and social media data sources raises questions about data quality, reliability, and potential biases. Future research should focus on understanding the impact of these sources on the performance and generalizability of AI models.
Secondly, the prevalence of non-commercial restrictions on widely-used datasets raises concerns about the accessibility and fairness of AI technology. Exploring alternative approaches to licensing and data sharing could promote more equitable access to training data, fostering a more inclusive AI ecosystem.
Lastly, the stagnation in geographical and linguistic representation in public AI training datasets calls for increased efforts to collect and incorporate data from underrepresented regions and languages. Collaborative initiatives and partnerships with organizations from diverse backgrounds can help address these gaps and ensure that AI technology caters to the needs and cultures of a global society.
In conclusion, this groundbreaking audit provides valuable insights into the attributes of AI datasets beyond text, taking into account sourcing trends, use restrictions, and geographical representation. The multi-disciplinary nature of the study enhances our understanding of the challenges and opportunities in dataset transparency, responsible use, and the global impact of AI technology.