MLCommons Association, an open engineering consortium dedicated to improving machine learning for everyone, has announced the general availability of the People’s Speech Dataset and the Multilingual Spoken Words Corpus (MSWC). These trail-blazing and permissively licensed datasets advance innovation in machine learning research and commercial applications. Further, the MLCommons Association is issuing a call for participation in the new DataPerf benchmark suite, which measures and encourages innovation in data-centric AI.
The People’s Speech Dataset
The People’s Speech Dataset is among the world’s largest English speech recognition datasets licensed for academic and commercial usage. The 30,000-hour supervised conversational dataset is an order of magnitude larger than what was available just a few years ago. The dataset, released under a Creative Commons license, democratizes access to speech technology such as voice assistants and transcription, and unlocks innovation in the machine learning community. Contributors to the dataset include researchers from Baidu, Factored, Harvard University, Intel, Landing AI, and NVIDIA. It can be downloaded at mlcommons.org/speech.
Multilingual Spoken Words Corpus
Also available today is the Multilingual Spoken Words Corpus (MSWC), a rich audio speech dataset with more than 340,000 keywords in 50 languages with upwards of 23.4 million examples. Previous datasets relied on manual efforts to collect and validate thousands of utterances for each keyword and were commonly restricted to a single language. A diverse multilingual dataset that spans languages spoken by over five billion people, MSWC advances the research and development of applications such as voice interfaces for a broad global audience. Contributors to the MSWC include researchers from Coqui, Factored, Google, Harvard University, Intel, Landing AI, NVIDIA, and the University of Michigan. It can be downloaded at mlcommons.org/words.
The new DataPerf benchmark suite supports data-centric AI innovation by measuring the quality of datasets for common ML tasks and the impact of enhancing datasets. Training and test datasets are a key part of creating an ML system — the system can only be as good as the data — but much less effort is spent on understanding and improving datasets than on mastering and improving models. DataPerf fosters and measures progress in this vital area. The MLCommons Association will support a series of challenges with leaderboards in 2022 to encourage participation in DataPerf. Contributors to the suite include researchers from Alibaba, Coactive.AI, ETH Zurich, Google, Harvard University, Landing.AI, Meta, Stanford University, and TU Eindhoven, drawing on the teams responsible for Cats4ML, the Data-Centric AI Competition, DCBench, Dynabench, and the MLPerf™ benchmarks.
Historically, most AI research has focused on improving model architectures and making them available to the community; in contrast, attention to engineering and maintaining datasets has lagged and is often manual and ad-hoc. The MLCommons Association is a firm proponent of Data-Centric AI (DCAI), the discipline of systematically engineering the data for AI systems by developing efficient software tools and engineering practices to make dataset creation and curation easier. Our open datasets and tools like DataPerf concretely support the DCAI movement and drive machine learning innovation.
“The machine learning model architecture for many applications is basically a solved problem. In many cases, focusing on engineering the data is more important for unlocking successful AI applications. Data is food for AI, and our systems need not just massive amounts of calories, but also high-quality nutrition. We need not just big data, but good data,” said Andrew Ng, founder and CEO of Landing AI, founding lead of Google Brain, co-founder and chairman of Coursera, and adjunct professor at Stanford University. “Thanks to the shared efforts by the community, including the work initiated by the MLCommons Association and its members, the movement demonstrates the potential for Data-Centric AI, and how we can collectively implement a greater AI adoption.”
“Speech technology can empower billions of people across the planet, but there’s a real need for large, open, and diverse datasets to catalyze innovation,” said David Kanter, the MLCommons Association co-founder and executive director. “The People’s Speech is a large-scale dataset in English while MSWC offers a tremendous breadth of languages. I’m excited for these datasets to improve everyday experiences like voice-enabled consumer devices and speech recognition.”