A Large-Scale Multilingual Dataset for Keyword Spotting


Millions of spoken keywords for developing keyword spotters!

Get Started

Lack of data for KWS? No more!


Keyword spotting (KWS) has become a hot topic in speech processing due to the rise of commercial applications based on voice command detection, such as voice assistants. This project demonstrates how anyone can reproduce SiDi KWS, a large-scale multilingual dataset of spoken keywords, by running open-source automatic forced alignment tools with public datasets.

We are at INTERSPEECH 2022!

Get to know SiDi KWS!


Large-scale

More than 24.3 million labeled audio clips of spoken keywords!

Huge vocabulary

Around 700 thousand unique keywords.

Multilingual

Keywords in English, French, German and Spanish!

100% reproducible

Based on public transcribed speech datasets and aligner.

Ready to reproduce SiDi KWS?


Get the input datasets

Download Librispeech, MLS, and Mozilla Common Voice.

Run the aligner

Run MFA to get the timestamp of the input transcripts.

Segment each word

Use any programming language to segment the input speech.

Voilà

Enjoy your new large-scale keyword spotting dataset!

The minds behind this project


SiDi KWS has been developed at SiDi as an internal research project conducted by the following authors:

Acknowledgments


SiDi KWS has been financed by Samsung Eletrônica da Amazonia Ltda., under the auspices of the Brazilian Federal Law of Informatics nº. 8248/91. Due to a coorporate policy, neither the original SiDi KWS dataset nor the source code used to build it can be made public.