Google researchers have released a collection of 2 million-plus labeled audio snippets designed to spark innovation in the area of sound search.
The company earlier this month published a paper titled "AudioSet: An ontology and human-labeled dataset for audio events" that it hopes will combine with image recognition to strengthen overall search and identification capabilities that could be used in a wide variety of machine learning applications, including the automation of video captions that include sound effects. Google began work on the project last year.
Google has exploited its YouTube business to collect 2 million ten-second YouTube excerpts (totaling 5.8 thousand hours of audio) labeled with more than 500 sound categories to create its AudioSet. Categories start at high levels such as Human Sounds and Music, and then get more specific, such as Whistling and Music Genre.
Dan Ellis, Google research scientist, explains in a blog post that "We decided to use 10 second sound snippets as our unit; anything shorter becomes very difficult to identify in isolation. We collected candidate snippets for each of our classes by taking random excerpts from YouTube videos whose metadata indicated they might contain the sound in question (“Dogs Barking for 10 Hours”). Each snippet was presented to a human labeler with a small set of category names to be confirmed (“Do you hear a Bark?”). Subsequently, we proposed snippets whose content was similar to examples that had already been manually verified to contain the class, thereby finding examples that were not discoverable from the metadata."
Ellis adds: "By releasing AudioSet, we hope to provide a common, realistic-scale evaluation task for audio event detection, as well as a starting point for a comprehensive vocabulary of sound events."