In a paper published on the preprint server Arxiv.org, researchers at Google and the University of Illinois propose mixture invariant training (MixIT), an unsupervised approach to separating, isolating, and enhancing the voices of multiple speakers in an audio recording that requires only single-channel (e.g., monaural) acoustic features. They claim it “significantly” improves speech separation performance by incorporating reverberant mixtures and a large amount of in-the-wild training data.
As the coauthors of the paper point out, audio perception is fraught with a fundamental problem — sounds are mixed together in a way that’s impossible to disentangle without knowledge of the sources’ characteristics. Attempts have been made to design algorithms capable of estimating each sound source from single-channel recordings, but most to date are supervised, meaning they train on audio mixtures created by adding sounds with or without simulations of the environment. The result is that they fare poorly when there’s a mismatch in the distribution of sound types or in the presence of acoustic reverberation because it’s (1) tough to match the characteristics of a real corpus; (2) the room characteristics are sometimes unknown; (3) data of every source type in isolation might not be readily available; (4) and accurately simulating realistic acoustics is difficult.
MixIT solves these challenges by using acoustic mixtures without references in training. Training examples are constructed by mixing together existing audio mixtures, and the system divides them into a number of sources such that the separated sources are remixed to approximate the original.
In experiments, MixIT was trained using four Google Cloud tensor processing units (TPU) to tackle three tasks: speech separation, speech enhancement, and universal sound separation. For the first task — speech separation — the researchers drew on the open source WSJ0-2mix and Libri2Mix data sets to extract over 390 hours of recordings of male and female speakers, to which they added a reverberation effect before feeding a mixture of the two sets (3-second clips from WSJ0-2mix and 10-second clips from Libri2Mix) to the model. For the speech enhancement task, they collected non-speech sounds from FreeSound.org to test whether MixIT could be trained to remove noisy audio from a mixture containing LibriSpeech voices. And for the universal sound separation task, they used the recently released Free Universal Sound Separation data set to train MixIT to separate arbitrary sounds from an acoustic mixture.
The researchers report that in universal sound separation and speech enhancement, unsupervised training didn’t help as much compared with existing approaches — presumably because the test sets were “well-matched” to the supervised training domain. However, they claim that for universal sound separation, unsupervised training appeared to help slightly with generalization to the test set relative to the supervised-only training; while it didn’t reach supervised levels, the coauthors claim MixIT’s no-supervision performance was “unprecedented.”
Here’s a recording fed into the model:
And here’s the separate audio sources:
Here’s another recording fed to the model:
And here’s what it isolated:
“MixIT opens new lines of research where massive amounts of previously untapped in-the-wild data can be leveraged to train sound separation systems,” the researchers wrote. “An ultimate goal is to evaluate separation on real mixture data; however, this remains challenging because of the lack of ground truth. As a proxy, future experiments may use recognition or human listening as a measure of separation, depending on the application.”