This project was an exploration into Computational Auditory Scene Analysis (CASA), Blind Source Separation (BSS), Multi-Resolution Analysis, Speaker Identification, Neural Networks and Speech Synthesis. It was performed as a group of four students, for a class in DSP for Music.
An elaborate simulation was conceived in an attempt to unify all the aforementioned fields: Two speakers are present in an idealized, rectangular room, speaking into two mono microphones placed in the room. A mixture of sounds plus reverberation is captured by each microphone, on which de-reverberation is performed, using a cross wavelet transform. The resulting mixture is then fed to an Independent Component Analysis algorithm (ICA), which attempts to separate the sources of the sound. The speakers were restricted to vowels, as this would simplify the classification in the next step. The separated sources are fed to an Artificial Neural Network, which classifies both the speaker and the vowel that is spoken. This is then fed to a speech synthesis algorithm, which generates the vowels again.
As you’re reading this, a few flaws in the experiment design might jump out at you. First and most importantly, Blind De-reverberation is not a trivial problem, and the rest of the experiment depended on its performance. Also, we restricted the speech to vowels, which would violate some of the conditions that the ICA algorithm needs in order to work. There are probably more, but importantly, we enjoyed exploring these fields of research and attempting to unify them.
Results in Brief :
For a full report, scroll to the bottom of the page
- We built an analog microphone with a pre-amplifier and an equalizer.
- Using the shoebox room simulation off some previous work, and analyzed the binaural room impulse response with a cross wavelet transform using a Paul Wavelet. This appears to localize some of the early reflections of the impulse response.
- We used a mixture of signals, and separated them with an ICA algorithm
- We trained a neural network with a few samples, and attempted to classify the speaker and vowel. We achieved 71.43% accuracy for speaker classification, and a vowel classification accuracy of about 50% (The small size of our database probably accounts for the low rate of classification)
Access a copy of the full report here.