Speaker Identification in Reverberant Environments

This project was an explo­ration into Com­pu­ta­tional Audi­tory Scene Analy­sis (CASA), Blind Source Sep­a­ra­tion (BSS), Multi-Resolution Analy­sis, Speaker Iden­ti­fi­ca­tion, Neural Net­works and Speech Syn­the­sis. It was per­formed as a group of four stu­dents, for a class in DSP for Music.

An elab­o­rate sim­u­la­tion was con­ceived in an attempt to unify all the afore­men­tioned fields: Two speak­ers are present in an ide­al­ized, rec­tan­gu­lar room, speak­ing into two mono micro­phones placed in the room. A mix­ture of sounds plus rever­ber­a­tion is cap­tured by each micro­phone, on which de-reverberation is per­formed, using a cross wavelet trans­form. The result­ing mix­ture is then fed to an Inde­pen­dent Com­po­nent Analy­sis algo­rithm (ICA), which attempts to sep­a­rate the sources of the sound. The speak­ers were restricted to vow­els, as this would sim­plify the clas­si­fi­ca­tion in the next step. The sep­a­rated sources are fed to an Arti­fi­cial Neural Net­work, which clas­si­fies both the speaker and the vowel that is spo­ken. This is then fed to a speech syn­the­sis algo­rithm, which gen­er­ates the vow­els again.

As you’re read­ing this, a few flaws in the exper­i­ment design might jump out at you. First and most impor­tantly, Blind De-reverberation is not a triv­ial prob­lem, and the rest of the exper­i­ment depended on its per­for­mance. Also, we restricted the speech to vow­els, which would vio­late some of the con­di­tions that the ICA algo­rithm needs in order to work. There are prob­a­bly more, but impor­tantly, we enjoyed explor­ing these fields of research and attempt­ing to unify them.

Results in Brief :

 For a full report, scroll to the bot­tom of the page

  • We built an ana­log micro­phone with a pre-amplifier and an equalizer.
Analog Microphone

Ana­log Micro­phone Circuit

  • Using the shoe­box room sim­u­la­tion off some pre­vi­ous work, and ana­lyzed the bin­au­ral room impulse response with a cross wavelet trans­form using a Paul Wavelet. This appears to local­ize some of the early reflec­tions of the impulse response.
Simulated Room Impulse Response

Sim­u­lated Room Impulse Response

Cross Wavelet Transform (with Paul Wavelet)

Cross Wavelet Trans­form (with Paul Wavelet)

 

  • We used a mix­ture of sig­nals, and sep­a­rated them with an ICA algorithm
Input signals

Input sig­nals

Separated Signals

Sep­a­rated Signals

 

  • We trained a neural net­work with a few sam­ples, and attempted to clas­sify the speaker and vowel. We achieved 71.43% accu­racy for speaker clas­si­fi­ca­tion, and a vowel clas­si­fi­ca­tion accu­racy of about 50% (The small size of our data­base prob­a­bly accounts for the low rate of clas­si­fi­ca­tion)

 Access a copy of the full report here.

Tags: , , , , , , ,