Machine learning and signal processing in speech communication

RUB » Institute of Communication Acoustics » Research » Topics

Deutsch

When talking on the phone in a car or on the train, using a digital assistant in the living room, or using a hearing aid - the recorded speech signals are almost always disturbed by ambient noise. Speech enhancement aims to reduce these noises and other recording-related distortions so that the signals can be presented in better quality and with improved intelligibility.

Speech signal enhancement algorithms are often based on statistical estimation methods. The target signal and the interference are modeled using statistical distributions. A cost function is then defined and optimized by analytical calculation or numerical methods. More recently, deep neural networks (DNNs) have also been used. The algorithms mustn't insert a large delay between the disturbed input signal and the processed output signal to ensure real-time voice communication. The algorithms must therefore deliver good results even when using very short segment lengths (e.g., 20 ms) (‘online processing’).

Methods and applications

Speech Enhancement has been one of our research topics for several decades. There are many applications- mobile voice communications, hearing aids, and human-machine interfaces - and many methods. We focus on noise reduction to improve listener comfort and fatigue and increase the acoustic signal's intelligibility. We employ methods based on single microphone signals as well as multiple microphone signals (microphone arrays and beamforming). Developing speech enhancement methods requires a blend of physical modeling, statistical signal processing techniques, and deep learning. Most of our enhancement techniques operate in the spectral domain. Typically, the noisy speech signal is segmented into short frames, transformed, enhanced, inverse transformed, and overlap-added to reconstruct the enhanced signal. The benefits of spectral processing are a concentration of speech energy in a few spectral parameters (especially for voiced speech), a simpler statistical description compared to the time domain, and possibly an application of psychoacoustic principles.

References