The global pandemic era brought rapid growth in use of Contact-Less technologies such as video conference solutions for corporate meetings, online video class platforms, video interviews or counseling, and other non-face-to-face business applications. The most stressful part of using this video conference solution is, however, the deterioration of sound quality due to background noise, echo, howling etc. that distracts participants.
A online video lecture service provider requested development of sound quality enhancement technology by using deep learning neural network to eliminate quality degradation factors such as echo/reverberation/howling/normal/abnormal background noise from various conditions (places, platforms) including simultaneous multi-access, single terminal usage conditions.
Generak purpose voice quality enhancement algorithm
voice quality enhancement algorithm based on voice filter from shared terminal
voice quality enhancement algorithm for high-performance terminal
Technological Challenges
AI processing speed by voice quality degradation factors
Measure AI processing delay for each voice quality degradation (ambient noise, acoustic echo, howling, etc.) factors based on one frame of audio analysis, achieve under 40ms delay required in real-time track by AEC (Acoustic Echo Cancellation) and DNS (Deep Noise Suppression) Challenge
1
Score 4.0 or above points in subjective sound quality evaluation
Create the world’s best sound quality improvement technology by setting scores higher than 3.52, which is the top score of Microsoft’s deep noise suppression challenge – INTERSPEECH 2020.
2
Test set generation
Generate test data for more than 20 open microphone environments. On person speaks, and other microphone synthesized the sound with noise coming, that is composed of various types and strengths (e.g., pure noise intensity synthesized with an inform distribution of 0~25 with average clean speech) and generate data by various open microphone level such as 20, 30, 40 microphones.
3
Road Map
Create sound DB and echo / reverberation data generator
RNN-based speaker voice feature vector generation
Sound spectrogram filter modeling
Clustering
Development of echo / reservation / howling noise elimination intergrated module
Verification and application
Key Features
Noise suppression in sound input during multi-party video conference
1
Noise suppression in sound input during multi-party video conference with multiple users participate through one microphone
2
Noise suppression in sound input during multi-party video conference through high performance smartphones
3
The Result
40
AI processing time for each voice degradation factors (ms)
40
Subjective quality evaluation score for voice enhancement