medium-shot-woman-holding-lion-illustration 1.png

Online class service provider

AI solution for advanced voice quality of multi-user online lecture platform

The global pandemic era brought rapid growth in use of Contact-Less technologies such as video conference solutions for corporate meetings, online video class platforms, video interviews or counseling, and other non-face-to-face business applications. The most stressful part of using this video conference solution is, however, the deterioration of sound quality due to background noise, echo, howling etc. that distracts participants.

A online video lecture service provider requested development of sound quality enhancement technology by using deep learning neural network to eliminate quality degradation factors such as echo/reverberation/howling/normal/abnormal background noise from various conditions (places, platforms) including simultaneous multi-access, single terminal usage conditions.

Generak purpose voice quality enhancement algorithm

voice quality enhancement algorithm based on voice filter from shared terminal

voice quality enhancement algorithm for high-performance terminal

Technological Challenges

AI processing speed by voice quality degradation factors

Measure AI processing delay for each voice quality degradation (ambient noise, acoustic echo, howling, etc.) factors based on one frame of audio analysis, achieve under 40ms delay required in real-time track by AEC (Acoustic Echo Cancellation) and DNS (Deep Noise Suppression) Challenge

Score 4.0 or above points in subjective sound quality evaluation

Create the world’s best sound quality improvement technology by setting scores higher than 3.52, which is the top score of Microsoft’s deep noise suppression challenge – INTERSPEECH 2020.

Test set generation

Generate test data for more than 20 open microphone environments. On person speaks, and other microphone synthesized the sound with noise coming, that is composed of various types and strengths (e.g., pure noise intensity synthesized with an inform distribution of 0~25 with average clean speech) and generate data by various open microphone level such as 20, 30, 40 microphones.