MediaZen-ETRI Develop 3-Channel Voice Spectrum... Breakthrough in AI Speech Recognition Performance Improvement

Published 21 Feb.2023 10:21(KST)

[Asia Economy Reporter Hyungsoo Park] MediaZen, a KOSDAQ-listed company, announced on the 21st that through the on-site support program for research personnel from the Electronics and Telecommunications Research Institute (ETRI), it has developed a voice spectrum using RGB 3 channels to complement the existing single-channel voice spectrum.

The voice recognition system based on deep learning networks, which has reached the pinnacle in recognition performance, is based on the Transformer algorithm. The Transformer algorithm has improved performance by processing large amounts of training data. With the emergence of ultra-large-scale training data, the degree of performance improvement has reached saturation. To enhance voice recognition performance, research is required not only in network architecture but also in various technical fields, such as exploring new methods to extract voice recognition features.

The most widely used voice recognition feature currently is the "log Mel spectrum." However, it has the drawback of not including various generation processes of voice signals. Due to the nature of deep learning networks, the input features must be able to represent the distinctive elements of various voices individually to enable more intelligent learning based on them.

Through the ETRI research personnel on-site support program, MediaZen developed a color spectrum with RGB components by channelizing the analyzed formant filter information and signal information using a voice production model. Generally, formant filter information is suitable for representing phonemes and has characteristics relatively robust to background noise. Signal information not only represents voice information but also effectively expresses the characteristics of the individual speaker’s voice. In the color spectrum, after analyzing the information, it is characterized and provided to the deep learning network during voice recognition system training. This helps the AI select the feature information necessary for voice recognition. In experiments conducted with the TensorFlow-based DeepSpeech2 voice recognition system to verify this, it was confirmed that the ERR performance improved by more than 20% compared to the existing log Mel spectrum voice recognition system.

MediaZen Executive Director Min-kyu Song explained, "The color spectrum developed through the ETRI research personnel on-site support program has a very wide range of applications, not only in voice recognition but also in TTS, speaker separation, emotion recognition, and all voice-based fields as well as audio-related applications." He added, "In terms of improving voice recognition performance, it is expected that adopting network architectures developed for image processing will enable the configuration of diverse and efficient voice recognition systems."