본문 바로가기

ComputerScience/기타

Speech Recognition - End to End Models for Speech Recognition

728x90

Before E2E ASR

ASR's main purpose is to translate audio into text

The overall process of asr can be expressed like above.

From speaker's audio signal, we get probalility of the word(AM), and from the word we calculate wheter the word is plausible in the specific sentence(LM)

Based on that probability we finally can decode signal into text.

 

Before E2E ASR we had a lot of modules in the full process. For example, extracting feature from audio and map that features into each phoneme and pass them through AM, LM, PM etc..

E2E ASR (ASR Encoder-Decoder Models)

On E2E model we just put audo features as an input and get text ouput from the model.

It's way more convenient than old ASR which is the integration of several complicated modules.

 

Encoder-Decoder Model is sort of more modern ASR system.

Attention : we feed sequence of feature vectors by time as an input. and get characterized text result

RNN-T : The main difference with attention is that input is just one feature vector per time. so we can get the real time text result as an output.

 

 

728x90
반응형