Before E2E ASR
ASR's main purpose is to translate audio into text
The overall process of asr can be expressed like above.
From speaker's audio signal, we get probalility of the word(AM), and from the word we calculate wheter the word is plausible in the specific sentence(LM)
Based on that probability we finally can decode signal into text.
Before E2E ASR we had a lot of modules in the full process. For example, extracting feature from audio and map that features into each phoneme and pass them through AM, LM, PM etc..
E2E ASR (ASR Encoder-Decoder Models)
On E2E model we just put audo features as an input and get text ouput from the model.
It's way more convenient than old ASR which is the integration of several complicated modules.
Encoder-Decoder Model is sort of more modern ASR system.
Attention : we feed sequence of feature vectors by time as an input. and get characterized text result
RNN-T : The main difference with attention is that input is just one feature vector per time. so we can get the real time text result as an output.
'ComputerScience > 기타' 카테고리의 다른 글
tmux 사용법 (0) | 2022.09.27 |
---|---|
local에서 ssh tunneling으로 원격 서버의 jupyter notebook 접속하기 (0) | 2022.09.23 |
Speech Recognition - MFCC (0) | 2022.09.16 |
node - 16. 서버리스 노드 개발 (0) | 2022.01.08 |
export .ipynb to PDF (at Colab) (0) | 2021.09.29 |