Speech Recognition - End to End Models for Speech Recognition

728x90

Before E2E ASR

ASR's main purpose is to translate audio into text

The overall process of asr can be expressed like above.

From speaker's audio signal, we get probalility of the word(AM), and from the word we calculate wheter the word is plausible in the specific sentence(LM)

Based on that probability we finally can decode signal into text.

Before E2E ASR we had a lot of modules in the full process. For example, extracting feature from audio and map that features into each phoneme and pass them through AM, LM, PM etc..

E2E ASR (ASR Encoder-Decoder Models)

On E2E model we just put audo features as an input and get text ouput from the model.

It's way more convenient than old ASR which is the integration of several complicated modules.

Encoder-Decoder Model is sort of more modern ASR system.

Attention : we feed sequence of feature vectors by time as an input. and get characterized text result

RNN-T : The main difference with attention is that input is just one feature vector per time. so we can get the real time text result as an output.

728x90

저작자표시 (새창열림)

'ComputerScience > 기타' 카테고리의 다른 글

tmux 사용법 (0)	2022.09.27
local에서 ssh tunneling으로 원격 서버의 jupyter notebook 접속하기 (0)	2022.09.23
Speech Recognition - MFCC (0)	2022.09.16
node - 16. 서버리스 노드 개발 (0)	2022.01.08
export .ipynb to PDF (at Colab) (0)	2021.09.29

jsdysw

Speech Recognition - End to End Models for Speech Recognition

Before E2E ASR

E2E ASR (ASR Encoder-Decoder Models)

'ComputerScience > 기타' 카테고리의 다른 글

티스토리툴바

Speech Recognition - End to End Models for Speech Recognition

Before E2E ASR

E2E ASR (ASR Encoder-Decoder Models)

'ComputerScience > 기타' 카테고리의 다른 글

'ComputerScience/기타' Related Articles

티스토리툴바