본문 바로가기

Paper Review

[Paper Review] Conformer: Convolution-augmented Transformer for Speech Recognition

728x90
 

Conformer: Convolution-augmented Transformer for Speech Recognition

Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interac

arxiv.org


0. Abstract

Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively.

In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way.

1. Introduction

Transformer architecture based on self-attention has enjoyed widespread adoption for modeling sequences due to its ability to capture long distance interactions and the high training efficiency

  • less capable to extract fine-grained local feature patterns

Alternatively, convolutions have also been successful for ASR, which capture local context progressively via a local receptive field layer by layer.

  • One limitation of using local connectivity is that you need many more layers or parameters to capture global information

We introduce a novel combination of self-attention and convolution, Conformer

 

** equivariance in CNN 

  • CNNs learn shared position-based kernels over a local window which maintain translation equivariance
  • This property ensures that if the input is transformed (e.g., shifted), the output transforms in a predictable way
  • f(T(x)) = T(f(x)) : f is equivariant with respect to a transformation T
  • 강아지가 사진에 어디에 위치하든 동일한 커널을 사용함으로써 강아지의 특징을 잡는다. 즉 CNN은 위치와 equivariant하다

** equivariance in self-attention 

  • augmented self-attention with relative position based information that maintains equivariance
  • For relative positional encoding, this means that the relationships between tokens remain consistent even if their absolute positions change
  • relationship between "dog" and "bone" in "The dog chewed the bone" should be similar to their relationship in "Yesterday, the dog chewed the bone", despite their absolute positions changing

2. Conformer Encoder

first processes the input with a convolution subsampling layer and then with a number of conformer blocks

  • SpecAug : let's say we have [500, 80] tensor, 500 time frames 80 frequence bin
  • Convoluation Subsampling : [125, 80], This reduces the sequence length. Let's assume it reduces by a factor of 4.
  • Linear : [125, D], where D is the encoder dimension (let's say 512)

2.1 Feed Forward Module

two linear layer and nonlinear activation in between. A residual connection is added.

  • input : [125, 512]
  • output : [125, 512] (same shape, but transformed features)

2.2 Multi-Headed Self-Attention Module

The relative positional encoding allows the self-attention module to generalize better on different input length and the resulting encoder is more robust to the variance of the utterance length

Pre-norm residual units with dropout which helps training and regularizing deeper models.

  • input : [125, 512], query(Q), key(K), value(V)
  • linear projection for each head : [125, d_k], repeat from i=1 to num_heads
  • attention : [125, d_k], head_i's shape
  • concat heads : [125, num_heads * d_k]
  • final linear projection : [125, 512]
  • output : [125, 512] (same shape, but with global context)

 

2.3 Convolution Module

  • pointwise Conv : kernel size 1, 2d convolution

  • Glu : e.g., [125, 2 * 512]->[125, 512], split x into two halves along the last dimension, x[:,:,:d_model] * sigmoid(x[:,:,d_model:]
  • Depthwise Conv : group=num_channel 인 1d convolution

2.4 Conformer Block

Having two Macaron-net style feed-forward layers with half-step residual connections sandwiching the attention and convolution modules in between provides a significant improvement over having a single feed-forward module in  Conformer architecture

We found that convolution module stacked after the self-attention module works best for speech recognition

 

3. Experiments

tested on the LibriSpeech dataset

used a single-LSTM-layer decoder in all our models

used 3-layer LSTM language model (LM) trained on the LibriSpeech language model corpus

4. Ablation Studies

Convolution block is the most important feature.

Macaron style FFN pair is more effective than a single FFN of the same number of parameters.

(4) Split the input into parallel branches -> One branch uses MHSA -> Another branch uses a convolution module -> The outputs of these branches are concatenated

Using lightweight convolution instead of depthwise conv shows significant drop in performance

Placing the convolution module after the self-attention module shows better result.

(2) instead of 1/2, half-step residual

Increasing attention heads up to 16 improves the accuracy, especially over the devother datasets

Find kernel size 32 to perform better than rest

728x90
반응형