0. Abstract
Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively.
In this work, we achieve the best of both worlds by studying how to combine convolution neural networks and transformers to model both local and global dependencies of an audio sequence in a parameter-efficient way.
1. Introduction
Transformer architecture based on self-attention has enjoyed widespread adoption for modeling sequences due to its ability to capture long distance interactions and the high training efficiency
- less capable to extract fine-grained local feature patterns
Alternatively, convolutions have also been successful for ASR, which capture local context progressively via a local receptive field layer by layer.
- One limitation of using local connectivity is that you need many more layers or parameters to capture global information
We introduce a novel combination of self-attention and convolution, Conformer
** equivariance in CNN
- CNNs learn shared position-based kernels over a local window which maintain translation equivariance
- This property ensures that if the input is transformed (e.g., shifted), the output transforms in a predictable way
- f(T(x)) = T(f(x)) : f is equivariant with respect to a transformation T
- 강아지가 사진에 어디에 위치하든 동일한 커널을 사용함으로써 강아지의 특징을 잡는다. 즉 CNN은 위치와 equivariant하다
** equivariance in self-attention
- augmented self-attention with relative position based information that maintains equivariance
- For relative positional encoding, this means that the relationships between tokens remain consistent even if their absolute positions change
- relationship between "dog" and "bone" in "The dog chewed the bone" should be similar to their relationship in "Yesterday, the dog chewed the bone", despite their absolute positions changing
2. Conformer Encoder
first processes the input with a convolution subsampling layer and then with a number of conformer blocks
- SpecAug : let's say we have [500, 80] tensor, 500 time frames 80 frequence bin
- augmentation method : https://arxiv.org/abs/1904.08779
- Convoluation Subsampling : [125, 80], This reduces the sequence length. Let's assume it reduces by a factor of 4.
- Linear : [125, D], where D is the encoder dimension (let's say 512)
2.1 Feed Forward Module
two linear layer and nonlinear activation in between. A residual connection is added.
- input : [125, 512]
- output : [125, 512] (same shape, but transformed features)
2.2 Multi-Headed Self-Attention Module
The relative positional encoding allows the self-attention module to generalize better on different input length and the resulting encoder is more robust to the variance of the utterance length
Pre-norm residual units with dropout which helps training and regularizing deeper models.
- input : [125, 512], query(Q), key(K), value(V)
- linear projection for each head : [125, d_k], repeat from i=1 to num_heads
- attention : [125, d_k], head_i's shape
- concat heads : [125, num_heads * d_k]
- final linear projection : [125, 512]
- output : [125, 512] (same shape, but with global context)
2.3 Convolution Module
- pointwise Conv : kernel size 1, 2d convolution
- Glu : e.g., [125, 2 * 512]->[125, 512], split x into two halves along the last dimension, x[:,:,:d_model] * sigmoid(x[:,:,d_model:]
- Depthwise Conv : group=num_channel 인 1d convolution
2.4 Conformer Block
Having two Macaron-net style feed-forward layers with half-step residual connections sandwiching the attention and convolution modules in between provides a significant improvement over having a single feed-forward module in Conformer architecture
We found that convolution module stacked after the self-attention module works best for speech recognition
3. Experiments
used a single-LSTM-layer decoder in all our models
used 3-layer LSTM language model (LM) trained on the LibriSpeech language model corpus
4. Ablation Studies
Convolution block is the most important feature.
Macaron style FFN pair is more effective than a single FFN of the same number of parameters.
(4) Split the input into parallel branches -> One branch uses MHSA -> Another branch uses a convolution module -> The outputs of these branches are concatenated
Using lightweight convolution instead of depthwise conv shows significant drop in performance
Placing the convolution module after the self-attention module shows better result.
(2) instead of 1/2, half-step residual
Increasing attention heads up to 16 improves the accuracy, especially over the devother datasets
Find kernel size 32 to perform better than rest