본문 바로가기

Paper Review

[Paper Review] Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks

728x90

https://www.cs.toronto.edu/~graves/icml_2006.pdf


0. Abstract

In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units.

RNNs are powerful sequence learners but there are two problems.

 

0.1 Pre-segmented training data

Example: Let's say we have an audio clip of someone saying "Hello world". 

Pre-segmented data might look like this:

  • "He-" (0.0s - 0.2s) 
  • "-llo" (0.2s - 0.4s) 
  • "wor-" (0.5s - 0.7s) 
  • "-ld" (0.7s - 0.9s) 

The RNN would be trained on these pre-defined segments, learning to associate each audio segment with its corresponding text output.

Without this alignment, the simple approaches aren't available to us.

The problem: In real-world speech, words and sounds often blend together, making it difficult and time-consuming to create pre-segmented training data

 

0.2 Post-processing to transform outputs into label sequence

After an RNN processes the input, its raw output often needs to be converted into a meaningful sequence of labels (like words or phonemes). 

Example: Let's say our RNN processes the "Hello world" audio and outputs a series of probabilities for each phoneme at each time step:

Post-processing is needed to convert these probabilities into a sequence of phonemes, and then into words.

This might involve:

  1. Choosing the highest probability phoneme at each time step
  2. Merging consecutive identical phonemes
  3. Applying language models to convert phonemes to words

The problem: This post-processing can be complex and may introduce errors, especially for continuous speech where word boundaries are unclear.

 

The novel method mentioned in the paper aims to train RNNs to directly output label sequences without requiring pre-segmented data or complex post-processing (training RNNs to label unsegmented sequences directly, thereby solving both problems with not knowing the alignment between the input and output)

 

1. Introduction

1.1 HMMs, CRFs are good but there're drawbacks

  1. Requirement of task-specific knowledge
  2. Explicit dependency assumptions
    • HMM assume that observations are independent which means that the probability of an observation depends only on the current hidden state, not on previous observations or states
    • CRFs still require specifying a graph structure that defines dependencies between variables. This structure might not capture all relevant dependencies in the data.
  3. Training is generative even though sequence labelling is discriminative
    • HMMs are typically trained generatively, even though sequence labeling is discriminative.
      • discriminative model : learn the conditional probaility distribution P(Y|X), focus directly on the decision boundary between classes (분류를 위한 구분선을 찾아내는데 신경쓴다)
      • generative model : learn the joint probability distribution P(X, Y), where X is the input data and Y is the label. Model how the data is generated, including the distribution of the input features. (input과 label이 구성하는 카테고리별 확률 분포의 파라미터를 찾아내는데 신경쓴다)

1.2 RNNs are good

  • require no prior knowledge of the data, beyond the choice of input and output representation.
  • can be trained discriminatively
  • modeling time series
  • robust to temporal and spatial noise

 

1.3 RNNs still face challlenges..

Example: Consider an RNN being used for phoneme classification in speech recognition

  1. Input: An acoustic signal of someone saying "cat"
  2. Output: A sequence of phonemes /k/ /æ/ /t/
  3. Signal preprocessing: The acoustic signal is divided into short time frames (e.g., 25ms each).
  4. Feature extraction: For each frame, features like Mel-frequency cepstral coefficients (MFCCs) are extracted.
  5. RNN processing: The RNN processes these features frame by frame.
  6. Independent classifications: For each frame, the RNN outputs a probability distribution over all possible phonemes.

 

  • Pointwise objective function: The loss function used to train the RNN is typically calculated separately for each point (time step) in the sequence. This means the network is optimized to make the best prediction at each individual time step, rather than optimizing for the entire sequence label.
    • classify each frame independently, The loss function is typically calculated separately for each frame's prediction.
    • doesn't capture the temporal dependencies between phonemes effectively.
  • Lack of sequence-level optimization: The RNN isn't directly trained to produce the best overall sequence of labels, but rather to produce the best label at each step independently.
    • Post processing is needed, convert frame wise predictions into a coherent sequence of phonemes

=> which means that training data must be pre-segmented and that the network outputs must be post-processed to give the final label sequence.

 

1.4 RNNs + HMMs (Hybrid systems)

  • During training HMMs can automatically divide the input sequence into segments and convert the RNN's frame-wise classifications into a coherent sequence of labels
  • RNN execel making predictions for individual short segments of the sequence.

=> Addresses the segmentation problem that RNNs typically face & Allows for end-to-end sequence labeling. But this hybrid approach doesn't fully exploit the sequence modeling capabilities of RNNs.

 

1.5 We Introduce Connectionist temporal classification (CTC)

  • Proposed method models all aspects of the sequence within a single network architecture
    • This means that the entire process of sequence labeling is handled by one RNN, rather than relying on separate components for different stages.
  • objective function is differentiable, allows graident-based optimization
  • End-to-end learning: The network learns to map directly from input sequences to label sequences.
  • No need for explicit alignment: The method doesn't require pre-aligned input-output pairs.
  • Flexibility: It can handle variable-length input and output sequences.

*temporal classification : task of assigning labels or categories to time series data or sequential observations

*frame-wise classification : independent labelling of each time-step, or frame, or input sequence

2. Temporal Classification

  • S : (x, z) pairs, set of training examples drawn from a fixed distribution D_x*z (x, z 차원 공간)
  • x : input space x = (R^m)^*, set of all sequences of m dimensional real valued vectors (m차원 vector들의 sequence)
    • (x1, x2, ..., z_T), U <= T
  • z : target space z = L^* , label sequences, labelling
    • (z1, z2, ..., z_U), U <= T
  • L : fininte alphabet

use S to train a temporal classifier h : x -> z to classify previously unseen input sequences in a way that minimizes error measure

 

2.1 Label Error Rate

  • ED(p, q) : Edit distance(number of insertions, substitutions, deletions required) between two sequences
  • S' : test set (disjoint from S)
  • Z : total number of target lables in S'

3. Connectionist Temporal Classification

 

Sequence Modeling with CTC

A visual guide to Connectionist Temporal Classification, an algorithm used to train deep neural networks in speech recognition, handwriting recognition and other sequence problems.

distill.pub

 

(Speech Recognition) Connectionist Temporal Classification 리뷰 및 설명

이 포스트는 개인적으로 공부한 내용을 정리하고 필요한 분들에게 지식을 공유하기 위해 작성되었습니다. 지적하실 내용이 있다면, 언제든 댓글 또는 메일로 알려주시기를 바랍니다. 상당 부분

zerojsh00.github.io

 

Breaking down the CTC Loss - Sewade Ogun's Website

blog 4 loss 1 The Connectionist Temporal Classification is a type of scoring function for the output of neural networks where the input sequence may not align with the output sequence at every timestep. It was first introduced in the paper by [Alex Graves

ogunlao.github.io

 

In a typical RNN for sequence labeling, the network produces a series of outputs at each time step. These outputs are usually scores or unnormalized probabilities for each possible label.
CTC introduces a way to interpret these raw outputs as probabilities over entire label sequences, not just individual labels at each time step.

 

3.1 From Network Outputs to Labellings

(CTC Network output)

CTC network has a softmax output layer. |L| + 1 units, where |L| is the number of labels and +1 is for 'blank' or no-label.

(alignment probabilities)

These outputs define the probabilities of all possible ways of aligning all possible label sequences with the input sequence. 

 

  • N_w : RNN network, (R^m)^T -> (R^n)^T
  • y : y = N_w(x)
  • x : input sequence
  • T : sequence length
  • L' : alphabet and blank
  • L'^T : paths, denote them ㅠ
  • ㅠ_t : in this example, phoneme
  • t : each timestep

This Equation assume that the network outputs at different times are conditionally independent, given the internal state of the network.

  • B : Decoding, L'^T -> L^(<=T)
    • ex) B(a-ab-) = B(-aa-_ -abb) = aab
  • L^(<=T) : set of possible labellings, simply removing all blanks and repeated labels from the paths

(Total Probability of a Label Sequence)

P("HELLO" | x) = P(h-e-_-ll-_-ll-oo | x) + P(hh-e-ll-_-_-l-_-o | x)

 

3.2 Constructing the Classifier

the output of the classifier should be the most probable labelling for the input sequence 

Based on the softmax output distribution, how can we decode the output?

 

3.2.1 Best path decoding

Best path decoding is based on the assumption that the most probable path will correspond to the most probable labelling. (가장 확률이 높은 path가 가장 좋은 labeling일 수 있다, not guaranteed...)

  • probable path : P(ㅠ|x)
  • the most probable path : ㅠ^*, concatenation of the most active outputs at every time-step
    • 매 순간 제일 높은 확률의 phoneme들을 다 모으면 가장 probable할거다
  • most probable labeling : B(ㅠ^*)

3.2.2 Prefix search decoding

Given enough time prefix search decoding always finds the most probable labelling

However, the maximum number of prefixes it must expand grows exponentially with the input sequence length.

4. Training the Network

So far we have described an output representation that allows RNNs to be used for CTC. We now derive an objective function for training CTC networks with gradient descent.

The objective function is derived from the principle of maximum likelihood.

 

4.1 The CTC Forward-Backward Algorithm

We require an efficient way of calculating the conditional probabilities p(l|x) of individual labellings.

At first sight this suggests this will be problematic: the sum is over all paths corresponding to a given labelling, and in general there are very many of these. (hello가 될 수 있는 path가 너무 많다..)

solved with a dynamic programming algorithm

ab가 될 수 있는 _-a-_-b-_와 _-a-a-_-b 는 똑같은 연산 결과를 공유 한다.

 

  • q_(1:p) : from the first to p of sequence q 
  • q_(r-p:r) : from r-p to r of sequence q
  • l: labeling
  • a_t(s) : total probability of l_(1:s) at time t
    • 아래 그림(matrix)에서 s, t 좌표의 노드가 갖는 확률로 봐도 된다

 

4.1.1 Forward Algorithm

Forward (α) variables are used in computing the loss

To allow for blanks in the output paths, modify label sequence l'

  • insert blank to the beginning, end, and between every pair of labels
  • The length of new l' is 2|l| + 1

 

allow all prefixes to start with either a blank(b) or the first symbol in l(l_1) (2이상의 s 부터는 출발할 수 없다)


allow all transitions between blank and non-blank labels and those between any pair of distinct non-blank labels

case 1, 현재 상태가 ϵ인 경우 혹은 t 시점 현재의 상태가 s -2의 상태와 동일한 경우
case 2, 그 외

Finally, the probability of l is then the sum of the total probabilities of l'(with and without the final blank at time T)



4.1.2 Backward Algorithm

backward (β) variables are used to calcuate gradients efficiently.

위 case1, case2의 역순이다.

t6, s=9일때의 노드(주황색)는 총 3가지 경로로 output에 기여했고 결국 세 경로로부터 역전파가 전달된다.

beta_t(s) : t부터 T 시점에 걸쳐 label seqeunce가 s:|l|일 확률이다.

 

To avoid underflow re-scaling is needed. Use hat of each functions instead to remain in computational range.

 

4.2 Maximum Likelihood Training

 

The key point is that, for a labelling l, the product of the forward and backward variables at a given and is the probability of all the paths corresponding to that go through the symbol at time t.

  • α(6,9) X β (6,9) / y(6,9) = p(ㅠ|x)

Objective is to maximize the log probabilities of all the correct classifications(z) in the training set.

  • l = z

Since the training examples are independent we can consider them separately.

r = a*b / y인데 y_kt가 빠졌다 참고하자, 논문은 index 1부터 시작하는데 이 그림은 0부터 시작한다

For any t, we can therefore sum over all s.

Let's differentiate this w.r.t ykt.

  • lab(l,k) : the set of positions where label k occurs as lab(l,k) = {s : ls = k}, which may be empty.
    • Let’s say l = "hello", l' = "_h_e_l_l_o_"
      • lab(l, 'h') = {1} (h appears at position 1)
      • lab(l, 'e') = {2} (e appears at position 2)
      • lab(l, 'l') = {3, 4} (l appears at positions 3 and 4)
      • lab(l, 'o') = {5} (o appears at position 5)
      • lab(l, 'x') = {} (x doesn't appear, so the set is empty)
  • 빨강 박스에 있는 노드중에 k가 등장하지 않았던 노드를 제외하고는 미분시 상수처럼 다 삭제된다 그러니 l에서 k가 등장했던 위치의 노드만 남는다.

Finally, this is the derivatives of objective function w.r.t ykt

 

  • hat is the rescaled version

To backpropagate the gradient through the softmax layer, we need the objective function derivatives with respect to the unnormalised outputs utk.

This is the 'error signal' received by the network during training.

  • softmax까지 내려가는 graident를 chain rule로 구하면 d O^ML / d ukt = (d O^ML/d ykt) * (d ykt / d ukt)
      • (d O^ML/d ykt) : 이건 위에서 구했고
    • (d ykt / d ukt) : softmax의 미분이다. 여기서 z_n이 utk 이고 hat(y_i)가 ykt이다.

5. Experiments

TIMIT : English speech accompanied by manually segmented phonetic transcripts

6. Discussion and Future Work

6.1 CTC vs Other temporal Classifier

  • CTC does not explicitly segment its input sequences. Removing the need to locate inherently ambiguous label boundaries
    • traditional methods might try to explicitly identify where each phoneme or word begins and ends
  • Allows grouping of label predictions
    • Some phonemes frequently occur together in many words of a language. For example, in English, the phonemes /t/ and /r/ often occur together in words like "tree", "train", "trip", etc.
    • Traditional speech recognition systems might require explicit rules or training to recognize such common phoneme combinations.
    • CTC, however, can learn these patterns automatically from the training data:
    • It doesn't need to be told in advance which phonemes commonly occur together.
    • Through its training process, CTC can learn to recognize and group these cooccurring phonemes.
    • This grouping happens implicitly as part of CTC's learning process

6.2 No explicit inter-label dependencies

CTC does not explicitly model the relationships between labels in a sequence. This means it doesn't have built-in rules or probabilities for how one label follows another.

  • Contrast with graphical models
    • Many other sequence models, like Hidden Markov Models (HMMs), explicitly model label dependencies. They often assume labels form a Markov chain, where the probability of a label depends on the k previous labels.
  • Implicit modeling of dependencies
    • Despite not explicitly modeling dependencies, CTC can still capture some relationships between labels implicitly through its training process.
  • "Double spike" example
    • This refers to CTC's ability to learn to predict labels that commonly occur together by outputting high probabilities for those labels in close succession.

6.3 Generalization and Overfitting

Maximum likelihood training like CTC is prone to overfitting (low generalization)

7. Conclusions

introduced a novel, general method for temporal classification with RNNs

  • obviates the need for pre-segmented data
  • allows the network to be trained directly for sequence labelling

outperformed both an HMM and an HMM-RNN hybrid on a real-world temporal classification problem.

728x90
반응형