0. Abstract
Suggest large-scale and weakly-supervised speech processing models.
- trained on 680,000 hours of audio data, multilingual, multitask supervision(like GPT)
- training objective is simply to predict these transcripts from the audio.
- large dataset of audio recordings paired with their corresponding transcripts, but these transcripts are not necessarily perfect or manually curated. They're often automatically generated or collected from various sources on the internet.
This model generalizes well to standard speech processing benchmarks
- this generalization is achieved in a zero-shot setting, meaning the models are applied to new tasks or languages without any additional finetuning (no need for task-specific, language-specific training data)
- The performance is often competitive with prior fully-supervised models, which are typically trained specifically for each task or language.
1. Introduction
1.1 Unsupervised pre-training
Wav2Vec 2.0, an unsupervised pre-training techniques, learn directly from raw audio without the need for human labels. they can use large datasets of un-labeled speech.
When fine-tuned on standard benchmarks, this approach has improved the state of the art, especially in a low-data setting.
These pre-trained audio encoders learn high-quality representations of speech, but they don't have a built-in mechanism to map these features to useful outputs (like text transcriptions)
So fine tuning on specific task with labeled data are necessary to adapt pre-trained representations to the task.
1.2 Fine-tuning has some problems
After fine-tuning, ML models are good at finding patterns within their training datasets and these patterns lead to improved performance on held-out data from the same (training) data. But these patterns don't generalize well to other datasets or different data distributions.
In that, Models that perform exceptionally well in controlled, dataset-specific environments may fail unexpectedly when deployed in diverse, real-world scenarios.
1.3 No pre-trained decoder
This suggests that while unsupervised pre-training has improved the quality of audio encoders dramatically, the lack of an equivalently high-quality pre-trained decoder, combined with a recommended protocol of dataset-specific fine-tuning, is a crucial weakness which limits their usefulness and robustness.
1.4 So we are going to build
The goal of a speech recognition system should be to work reliably “out of the box” in a broad range of environments without requiring supervised fine-tuning of a decoder for every deployment distribution.
- Large weekly supervised dataset
- moving beyond the requirement of gold-standard human-validated transcripts
- trade-off between quality and quantity
- Multilingual and Multitask
- Broaden the scope of weakly-supervised pre-training.
- Of those 680,000 hours of audio with labels, 117,000 hours cover 96 other languages
- dataset also includes 125,000 hours of X→en translation data
- No needs for the self-supervision or self-training techniques
- Self-training typically involves using a model to generate pseudo-labels for unlabeled data, then training on that data. Whisper doesn't use this approach.
2.Approach
2.1 Data Processing
construct the dataset from audio that is paired with transcripts on the Internet
- broad distribution of audio : recording, speakers, several languages
- diverse audio quality : leads to robustness
- diverse transcript quality : used several automated filtering methods to improve the quality, remove too poor transcripts
- many transcripts are generated from ASR systems
- training on datasets of mixed human and machine-generated data can significantly impair the performance
- so developed many heuristics to detect and remove machine-generated transcripts from the training dataset.
- for example, remove inverse text normalization results from ASR
- include translation task e.g {(audio:korean), (transcript:english)} but not for the cases which transcripts are not english.
- to detect audio language, used fine-tuned audio language model on VoxLingua107
- segmented into 30-second segments paired with the subset of the trasncript
- some segments don't have speech (used as training data for VAD)
- de-duplicate between training data and evaluation data at transcript level
2.2 Model
To focus on large-scale supervised pre-training, we use an off-the-shelf architecture, encoder-decoder Transformer to avoid confounding our findings with model improvements.
2.2.1 Encoder
- feature extraction
- 16k Hz sampling
- 80 channel log-magnitude Mel Spectrogram computed on 25ms windows with stride of 10ms
- feature normalization
- normalize these features (from -1 to 1) with mean and variance calculated from the entire pre-training dataset, not from individual audio files.
- 2 conv1d + GELU
- Feature extraction: These layers act as a feature extractor, learning to identify relevant patterns in the input spectrogram.
- Dimensionality reduction: The second convolutional layer has a stride of two, which reduces the temporal dimension of the input by half. This helps to decrease the computational complexity for the subsequent Transformer layers.
- Local context aggregation: With a filter width of 3, each convolutional layer captures local context across 3 time steps in the spectrogram. This allows the model to learn local temporal patterns.
- Non-linearity introduction: The GELU activation function between the convolutional layers introduces non-linearity, allowing the model to learn more complex representations.
- Initial processing: These layers serve as a "stem" that performs initial processing on the raw input before it's fed into the main Transformer architecture.
- earnable feature representation: Unlike hand-crafted features, these convolutional layers can learn to extract features that are most relevant for the speech recognition task.
- Sinusoidal position embedding
- pre-activate residual
- layer normalization on encoder output
2.2.2 Decoder
- learned position embeddings
- tied input-output token representation
- start of transcript/ english/ transcribe task/ time step/ actual transcription starts(token)
- BPE text tokenizer used for breaking transcript
2.3 Multitask Format
When we are saying "transcribing", the core part is predicting word but it is not the only part. A fully featured speech recognition system can involve many additional components such as voice activity detection, speaker diarization, and inverse text normalization. These components are often handled separately, resulting in a relatively complex system around the core speech recognition model. To reduce this complexity, we would like to have a single model perform the entire speech processing pipeline, not just the core recognition part.
2.3.1 One Format
Multitask Training Format : Specify all tasks and conditioning information as a sequence of input tokens to decoder.
Train the model to predict all these tokens.
- For transcription or translation
- ex) <|startoftranscript|> <|en|><|transcribe|> <0.0> (Specifies English transcription)
- With <|startoftranscript|>, Specify the start of prediction
- First, Specify the language being spoken <|en|>
- Next, Specify the task <|transcribe|>, <|translate|>
- Next, Specify whether to predict timestamps or not by including <|notimestampts|>
- Lastly, we add a <|endoftranscript|> token
- For VAD
- <|startoftranscript|> <|nospeech|>
- Specify nonspeech with <|nospeech|>
2.3.2 Training Process
The decoder, audio-conditional language model, is trained to generate text based on audio input. It is conditioned to audio but also a language model.
So, the authors trained decoder to condition the history of text of the transcript.
- With some probability, not all the time, add the transcript text preceding the current audio segment to the decoder’s context.
- Let's say we have a long audio file split into 30-second segments:
- Segment 1: "The weather today is sunny."
- Segment 2 (next): "It's perfect for a picnic in the park."
- When processing Segment 2, the model might sometimes receive <|startoftranscript|> <|en|> The weather today is sunny. <|segment2|>
- Execlude the training loss over this previous context text
Why? In the hope that it will learn to use longer-range text context to resolve ambiguous audio.
- This context can help the model better understand and transcribe Segment 2, especially if there are any ambiguous words or phrases.
- For instance, if the word "picnic" is unclear in the audio, the context of sunny weather might help the model correctly transcribe it.
- The key idea is to give the model access to broader context, helping it make more informed decisions when transcribing or translating speech, especially in cases where the audio alone might be ambiguous or unclear
- 우리 사람도 특정 단어를 잘 못 들었어도 문맥에 따라 들은 소리를 알맞게 번역할 수 있다. 이런 효과를 위해 지금 오디오보다 이전 시점의 output(번역)을 일정 확률로 decoder의 input으로 전달해줬다고 한다.
2.3.3 Timestamp prediction
2.3.3.1 Timestamp Tokens
quantize all times to the nearest 20ms and and add additional tokens.
- Whisper is trained on 30s audio chunks so it's reasonable to assume that Whisper's vocabulary includes at least 1500 timestamp tokens (for a 30-second chunk)
- and likely more to accommodate longer audio files or continuous processing of long-form audio.
- <|0.00|>, <|0.02|>, <|0.04|>, etc.
Instead of predicting timestamps for every word or at regular intervals, Whisper predicts start and end timestamps for each utterance or segment of speech. (Utterance-level timestamps)
- Start time token is predicted before each caption’s text, and the end time token is predicted after.
- So the model predicts both the transcribed text and its timing information in a unified manner
2.3.3.2 Handling Partial Segments
Case 1: In timestamp mode
Only the start time token is predicted for the partial segment. This indicates where the next decoding should begin
[30-second audio chunk]
<|0.00|> First complete sentence. <|2.50|>
<|2.50|> Second complete sentence. <|5.00|>
<|28.50|> What do you think about
// Start of a partial sentence, only start time predicted
// indicate that the subsequent decoding should be performed on an au- dio window aligned with that time
[Next 30-second audio chunk]
having dinner outside? <|30.50|>
Case 2: Not in timestamp mode
The audio is truncated to exclude the partial segment
[30-second audio chunk]
First complete sentence.
Second complete sentence.
// Partial sentence is not included cause it's not completed
2.4 Training Details
- provide a suite of models of various size: tiny, base, small, medium, large
- train with data parallelism across accelerators
- FP16 with dynamic loss scaling
- activation check-pointing
- gradient norm clipping
- learning rate scheduling
- no data augmentation or regularization, instead rely on the diversity contained within large dataset
- many transcripts has speaker annotation so model tries to predict the name of speakers but this information is rarely inferable from only the most recent 30 seconds so fine-tune whisper on the subset of transcripts that do not include speaker notation.
3. Experiments
3.1 Zero-shot Evaluation
To measure broad generalizaition, evaluate Whisper in a zero-shot setting without using any of the training data for each of datasets, withous fine-tuning.
3.2 Evalution Metrics
WER penalizes all differences between the model’s output and the reference transcript including innocuous differences in transcript style.
Especially for Whisper, zero-shot models, it didn't observe any examples of specific dataset transcript format so WER may unfairly penalize. (e.g "don 't" instead of "don't")
To address the issue of WER penalizing Whisper for non-semantic differences in transcriptions, developed a text normalizer which standardize the texts before the WER calculation.
*The text normalizer was developed by iteratively examining Whisper's outputs and adjusting the normalization rules. This process could inadvertently create a bias towards Whisper's specific way of formatting transcripts
3.3 English Speech Recognition
Whisper models, which are trained on a broad and diverse distribution of audio and evaluated in a zero-shot setting, could potentially match human behavior much better than existing systems.
3.4. Multi-lingual Speech Recognition
We suspect the underperformance of Whisper models on VoxPopuli could be due to
1) other models including this distribution(VoxPopuli) as a major source for their unsupervised pre-training data
2) and the dataset(VoxPopuli) having significantly more supervised data, which benefits fine-tuning.
3.5. Translation
Correlation between the amount of translation training data per language and the resulting zero-shot BLEU score on Fleurs.
Much lower coefficient, 0.24
Suspect this is due to errors in audio language identification. CY is an outlier with much worse than expected performance. Inspection shows the majority of English audio was mis-classified as Welsh by our language identification system. It was actually Eng->Eng
3.6. Language Identification
On the 82 overlapping languages(between whisper and Fleurs), the best Whisper model achieves 80.3 % accuracy
3.7. Robustness to Additive Noise
*SNR(The level of additive noise)
3.8. Long-form Transcription
Whisper models are trained on 30-second audio chunks and cannot consume longer audio inputs at once
To consume longer audio, developed a strategy to perform buffered transcription of long audio by consecutively transcribing 30-second segments of audio and shifting the window according to the timestamps predicted by the model.
Beam search and temperature scheduling based on the repetitiveness and the log-probability of the model predictions.
3.9. Comparison with Human Performance
4. Analysis and Ablations
4.1 Model Scaling
4.1 Dataset Scaling
This mirrors the diminishing returns observed with model size scaling for English speech recognition and could similarly be explained by saturation effects when approaching human-level performance.
The general trend across tasks of diminishing returns when moving from 54,000 hours to our full dataset size of 680,000 hours could suggest that
1) the current best Whisper models are under-trained relative to dataset size and performance could be further improved by a combination of longer training and larger models
or
2) It could also suggest that we are nearing the end of performance improvements from dataset size scaling for speech recognition.
4.3. Multitask and Multilingual Transfer
Jointly training a single model on many tasks and languages might have negative transfer where interference between the learning of several tasks results in performance worse than would be achieved by training on only a single task or language.
Learning degree = amount of FLOPs spent on training
If Whisper had N FLOPs of computation during training, when excluding the amount learned for each language and task, the computational amount spent solely for English speech recognition is 0.65*N.
Therefore, for a fair comparison, we compare the English-only model trained with 0.65*N FLOPs to Whisper (which is trained with N FLOPs)
*joint models also slightly outperform English-only models even when not adjusting for compute spent per task
4.4. Text Normalization
Our normalizer might be overfitted to fixing Whisper’s peculiarities rather than addressing general variation in transcription.
To investigate this, compare our normalizer+Model and other normalizer+Model
두 서로 다른 모델에 대해서 Fairspeech의 normalizer를 썼을 때보다 our normalizer를 썼을때 얼마나 WER이 줄어드는가? 를 확인한다. 예를들어 whisper에서만 이 수치가 많이 줄어든다는 건 our normalization이 whisper에 특히 overfitting 되어있다고 볼 수 있다.
4.5 Strategies for Reliable Long-form Transcription
5. Conclusion
Training on a large and diverse weakly-supervised dataset and focusing on zero-shot transfer(perform well on tasks without fine-tuning) can significantly improve the robustness of speech recognition system