0. Abstract
natural human interaction often relies on speech, necessitating a shift towards voice-based models.
A straightforward approach to achieve this involves a pipeline of “Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)”
this method suffers from inherent limitations, such as information loss during modality conversion and error accumulation across the three stages.
Speech Language Models (SpeechLMs)—end-to- end models that generate speech without converting from text—have emerged as a promising alternative.
This survey paper provides
- comprehensive overview of recent methodologies for constructing SpeechLMs
- the key components of their architecture
- various training recipes integral to their development.
- Chanllenges and Future Directions
1. Introduction
“Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)”
this naive solution mainly suffers from the following two problems.
- Information loss.
- Speech signals not only contain semantic information (i.e., the meaning of the speech) but also paralin- guistic information (e.g., pitch, timbre, tonality, etc.).
- Putting a text-only LLM in the middle will cause the complete loss of paralinguistic information in the input speech
- Cumulative error.
- A staged approach like this can easily lead to cumulative errors throughout the pipeline, particularly in the ASR-LLM stage
- Specifically, transcription errors that occur when converting speech to text in the ASR module can negatively impact the language generation performance of the LLM.
"SpeechLM"
- applicable in areas like personalized assistants, emotion-aware systems, and more nuanced human-computer interaction scenarios. (more complex tasks)
- enable real-time voice interaction, where the model can be interrupted by humans or choose to speak while the user is still speaking, which resembles the pattern of human conversations more closely.
2. Problem Formulation
A Speech Language Model (SpeechLM) is an autoregressive foundation model that processes and generates speech data, utilizing contextual understanding for coherent sequence generation.
It supports both speech and text modalities, such as speech-in-text-out, text-in-speech-out, or speech- in-speech-out, enabling a wide range of tasks with context- aware capabilities.
3. Components in SpeechLM
- speech tokenizer,
- language model(decoder-only transformer),
- token-to- speech synthesizer (vocoder)
Specifically, the speech tokenizer first transforms continuous audio waveforms into discrete tokens to serve as input to the language model
then the language model performs the next-token prediction based on the input speech tokens.
Finally, the vocoder transforms the discrete tokens outputted by the language model back into audio waveforms
3.1 Speech Tokenizer
encodes continuous audio signals (waveforms) into latent representations and then converts the latent representations into discrete tokens (or sometimes called speech units).
- speech encoder : encodes the essential information from the waveform, f()
- v = f(wave)
- quantizer : discretizes continuous representations into discrete tokens, d()
- s = d(v) or s = d(wave)
- s can be used to train the speech tokenizer as a target label (wave -> s)
3.1.1. semantic understanding objective
this objective focus on extracting semantic features, tokens accurately capture the content and meaning of the speech.
3.1.2. Acoustic Generation Objective
this objective focus on capturing the acoustic features necessary for generating high-quality speech waveforms.
tokenizer should preserve essential acoustic characteristics (making those tokens suitable for speech synthesis tasks.)
3.1.3 Mixed Objective
this objective aim to balance both semantic understanding and acoustic generation.
Most existing mixed speech tokenizers primarily adopt the architecture of acoustic generation speech tokenizers and focus on distilling information from semantic tokenizers into the acoustic tokenizer.
3.2 Language Model
- Et : text embedding matrix
- De : decoder block
- Et' : text output embedding matrix
To adapt the language model to generate speech, the origi- nal text tokenizer is changed to the speech tokenizers
Es : speech embedding matrix
Es' : speech output embedding matrix
naive approach is to expand the vocabulary of the original TextLM to incorporate both texgt and speech tokens
Ej : text + speech embedding matrix
Ej' : text + speech output embedding matrix
By doing so, the model can generate both text and speech in a single sequence, enabling much more diverse applications
3.3 Token-to-Speech Synthesizer (Vocoder)
vocoder is utilized to synthesize all the speech tokens back into speech waveforms
- Direct synthesis is the pipeline where the
- vocoder directly converts discrete, speech tokens generated by the language model into audio waveforms.
- input-enhanced synthesis
- employs an additional module to transform the tokens into a continuous latent representation before they are fed into the vocoder
- The main reason for using this pipeline is that vocoders typically require intermediate audio representations, such as mel-spectrograms
4. Training Recipes
4.1 Feature Modeled
- Discrete Features
- refer to quantized representations of speech signals that can be represented as distinct, countable units or tokens.
- derived from speech signals through various encoding and quantization processes, resulting in a finite set of possible values.
- 4 main types of features
- semantic tokens
- majority of speech tokenizers produce discrete tokens that better model the semantic infor- mation within a speech waveform
- paralinguistic tokens
- speech generated solely upon semantic tokens lacks expressive information such as prosody and different pitches or timbres
- add pitch or style tokens to semantic tokens
- acoustic tokens
- aim to capture the essential acoustic features to reconstruct high-fidelity speech
- mixed tokens
- jointly model semantic and acoustic information, showing promising results in Moshi[Defossez et al., 2024]
- semantic tokens
- Continuous Features
- un-quantized, real-valued representations of speech signals that exist on a continuous scale.(frame-by-frame)
- include spectral representations like mel-spectrograms or latent representations extracted from neural networks.
- The exploration of leveraging continuous features to condition SpeechLMs is still in its infancy.
- Mini-Omni [Xie and Wu, 2024] extracts intermediate representations from a frozen Whisper encoder as input for the SpeechLM
4.2 Training Stages
- Cold Initialization : train model from scratch
- Continued pre-training :
- involves initializing the language model with pre-trained weights from a TextLM and then adapting it to handle speech tokens
- SpeechLM benefits from both an increased size of the pre-trained checkpoint and a larger training dataset. [Hassid et al., 2024]
- training from text-pretrained checkpoints outperforms cold initialization [Hassid et al., 2024]
4.2 Language Model Instruction Tuning
- fine-tuning SpeechLMs to follow specific instructions to perform a wide range of tasks.
- increase generalization performances in zero shot setting by training model with instruction added dataset
- create effective instruction-following datasets and use this for fine-tuning
- e.g.,
- instruction data are generated based on ASR datasets by appending the instruction to paired ASR data, asking the model to convert speech into text.
- Llama-Omni [Fang et al., 2024] creates instruction-following data, speech-in-speech-out dataset by transforming a text-based instruction-following dataset using TTS.
- e.g.,
4.3 Speech Generation Paradigm
above approach does not reflect the natural flow of voice interactions.
For instance, during a conversation, one person may interrupt another, switching from listening to speaking.
a person might choose not to respond if the other is engaged in a conversation with someone else.
- Real-time Interaction
- model to effectively perform speech understanding (processing input) and speech generation (producing output) simultaneously
- User Interruption: SpeechLMs should be able to be interrupted by users and should respond appropriately to new instructions provided during the conversation
- Simultaneous Response: SpeechLMs should be capable of generating responses while the user is still speaking
- Silence Mode
- Silence Mode refers to the state in which the SpeechLMs remain inactive or silent during periods of non-interaction.
5 Challenges and Future Directions
- SpeechLMs still need to train the three com- ponents separately. This separate optimization may hinder the model’s overall potential.
- it would be worth-while to investigate whether training can be conducted in an end-to-end manner, allowing gradients to be back-propagated from the vocoder’s output to the tokenizer’s input.
- the most adopted approaches described in section 3 still result in noticeable delays between input and output speech generation.
- This delay occurs because a typical vocoder must wait for the entire sequence of output tokens to be generated by the language model before functioning,