Abstract
With the rapid advancement of large language models (LLMs), LLM-based speech foundation
models have achieved significant improvements in both speech understanding and generation. These
advances enable more accurate and fine-grained comprehension of spoken language, as well as more
fluent and natural speech output for human-machine interaction. From the perspective of speech
representation, most existing speech LLMs either separate the representations for understanding
and generation tasks or employ discrete representations for both. The former approach cannot edit
speech, while the latter suffers from loss of speech details due to quantization. In terms of speech
tasks, there is currently no single model capable of performing semantic and acoustic editing on
input speech through free-form instructions. This multiround editing capability has already been
well demonstrated in the image domain. To address the issue of inconsistent representations between
understanding and generation, we propose a continuous audio tokenizer that unifies both speech
understanding and generation tasks. Furthermore, we introduce a novel capability, free-form speech
editing, which enables fine-grained and high-fidelity speech manipulation.
- 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio
- 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: Ming-UniAudio
- 🔥 First universal free-form speech editing model for semantic and acoustic tasks without temporal regime: Ming-UniAudio-Edit
- 🔥 First benchmark for free-form speech editing: Ming-Freeform-Audio-Edit-Benchmark