Abstract

With the rapid advancement of large language models (LLMs), LLM-based speech foundation models have achieved significant improvements in both speech understanding and generation. These advances enable more accurate and fine-grained comprehension of spoken language, as well as more fluent and natural speech output for human-machine interaction. From the perspective of speech representation, most existing speech LLMs either separate the representations for understanding and generation tasks or employ discrete representations for both. The former approach cannot edit speech, while the latter suffers from loss of speech details due to quantization. In terms of speech tasks, there is currently no single model capable of performing semantic and acoustic editing on input speech through free-form instructions. This multiround editing capability has already been well demonstrated in the image domain. To address the issue of inconsistent representations between understanding and generation, we propose a continuous audio tokenizer that unifies both speech understanding and generation tasks. Furthermore, we introduce a novel capability, free-form speech editing, which enables fine-grained and high-fidelity speech manipulation.

- 🔥 First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio
- 🔥 First Speech LLM with unifed continuous tokenizer for both understanding and generation: Ming-UniAudio
- 🔥 First universal free-form speech editing model for semantic and acoustic tasks without temporal regime: Ming-UniAudio-Edit
- 🔥 First benchmark for free-form speech editing: Ming-Freeform-Audio-Edit-Benchmark

Model Archieture of Ming-UniAudio

method

Overall Framework of The Unified Continuous Speech Tokenizer: MingTok-Audio

method

The two figures above represent the architecture of MingTok-Audio and it's three training stages.

Editing Tasks Video demos

Instruction-Guided Free-Form Speech Editing

Semantic Editing - Insert

Instruction Transcription Target Transcription Before Edit Speechedit Result
insert '简直' after the character or word at index 8. 真是个浪漫的邂逅可以说是英雄救美了 真是个浪漫的邂逅简直可以说是英雄救美了
insert '真正' before the character or word '好'. 就有道而正焉可谓好学也已 就有道而正焉可谓真正好学也已
insert 'clearly' before the character or word at index 8. Its legal status in Trinidad was insufficient to preserve its ecological status. Its legal status in Trinidad was insufficient clearly to preserve its ecological status.
insert 'successfully' after the character or word 'profession'. Previously an attorney Korona left the profession to pursue a career in music. Previously an attorney Korona left the profession successfully to pursue a career in music.

Semantic Editing - Substitute

Instruction Transcription Target Transcription Before Edit Speechedit Result
substitute '妈妈' with '爸爸'. 我想对于妈妈来说会比任何礼物都要温暖 我想对于爸爸来说会比任何礼物都要温暖
substitute the characters or words from index 8 to index 10 with '五万元'. 当时我想等筹齐两万元聘礼就送她妈回家 当时我想等筹齐五万元聘礼就送她妈回家
substitute 'get pictures off' with 'transfer photos from'. I'm trying to explain to my mother how to get pictures off her phone. I'm trying to explain to my mother how to transfer photos from her phone.
substitute the words from index 8 to index 9 with 'could become'. Considering the growth of human population insects might be the food of the future. Considering the growth of human population insects could become the food of the future.

Semantic Editing - Delete

Instruction Transcription Target Transcription Before Edit Speechedit Result
delete '比普通的茶叶要'. 花草茶的口味一般比普通的茶叶要苦一些 花草茶的口味一般苦一些
delete the characters or words from index 11 to index 15. 我吃了点燕麦片煎鸡蛋还喝了点橙汁 我吃了点燕麦片煎鸡蛋汁
delete 'times'. The classification of this gibbon has changed several times in the past few years. The classification of this gibbon has changed several in the past few years.
delete the characters or words from index 2 to index 6. On the second day the boy climbed to the top of a cliff near the camp On climbed to the top of a cliff near the camp

Acoustic Editing - Dialect Conversion

Instruction Transcription Before Edit Speechedit Result
Change the accent of the speech to Dongbei. 之后,他考取导游证,成为拱北口岸中旅的导游。
Change the accent of the speech to Chengdu. 只有当科技为本地社群创造价值的时候,才能真正有意义。
Change the accent of the speech to Chengdu. 我得用回想与幻想补充我所缺少的饮食,安慰我所得到的痛苦。
Change the accent of the speech to Guangxi. 全国恶性肿瘤发病,及死亡第一位的是肺癌。

Acoustic Editing - Speed

Instruction Transcription Before Edit Speechedit Result
adjusts the speed to 0.5. 我用胸抵住车把,掌握方向,速度一点也不比别人慢。
adjusts the speed to 0.7. There is a growing body of case law on Bayh-Dole.
adjusts the speed to 1.3. Cribb was born near Bristol but moved to London before starting professional fighting.
adjusts the speed to 2. 切实帮助困难群众解决生产生活中,遇到的困难和问题。

Acoustic Editing - Pitch

Instruction Transcription Before Edit Speechedit Result
shifts the pitch by 3 steps. 因为外面有战争,家里又有战争带来的悲伤和匮乏。
shifts the pitch by 5 steps. 自动驾驶将大幅提升出行安全,效率。
shifts the pitch by -1 steps. The heart of the campus has a number of historic buildings.
shifts the pitch by -1 steps. Stevenson is also the director of music ministries at Angeles Mesa Presbyterian Church.

Acoustic Editing - Volume

Instruction Transcription Before Edit Speechedit Result
adjusts the volume to 1.4. A woman sits as she shows the designs she has made in the floor.
adjusts the volume to 1.6. For example, they both consist of predominately older, hence redder, stars.
adjusts the volume to 0.9. 伏羲的儿孙们看见伏羲捉来了鱼,也都欢欢喜喜跑来问长问短。
adjusts the volume to 0.3. 他们还告诉巨人,那座城市里群英荟萃。

Acoustic Editing - Denoise

Instruction Transcription Before Edit Speechedit Result
denoise the audio. Be shape of example,before deriving this formula we explained what we mean by problems of this kind we now generalize these ideas for general binomial experiments.
denoise the audio. Summoned to himself with firmness no surrender his superiors had also preached this saying it was the way of eternal honor his comrades were old.
denoise the audio. There are people who travel long distances to assure my continued existence we have also seen the power of faith at work among us it was muscular but it wasn't symmetrical.
denoise the audio. Theory eventually proved inexact the heavens refused to give up their weeping but what has been happening recently might be described as creeping mannerism clever.

Acoustic Editing - Background Music

Instruction Before Edit Speechedit Result
add rain to audio.
add car sound to audio.
add carefree music to audio.
add groovy music to audio.

Acoustic Editing - Emotion Conversion

Instruction Transcription Before Edit Speechedit Result
change the emotion to happy mood. 比尔想再看小主人一眼然后走进森林安静地死去。
change the emotion to happy mood. 世界爱眼日是每年十月的第二个星期四。
change the emotion to happy mood. 我会玩很多游戏呢听说多喝水能治百病。
change the emotion to happy mood. 建议戴口罩空气质量轻度污染。

Audio Understanding

Chinese and English ASR

Input Transcription
呃很久没有看到看过如此不带价值判断的电影
桃花庄人塔俱乐部是位于杭州市德清县的一个俱乐部
he was excited and at the same time uneasy maybe the girl had already forgotten him
it's true that everything has its destiny but one day that destiny will be realized

Dialect Understanding

Input Transcription
[方言-粤语] 你做乜嘢啊系咪唔想倾偈啊。
[方言-上海话] 阿拉考试还没定下来唻。
[方言-闽南语] 宝贝较早休困晚安。
[方言-川渝方言] 我难受得很别个都睡了。

Context ASR

Input Prompt Transcription
Please recognize the language of this speech and transcribe it. Format: oral. This is an audio about Banking. This audio may contains the following words or phrases:Zelle,daily A C H transfer limit,cashier's checks,transaction memos,F D I C regulations,cryptocurrency wallet,K Y C requirements. Hey Chris, you won't believe what happened when I tried sending rent through Zelle yesterday. I hit some daily ACH transfer limit! My landlord's insisting on cashier's checks now. Remember how Sarah's Venmo payment got flagged last month? The bank's fraud detection system kept asking about transaction memos and 'source of funds' verification. Honestly, these FDIC regulations around peer-to-peer payments are getting ridiculous. I had to provide three months of bank statements just to increase my wire transfer threshold. Oh, and don't even get me started on cryptocurrency wallet KYC requirements.
Please recognize the language of this speech and transcribe it. Format: oral. This is an audio about Banking. This audio may contains the following words or phrases:Priority Pass lounges,T S A Pre Check,rewards structure,bonus miles,Citibank's Prestige Card,Visa Infinite,E M V chip security protocols,dynamic currency conversion. So listen, I finally canceled my Chase Sapphire Reserve last week. Remember how they touted those Priority Pass lounges and Luxury Hotel Collection benefits? Turns out I only used the T S A Pre Check credit once this whole year! The annual fee jumped to five hundred fifty dollars, plus they started requiring eighteen thousand points to waive it. My Amex Platinum isn't any better that seven hundred dollar fee just hit, and their new rewards structure requires thirty thousand in annual spending for bonus miles. Oh, and get this Citibank's Prestige Card now charges two hundred bucks for authorized users! Honestly, these Visa Infinite perks like concierge services and purchase protection sound fancy, but when do regular people actually use E M V chip security protocols or dynamic currency conversion?
Please recognize the language of this speech and transcribe it. Format: oral. This is an audio about 酒店常旅客计划. This audio may contains the following words or phrases:至悦大使,重庆来福士洲际,酒廊待遇,万豪旅享家,钛金会员. 诶?小李,我最近在研究IHG的会员体系,这个‘至悦大使’的达标条件也太苛刻了吧!‘三百权益’里,洲际的认可房晚才给三十晚。你说,他们家的‘先行者任务’算不算‘里程碑奖励’啊?对了,我之前用积分兑换重庆来福士洲际的行政套房,礼宾部居然没给酒廊待遇,反而现金订房的客人能拿到双早。万豪旅享家的‘钛金会员’都能自动匹配套房升级券,IHG这个动态定价系统真是让人头大!
Please recognize the language of this speech and transcribe it. Format: oral. This is an audio about 汽车行业. This audio may contains the following words or phrases:汽车之家曹雷,矩阵式 L E D 大灯,四十八伏轻混系统,可变气门升程技术,M B U X 超联屏,Sportback,Allroad. 嘿,老李,你看到‘汽车之家’曹雷发的文章没?说新款奥迪A3加长到四米六了。昨儿我去4S店试驾,销售说这车配了啥矩阵式LED大灯,还有四十八伏轻混系统。不过,宝马1系那个B48发动机也改了‘可变气门升程技术’,奔驰A级更夸张,直接把MBUX超联屏塞进紧凑车里!要我说啊,现在车企搞细分市场真够拼的!听说奥迪还要出Sportback、Allroad等四个版本呢,连自适应巡航都标配了!

Audio Generation

Voice Clone

Input Prompt Target Text TTS Result
全球每年有超过一百三十五万人,因交通事故而死亡。
The stained glass offered a hypnotic atmosphere.

Multi-lingual Synthesis

Input Prompt Text Input Prompt audio Target Text TTS Result
We asked over twenty different people, and they all said it was his. The stained glass offered a hypnotic atmosphere.
The wedding was photographed by celebrity wedding photographer Kid Chan. Bender also conducted extensive research on autism.
关于不少万达广场的注册资本金更改。 哎,这些情况在北京这样的大都市,是无法避免的。
长春周二之前晴天多云五月七日是晴天。 两人一直对婚变封口,使传闻闹得热烘烘。

Reference

      
@article{Mingomni2025,
  title   = {Ming-Omni: A Unified Multimodal Model for Perception and Generation},
  author  = {Inclusion AI, Ant Group},
  journal = {Technical Report},
  year    = {2025}
}