Mini-Omni: Open-Source Model for Real-Time Speech Interaction

Current academic language models still rely on external Text-to-Speech (TTS) systems, causing undesirable latency in speech synthesis. To address this, the Mini-Omni model introduces an audio-based, end-to-end conversational capability that supports real-time speech interaction without additional TTS systems. This is achieved through a text-instructed speech generation method and batch-parallel strategies during inference, maintaining the original language capabilities of the model with minimal degradation.

The researchers have named this approach “Any Model Can Talk.” QWEN2-0.5b model, equipped with the “Any Model Can Talk” framework, can effectively handle speech tasks while maintaining its original language processing strengths.

The introduction of the VoiceAssistant-400K dataset further fine-tunes models for optimized speech output. To date, Mini-Omni is the first open-source model offering fully end-to-end real-time speech interaction, paving the way for future research in this field.

The project is available on GitHub and Hugging Face.

Key Features

Real-Time Speech-to-Speech Capabilities: Mini-Omni allows for natural speech conversations, handling both input and output in audio form.
Simultaneous Text and Audio Generation: The model can “talk while thinking,” generating text and audio simultaneously, ensuring a smooth conversational flow.
Streaming Audio Output: Mini-Omni supports real-time streaming of audio outputs, making it ideal for interactive applications.
Batch Inference: The model includes “Audio-to-Text” and “Audio-to-Audio” batch inference methods to further boost performance.

Model Architecture

The Mini-Omni model is built on Qwen2-0.5B, a transformer architecture with 24 blocks and an internal dimension of 896. This foundation is enhanced with a Whisper-small encoder to process speech input effectively. This unique combination allows Mini-Omni to integrate real-time audio reasoning into its language processing capabilities.

Parallel Text-Audio Generation: Mini-Omni employs a parallel text-audio generation method, using batch parallel decoding to generate speech and text simultaneously. This method ensures that the model can maintain its reasoning capabilities across modalities without compromising performance.

Mini-Omni incorporates text-instruct mechanisms alongside Batch parallel generation techniques.

“Any Model Can Talk” Framework

This framework allows existing language models, such as LLaMA, Vicuna, and Baichuan, to adopt speech capabilities with minimal additional training. The process involves three phases: modality alignment, adaptation training, and multimodal fine-tuning, extending the speech interaction capabilities of these models.

Datasets and Training

Mini-Omni was trained on 8,000 hours of speech data and 2 million text-based data points from the Open-Orca dataset. Additionally, the researchers introduced the VoiceAssistant-400K dataset, comprising 400,000 entries specifically designed for fine-tuning speech assistant models.

training process — Mini-Omni three-stage training phases: modality expansion, modality adaptation training, and holistic fine-tuning.

Performance Evaluation

The model demonstrated strong performance in both Automatic Speech Recognition (ASR) and text-to-speech tasks. For example, it achieved a word error rate (WER) of 4.5% on the LibriSpeech test-clean dataset, closely trailing behind established models like Whisper-small, which achieved 3.4%.

Conclusion

The Mini-Omni model is a pioneering effort in real-time multimodal AI, offering significant advancements in speech interaction capabilities. Its innovative parallel text-audio generation method and the “Any Model Can Talk” framework provide a flexible, efficient way to enhance AI models with speech capabilities, opening new possibilities for real-time, interactive AI applications.