Speech intelligence that passes the human test.

Frontier voice synthesis and transcription for natural, expressive, nuanced interactions.

Voice agents

Text-to-speech

Speech-to-text

Bank Agent Bank Agent
Hi, how can I help you today?
I'd like to open a business account. What do you need from me?
Let me fetch your ID on file.
Model Reasonning
Can you please confirm your registered address?
27 Clerkenwell Road, London.
Car assistant
Car assistant
Traffic ahead on the A40 near Hillingdon. I've rerouted you via the M25. Estimated arrival is now 9.15am, which still gives you 10 minutes before your meeting.
Audio file Investment-transcript.mp3
S
Speaker 1
00:01-00:07
The DCF puts the target at 1.2 billion, but the comps are suggesting closer to 950.
00:08-00:11
We need to reconcile that before the pitch.
S
Speaker 2
00:12-00:13
Agreed.
00:14-00:18
The multiple is being dragged down by the peer group selection.
00:19-00:24
If we narrow it to pure play sass, the range tightens significantly.
S
Speaker 1
00:25-00:26
Fine.
00:27-00:32
Let's rerun the comps with the revised peer set and update the book by Thursday.
S
Speaker 2
00:33-00:36
I'll have the associate turn it around tomorrow.
00:37-00:41
Do we want to keep the base case at 8 times or push to 9?

Powered by state-of-the-art, open voice models.

Voxtral TTS.

Realistic, emotionally expressive voice generation and cloning.

Voxtral Mini Transcribe 2.

Batch transcription with speaker diarization and context biasing.

Voxtral Realtime.

Live streaming transcription with sub-200ms latency.

Why Mistral Speech?

Agents that sound human.

Voice generation and replication that captures personality, rhythm, and emotional dexterity.

Hear and discern everything.

Transcription that stays accurate in noisy, real-world conditions and knows who said what.

Localize across languages and accents.

Nine languages for voice generation, thirteen for transcription, with cross-lingual and dialect adaptation.

Powered by your models in your infrastructure.

Open weights, domain fine-tuning, and on-prem deployment. Full control over every component in the pipeline.

Get started with Mistral Speech.

API.

Programmatic access to Mistral's audio models for custom integrations.

Playground.

Test voice generation, cloning, and transcription in Mistral Studio.

Enterprise.

Custom deployments, solutions, model training, and dedicated support.

Closing the loop on audio intelligence.

Construct and customize voices.

Voice agents.

Real-time voice-to-voice conversations that listen, reason, and respond with your brand's voice, tone, and domain knowledge.

Voice cloning.

Replicate any voice from a sample as short as 3 seconds, capturing tone, rhythm, and personality.

Text-to-speech.

Emotionally expressive speech and voice cloning that captures a speaker's personality. Adapt to any voice from a short sample, or use preset voices.

Capture every word.

Real-time transcription.

Streaming architecture that transcribes as audio arrives, not in chunks, with latency configurable down to sub-200ms.

Batch transcription.

Process hours-long meetings, call recordings, and compliance archives, with structured outputs and speaker attribution.

Cross-lingual adaption.

Generate speech in one language using a voice from another, preserving accent and identity.

Prototype, test, tune, adapt.

Audio playground.

Test conversations, voice generation, and transcription in Mistral Studio with actors, voice emulation, diarization, and per-input controls.

Speaker diarization.

Identifies who said what and when, with speaker labels and start/end timestamps for meetings, interviews, and multi-party calls.

Context biasing.

Guide the model with up to 100 custom terms: names, technical vocabulary, internal jargon.

How teams use Mistral Speech today.

Customer support.

Voice agents that route and resolve queries across channels with natural, brand-appropriate speech.

Financial services.

Compliant voice AI for wealth management advisory, insurance policy queries, and client onboarding.

Manufacturing and industrial operations.

Voice interfaces for quality inspection, production feedback, and field operations in high-noise environments.

Public services and government.

Dialect-specific voice assistants for citizen services, deployed on sovereign infrastructure.

Compliance and risk.

Real-time call monitoring with speaker attribution, KYC/AML automation, and auditable interaction records.

Supply chain and logistics.

Voice-enabled shipment tracking, customs coordination, and exception handling across languages.

Automotive and in-vehicle systems.

Lightweight on-device models powering voice interfaces without cloud dependency.

Sales and marketing.

Meeting intelligence with speaker attribution, pipeline analysis, and automated follow-ups.

Real-time translation.

Cross-lingual voice adaptation for live translation, preserving speaker identity and accent.

Resources.

Frequently asked questions.

Try voice generation and transcription in the Audio Playground in Mistral Studio, integrate via API, or download open weights to self-host.

Yes, two. Voxtral Mini Transcribe 2 for batch transcription with speaker diarization and context biasing and Voxtral Realtime for live streaming transcription with sub-200ms latency.

Yes, you can provide a voice sample as short as 3 seconds and Voxtral TTS, our text-to-speech model, will adapt to capture the speaker's tone, rhythm, and personality. You can also use preset voices or build your own voice library.

Transcription supports 13 languages including English, French, German, Spanish, Chinese, Hindi, Arabic, Portuguese, Russian, Japanese, Korean, Italian, and Dutch. Voice generation supports 9 languages with dialect-aware expressions in English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic.

An example of combining Voxtral models to form a speech-to-speech pipeline is: Voxtral Realtime transcribes incoming speech, another Mistral LLM reasons over the transcript and determines a response, and Voxtral TTS generates spoken output. Each component is independently customizable and deployable. Cross-lingual voice adaptation means the pipeline can also handle live translation while preserving the speaker's accent and identity.

Yes, you can self-host Mistral Speech or deploy on Mistral Compute.

Build your AI future.

The most expressive, accurate, and open audio AI for the enterprise.