OpenClaw Voice Mode: Telegram Voice Notes and TTS Setup
OpenClaw was previously known as Clawdbot and Moltbot. This guide applies to all versions.
OpenClaw voice mode lets you send Telegram voice notes and get spoken replies. Set up Whisper transcription and TTS output for hands-free AI agent control.
Key takeaways
- OpenClaw auto-transcribes Telegram voice notes using a detection chain: local Whisper CLI first, then provider APIs (OpenAI, Deepgram, Groq, or Mistral Voxtral)
- TTS replies arrive as round Telegram voice bubbles, not file attachments, when the feature is enabled
- Set
messages.tts.auto: "inbound"to only speak when responding to a voice message, keeping text replies silent - Microsoft Edge TTS works with no API key and is the default fallback when no provider keys are configured
- The
echoTranscriptoption sends a text confirmation of what OpenClaw heard before the agent processes it
Always review commands your agent suggests before approving them. Don't paste prompts from sources you don't trust.
Fixes when it breaks. Workflows when it doesn't.
OpenClaw guides, configs, and troubleshooting notes. Every two weeks.
How OpenClaw transcribes Telegram voice notes automatically
OpenClaw transcribes voice notes without any middleware. When a Telegram voice message arrives, the audio media understanding pipeline intercepts it before the agent sees the message body. The pipeline locates the audio attachment, downloads it if needed, and routes it through the configured transcription stack.
Auto-detection runs in this order when no explicit models are configured:
sherpa-onnx-offline: requiresSHERPA_ONNX_MODEL_DIRwith model fileswhisper-cli: from whisper.cpp, usesWHISPER_CPP_MODELor the bundled tiny modelwhisper: Python CLI, downloads models automatically- Gemini CLI via
read_many_files - Provider keys: OpenAI, then Groq, then Deepgram, then Google
On success, OpenClaw replaces the message body with an [Audio] block and sets the {{Transcript}} template variable. The agent sees plain text and responds as if you had typed the message.
According to the OpenClaw audio docs, audio files below a minimum size threshold are skipped as likely empty. The default size cap is configurable via tools.media.audio.maxBytes (the example config uses 20MB).
Which transcription provider should you use for OpenClaw?
For most setups, OpenAI gpt-4o-mini-transcribe hits the best balance of speed, accuracy, and cost. Deepgram nova-3 is the fastest option and handles noisy audio well. Mistral Voxtral is newer and strong on non-English languages. Local Whisper CLI requires no API key but adds latency and is CPU-bound.
| Provider | API Key | Speed | Best For |
|---|---|---|---|
| OpenAI gpt-4o-mini-transcribe | Yes | ~1-2s | General use |
| Deepgram nova-3 | Yes | Under 1s | Speed, noisy audio |
| Mistral Voxtral | Yes | ~2s | Multilingual |
| Groq | Yes | Under 1s | Fast inference |
| whisper-cli (local) | No | Variable | No API key |
| whisper Python | No | Variable | Auto model download |
If you are starting fresh and already have an OpenAI key, use OpenAI. If you want zero API costs and can tolerate slower transcription, install whisper.cpp and point the config at the binary.
How to configure voice note transcription in openclaw.json
The minimal config that enables transcription with OpenAI:
{
tools: {
media: {
audio: {
enabled: true,
models: [
{ provider: "openai", model: "gpt-4o-mini-transcribe" }
],
},
},
},
}Provider + local CLI fallback (transcription keeps working when the API is down):
{
tools: {
media: {
audio: {
enabled: true,
maxBytes: 20971520,
models: [
{ provider: "openai", model: "gpt-4o-mini-transcribe" },
{
type: "cli",
command: "whisper",
args: ["--model", "base", "{{MediaPath}}"],
timeoutSeconds: 45,
},
],
},
},
},
}Deepgram-only config:
{
tools: {
media: {
audio: {
enabled: true,
models: [{ provider: "deepgram", model: "nova-3" }],
},
},
},
}To restrict transcription to direct messages only (skip groups):
{
tools: {
media: {
audio: {
enabled: true,
scope: {
default: "allow",
rules: [{ action: "deny", match: { chatType: "group" } }],
},
models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
},
},
},
}How to enable TTS so OpenClaw speaks back
TTS is off by default. Enable it with messages.tts.auto in openclaw.json or by running /tts always in any chat session.
When active on Telegram, replies arrive as round voice note bubbles rather than file attachments. Three providers are supported: ElevenLabs (most natural voices), OpenAI (fast and affordable), and Microsoft Edge TTS (free, no API key required).
messages.tts.auto accepts two values you will actually use:
"always": every agent reply is spoken"inbound": only speak when the incoming message was a voice note
The "inbound" setting is usually the right choice for conversational use. You get voice replies when you sent voice, and text replies when you typed. It avoids TTS on long tool outputs where you just wanted a quick status check.
How to configure TTS in openclaw.json
Microsoft Edge TTS (free, no API key):
{
messages: {
tts: {
auto: "always",
provider: "microsoft",
microsoft: {
enabled: true,
voice: "en-US-MichelleNeural",
lang: "en-US",
outputFormat: "audio-24khz-48kbitrate-mono-mp3",
rate: "+10%",
pitch: "-5%",
},
},
},
}OpenAI primary with ElevenLabs fallback:
{
messages: {
tts: {
auto: "inbound",
provider: "openai",
openai: {
model: "gpt-4o-mini-tts",
voice: "alloy",
},
elevenlabs: {
voiceId: "your_voice_id",
modelId: "eleven_multilingual_v2",
voiceSettings: {
stability: 0.5,
similarityBoost: 0.75,
speed: 1.0,
},
},
},
},
}For long agent replies, TTS can auto-summarize before speaking. Set summaryModel to a fast model like openai/gpt-4.1-mini to trim the response before it hits the TTS engine. Without this, very long replies may hit the maxTextLength cap and produce unwieldy audio output.
{
messages: {
tts: {
auto: "inbound",
provider: "openai",
summaryModel: "openai/gpt-4.1-mini",
maxTextLength: 400,
},
},
}How echoTranscript helps you confirm what OpenClaw heard
echoTranscript: true sends a text confirmation back to the chat before the agent processes the transcript. This is useful during initial setup to verify that the transcription is accurate before the agent acts on it.
{
tools: {
media: {
audio: {
enabled: true,
echoTranscript: true,
echoFormat: "Heard: \"{transcript}\"",
models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
},
},
},
}The echoFormat field supports the {transcript} placeholder. The confirmation message appears before the agent reply in the chat. Turn this off once you have confirmed transcription is working, since it adds a message to every voice input.
How to use OpenClaw voice notes in Telegram groups with mention gating
When requireMention: true is set for a Telegram group, OpenClaw normally skips messages that do not contain an @mention. Voice notes pose a problem because the message body is empty until transcription runs.
OpenClaw handles this with preflight transcription. If a voice note arrives in a mention-gated group with no text body, OpenClaw transcribes it first, checks the transcript for mention patterns (e.g., @BotName), and only proceeds if a mention is found. The transcript is then passed through the normal pipeline.
If preflight transcription fails (timeout, API error), the message falls back to text-only mention detection and is likely dropped. This is a best-effort flow, so for critical workflows, use DMs instead of groups.
To disable preflight for a specific group:
{
channels: {
telegram: {
groups: {
"-1001234567890": {
disableAudioPreflight: true
}
}
}
}
}The coordinator pattern: managing projects hands-free while driving
The coordinator pattern turns OpenClaw into a voice-driven project manager. The setup: connect the agent to your task system (Lantern, Notion, or similar via tool calls), configure auto: "inbound" TTS, and use Telegram voice notes as the input interface.
A typical session while driving:
- You press the microphone in Telegram and say: "Add a high priority task to finish the API docs before tomorrow's call."
- OpenClaw transcribes the voice note, runs the task-creation tool, and replies with a spoken confirmation: "Done. Task added: finish API docs, high priority, due tomorrow."
- You follow up: "What else is on my plate today?" and get a spoken summary of your task list.
Tips for the coordinator workflow:
- Keep system prompts concise when using TTS, since the agent's reply is what gets spoken
- Use a
summaryModelto trim long status outputs before TTS - Test the setup in a quiet environment first, since background noise can degrade transcription
- If you use groups for coordination, test the preflight mention behavior before relying on it
This pattern works well for any away-from-keyboard context: cooking, exercising, or commuting. The Telegram mobile app handles push-to-talk natively, so there is no friction on the input side.
Common OpenClaw voice setup problems and how to fix them
Transcription does not trigger at all
Check that tools.media.audio.enabled is not set to false. On some platforms (especially Windows), auto-detection may not find local CLIs reliably. Set an explicit models array with a provider key instead of relying on auto-detection.
Deepgram stops transcribing after a config change or restart
This is a known issue (GitHub #7460). The root cause is a config reload race condition. Fix: set an explicit tools.media.audio.models entry with provider: "deepgram" rather than letting Deepgram surface through auto-detection.
Voice notes work in DMs but not in groups
Check whether the group has requireMention: true. Voice notes in mention-gated groups go through preflight transcription. If preflight fails, the message is dropped. Verify that the transcription provider is reachable and that the group config does not have disableAudioPreflight: true.
TTS fires on every reply, not just voice responses
Switch from auto: "always" to auto: "inbound". This limits TTS to replies triggered by incoming voice notes.
TTS output is too long and the reply sounds like a wall of text
Set summaryModel in the TTS config. A fast model like openai/gpt-4.1-mini will summarize the agent's reply before it goes to the TTS engine. Also set maxTextLength to cap input length.
CLI transcription exits with non-zero code
Ensure the CLI prints plain text to stdout and exits 0. JSON output needs to go through jq -r .text before OpenClaw can use it.
Key terms
Preflight transcription: A transcription pass that runs before mention detection in group chats. Allows voice notes to be checked for @mentions before the message is dropped due to missing text.
echoTranscript: An optional config flag that sends a text confirmation of the transcript back to the originating chat before the agent processes it. Useful for debugging transcription accuracy.
auto: "inbound": A TTS mode that only converts agent replies to audio when the incoming message was a voice note. Text messages get text replies; voice messages get voice replies.
Coordinator pattern: A workflow where an AI agent acts as a voice-driven task manager, receiving spoken commands via Telegram voice notes and responding with spoken status updates.
FAQ
Does OpenClaw voice mode work without any API keys?
OpenClaw voice mode works without API keys if you have a local Whisper CLI installed (whisper.cpp or the Python whisper package). Auto-detection finds these first before checking provider keys. For TTS, Microsoft Edge TTS requires no API key and is the default fallback. A zero-API-key setup is fully functional but slower on transcription since local inference runs on CPU.
How does OpenClaw TTS format voice replies on Telegram: does it send a file or a voice note bubble?
OpenClaw sends TTS output as a round Telegram voice note bubble, not a standard audio file attachment. Telegram distinguishes between the two in the UI: audio files show as a download card, while voice notes appear as the round waveform bubble that plays inline. OpenClaw uses the voice note format by default when TTS is active.
Can OpenClaw transcribe voice notes in Telegram group chats where the bot requires an @mention?
Yes. OpenClaw performs preflight transcription before the mention check when a voice note arrives in a mention-gated group. The voice note is transcribed first, then the transcript is scanned for @mention patterns. If a mention is found, the message continues through the full pipeline. If transcription fails during preflight, the message falls back to text-only mention detection and is likely dropped, so a reliable transcription provider matters more in groups than in DMs.
Evidence and sources
- OpenClaw Audio and Voice Notes docs: transcription config, auto-detection chain, echoTranscript, preflight mention behavior
- OpenClaw Text-to-Speech docs: TTS providers, config keys,
auto: "inbound"behavior - GitHub issue #7460: Deepgram transcription drops after config reload (verified)
- GitHub issue #22554: audio.enabled behavior on Windows with explicit models
Related resources
- Automated Morning Briefing with OpenClaw
- Build a Website with OpenClaw
- Best VPS for OpenClaw
- Fix OpenClaw Discord Bot Not Responding
Changelog
| Date | Change |
|---|---|
| 2026-03-25 | QC fixes: prose cleanup, key takeaways trimmed to 5, file size claims hedged with doc citations, GitHub #7460 verified |
| 2026-03-23 | Initial publication |
Fixes when it breaks. Workflows when it doesn't.
OpenClaw guides, configs, and troubleshooting notes. Every two weeks.



