OpenClaw Voice Mode: Telegram Voice Notes and TTS Setup

Key takeaways

OpenClaw auto-transcribes Telegram voice notes using a detection chain: local Whisper CLI first, then provider APIs (OpenAI, Deepgram, Groq, or Mistral Voxtral)
TTS replies arrive as round Telegram voice bubbles, not file attachments, when the feature is enabled
Set messages.tts.auto: "inbound" to only speak when responding to a voice message, keeping text replies silent
Microsoft Edge TTS works with no API key and is the default fallback when no provider keys are configured
The echoTranscript option sends a text confirmation of what OpenClaw heard before the agent processes it

Feed this to your OpenClaw agent

Read https://www.stack-junkie.com/blog/openclaw-voice-mode-telegram and follow the instructions to help me with: OpenClaw Voice Mode: Telegram Voice Notes and TTS Setup. Summarize each step before running commands and confirm with me first.

Always review commands your agent suggests before approving them. Don't paste prompts from sources you don't trust.

Fixes when it breaks. Workflows when it doesn't.

OpenClaw guides, configs, and troubleshooting notes. Every two weeks.

How OpenClaw transcribes Telegram voice notes automatically

OpenClaw transcribes voice notes without any middleware. When a Telegram voice message arrives, the audio media understanding pipeline intercepts it before the agent sees the message body. The pipeline locates the audio attachment, downloads it if needed, and routes it through the configured transcription stack.

Auto-detection runs in this order when no explicit models are configured:

sherpa-onnx-offline: requires SHERPA_ONNX_MODEL_DIR with model files
whisper-cli: from whisper.cpp, uses WHISPER_CPP_MODEL or the bundled tiny model
whisper: Python CLI, downloads models automatically
Gemini CLI via read_many_files
Provider keys: OpenAI, then Groq, then Deepgram, then Google

On success, OpenClaw replaces the message body with an [Audio] block and sets the {{Transcript}} template variable. The agent sees plain text and responds as if you had typed the message.

According to the OpenClaw audio docs, audio files below a minimum size threshold are skipped as likely empty. The default size cap is configurable via tools.media.audio.maxBytes (the example config uses 20MB).

Which transcription provider should you use for OpenClaw?

For most setups, OpenAI gpt-4o-mini-transcribe hits the best balance of speed, accuracy, and cost. Deepgram nova-3 is the fastest option and handles noisy audio well. Mistral Voxtral is newer and strong on non-English languages. Local Whisper CLI requires no API key but adds latency and is CPU-bound.

Provider	API Key	Speed	Best For
OpenAI gpt-4o-mini-transcribe	Yes	~1-2s	General use
Deepgram nova-3	Yes	Under 1s	Speed, noisy audio
Mistral Voxtral	Yes	~2s	Multilingual
Groq	Yes	Under 1s	Fast inference
whisper-cli (local)	No	Variable	No API key
whisper Python	No	Variable	Auto model download

If you are starting fresh and already have an OpenAI key, use OpenAI. If you want zero API costs and can tolerate slower transcription, install whisper.cpp and point the config at the binary.

How to configure voice note transcription in openclaw.json

The minimal config that enables transcription with OpenAI:

json5

{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [
          { provider: "openai", model: "gpt-4o-mini-transcribe" }
        ],
      },
    },
  },
}

Provider + local CLI fallback (transcription keeps working when the API is down):

json5

{
  tools: {
    media: {
      audio: {
        enabled: true,
        maxBytes: 20971520,
        models: [
          { provider: "openai", model: "gpt-4o-mini-transcribe" },
          {
            type: "cli",
            command: "whisper",
            args: ["--model", "base", "{{MediaPath}}"],
            timeoutSeconds: 45,
          },
        ],
      },
    },
  },
}

Deepgram-only config:

json5

{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [{ provider: "deepgram", model: "nova-3" }],
      },
    },
  },
}

To restrict transcription to direct messages only (skip groups):

json5

{
  tools: {
    media: {
      audio: {
        enabled: true,
        scope: {
          default: "allow",
          rules: [{ action: "deny", match: { chatType: "group" } }],
        },
        models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
      },
    },
  },
}

How to enable TTS so OpenClaw speaks back

TTS is off by default. Enable it with messages.tts.auto in openclaw.json or by running /tts always in any chat session.

When active on Telegram, replies arrive as round voice note bubbles rather than file attachments. Three providers are supported: ElevenLabs (most natural voices), OpenAI (fast and affordable), and Microsoft Edge TTS (free, no API key required).

messages.tts.auto accepts two values you will actually use:

"always": every agent reply is spoken
"inbound": only speak when the incoming message was a voice note

The "inbound" setting is usually the right choice for conversational use. You get voice replies when you sent voice, and text replies when you typed. It avoids TTS on long tool outputs where you just wanted a quick status check.

How to configure TTS in openclaw.json

Microsoft Edge TTS (free, no API key):

json5

{
  messages: {
    tts: {
      auto: "always",
      provider: "microsoft",
      microsoft: {
        enabled: true,
        voice: "en-US-MichelleNeural",
        lang: "en-US",
        outputFormat: "audio-24khz-48kbitrate-mono-mp3",
        rate: "+10%",
        pitch: "-5%",
      },
    },
  },
}

OpenAI primary with ElevenLabs fallback:

json5

{
  messages: {
    tts: {
      auto: "inbound",
      provider: "openai",
      openai: {
        model: "gpt-4o-mini-tts",
        voice: "alloy",
      },
      elevenlabs: {
        voiceId: "your_voice_id",
        modelId: "eleven_multilingual_v2",
        voiceSettings: {
          stability: 0.5,
          similarityBoost: 0.75,
          speed: 1.0,
        },
      },
    },
  },
}

For long agent replies, TTS can auto-summarize before speaking. Set summaryModel to a fast model like openai/gpt-4.1-mini to trim the response before it hits the TTS engine. Without this, very long replies may hit the maxTextLength cap and produce unwieldy audio output.

json5

{
  messages: {
    tts: {
      auto: "inbound",
      provider: "openai",
      summaryModel: "openai/gpt-4.1-mini",
      maxTextLength: 400,
    },
  },
}

How echoTranscript helps you confirm what OpenClaw heard

echoTranscript: true sends a text confirmation back to the chat before the agent processes the transcript. This is useful during initial setup to verify that the transcription is accurate before the agent acts on it.

json5

{
  tools: {
    media: {
      audio: {
        enabled: true,
        echoTranscript: true,
        echoFormat: "Heard: \"{transcript}\"",
        models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
      },
    },
  },
}

The echoFormat field supports the {transcript} placeholder. The confirmation message appears before the agent reply in the chat. Turn this off once you have confirmed transcription is working, since it adds a message to every voice input.

How to use OpenClaw voice notes in Telegram groups with mention gating

When requireMention: true is set for a Telegram group, OpenClaw normally skips messages that do not contain an @mention. Voice notes pose a problem because the message body is empty until transcription runs.

OpenClaw handles this with preflight transcription. If a voice note arrives in a mention-gated group with no text body, OpenClaw transcribes it first, checks the transcript for mention patterns (e.g., @BotName), and only proceeds if a mention is found. The transcript is then passed through the normal pipeline.

If preflight transcription fails (timeout, API error), the message falls back to text-only mention detection and is likely dropped. This is a best-effort flow, so for critical workflows, use DMs instead of groups.

To disable preflight for a specific group:

json5

{
  channels: {
    telegram: {
      groups: {
        "-1001234567890": {
          disableAudioPreflight: true
        }
      }
    }
  }
}

The coordinator pattern: managing projects hands-free while driving

The coordinator pattern turns OpenClaw into a voice-driven project manager. The setup: connect the agent to your task system (Lantern, Notion, or similar via tool calls), configure auto: "inbound" TTS, and use Telegram voice notes as the input interface.

A typical session while driving:

You press the microphone in Telegram and say: "Add a high priority task to finish the API docs before tomorrow's call."
OpenClaw transcribes the voice note, runs the task-creation tool, and replies with a spoken confirmation: "Done. Task added: finish API docs, high priority, due tomorrow."
You follow up: "What else is on my plate today?" and get a spoken summary of your task list.

Tips for the coordinator workflow:

Keep system prompts concise when using TTS, since the agent's reply is what gets spoken
Use a summaryModel to trim long status outputs before TTS
Test the setup in a quiet environment first, since background noise can degrade transcription
If you use groups for coordination, test the preflight mention behavior before relying on it

This pattern works well for any away-from-keyboard context: cooking, exercising, or commuting. The Telegram mobile app handles push-to-talk natively, so there is no friction on the input side.

Common OpenClaw voice setup problems and how to fix them

Transcription does not trigger at all

Check that tools.media.audio.enabled is not set to false. On some platforms (especially Windows), auto-detection may not find local CLIs reliably. Set an explicit models array with a provider key instead of relying on auto-detection.

Deepgram stops transcribing after a config change or restart

This is a known issue (GitHub #7460). The root cause is a config reload race condition. Fix: set an explicit tools.media.audio.models entry with provider: "deepgram" rather than letting Deepgram surface through auto-detection.

Voice notes work in DMs but not in groups

Check whether the group has requireMention: true. Voice notes in mention-gated groups go through preflight transcription. If preflight fails, the message is dropped. Verify that the transcription provider is reachable and that the group config does not have disableAudioPreflight: true.

TTS fires on every reply, not just voice responses

Switch from auto: "always" to auto: "inbound". This limits TTS to replies triggered by incoming voice notes.

TTS output is too long and the reply sounds like a wall of text

Set summaryModel in the TTS config. A fast model like openai/gpt-4.1-mini will summarize the agent's reply before it goes to the TTS engine. Also set maxTextLength to cap input length.

CLI transcription exits with non-zero code

Ensure the CLI prints plain text to stdout and exits 0. JSON output needs to go through jq -r .text before OpenClaw can use it.

Key terms

Preflight transcription: A transcription pass that runs before mention detection in group chats. Allows voice notes to be checked for @mentions before the message is dropped due to missing text.

echoTranscript: An optional config flag that sends a text confirmation of the transcript back to the originating chat before the agent processes it. Useful for debugging transcription accuracy.

auto: "inbound": A TTS mode that only converts agent replies to audio when the incoming message was a voice note. Text messages get text replies; voice messages get voice replies.

Coordinator pattern: A workflow where an AI agent acts as a voice-driven task manager, receiving spoken commands via Telegram voice notes and responding with spoken status updates.

FAQ

Does OpenClaw voice mode work without any API keys?

OpenClaw voice mode works without API keys if you have a local Whisper CLI installed (whisper.cpp or the Python whisper package). Auto-detection finds these first before checking provider keys. For TTS, Microsoft Edge TTS requires no API key and is the default fallback. A zero-API-key setup is fully functional but slower on transcription since local inference runs on CPU.

How does OpenClaw TTS format voice replies on Telegram: does it send a file or a voice note bubble?

OpenClaw sends TTS output as a round Telegram voice note bubble, not a standard audio file attachment. Telegram distinguishes between the two in the UI: audio files show as a download card, while voice notes appear as the round waveform bubble that plays inline. OpenClaw uses the voice note format by default when TTS is active.

Can OpenClaw transcribe voice notes in Telegram group chats where the bot requires an @mention?

Yes. OpenClaw performs preflight transcription before the mention check when a voice note arrives in a mention-gated group. The voice note is transcribed first, then the transcript is scanned for @mention patterns. If a mention is found, the message continues through the full pipeline. If transcription fails during preflight, the message falls back to text-only mention detection and is likely dropped, so a reliable transcription provider matters more in groups than in DMs.

Evidence and sources

OpenClaw Audio and Voice Notes docs: transcription config, auto-detection chain, echoTranscript, preflight mention behavior
OpenClaw Text-to-Speech docs: TTS providers, config keys, auto: "inbound" behavior
GitHub issue #7460: Deepgram transcription drops after config reload (verified)
GitHub issue #22554: audio.enabled behavior on Windows with explicit models

Changelog

Date	Change
2026-03-25	QC fixes: prose cleanup, key takeaways trimmed to 5, file size claims hedged with doc citations, GitHub #7460 verified
2026-03-23	Initial publication

Key takeaways

OpenClaw auto-transcribes Telegram voice notes using a detection chain: local Whisper CLI first, then provider APIs (OpenAI, Deepgram, Groq, or Mistral Voxtral)
TTS replies arrive as round Telegram voice bubbles, not file attachments, when the feature is enabled
Set messages.tts.auto: "inbound" to only speak when responding to a voice message, keeping text replies silent
Microsoft Edge TTS works with no API key and is the default fallback when no provider keys are configured
The echoTranscript option sends a text confirmation of what OpenClaw heard before the agent processes it

Feed this to your OpenClaw agent

Read https://www.stack-junkie.com/blog/openclaw-voice-mode-telegram and follow the instructions to help me with: OpenClaw Voice Mode: Telegram Voice Notes and TTS Setup. Summarize each step before running commands and confirm with me first.

Always review commands your agent suggests before approving them. Don't paste prompts from sources you don't trust.

Fixes when it breaks. Workflows when it doesn't.

OpenClaw guides, configs, and troubleshooting notes. Every two weeks.

How OpenClaw transcribes Telegram voice notes automatically

Auto-detection runs in this order when no explicit models are configured:

sherpa-onnx-offline: requires SHERPA_ONNX_MODEL_DIR with model files
whisper-cli: from whisper.cpp, uses WHISPER_CPP_MODEL or the bundled tiny model
whisper: Python CLI, downloads models automatically
Gemini CLI via read_many_files
Provider keys: OpenAI, then Groq, then Deepgram, then Google

On success, OpenClaw replaces the message body with an [Audio] block and sets the {{Transcript}} template variable. The agent sees plain text and responds as if you had typed the message.

Which transcription provider should you use for OpenClaw?

Provider	API Key	Speed	Best For
OpenAI gpt-4o-mini-transcribe	Yes	~1-2s	General use
Deepgram nova-3	Yes	Under 1s	Speed, noisy audio
Mistral Voxtral	Yes	~2s	Multilingual
Groq	Yes	Under 1s	Fast inference
whisper-cli (local)	No	Variable	No API key
whisper Python	No	Variable	Auto model download

If you are starting fresh and already have an OpenAI key, use OpenAI. If you want zero API costs and can tolerate slower transcription, install whisper.cpp and point the config at the binary.

How to configure voice note transcription in openclaw.json

The minimal config that enables transcription with OpenAI:

json5

{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [
          { provider: "openai", model: "gpt-4o-mini-transcribe" }
        ],
      },
    },
  },
}

Provider + local CLI fallback (transcription keeps working when the API is down):

json5

{
  tools: {
    media: {
      audio: {
        enabled: true,
        maxBytes: 20971520,
        models: [
          { provider: "openai", model: "gpt-4o-mini-transcribe" },
          {
            type: "cli",
            command: "whisper",
            args: ["--model", "base", "{{MediaPath}}"],
            timeoutSeconds: 45,
          },
        ],
      },
    },
  },
}

Deepgram-only config:

json5

{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [{ provider: "deepgram", model: "nova-3" }],
      },
    },
  },
}

To restrict transcription to direct messages only (skip groups):

json5

{
  tools: {
    media: {
      audio: {
        enabled: true,
        scope: {
          default: "allow",
          rules: [{ action: "deny", match: { chatType: "group" } }],
        },
        models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
      },
    },
  },
}

How to enable TTS so OpenClaw speaks back

TTS is off by default. Enable it with messages.tts.auto in openclaw.json or by running /tts always in any chat session.

messages.tts.auto accepts two values you will actually use:

"always": every agent reply is spoken
"inbound": only speak when the incoming message was a voice note

How to configure TTS in openclaw.json

Microsoft Edge TTS (free, no API key):

json5

{
  messages: {
    tts: {
      auto: "always",
      provider: "microsoft",
      microsoft: {
        enabled: true,
        voice: "en-US-MichelleNeural",
        lang: "en-US",
        outputFormat: "audio-24khz-48kbitrate-mono-mp3",
        rate: "+10%",
        pitch: "-5%",
      },
    },
  },
}

OpenAI primary with ElevenLabs fallback:

json5

{
  messages: {
    tts: {
      auto: "inbound",
      provider: "openai",
      openai: {
        model: "gpt-4o-mini-tts",
        voice: "alloy",
      },
      elevenlabs: {
        voiceId: "your_voice_id",
        modelId: "eleven_multilingual_v2",
        voiceSettings: {
          stability: 0.5,
          similarityBoost: 0.75,
          speed: 1.0,
        },
      },
    },
  },
}

json5

{
  messages: {
    tts: {
      auto: "inbound",
      provider: "openai",
      summaryModel: "openai/gpt-4.1-mini",
      maxTextLength: 400,
    },
  },
}

How echoTranscript helps you confirm what OpenClaw heard

json5

{
  tools: {
    media: {
      audio: {
        enabled: true,
        echoTranscript: true,
        echoFormat: "Heard: \"{transcript}\"",
        models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
      },
    },
  },
}

How to use OpenClaw voice notes in Telegram groups with mention gating

To disable preflight for a specific group:

json5

{
  channels: {
    telegram: {
      groups: {
        "-1001234567890": {
          disableAudioPreflight: true
        }
      }
    }
  }
}

The coordinator pattern: managing projects hands-free while driving

A typical session while driving:

You press the microphone in Telegram and say: "Add a high priority task to finish the API docs before tomorrow's call."
OpenClaw transcribes the voice note, runs the task-creation tool, and replies with a spoken confirmation: "Done. Task added: finish API docs, high priority, due tomorrow."
You follow up: "What else is on my plate today?" and get a spoken summary of your task list.

Tips for the coordinator workflow:

Keep system prompts concise when using TTS, since the agent's reply is what gets spoken
Use a summaryModel to trim long status outputs before TTS
Test the setup in a quiet environment first, since background noise can degrade transcription
If you use groups for coordination, test the preflight mention behavior before relying on it

This pattern works well for any away-from-keyboard context: cooking, exercising, or commuting. The Telegram mobile app handles push-to-talk natively, so there is no friction on the input side.

Common OpenClaw voice setup problems and how to fix them

Transcription does not trigger at all

Deepgram stops transcribing after a config change or restart

Voice notes work in DMs but not in groups

TTS fires on every reply, not just voice responses

Switch from auto: "always" to auto: "inbound". This limits TTS to replies triggered by incoming voice notes.

TTS output is too long and the reply sounds like a wall of text

Set summaryModel in the TTS config. A fast model like openai/gpt-4.1-mini will summarize the agent's reply before it goes to the TTS engine. Also set maxTextLength to cap input length.

CLI transcription exits with non-zero code

Ensure the CLI prints plain text to stdout and exits 0. JSON output needs to go through jq -r .text before OpenClaw can use it.

Key terms

Preflight transcription: A transcription pass that runs before mention detection in group chats. Allows voice notes to be checked for @mentions before the message is dropped due to missing text.

echoTranscript: An optional config flag that sends a text confirmation of the transcript back to the originating chat before the agent processes it. Useful for debugging transcription accuracy.

auto: "inbound": A TTS mode that only converts agent replies to audio when the incoming message was a voice note. Text messages get text replies; voice messages get voice replies.

Coordinator pattern: A workflow where an AI agent acts as a voice-driven task manager, receiving spoken commands via Telegram voice notes and responding with spoken status updates.

OpenClaw Audio and Voice Notes docs: transcription config, auto-detection chain, echoTranscript, preflight mention behavior
OpenClaw Text-to-Speech docs: TTS providers, config keys, auto: "inbound" behavior
GitHub issue #7460: Deepgram transcription drops after config reload (verified)
GitHub issue #22554: audio.enabled behavior on Windows with explicit models

Changelog

Date	Change
2026-03-25	QC fixes: prose cleanup, key takeaways trimmed to 5, file size claims hedged with doc citations, GitHub #7460 verified
2026-03-23	Initial publication

More Advanced guides

Related Articles

OpenClaw Notification Routing: Send Alerts Anywhere

Best Chat App for OpenClaw: Telegram, Discord, or Slack?

OpenClaw DM Policy: How All Four Modes Actually Work

More Advanced guides

Related Articles

OpenClaw Notification Routing: Send Alerts Anywhere

Best Chat App for OpenClaw: Telegram, Discord, or Slack?

OpenClaw DM Policy: How All Four Modes Actually Work