Skip to main content
HumeAI TADA 3B ML (tada-3b-ml) is a multilingual text-to-speech model served on PolarGrid edge nodes via Triton’s python backend. Unlike preset-voice models, TADA clones a speaker from a short reference clip and can carry that voice across languages — synthesize French in a voice you only ever recorded speaking English.
  • HF repo: HumeAI/tada-3b-ml
  • Modality: Text-to-Speech (streaming)
  • Backend: Triton python (isolated TADA pod)
  • Available regions: yvr-02 (Blackwell production)

Headline benchmark

TADA exposes a chunked-HTTP streaming transport for /v1/audio/speech.
Measurementp50p95
End-to-end TTFA (with network)352 ms449 ms
Server-only TTFA (gateway → first triton byte)238 ms282 ms
Network leg of TTFA (e2e − server)109 ms210 ms
Full-utterance latency (TTLB, client wall-clock)919 ms1300 ms
Real-time factor (RTF)0.160.36
Streaming /v1/audio/speech (stream: true, pcm), 100 runs against https://api.yvr-02.edge.polargrid.ai, captured 2026-05-27 from a Vancouver-area laptop over the public internet. TTFA is the time to the first audio byte arriving at the client — identical to TTFB since the response body is raw PCM. Server-only TTFA comes from the X-Pg-First-Byte-Ms response header (PR #507), which the gateway sets to its measured time from _stream_tts entry to the first PCM chunk returned by triton. Total synthesis time (TTLB) is not exposed via header — response headers leave the wire before synthesis completes — so TTLB stays a client-side wall-clock number. RTF = synthesis wall-clock ÷ audio duration; below 1.0 is faster than real time. 92/100 runs returned a streaming verdict; the remainder finished too fast for the incremental-arrival heuristic to fire, which is a property of the heuristic, not server-side buffering. Raw runs: benchmarks/yvr-02-2026-05-27/tada-3b-ml/. Harness: bench/tada-3b-ml/.

How this compares

ProviderTTFA p50NotesSource
PolarGrid tada-3b-ml on Blackwell352 ms e2e / 238 ms serverMultilingual + cross-lingual voice cloningthis card
ElevenLabs Turbo v2~200 to 300 ms model / ~478 ms real-world streaming TTFBEnglish-leaningelevenlabs.io
Cartesia Sonic90 ms marketing claim / ~188 ms independent p50English-leaning, no cross-lingual cloningcartesia.ai
ElevenLabs Flash75 ms marketing claim / ~288 ms independent p50English-leaning, no cross-lingual cloninggradium.ai
Hume Octave 2~100 to 200 ms TTFTHume’s newer TTS, would land below TADAdev.hume.ai
PolarGrid’s 238 ms server TTFA is in range of real-world ElevenLabs Turbo v2 streaming TTFB. Cartesia Sonic and ElevenLabs Flash report lower marketing numbers and similar real-world numbers, but ship smaller English-leaning models without cross-lingual cloning, so the comparison is not like for like. Hume Octave 2 has moved the goalpost on Hume’s own product line; PolarGrid hosts TADA (the prior generation) faster than Hume hosted it.

Quickstart

Edge endpoints accept your raw pg_* API key as a bearer token — no token exchange. See Authentication. Replace <region> with your edge region, or discover the nearest one via the autorouter.
curl -X POST https://api.<region>.edge.polargrid.ai/v1/audio/speech \
  -H "Authorization: Bearer $POLARGRID_API_KEY" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{
    "model": "tada-3b-ml",
    "input": "Hello from PolarGrid.",
    "voice": "default",
    "response_format": "pcm",
    "stream": true
  }' \
  --output speech.pcm

Capabilities

FieldValue
EndpointPOST /v1/audio/speech
Audio output24 kHz, 16-bit, mono
StreamingYes — chunked HTTP, pcm and opus (stream: true); audio delivered incrementally in ~4-token windows during synthesis
Batch formatspcm, wav, mp3
Voice modelCross-lingual voice cloning from a reference clip (no preset voice catalog)
LanguagesEnglish, French, German, Spanish, Italian, Portuguese, Polish, Japanese, Arabic, Chinese
speed controlBatch only — streaming requires speed = 1.0
Max batch size1

Voices — cross-lingual cloning

TADA does not expose preset voice IDs. The voice parameter selects a reference speaker:
voice valueMeaning
defaultThe bundled reference clip — a neutral English speaker. Use this when you just want speech and don’t care about the timbre.
A URLA WAV file (24 kHz mono) fetched and used as the reference. Pair it with voice_transcript — the exact text spoken in the clip.
A base64 WAVThe reference clip inlined as a base64-encoded WAV string. Also pair with voice_transcript.
voice_transcript is required whenever voice is a URL or base64 clip — TADA conditions on both the reference audio and its transcript. It is not needed for voice: "default". The cloned voice carries across languages: provide an English reference clip and set the language field (or write the input in the target language) to synthesize that speaker in French, Japanese, and so on.
curl -X POST https://api.<region>.edge.polargrid.ai/v1/audio/speech \
  -H "Authorization: Bearer $POLARGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tada-3b-ml",
    "input": "Bonjour, ceci est une voix clonée.",
    "voice": "https://example.com/my-reference.wav",
    "voice_transcript": "This is the exact text spoken in the reference clip.",
    "language": "fr",
    "response_format": "wav"
  }' \
  --output cloned.wav

Streaming

Pass stream: true for chunked audio over a single HTTP response. Streaming formats are pcm (default for raw HTTP callers) and opus; the PolarGrid SDKs default streaming requests to opus. Requesting wav or mp3 with stream: true returns 400 Bad Request. Audio is delivered incrementally: TADA’s synthesis loop runs one step per text token, and the handler emits a chunk every couple of tokens (a ~4-token window, 2 of them overlap for crossfade context) as synthesis proceeds — so the first audio arrives well before the utterance finishes. Synthesis is fast on top of that (real-time factor ~0.13–0.25, i.e. audio produced several × faster than it plays). The streaming_verdict field in bench/tada-3b-ml/ reports per run whether the edge delivered bytes incrementally. TADA streaming does not honor the speed parameter — speed control needs a full second synthesis pass, which is incompatible with per-chunk streaming. Pass speed: 1.0 (or omit it) for streaming requests; use batch mode if you need to change the rate. See the Text-to-Speech API reference for the full streaming contract — response headers, truncated-stream detection, and the per-format table.

Aliases

The following caller-facing aliases resolve to tada-3b-ml:
AliasResolves to
humane-tadatada-3b-ml
humane/tada-ttstada-3b-ml

Model identifier

Call this model with the canonical id tada-3b-ml (or an alias above) at /v1/audio/speech. The HuggingFace repo id HumeAI/tada-3b-ml is accepted at /v1/models/load for hot-loading but does not resolve at inference time.

Notes

  • TADA runs in its own Triton pod, isolated from the voice pod: hume-tada pins transformers < 5 and torch < 2.8, while the voice pod’s cohere-transcribe needs transformers >= 5.4. See backend/edge-production-setup/CLAUDE.md for the pod layout.
  • Streaming synthesis is per-token through a decoupled Triton transaction policy — the handler pushes PCM windows as they are decoded rather than buffering the full utterance.
  • For preset-voice English/British TTS with a fixed catalog, use kokoro-82m instead — TADA is the choice when you need a specific cloned voice or a non-English language.

See also