tada-3b-ml) is a multilingual text-to-speech model served on PolarGrid edge nodes via Triton’s python backend. Unlike preset-voice models, TADA clones a speaker from a short reference clip and can carry that voice across languages — synthesize French in a voice you only ever recorded speaking English.
- HF repo:
HumeAI/tada-3b-ml - Modality: Text-to-Speech (streaming)
- Backend: Triton
python(isolated TADA pod)
- Available regions:
yvr-02(Blackwell production)
Headline benchmark
TADA exposes a chunked-HTTP streaming transport for/v1/audio/speech.
| Measurement | p50 | p95 |
|---|---|---|
| End-to-end TTFA (with network) | 352 ms | 449 ms |
| Server-only TTFA (gateway → first triton byte) | 238 ms | 282 ms |
| Network leg of TTFA (e2e − server) | 109 ms | 210 ms |
| Full-utterance latency (TTLB, client wall-clock) | 919 ms | 1300 ms |
| Real-time factor (RTF) | 0.16 | 0.36 |
/v1/audio/speech (stream: true, pcm), 100 runs against
https://api.yvr-02.edge.polargrid.ai, captured 2026-05-27 from a
Vancouver-area laptop over the public internet. TTFA is the time to the
first audio byte arriving at the client — identical to TTFB since the
response body is raw PCM. Server-only TTFA comes from the
X-Pg-First-Byte-Ms response header
(PR #507),
which the gateway sets to its measured time from _stream_tts entry to
the first PCM chunk returned by triton. Total synthesis time (TTLB) is
not exposed via header — response headers leave the wire before
synthesis completes — so TTLB stays a client-side wall-clock number. RTF
= synthesis wall-clock ÷ audio duration; below 1.0 is faster than real
time. 92/100 runs returned a streaming verdict; the remainder
finished too fast for the incremental-arrival heuristic to fire, which
is a property of the heuristic, not server-side buffering. Raw runs:
benchmarks/yvr-02-2026-05-27/tada-3b-ml/.
Harness:
bench/tada-3b-ml/.
How this compares
| Provider | TTFA p50 | Notes | Source |
|---|---|---|---|
PolarGrid tada-3b-ml on Blackwell | 352 ms e2e / 238 ms server | Multilingual + cross-lingual voice cloning | this card |
| ElevenLabs Turbo v2 | ~200 to 300 ms model / ~478 ms real-world streaming TTFB | English-leaning | elevenlabs.io |
| Cartesia Sonic | 90 ms marketing claim / ~188 ms independent p50 | English-leaning, no cross-lingual cloning | cartesia.ai |
| ElevenLabs Flash | 75 ms marketing claim / ~288 ms independent p50 | English-leaning, no cross-lingual cloning | gradium.ai |
| Hume Octave 2 | ~100 to 200 ms TTFT | Hume’s newer TTS, would land below TADA | dev.hume.ai |
Quickstart
Edge endpoints accept your rawpg_* API key as a bearer token — no token exchange. See Authentication. Replace <region> with your edge region, or discover the nearest one via the autorouter.
Capabilities
| Field | Value |
|---|---|
| Endpoint | POST /v1/audio/speech |
| Audio output | 24 kHz, 16-bit, mono |
| Streaming | Yes — chunked HTTP, pcm and opus (stream: true); audio delivered incrementally in ~4-token windows during synthesis |
| Batch formats | pcm, wav, mp3 |
| Voice model | Cross-lingual voice cloning from a reference clip (no preset voice catalog) |
| Languages | English, French, German, Spanish, Italian, Portuguese, Polish, Japanese, Arabic, Chinese |
speed control | Batch only — streaming requires speed = 1.0 |
| Max batch size | 1 |
Voices — cross-lingual cloning
TADA does not expose preset voice IDs. Thevoice parameter selects a reference speaker:
voice value | Meaning |
|---|---|
default | The bundled reference clip — a neutral English speaker. Use this when you just want speech and don’t care about the timbre. |
| A URL | A WAV file (24 kHz mono) fetched and used as the reference. Pair it with voice_transcript — the exact text spoken in the clip. |
| A base64 WAV | The reference clip inlined as a base64-encoded WAV string. Also pair with voice_transcript. |
voice_transcript is required whenever voice is a URL or base64 clip — TADA conditions on both the reference audio and its transcript. It is not needed for voice: "default".
The cloned voice carries across languages: provide an English reference clip and set the language field (or write the input in the target language) to synthesize that speaker in French, Japanese, and so on.
Streaming
Passstream: true for chunked audio over a single HTTP response. Streaming formats are pcm (default for raw HTTP callers) and opus; the PolarGrid SDKs default streaming requests to opus. Requesting wav or mp3 with stream: true returns 400 Bad Request.
Audio is delivered incrementally: TADA’s synthesis loop runs one step per text token, and the handler emits a chunk every couple of tokens (a ~4-token window, 2 of them overlap for crossfade context) as synthesis proceeds — so the first audio arrives well before the utterance finishes. Synthesis is fast on top of that (real-time factor ~0.13–0.25, i.e. audio produced several × faster than it plays). The streaming_verdict field in bench/tada-3b-ml/ reports per run whether the edge delivered bytes incrementally.
TADA streaming does not honor the speed parameter — speed control needs a full second synthesis pass, which is incompatible with per-chunk streaming. Pass speed: 1.0 (or omit it) for streaming requests; use batch mode if you need to change the rate.
See the Text-to-Speech API reference for the full streaming contract — response headers, truncated-stream detection, and the per-format table.
Aliases
The following caller-facing aliases resolve totada-3b-ml:
| Alias | Resolves to |
|---|---|
humane-tada | tada-3b-ml |
humane/tada-tts | tada-3b-ml |
Model identifier
Call this model with the canonical idtada-3b-ml (or an alias above) at /v1/audio/speech. The HuggingFace repo id HumeAI/tada-3b-ml is accepted at /v1/models/load for hot-loading but does not resolve at inference time.
Notes
- TADA runs in its own Triton pod, isolated from the voice pod: hume-tada pins
transformers < 5andtorch < 2.8, while the voice pod’scohere-transcribeneedstransformers >= 5.4. Seebackend/edge-production-setup/CLAUDE.mdfor the pod layout. - Streaming synthesis is per-token through a decoupled Triton transaction policy — the handler pushes PCM windows as they are decoded rather than buffering the full utterance.
- For preset-voice English/British TTS with a fixed catalog, use
kokoro-82minstead — TADA is the choice when you need a specific cloned voice or a non-English language.
See also
- Text-to-Speech API — endpoint reference, formats, streaming contract
- Voice AI guide — building voice agents on PolarGrid
- Authentication — using your
pg_*API key /v1/models— list all available models
