Documentation Index
Fetch the complete documentation index at: https://polargrid.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
HumeAI TADA 3B ML (tada-3b-ml) is a multilingual text-to-speech model served on PolarGrid edge nodes via Triton’s python backend. Unlike preset-voice models, TADA clones a speaker from a short reference clip and can carry that voice across languages — synthesize French in a voice you only ever recorded speaking English.
- HF repo:
HumeAI/tada-3b-ml
- Modality: Text-to-Speech (streaming)
- Backend: Triton
python (isolated TADA pod)
- Streaming availability: Pre-GA validation — not yet announced for a production region.
Headline benchmark
TADA exposes a chunked-HTTP streaming transport for /v1/audio/speech.
| Metric | p50 | p95 |
|---|
| TTFA — time to first audio byte | 499 ms | 523 ms |
| Full-utterance latency | 1417 ms | 1886 ms |
| Real-time factor (RTF) | 0.25 | 0.46 |
Streaming /v1/audio/speech (stream: true, pcm), 100 runs against a
PolarGrid edge endpoint. TTFA is end-to-end wall-clock from request sent to
first audio byte (network included). RTF = synthesis wall-clock ÷ audio
duration; below 1.0 is faster than real time. Every run returned a
streaming verdict — audio arrives in chunks during synthesis. Source:
bench/tada-3b-ml/.
Quickstart
Edge endpoints accept your raw pg_* API key as a bearer token — no token exchange. See Authentication. Replace <region> with your edge region, or discover the nearest one via the autorouter.
curl -X POST https://api.<region>.edge.polargrid.ai/v1/audio/speech \
-H "Authorization: Bearer $POLARGRID_API_KEY" \
-H "Content-Type: application/json" \
--no-buffer \
-d '{
"model": "tada-3b-ml",
"input": "Hello from PolarGrid.",
"voice": "default",
"response_format": "pcm",
"stream": true
}' \
--output speech.pcm
Capabilities
| Field | Value |
|---|
| Endpoint | POST /v1/audio/speech |
| Audio output | 24 kHz, 16-bit, mono |
| Streaming | Yes — chunked HTTP, pcm and opus (stream: true); audio delivered incrementally in ~4-token windows during synthesis |
| Batch formats | pcm, wav, mp3 |
| Voice model | Cross-lingual voice cloning from a reference clip (no preset voice catalog) |
| Languages | English, French, German, Spanish, Italian, Portuguese, Polish, Japanese, Arabic, Chinese |
speed control | Batch only — streaming requires speed = 1.0 |
| Max batch size | 1 |
Voices — cross-lingual cloning
TADA does not expose preset voice IDs. The voice parameter selects a reference speaker:
voice value | Meaning |
|---|
default | The bundled reference clip — a neutral English speaker. Use this when you just want speech and don’t care about the timbre. |
| A URL | A WAV file (24 kHz mono) fetched and used as the reference. Pair it with voice_transcript — the exact text spoken in the clip. |
| A base64 WAV | The reference clip inlined as a base64-encoded WAV string. Also pair with voice_transcript. |
voice_transcript is required whenever voice is a URL or base64 clip — TADA conditions on both the reference audio and its transcript. It is not needed for voice: "default".
The cloned voice carries across languages: provide an English reference clip and set the language field (or write the input in the target language) to synthesize that speaker in French, Japanese, and so on.
curl -X POST https://api.<region>.edge.polargrid.ai/v1/audio/speech \
-H "Authorization: Bearer $POLARGRID_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "tada-3b-ml",
"input": "Bonjour, ceci est une voix clonée.",
"voice": "https://example.com/my-reference.wav",
"voice_transcript": "This is the exact text spoken in the reference clip.",
"language": "fr",
"response_format": "wav"
}' \
--output cloned.wav
Streaming
Pass stream: true for chunked audio over a single HTTP response. Streaming formats are pcm (default for raw HTTP callers) and opus; the PolarGrid SDKs default streaming requests to opus. Requesting wav or mp3 with stream: true returns 400 Bad Request.
Audio is delivered incrementally: TADA’s synthesis loop runs one step per text token, and the handler emits a chunk every couple of tokens (a ~4-token window, 2 of them overlap for crossfade context) as synthesis proceeds — so the first audio arrives well before the utterance finishes. Synthesis is fast on top of that (real-time factor ~0.13–0.25, i.e. audio produced several × faster than it plays). The streaming_verdict field in bench/tada-3b-ml/ reports per run whether the edge delivered bytes incrementally.
TADA streaming does not honor the speed parameter — speed control needs a full second synthesis pass, which is incompatible with per-chunk streaming. Pass speed: 1.0 (or omit it) for streaming requests; use batch mode if you need to change the rate.
See the Text-to-Speech API reference for the full streaming contract — response headers, truncated-stream detection, and the per-format table.
Aliases
The following caller-facing aliases resolve to tada-3b-ml:
| Alias | Resolves to |
|---|
humane-tada | tada-3b-ml |
humane/tada-tts | tada-3b-ml |
Model identifier
Call this model with the canonical id tada-3b-ml (or an alias above) at /v1/audio/speech. The HuggingFace repo id HumeAI/tada-3b-ml is accepted at /v1/models/load for hot-loading but does not resolve at inference time.
Notes
- TADA runs in its own Triton pod, isolated from the voice pod: hume-tada pins
transformers < 5 and torch < 2.8, while the voice pod’s cohere-transcribe needs transformers >= 5.4. See backend/edge-production-setup/CLAUDE.md for the pod layout.
- Streaming synthesis is per-token through a decoupled Triton transaction policy — the handler pushes PCM windows as they are decoded rather than buffering the full utterance.
- For preset-voice English/British TTS with a fixed catalog, use
kokoro-82m instead — TADA is the choice when you need a specific cloned voice or a non-English language.
See also