HumeAI TADA 3B ML - PolarGrid

HumeAI TADA 3B ML (tada-3b-ml) is a multilingual text-to-speech model served on PolarGrid edge nodes via Triton’s python backend. Unlike preset-voice models, TADA clones a speaker from a short reference clip and can carry that voice across languages — synthesize French in a voice you only ever recorded speaking English.

HF repo: HumeAI/tada-3b-ml
Modality: Text-to-Speech (streaming)
Backend: Triton python (isolated TADA pod)

Available regions: all regions — see Model availability

Headline benchmark

TADA exposes a chunked-HTTP streaming transport for /v1/audio/speech.

Measurement	p50	p95
End-to-end TTFA (with network)	352 ms	449 ms
Server-only TTFA (gateway → first triton byte)	238 ms	282 ms
Network leg of TTFA (e2e − server)	109 ms	210 ms
Full-utterance latency (TTLB, client wall-clock)	919 ms	1300 ms
Real-time factor (RTF)	0.16	0.36

Streaming /v1/audio/speech (stream: true, pcm), 100 runs against https://api.yvr-02.edge.polargrid.ai, captured 2026-05-27 from a Vancouver-area laptop over the public internet. TTFA is the time to the first audio byte arriving at the client — identical to TTFB since the response body is raw PCM. Server-only TTFA comes from the X-Pg-First-Byte-Ms response header (PR #507), which the gateway sets to its measured time from _stream_tts entry to the first PCM chunk returned by triton. Total synthesis time (TTLB) is not exposed via header — response headers leave the wire before synthesis completes — so TTLB stays a client-side wall-clock number. RTF = synthesis wall-clock ÷ audio duration; below 1.0 is faster than real time. 92/100 runs returned a streaming verdict; the remainder finished too fast for the incremental-arrival heuristic to fire, which is a property of the heuristic, not server-side buffering. Raw runs: benchmarks/yvr-02-2026-05-27/tada-3b-ml/. Harness: bench/tada-3b-ml/.

How this compares

Provider	TTFA p50	Notes	Source
PolarGrid `tada-3b-ml` on Blackwell	352 ms e2e / 238 ms server	Multilingual + cross-lingual voice cloning	this card
ElevenLabs Turbo v2	~200 to 300 ms model / ~478 ms real-world streaming TTFB	English-leaning	elevenlabs.io
Cartesia Sonic	90 ms marketing claim / ~188 ms independent p50	English-leaning, no cross-lingual cloning	cartesia.ai
ElevenLabs Flash	75 ms marketing claim / ~288 ms independent p50	English-leaning, no cross-lingual cloning	gradium.ai
Hume Octave 2	~100 to 200 ms TTFT	Hume’s newer TTS, would land below TADA	dev.hume.ai

PolarGrid’s 238 ms server TTFA is in range of real-world ElevenLabs Turbo v2 streaming TTFB. Cartesia Sonic and ElevenLabs Flash report lower marketing numbers and similar real-world numbers, but ship smaller English-leaning models without cross-lingual cloning, so the comparison is not like for like. Hume Octave 2 has moved the goalpost on Hume’s own product line; PolarGrid hosts TADA (the prior generation) faster than Hume hosted it.

Quickstart

Edge endpoints accept your raw pg_* API key as a bearer token — no token exchange. See Authentication. Replace <region> with your edge region, or discover the nearest one via the autorouter.

curl -X POST https://api.<region>.edge.polargrid.ai/v1/audio/speech \
  -H "Authorization: Bearer $POLARGRID_API_KEY" \
  -H "Content-Type: application/json" \
  --no-buffer \
  -d '{
    "model": "tada-3b-ml",
    "input": "Hello from PolarGrid.",
    "voice": "default",
    "response_format": "pcm",
    "stream": true
  }' \
  --output speech.pcm

import { PolarGrid } from "@polargrid/polargrid-sdk";

const client = await PolarGrid.create({ apiKey: process.env.POLARGRID_API_KEY });

for await (const chunk of client.textToSpeechStream({
  model: "tada-3b-ml",
  input: "Hello from PolarGrid.",
  voice: "default",
  responseFormat: "opus",
})) {
  audioPlayer.appendChunk(chunk);
}

from polargrid import PolarGrid

client = await PolarGrid.create(api_key="pg_...")

async for chunk in client.text_to_speech_stream({
    "model": "tada-3b-ml",
    "input": "Hello from PolarGrid.",
    "voice": "default",
    "response_format": "opus",
}):
    audio_player.append_chunk(chunk)

Capabilities

Field	Value
Endpoint	`POST /v1/audio/speech`
Audio output	24 kHz, 16-bit, mono
Streaming	Yes — chunked HTTP, `pcm` and `opus` (`stream: true`); audio delivered incrementally in ~4-token windows during synthesis
Batch formats	`pcm`, `wav`, `mp3`
Voice model	Cross-lingual voice cloning from a reference clip (no preset voice catalog)
Languages	English, French, German, Spanish, Italian, Portuguese, Polish, Japanese, Arabic, Chinese
`speed` control	Batch only — streaming requires `speed = 1.0`
Max batch size	1

Voices — cross-lingual cloning

TADA does not expose preset voice IDs. The voice parameter selects a reference speaker:

`voice` value	Meaning
`default`	The bundled reference clip — a neutral English speaker. Use this when you just want speech and don’t care about the timbre.
A URL	A WAV file (24 kHz mono) fetched and used as the reference. Pair it with `voice_transcript` — the exact text spoken in the clip.
A base64 WAV	The reference clip inlined as a base64-encoded WAV string. Also pair with `voice_transcript`.

voice_transcript is required whenever voice is a URL or base64 clip — TADA conditions on both the reference audio and its transcript. It is not needed for voice: "default". The cloned voice carries across languages: provide an English reference clip and set the language field (or write the input in the target language) to synthesize that speaker in French, Japanese, and so on.

curl -X POST https://api.<region>.edge.polargrid.ai/v1/audio/speech \
  -H "Authorization: Bearer $POLARGRID_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "tada-3b-ml",
    "input": "Bonjour, ceci est une voix clonée.",
    "voice": "https://example.com/my-reference.wav",
    "voice_transcript": "This is the exact text spoken in the reference clip.",
    "language": "fr",
    "response_format": "wav"
  }' \
  --output cloned.wav

Streaming

Pass stream: true for chunked audio over a single HTTP response. Streaming formats are pcm (default for raw HTTP callers) and opus; the PolarGrid SDKs default streaming requests to opus. Requesting wav or mp3 with stream: true returns 400 Bad Request.

speed must be 1.0 in streaming mode. TADA streaming does not support speed values other than 1.0. Setting any other value (e.g., speed: 1.5) with stream: true returns a 400 Bad Request error from the gateway. Use batch mode (stream: false) if you need speed control.

Audio is delivered incrementally: TADA’s synthesis loop runs one step per text token, and the handler emits a chunk every couple of tokens (a ~4-token window, 2 of them overlap for crossfade context) as synthesis proceeds — so the first audio arrives well before the utterance finishes. Synthesis is fast on top of that (real-time factor ~0.13-0.25, i.e. audio produced several times faster than it plays). The streaming_verdict field in bench/tada-3b-ml/ reports per run whether the edge delivered bytes incrementally. TADA streaming does not honor the speed parameter — speed control needs a full second synthesis pass, which is incompatible with per-chunk streaming. Pass speed: 1.0 (or omit it) for streaming requests; use batch mode if you need to change the rate. See the Text-to-Speech API reference for the full streaming contract — response headers, truncated-stream detection, and the per-format table.

Aliases

The following caller-facing aliases resolve to tada-3b-ml:

Alias	Resolves to
`humane-tada`	`tada-3b-ml`
`humane/tada-tts`	`tada-3b-ml`

Model identifier

Call this model with the canonical id tada-3b-ml (or an alias above) at /v1/audio/speech. The HuggingFace repo id HumeAI/tada-3b-ml is accepted at /v1/models/load for hot-loading but does not resolve at inference time.

Input length limit

tada-3b-ml accepts at most 850 characters of input per request. The cap is enforced at the gateway before synthesis; over-limit requests return 413 Payload Too Large (Input too long: maximum 850 characters for tada-3b-ml). The count is taken after surrounding quotes and code/markdown artifacts are stripped, i.e. the text actually synthesized. The limit is lower than other TTS models (kokoro-82m allows 4096) because longer inputs can exhaust GPU memory mid-synthesis. The fixed cap keeps the limit deterministic regardless of server load — without it, the same request could succeed or fail depending on the node’s GPU memory state. Split longer text into multiple requests and concatenate the audio client-side.

Deterministic output

TADA is a diffusion-based TTS model. The inference code seeds the RNG to a fixed value before every generation, so the same input text + same voice reference produces byte-identical audio across requests. This is intentional: a fixed seed guarantees consistent voice identity and timing, which is important for voice-agent pipelines where unpredictable prosody shifts between calls would degrade the user experience. Key details:

No caching involved. Each request runs full diffusion inference. Billing applies per request regardless of output similarity.
Applies to both batch and streaming modes. The determinism holds whether you call with stream: true or stream: false.
Planned: user-controllable seed. A future API version will expose a seed parameter so callers can introduce deliberate prosody variation when desired.

Notes

TADA runs in its own Triton pod, isolated from the voice pod: hume-tada pins transformers < 5 and torch < 2.8, while the voice pod’s cohere-transcribe needs transformers >= 5.4. See backend/edge-production-setup/CLAUDE.md for the pod layout.
Streaming synthesis is per-token through a decoupled Triton transaction policy — the handler pushes PCM windows as they are decoded rather than buffering the full utterance.
For preset-voice English/British TTS with a fixed catalog, use kokoro-82m instead — TADA is the choice when you need a specific cloned voice or a non-English language.

​Headline benchmark

​How this compares

​Quickstart

​Capabilities

​Voices — cross-lingual cloning

​Streaming

​Aliases

​Model identifier

​Input length limit

​Deterministic output

​Notes

​See also

Headline benchmark

How this compares

Quickstart

Capabilities

Voices — cross-lingual cloning

Streaming

Aliases

Model identifier

Input length limit

Deterministic output

Notes

See also