Healthcare voice agents have a CPU problem, not an LLM problem
Every 'reduce TTFT' thread on Twitter is solving the wrong bottleneck. Where the latency actually lives in a real clinical voice agent — and why optimizing the model first is almost always a waste of effort.
The default conversation about voice-agent latency goes like this:
“We need to get TTFT under 400ms.” “Have you tried streaming?” “Have you tried a smaller model?” “Have you tried Groq?”
This is fine for a generic voice toy. It is almost completely beside the point for a healthcare voice agent.
In the systems I’ve shipped, the model is rarely the slowest part of the path. The slowest part is the side-quests the agent has to run before it can answer. Cutting those side-quests, or running them in parallel, or accepting that some of them will block — that is where the real latency budget lives.
A real call, in real units
Pick a single, representative interaction. Patient calls in. Agent answers. The patient says: “Hi, can I get the result of the test I gave on Monday?”
Here is roughly what has to happen before the agent can say anything useful:
- STT finalization on the user’s turn — 100–300ms after the patient stops speaking, depending on endpointing settings
- Intent + slot extraction — 80–200ms if streamed
- Patient identity resolution — phone-number lookup against the patient DB, anywhere from 40ms (warm cache) to 600ms (cold, with PII redaction in the path)
- Consent + identity verification check — at minimum a date-of-birth challenge, which is a full additional conversational turn — adds 3–6 seconds of human latency
- LIS lookup for the actual result — 200–1200ms depending on whether the LIS is on the same VPC, plus whatever the LIS itself feels like doing today
- PHI redaction / safety pass on the response — 50–150ms
- TTS generation, first audio chunk — 200–400ms
Add the human turns (the patient saying their DOB, then their name) and you are looking at a 5–10 second wall-clock interaction. The LLM’s TTFT is somewhere between 5% and 15% of that total.
If you save 200ms on TTFT and your LIS call is 900ms, you’ve optimized the wrong thing.
Where the latency actually lives
In my experience, in roughly this order:
1. Identity verification. It is human-bound. There is no way to “speed up” a patient saying their date of birth. The only real optimization here is to avoid asking for it when you can — by trusting the caller-ID for low-risk intents, and only escalating to verification when the request touches PHI.
2. LIS reads. Most hospital LISs were not designed for sub-second random reads. The fix is rarely “make the LIS faster” (you can’t); it is “cache the read tier you’re allowed to cache, and make the cache invalidation explicit.” This is a clinical decision, not an engineering one — you cannot cache an arterial blood gas, but you can cache the patient’s name.
3. Cold starts in your own services. If you autoscale your agent infra to zero, the first call after an idle period will be terrible. Either keep a warm pool, or accept the first-call cost and make sure it doesn’t happen to your VIP user.
4. TTS first-audio chunk. This is the one most teams don’t measure, and the one users feel most directly. The gap between “agent has decided what to say” and “user hears the first sound” is the gap that makes the agent feel either alive or robotic. Streaming TTS is the single highest-ROI optimization in a voice stack, and it is almost always under-invested in compared to LLM tuning.
5. The LLM. Yes, finally. And the right optimization here is usually not “smaller model”; it is “smaller prompt.” Most healthcare voice agents are dragging 4–8KB of system prompt and tool descriptions on every turn. Trimming that pays back faster than swapping providers.
A heuristic
When a clinical voice agent feels slow, ask the question in this order:
- Are we asking for verification we don’t need?
- Is the LIS call on the critical path when it could be pre-fetched?
- Are we cold-starting anywhere?
- Is TTS streaming, or is it returning a full audio file?
- Is the prompt fat?
- …is the LLM actually slow?
If you’re starting at #6 and working backwards, you’re going to spend three months optimizing the cheapest part of the path.
What “fast” actually feels like to a patient
This is the part I wish more engineers internalized.
A patient does not have a stopwatch. They have a tolerance for silence. Below ~700ms after they finish speaking, the agent feels responsive. Between 700ms and 2s, it feels thinking. Above 2s of pure silence, it feels broken.
The most underrated trick in production voice agents is the acknowledgment beat: a 250–400ms filler (“let me check that for you…”) that buys you the next 1.5 seconds to do the real work. Used well, it is indistinguishable from a fast agent. Used badly, it sounds patronizing. The line is narrower than people think.
Summary
If your healthcare voice agent feels slow, the model is almost never the first thing to fix. Audit the side-quests, audit the identity flow, audit your TTS, audit your prompt size. Then, if you still have a problem — sure, talk about TTFT.
— C.