The handwritten Test Requisition Form is the most under-rated benchmark in clinical AI
Forget MMLU. The honest benchmark for any clinical AI you're about to trust with a patient is whether it can read a doctor's handwriting on a TRF — and most can't.
A short, opinionated take.
Every conversation about clinical AI capability is currently fought on two benchmarks: MMLU-style trivia, and increasingly, expert reasoning suites like MedQA, MedMCQA, NEJM-style cases. These are useful, but they are not the benchmarks I care about when I’m deciding whether to put a system in front of a patient.
The benchmark I care about is the handwritten TRF — the Test Requisition Form.
What a TRF actually is
A TRF is the slip of paper a doctor fills out to order lab tests. It contains:
- Patient demographics (often in 4-point handwriting in a 6mm box)
- Ordered tests (sometimes by name, sometimes by lab code, sometimes by a profile name only this clinic uses, sometimes circled in a checkbox grid that itself has typos)
- The clinician’s signature, registration number, and stamp (the stamp often illegible, the registration often missing a digit, the signature often a glyph)
- A diagnosis or indication (free-text, abbreviated, frequently a 4-character disease code from a list nobody has documented since 2008)
- Time, date, and collection details
- Sometimes — and only sometimes — the patient’s phone number, which is often the only field that matters for downstream reporting
These forms arrive crumpled, scanned at 150 DPI on a 12-year-old multifunction printer, partially smeared by the blood-collection tech’s gloved thumb. They are routinely the single source of truth for every downstream step in the lab.
Why this is the real benchmark
Three reasons.
One — every single thing downstream depends on it. The patient’s identity, the tests that get run, the diagnosis attached to the result, the doctor the result gets sent back to — all of it is encoded on this one form, in handwriting, with no second source of truth. If you get the patient wrong here, you’ve corrupted every record that follows. The error compounds.
Two — humans get this wrong all the time. The baseline isn’t “perfect.” The baseline is a tired data-entry tech misreading a 7 as a 1, or a Mr Sharma becoming a Mrs Sharma because of a slipped pen stroke. The honest comparison for a clinical AI is not “is the AI perfect” but “does the AI make fewer errors than the human it’s replacing, and are its errors of a different kind that the QA process can catch?”
Three — it is genuinely a multi-skill task. TRF extraction requires character recognition, layout understanding, domain knowledge (“CBC” and “Complete Blood Count” are the same thing; “TSH” and “Thyroid Profile” overlap but are not the same), checkbox detection, signature presence, stamp legibility, and the ability to know what to do when a required field is simply missing. There is no single benchmark that captures all of that. A model that scores 95% on MedQA can still confidently mis-order a test, and the patient is the one who pays for it.
What I look at when I evaluate a clinical document AI
If a vendor shows me a clinical document AI, the first thing I do is hand it a TRF. Specifically, three flavors of TRF.
1. A clean printed one with all fields filled. Every system in the world will pass this. It tells you nothing.
2. A handwritten one from a busy outpatient clinic. This is where 60% of the field falls apart. The model either silently hallucinates (worst case), refuses to extract (annoying but recoverable), or extracts most fields but quietly gets one wrong without flagging uncertainty (this is the real killer).
3. A handwritten one with an ambiguous field. A scrawled Mr/Mrs checkbox where neither is clearly ticked. A test name that could read LFT or LDH. A date that could be 04/06 or 06/04 depending on convention.
The question I am asking is: does the model know what it doesn’t know? Does it produce a confidence per field? Does it flag the ambiguous ones for human review instead of silently picking one? Does it return which pixel region it used to make each decision, so a QA reviewer can sanity-check?
If the answer to any of these is “no,” the model is not ready for clinical use, regardless of what its TRF accuracy number is.
Why this matters beyond TRFs
The TRF is a stand-in for a wider class of clinical documents: discharge summaries, hand-written notes, paper consent forms, lab requisitions, immunization cards. All of them share the same three properties — they’re the source of truth, humans regularly get them wrong, and they require multi-skill extraction.
Any clinical AI strategy that doesn’t have a credible answer for these documents is going to remain a demo, no matter how good its reasoning suite scores are.
The honest summary
If you’re building or buying clinical AI, ask for the TRF benchmark. Specifically:
- Show me your accuracy on handwritten forms from a real Indian / global-south outpatient setting, not on US-printed ones
- Show me your per-field confidence and your behavior on ambiguous fields
- Show me your false-positive rate, not just your accuracy — silent wrong answers are catastrophic in this domain
- Show me what your system does when a required field is missing or illegible
The systems that have honest answers to those four questions are the ones I trust. The rest is theatre.
— C.