Open Source · Pro Bono
OmmSai — Healthcare AI Pipeline
15,247 handwritten prescription PDFs processed in 47.5 hours. Zero data loss. $0 infrastructure. Open-sourced for reuse.
Customer
Sai Healthcare — charitable medical event organizer
Timeline
48-hour production window
Status
Shipped · Open-sourced
Capability
Stack
Outcome
Customer Context
Who they are and what world they live in
A charitable healthcare event serving thousands of patients needed to digitize 15,000+ handwritten prescription PDFs — patient name, medications, dosages, instructions — into structured JSON for downstream medical record processing. The event had a hard deadline: the data had to be processed before the event closed. Manual transcription by volunteers was mathematically impossible. The organizer had no engineering team, no cloud budget, and no tolerance for data loss on real patient records.
The Problem
The fuzzy ask, translated
The ask was simple: 'process these PDFs.' The real problem had four parts. First: handwritten medical handwriting is notoriously illegible — some of these scans were borderline unreadable. Second: the LLM needed to be confident enough to be useful but calibrated enough to flag what it couldn't read. Third: volunteers — not engineers — were going to run this tool on laptops at the event venue. Fourth: the 48-hour deadline was the event itself. There was no 'we'll finish it next week.'
The Constraints
Time · Budget · Regulatory · Technical · Organizational
Hard 48-hour deadline — the event closes and the window closes with it
Handwritten medical prescriptions — notoriously illegible, varying formats, multiple languages
No fine-tuning budget and no cloud spend budget — free-tier APIs and local compute only
Zero data loss tolerance — real patient medication records
Non-engineer operators — Tkinter GUI required so volunteers could run it on any Windows laptop without a terminal
API rate limits — Anthropic free tier throttles under production volume
Architecture Decisions
What I chose. What I rejected. Why.
LLM model
Chosen
Claude Sonnet (Anthropic API) with structured JSON output schema
Rejected
GPT-4 / local Ollama models
Why
Claude's vision capabilities on handwritten text were measurably better in manual eval across 50 sample prescriptions. Ollama models at the available parameter count couldn't reliably extract medication dosages from degraded scans.
Concurrency model
Chosen
ThreadPoolExecutor with 8 parallel workers
Rejected
Sequential processing / async/await
Why
8 workers saturated the free-tier rate limit without exceeding it. Sequential processing would have taken 6× longer. Async/await added complexity without benefit given the I/O-bound workload and the need for simple error isolation per prescription.
Operator interface
Chosen
Tkinter desktop GUI with queue display, progress bar, and error log
Rejected
CLI / web interface
Why
Volunteers running on event-venue laptops. No terminal familiarity, no browser tab management, no server to host. Tkinter meant one executable, any Windows machine, no setup.
Confidence gating
Chosen
Hold-out eval set of 200 known prescriptions, automated diff against ground truth, 0.85 confidence threshold → human review queue
Rejected
Accept all model output / manual spot-check
Why
The model was confident on prescriptions it shouldn't have been — hallucinating dosages on illegible scans. Eval-by-vibes wasn't going to work on patient medication data. The hold-out set revealed the calibration gap before it hit production.
The Hard Problem
The one thing that almost broke the deployment
Claude was hallucinating dosages on illegible scans — and doing so confidently. The model would read a smudged '5mg' as '50mg' and return a confidence score that looked fine. Eval-by-vibes on a sample wasn't catching this. The failure mode was not 'model refuses to answer' but 'model answers incorrectly with high apparent confidence.' On medication dosages, that's a patient safety issue.
The Fix
Built an eval harness before deploying at volume: 200 prescriptions with known ground-truth extractions (manually verified), automated diff of model output against ground truth per field (patient name, medication, dosage, instructions), confidence threshold of 0.85 per field. Anything below threshold on any field routed to a human review queue displayed in the Tkinter GUI. Operators reviewed flagged records in real time. The eval harness ran in under 3 minutes on the hold-out set — enough to iterate the prompt before the full run.
Production Reality
What I had to fix in week 2
API rate limiting hit harder than expected at production volume. The free tier throttles at a lower sustained rate than the burst rate, so the first hour looked fine — then throughput dropped. Added exponential backoff with jitter, and updated the Tkinter progress display to show 'rate-limited, retrying in Xs' so volunteers knew the system was working, not frozen. Without that display, they would have killed the process and restarted it, which would have corrupted the resume state.
Lessons Carried Forward
What this taught me that I apply to every deployment
Write the spec before the code — 'process prescriptions' is not a spec; 'extract these 4 fields with this confidence threshold and route low-confidence to human review' is
Build the eval harness before the feature — the 200-prescription hold-out set found the dosage hallucination problem in 3 minutes; finding it in production would have been a patient safety incident
Plan for the failure mode you didn't think of — rate limiting at sustained volume is different from rate limiting at burst volume
Operator UX is a production constraint, not a polish item — the Tkinter display that showed 'rate-limited, retrying' prevented volunteers from killing the process and corrupting resume state
Pro-bono open-source work is verifiable in a way paid work often isn't — recruiters can read the code, not just the resume bullet
Related Deployments