Selective Conformal Triage for Medical VLMs
Providing mathematical guarantees for safe clinical deployment of phrasing-robust medical VLMs
← Safety Index | Paraphrase Robustness →
Objective
Guarantee user-set error rates on auto-accepted cases by implementing an uncertainty-aware triage system. This component addresses RQ4: “In what ways can a vision-language model be integrated into the radiology workflow as a triage tool that safely prioritizes or automates cases without missing critical findings?”
Key Goals
- Error Detection: Catch >80% of incorrect/hallucinated answers
- Selective Answering: Achieve ~90% safe accuracy with 15-20% deferral
- Mathematical Guarantees: Provide conformal coverage for error bounds
Workflow
- Generate k paraphrases per item; collect predictions + confidences
- Compute dispersion and calibrated scores
- Fit conformal thresholds on validation to meet error budget
- Deploy triage policy: auto‑accept if below threshold; otherwise abstain/route
Subgroup‑Aware Coverage
- Maintain fairness by calibrating thresholds within strata (e.g., sentinel findings)
- Monitor coverage and error by subgroup with confidence intervals
Outputs
- Selective risk vs coverage curves
- Error guarantees for auto‑accepted cases
- Audit logs for clinical traceability
Implementation with Phrasing Robustness
Multi-Signal Integration
The triage system combines multiple uncertainty signals:
- Paraphrase Consistency: Agreement across linguistic variants (primary signal)
- Confidence Scores: Model’s self-reported certainty calibrated via temperature scaling
- Attention Stability: Consistency of attention maps across paraphrases
- Clinical Risk: Stratification based on finding criticality
Expected Performance
Based on the phrasing robustness research:
- Sensitivity: Near-100% for critical findings
- Specificity: 85-90% for non-urgent findings
- Auto-clearance: ~30-40% of normal exams
- Flip-rate threshold: Defer if >10% disagreement across paraphrases
Integration with Other Components
- Input: Results from MedPhr-Rad paraphrase testing
- Analysis: Uses attention metrics from interpretability toolkit
- Output: Safety decisions for clinical deployment
- Validation: Part of RQ5 experimental validation
See also: Phrasing Robustness Framework, MedPhr-Rad Benchmark, Metrics