MedPhr‑Rad: Paraphrase‑Robustness Benchmark for Radiology VLMs
A core component of the Robustness Gauntlet Framework, focusing on linguistic robustness evaluation
← Paraphrase Metrics | ← Evaluation Index | Robustness Gauntlet →
Overview
MedPhr-Rad is the linguistic robustness component of the comprehensive Robustness Gauntlet framework. It specifically addresses how medical VLMs handle semantically equivalent phrasings of clinical questions, which is critical for real-world deployment where radiologists may phrase the same question in various ways.
Goal
Standardize paraphrase‑first robustness evaluation in radiology VQA by releasing:
- A taxonomy of paraphrase categories
- Generators with NLI + concept‑equivalence filters
- Evaluation harness and metrics
Scope
- Base datasets: VQA‑RAD, PMC‑VQA, SLAKE
- Items extended with semantically equivalent rephrasings per taxonomy
- Baselines: LLaVA‑Rad, MedGemma, LLaVA‑Med
Metrics (per paraphrase group)
- Consistency rate, flip rate, robust accuracy
- ECE for calibration
- Selective risk at target coverage using paraphrase dispersion
Release Artifacts
- Templates and generators (release synthetic prompts when raw text is restricted)
- Scoring scripts and plots (paired tests, bootstrap CIs)
- Leaderboard for new model submissions
Paraphrase Taxonomy
-
Synonymy: Medical term variations
- “pneumonia” ↔ “lung infection” ↔ “pulmonary consolidation”
- “cardiomegaly” ↔ “enlarged heart” ↔ “cardiac enlargement”
-
Negation Handling: Positive/negative formulations
- “Is there pneumonia?” ↔ “Is there no pneumonia?”
- “Any abnormalities?” ↔ “All normal?”
-
Hedging & Certainty: Confidence modifiers
- “definite pneumonia” ↔ “possible pneumonia” ↔ “likely pneumonia”
- “clear evidence” ↔ “suggestive of” ↔ “consistent with”
-
Temporality: Time-based variations
- “new finding” ↔ “recent change” ↔ “acute process”
- “chronic” ↔ “longstanding” ↔ “old”
-
Quantifiers: Amount descriptors
- “any fluid” ↔ “some fluid” ↔ “significant fluid”
- “mild” ↔ “moderate” ↔ “severe”
-
Clinical Style: Formal vs conversational
- “What is your differential?” ↔ “What could this be?”
- “Describe findings” ↔ “What do you see?”
Integration with Robustness Gauntlet
MedPhr-Rad serves as the linguistic robustness module within the larger framework:
- Input to Visual Testing: Paraphrased questions tested across visual perturbations
- Attribution Analysis: How attention changes with different phrasings
- Triage Decisions: Consistency across paraphrases informs deferral
- Enhancement Target: Paraphrase-based training improves overall robustness
Relationship to Other Components
- Part of Robustness Gauntlet Framework
- Pairs with Selective Conformal Triage for safe deployment
- Metrics detailed in Paraphrase Robustness Metrics
- Links to concept resources: RadLex, UMLS, RadGraph