VSF Med: Vulnerability Scoring Framework for Medical VLMs
Core benchmark from my paper VSF Med: Vulnerability Scoring Framework for Medical Vision-Language Models, focusing on linguistic robustness and clinical risk
← Paraphrase Metrics | ← Evaluation Index | Robustness Gauntlet →
Overview
VSF Med is the linguistic robustness component of the comprehensive Robustness Gauntlet framework. It specifically addresses how medical VLMs handle semantically equivalent phrasings of clinical questions, which is critical for real-world deployment where radiologists may phrase the same question in various ways.
Goal
Standardize paraphrase‑first robustness evaluation in radiology VQA by releasing:
- A taxonomy of paraphrase categories
- Generators with NLI + concept‑equivalence filters
- Evaluation harness and metrics
Scope
- Base datasets: VQA‑RAD, PMC‑VQA, SLAKE
- Items extended with semantically equivalent rephrasings per taxonomy
- Baselines: LLaVA‑Rad, MedGemma, LLaVA‑Med
Metrics (per paraphrase group)
- Consistency rate, flip rate, robust accuracy
- ECE for calibration
- Selective risk at target coverage using paraphrase dispersion
Release Artifacts
- Templates and generators (release synthetic prompts when raw text is restricted)
- Scoring scripts and plots (paired tests, bootstrap CIs)
- Leaderboard for new model submissions
Paraphrase Taxonomy
-
Synonymy: Medical term variations
- “pneumonia” ↔ “lung infection” ↔ “pulmonary consolidation”
- “cardiomegaly” ↔ “enlarged heart” ↔ “cardiac enlargement”
-
Negation Handling: Positive/negative formulations
- “Is there pneumonia?” ↔ “Is there no pneumonia?”
- “Any abnormalities?” ↔ “All normal?”
-
Hedging & Certainty: Confidence modifiers
- “definite pneumonia” ↔ “possible pneumonia” ↔ “likely pneumonia”
- “clear evidence” ↔ “suggestive of” ↔ “consistent with”
-
Temporality: Time-based variations
- “new finding” ↔ “recent change” ↔ “acute process”
- “chronic” ↔ “longstanding” ↔ “old”
-
Quantifiers: Amount descriptors
- “any fluid” ↔ “some fluid” ↔ “significant fluid”
- “mild” ↔ “moderate” ↔ “severe”
-
Clinical Style: Formal vs conversational
- “What is your differential?” ↔ “What could this be?”
- “Describe findings” ↔ “What do you see?”
Integration with Robustness Gauntlet
MedPhr-Rad serves as the linguistic robustness module within the larger framework:
- Input to Visual Testing: Paraphrased questions tested across visual perturbations
- Attribution Analysis: How attention changes with different phrasings
- Triage Decisions: Consistency across paraphrases informs deferral
- Enhancement Target: Paraphrase-based training improves overall robustness
Relationship to Other Components
- Part of Robustness Gauntlet Framework
- Pairs with Selective Conformal Triage for safe deployment
- Metrics detailed in Paraphrase Robustness Metrics
- Links to concept resources: RadLex, UMLS, RadGraph