VSF Med: Vulnerability Scoring Framework for Medical VLMs

Core benchmark from my paper VSF Med: Vulnerability Scoring Framework for Medical Vision-Language Models, focusing on linguistic robustness and clinical risk

← Paraphrase Metrics | ← Evaluation Index | Robustness Gauntlet →

Overview

VSF Med is the linguistic robustness component of the comprehensive Robustness Gauntlet framework. It specifically addresses how medical VLMs handle semantically equivalent phrasings of clinical questions, which is critical for real-world deployment where radiologists may phrase the same question in various ways.

Goal

Standardize paraphrase‑first robustness evaluation in radiology VQA by releasing:

A taxonomy of paraphrase categories
Generators with NLI + concept‑equivalence filters
Evaluation harness and metrics

Scope

Base datasets: VQA‑RAD, PMC‑VQA, SLAKE
Items extended with semantically equivalent rephrasings per taxonomy
Baselines: LLaVA‑Rad, MedGemma, LLaVA‑Med

Metrics (per paraphrase group)

Consistency rate, flip rate, robust accuracy
ECE for calibration
Selective risk at target coverage using paraphrase dispersion

Release Artifacts

Templates and generators (release synthetic prompts when raw text is restricted)
Scoring scripts and plots (paired tests, bootstrap CIs)
Leaderboard for new model submissions

Paraphrase Taxonomy

Synonymy: Medical term variations
- “pneumonia” ↔ “lung infection” ↔ “pulmonary consolidation”
- “cardiomegaly” ↔ “enlarged heart” ↔ “cardiac enlargement”
Negation Handling: Positive/negative formulations
- “Is there pneumonia?” ↔ “Is there no pneumonia?”
- “Any abnormalities?” ↔ “All normal?”
Hedging & Certainty: Confidence modifiers
- “definite pneumonia” ↔ “possible pneumonia” ↔ “likely pneumonia”
- “clear evidence” ↔ “suggestive of” ↔ “consistent with”
Temporality: Time-based variations
- “new finding” ↔ “recent change” ↔ “acute process”
- “chronic” ↔ “longstanding” ↔ “old”
Quantifiers: Amount descriptors
- “any fluid” ↔ “some fluid” ↔ “significant fluid”
- “mild” ↔ “moderate” ↔ “severe”
Clinical Style: Formal vs conversational
- “What is your differential?” ↔ “What could this be?”
- “Describe findings” ↔ “What do you see?”

Integration with Robustness Gauntlet

MedPhr-Rad serves as the linguistic robustness module within the larger framework:

Input to Visual Testing: Paraphrased questions tested across visual perturbations
Attribution Analysis: How attention changes with different phrasings
Triage Decisions: Consistency across paraphrases informs deferral
Enhancement Target: Paraphrase-based training improves overall robustness

Relationship to Other Components

Part of Robustness Gauntlet Framework
Pairs with Selective Conformal Triage for safe deployment
Metrics detailed in Paraphrase Robustness Metrics
Links to concept resources: RadLex, UMLS, RadGraph

RobMed LLM Notes

Explorer

01-medphr-rad