Medical Vision-Language Model Robustness Research

Binesh Kumar — PhD Candidate, Secure and Assured Intelligent Learning (SAIL) Lab, University of New Haven
Research Focus: Phrasing-Robust Medical Vision-Language Models for Radiology: Measurement, Causality, Mitigation, and Safe Triage

Research Overview

This digital garden documents my dissertation research on making medical Vision-Language Models robust to phrasing variations in radiology. Current medical VLMs exhibit brittle behavior where paraphrasing a question can flip answers or change confidence unpredictably – a critical safety risk. Building upon an open-source interpretability toolkit, this work measures phrasing effects, identifies causal factors behind failures, develops mitigation strategies, and integrates uncertainty-aware triage for safe clinical deployment.

Core Research Questions

  1. Phrasing Robustness: How can we quantify and improve medical VLM robustness to question phrasing variations?
  2. Causal Attribution: What causal factors drive VLM failures under paraphrased inputs?
  3. Uncertainty & Reliability: How can we quantify uncertainty so models know when they’re unsure?
  4. Safe Triage Integration: How can VLMs be integrated as safe triage tools without missing critical findings?
  5. Generalization: Do robustness improvements generalize across datasets and modalities?

🚀 Quick Navigation

Start Here

📚 Research Areas

🏗️ Architecture Foundations

🏥 Healthcare Applications

🛡️ Robustness & Safety

📊 Evaluation & Metrics

🔬 Current Research Focus

Active Work Streams

  1. Phrasing Robustness Measurement (H1)

    • Flip-rate quantification across paraphrases (target: >20% → <5%)
    • Attention consistency metrics under rephrasing
    • Paraphrase dataset creation from MIMIC-CXR
    • Baseline evaluation: LLaVA-Rad, MedGemma
  2. Causal Attribution Analysis (H2)

    • Causal mediation analysis of phrasing → attention → answers
    • Intervention experiments on attention distributions
    • Identifying linguistic constructs causing failures
    • Quantifying mediation effects
  3. Uncertainty & Mitigation (H3)

    • Calibrated confidence scores and “I don’t know” options
    • Consistency loss training with paraphrase augmentation
    • Brier score optimization
    • Target: 95% sensitivity with 5-10% abstention
  4. Safe Triage System (H4)

    • Confidence thresholds for auto-clearance
    • Near-100% sensitivity for critical findings
    • 30-40% workload reduction on normal cases
    • OOD detection integration

🛠️ Technical Stack

Models Under Study

  • LLaVA-RAD: Primary target for paraphrase robustness
  • MedGemma: Secondary comparison model
  • GPT-5: Closed-source baseline for state-of-the-art comparison
  • LLaVA-Med: Baseline medical VLM
  • BiomedCLIP: Domain-adapted foundation model

Evaluation Datasets

  • MIMIC-CXR: Base for paraphrase dataset creation
  • VQA-RAD: Radiology visual question answering
  • NEJM Image Challenge: External validation set
  • Radiology Paraphrase QA: New dataset with multiple rephrasings (deliverable)

Key Metrics

  • Flip-Rate: Answer changes across paraphrases (baseline >20%, target <5%)
  • Attention Consistency: JS divergence between paraphrase attention maps
  • Calibration: Brier score and reliability diagrams
  • Triage Sensitivity: Detection of critical findings (target: ~100%)
  • Workload Reduction: % of normal cases auto-cleared (target: 30-40%)
  • Generalization Gap: Performance drop on external datasets

📈 Expected Impact

  • <5% flip-rate on paraphrased questions (vs >20% baseline)
  • Causal evidence linking phrasing to attention shifts and errors
  • 95% sensitivity at 5-10% abstention rate
  • 30-40% workload reduction with near-zero missed critical findings
  • Generalization to external datasets and modalities
  • Open-source toolkit with debugging, visualization, and safety analysis

🤝 Collaboration & Contact

I welcome collaborations on:

  • Medical VLM robustness evaluation
  • Interpretability and attention analysis
  • Clinical triage and safety mechanisms
  • Chest X-ray VQA datasets and benchmarks
  • Clinical validation and deployment studies

Connect via the SAIL Lab or university email.


Archived Content