Clinically Robust Vision-Language Models for Diagnostic Reasoning

Binesh Kumar — PhD student at University of New Haven
My research: Clinically robust vision-language models for diagnostic reasoning in radiology

What I’m Working On

This website contains my research notes about making medical AI better. Right now, medical AI models have a big problem - they change their answers when you ask the same question in different ways. This is dangerous for patients.

My goal is to understand why this happens and fix it. I want to make AI that doctors can trust to help them make better decisions.

Main Questions I’m Trying to Answer

  1. Why do models change answers? - When you ask the same question differently, why does the AI give different answers?
  2. Where does the problem come from? - Which parts of the AI cause these failures?
  3. How can we tell when AI is unsure? - Can we make AI say “I don’t know” when it should?
  4. How to use AI safely in hospitals? - How can doctors use AI without missing important problems?
  5. Do fixes work everywhere? - If I fix the problem, does it work on different types of medical images?

🚀 Where to Start

New Here? Start With These:

📚 My Research Areas

🏗️ How AI Models Work

🏥 Medical AI Applications

🛡️ Making AI Safer

📊 Testing AI Models

🔬 What I’m Working on Right Now

My Current Projects

  1. Measuring the Problem (Phase 1)

    • Finding how often AI changes answers (want to go from >20% to <5%)
    • Checking if AI looks at same image parts when answers change
    • Making a dataset with different ways to ask same questions
    • Testing LLaVA-Rad and MedGemma models
  2. Finding Why It Happens (Phase 2)

    • Understanding the chain: different question → AI attention → wrong answer
    • Testing what happens when I change specific parts of the AI
    • Finding which types of questions cause problems
    • Measuring exactly how much each part contributes
  3. Fixing the Problem (Phase 3)

    • Teaching AI to be more confident about what it knows
    • Training AI to say “I don’t know” when unsure
    • Making AI give consistent answers across question variations
    • Goal: 95% accuracy while saying “I don’t know” 5-10% of the time
  4. Making It Safe for Hospitals (Phase 4)

    • Setting confidence levels for when AI should handle cases alone
    • Making sure AI never misses serious medical problems (near 100% detection)
    • Reducing doctor workload by 30-40% on normal cases
    • Detecting when images are too different from training data

🛠️ What I’m Using

AI Models I’m Testing

  • LLaVA-RAD: My main focus - medical AI that can read X-rays
  • MedGemma: Google’s medical AI for comparison
  • GPT-5: Best commercial AI to compare against
  • LLaVA-Med: Basic medical AI model
  • BiomedCLIP: AI trained specifically on medical data

Medical Image Collections

  • MIMIC-CXR: Large collection of chest X-rays with reports
  • VQA-RAD: Questions and answers about medical images
  • NEJM Image Challenge: Hard medical cases from famous journal
  • Radiology Paraphrase QA: New dataset I’m creating with multiple ways to ask questions

How I Measure Success

  • Flip-Rate: How often answers change (currently >20%, want <5%)
  • Attention Consistency: Whether AI looks at same image parts consistently
  • Calibration: How well AI confidence matches actual accuracy
  • Triage Sensitivity: Finding serious problems (want ~100%)
  • Workload Reduction: How much work AI can take off doctors (want 30-40%)
  • Generalization: Whether improvements work on new datasets

📈 What I Hope to Achieve

  • Better consistency: AI changes answers less than 5% of the time (vs >20% now)
  • Understanding causes: Clear proof of which AI parts cause problems
  • Smart confidence: 95% accuracy when AI is confident, says “I don’t know” 5-10% of time
  • Hospital efficiency: Help doctors handle 30-40% more cases without missing serious problems
  • Broader impact: Show that fixes work on different types of medical images
  • Open tools: Share all my tools so other researchers can use and improve them

🤝 Want to Collaborate?

I’m interested in working with others on:

  • Testing how reliable medical AI models are
  • Understanding what AI “looks at” in medical images
  • Making AI safer for hospital use
  • Building better medical image datasets
  • Validating AI with real doctors and patients

You can contact me through my university or the SAIL Lab.


Additional Notes

  • Additional Notes & Archived Plans — Older terminology (FSF/EFG), early robustness gauntlet plan, and other background material that informed the current proposal

Old Content (Archive)