When AI Radiologists Get Confused: The Critical Challenge of VLM Robustness in Medical Diagnostics
Picture this: You’re in the emergency room with chest pain and shortness of breath. The doctor orders a chest X-ray, and while waiting for the radiologist, you pull out your phone. Could ChatGPT help interpret what’s wrong? You’ve used it for math problems and recipe suggestions. Surely it could read an X-ray?
This isn’t a hypothetical anymore. We’re already there. An Australian study found that 9.9% of adults had used ChatGPT for health questions in just six months, with nearly 40% of non-users considering it. When people get health advice from ChatGPT, nearly half simply follow it. No questions asked, no double-checking with their doctor.
Our recent research reveals that when these sophisticated AI systems move from answering text questions to interpreting medical images, they become dangerously brittle. We evaluated 125 chest X-ray interpretations using state-of-the-art vision language models, including Google’s MedGemma 4B and GPT-4V. Simply changing “vascular dilation” to “vascular congestion” in a question made the same AI system provide completely different diagnoses for the same X-ray image.
The Promise of VLM-Based Radiologists
When we first started working with vision language models for medical imaging, the promise seemed clear. Unlike traditional AI that just spits out labels like “pneumonia: 87% probability,” these models could actually explain what they saw. They’d tell you why they thought something looked abnormal. You could ask follow-up questions. Feed them a patient’s history and watch them adjust their interpretation accordingly.
But here’s what we discovered matters just as much as accuracy: consistency. We call it robustness in the lab, but what it really means is whether the AI gives you the same answer when you ask the same question slightly differently. Think about it. A radiologist doesn’t suddenly see pneumonia just because you say “chest radiograph” instead of “X-ray.” They know “lung volumes” and “lung capacity” mean the same thing.
Yet that’s exactly what happens with today’s most advanced models. And we’re not talking about edge cases or trick questions. We tested basic medical synonyms, the kind any first-year resident would recognize as identical. The models fell apart.
When Terminology Becomes a Diagnostic Trap
Let’s walk through real examples from our evaluation that show how catastrophically these models can fail.
Case 1: The Vascular Confusion
We showed a chest X-ray to one of the most advanced vision language models available. Asked about vascular findings. The model correctly identified pulmonary vascular dilation, which is exactly what we’d expect. It’s a widening of blood vessels that might indicate various conditions but isn’t immediately life threatening.
Then we changed two words. Just two. “Vascular dilation” became “vascular congestion.”
Suddenly the model was talking about cardiac congestion. Possible heart failure. Recommending completely different follow-up procedures. Same image, nearly identical question, completely different medical pathway. The clinical implications hit us immediately. A patient might get rushed into unnecessary cardiac workup while their actual condition goes untreated. Or worse, someone might start urgent cardiac treatment for what’s actually a non-cardiac issue.
Case 2: The Imaginary Pneumonia
This one still makes us shake our heads. We had an X-ray showing clear pleural effusion. That’s fluid around the lungs, often serious enough to need drainage. The model saw it correctly when we asked about lung findings.
But when we added the phrase “chest radiograph” to our question? The model suddenly “saw” pneumonia that wasn’t there.
It didn’t just add pneumonia to its diagnosis. It completely forgot about the pleural effusion and started recommending antibiotics. This isn’t just wrong. It’s actively harmful. A patient with fluid crushing their lungs needs drainage, not antibiotics for an infection that doesn’t exist.
Case 3: The Vanishing Diagnosis
Perhaps most concerning was when changing “lung volumes” to “lung capacity” made critical findings disappear entirely. The model went from correctly identifying pleural effusion and potential cardiac issues to completely missing the effusion and focusing only on cardiac problems.
Pleural effusion can kill you if it’s not treated. It can lead to respiratory failure. Yet a simple synonym made the AI blind to its presence. The model confidently described other findings while missing the one thing that might send someone to the ICU.
What makes these failures so unsettling is their unpredictability. You can’t train staff to avoid certain phrases or create a list of “safe” terminology. The brittleness runs deeper than that.
Making Sense of the Brittleness: What We’re Learning in the Lab
The brittleness we observed in chest X-ray VLMs sent us down a research rabbit hole. How could models that seem so sophisticated fail so spectacularly when we barely changed our words? We needed a systematic way to measure this vulnerability, which led us to develop VSF-Med (Vulnerability Scoring Framework for Medical Vision-Language Models) here at the SAIL Lab at University of New Haven.
VSF-Med: Our Systematic Approach
VSF-Med isn’t just another benchmark. It’s our attempt to quantify exactly how and why these models break. We evaluated 68,478 attack scenarios across five models including:
- CheXagent 8B: Medical-specialized model
- Llama 3.2 11B Vision: General-purpose VLM
- GPT-4o: State-of-the-art multimodal model
- Google MedGemma 4B: Medical-focused model
- GPT-4V: Previous generation flagship
The framework measures vulnerability across nine different attack vectors, giving us concrete numbers for what we’d been observing anecdotally.
The Sobering Results
What we found was concerning. Even CheXagent 8B, a model specifically trained for medical imaging, showed moderate vulnerability (z-score: 0.68) to our prompt injection attacks. That “dilation vs congestion” problem we showed you earlier? Not an isolated incident.
Key findings:
- Performance scores ranged from 7 to 97 across different phrasings of identical X-ray questions
- Medical-specialized models demonstrated only 36% better resilience compared to general-purpose VLMs
- Current best-in-class models still have vulnerability spreads exceeding 0.3 standard deviations
Better than general models, yes. Good enough for clinical use? Not even close.
Why Are These Models So Fragile?
We have theories we’re exploring:
-
Contrastive Learning Issues: Many vision-language models use contrastive learning during training, where they learn to match images with text descriptions. This might create brittle associations between specific phrases and visual features.
-
The Alignment Problem: These models are fine-tuned to be helpful and responsive, which might make them overeager to provide different answers when prompted differently. They’re trying so hard to be useful that they forget to be consistent.
-
Medical Language Complexity: Radiological language is precise but full of synonyms. Models trained on general text might not grasp that “increased opacity” and “increased density” mean the same thing in a chest X-ray context.
-
Architectural Limitations: The transformer architecture itself might contribute to this sensitivity. Attention mechanisms that work beautifully for general language tasks might amplify small prompt variations in high-stakes medical contexts.
What This Means for Medical AI
The momentum behind medical AI is undeniable. Just this week, Mayo Clinic Press highlighted how AI is already being used for stroke diagnosis, heart failure detection, and cancer screening. They describe an optimistic future where “AI has the potential to improve the work of human healthcare teams, making care more personal and effective.”
While this enthusiasm is understandable given AI’s promise, our research suggests we need to address fundamental robustness issues before these systems can truly deliver on that potential.
Safety Implications
- Diagnostic Inconsistency: Same image, different terminology = different diagnoses
- Clinical Risk: Unreliable AI could misguide medical decisions
- Trust Issues: Healthcare providers need consistent, predictable AI behavior
- Patient Safety: When people follow AI medical advice without verification, inconsistency becomes dangerous
The Real-World Context
Remember those statistics we opened with? People are already using these tools for health decisions. An Australian study found that people with limited health literacy use ChatGPT at nearly twice the rate of others (18.4% vs 9.4%). Those from non-English speaking backgrounds? Even higher at 29.2%. These are exactly the populations who might be most vulnerable to inconsistent AI responses.
Future Directions
Our research highlights the urgent need for:
- Robust Training Methods: VLMs that maintain consistency across terminology variations
- Comprehensive Testing: Systematic evaluation of medical AI before clinical deployment
- Safety Frameworks: Guidelines for reliable medical AI implementation
- Architectural Innovations: New approaches that improve multimodal robustness
Open Science Commitment
Our VSF-Med framework is completely open source because we believe this problem is too important for any single team to tackle alone. We’ve made it so researchers anywhere can benchmark their medical VLM with a single command, generating over 30,000 adversarial test cases automatically.
We’re diving deeper into architectural modifications that might improve robustness. We’re exploring whether different training objectives could create more stable image-text associations. And we’re working with clinicians to understand which types of brittleness pose the greatest real-world risks.
The Path Forward
Medical AI has immense potential to revolutionize healthcare, from reducing diagnostic errors to making expertise available in underserved areas. But as our research shows, we’re not there yet. The brittleness we’ve uncovered isn’t just a technical curiosity—it’s a fundamental barrier to safe clinical deployment.
Until we can get vulnerability spreads much, much lower, these systems remain too fragile for autonomous clinical use. This isn’t about being pessimistic about AI. It’s about being realistic about what needs to be fixed before we can responsibly deploy these powerful tools in life-and-death situations.
Because ultimately, this isn’t just an interesting technical puzzle. It’s about making sure AI tools genuinely help rather than harm when lives are on the line.
References
Sadanandan, B. K., & Behzadan, V. (2025). VSF-Med: A vulnerability scoring framework for medical vision-language models. arXiv preprint arXiv:2507.00052.
Cheng, T. S., Ung, C., Heffernan, C., Harrison, R., & Mackey, T. K. (2024). Assessment of public perceptions and attitudes toward ChatGPT and implications for emerging artificial intelligence chatbots in health care. JMIR AI, 3, e50150.
Mayo Clinic Press (2025). AI in healthcare: Current applications and future potential. Mayo Clinic Health System. Retrieved from https://www.mayoclinichealthsystem.org/
Research conducted at the SAIL Lab, University of New Haven. Open-source framework available at https://github.com/sail-lab
Enjoy Reading This Article?
Here are some more articles you might like to read next: