Research
Problem Statement
Medical Vision-Language Models (VLMs) achieve 85-95% accuracy on benchmarks, yet this high performance masks a critical safety gap: Phrasing-Sensitive Failures (PSF). These models produce contradictory clinical assessments for semantically equivalent questions—answering “Is the heart size normal?” with “Yes” while responding “Is there cardiomegaly?” with “Yes, enlarged” for the same image.
Compounding this, the Misleading Explanation Effect (MEE) causes standard faithfulness metrics to paradoxically rate incorrect predictions as more interpretable than correct ones. Together, PSF and MEE undermine the primary mechanisms clinicians use to assess AI reliability, creating dangerous blind spots in clinical deployment.
Dissertation Research
Clinically Robust Vision-Language Models for Diagnostic Applications: Measuring, Understanding, and Mitigating Phrasing-Sensitive Failures
Under the advisement of Dr. Vahid Behzadan at the Secure and Assured Intelligent Learning Lab (SAIL Lab), University of New Haven.
Key Research Questions
-
Prevalence: How prevalent is PSF across different pathology types and linguistic phenomena (negation, synonyms, hedging)?
-
Mechanism: Which layers, attention heads, and fusion pathways are causally responsible for phrasing sensitivity?
-
Mitigation: Can targeted LoRA-based interventions reduce PSF by 50%+ while maintaining diagnostic accuracy?
-
Deployment: How should safety thresholds be calibrated for different clinical workflows (triage, critical alerts, reporting)?
Research Thrusts
Thrust 1: Measuring PSF and MEE
Development of the VSF-Med Benchmark—2,500+ paraphrased questions across 5 chest X-ray pathologies (cardiomegaly, pleural effusion, pneumothorax, consolidation, atelectasis) with linguistic phenomenon annotations.
Novel Metrics:
- Paraphrase Flip Rate (PFR): Proportion of semantically equivalent question pairs yielding contradictory answers
- Attention Stability Index (ASI): Cosine similarity of attention maps across paraphrases
- Misleading Explanation Effect Coefficient (MEEC): Difference in faithfulness scores between incorrect vs. correct predictions
Preliminary Findings: 18-28% PFR observed in MedGemma and LLaVA-Rad with stable attention (ASI > 0.85), confirming visual-linguistic decoupling.
Thrust 2: Causal Analysis
Localizing PSF origins through:
- Activation patching to identify layers with high Causal Importance Scores (CIS > 0.5)
- Token ablation to measure necessity/sufficiency of linguistic elements
- Decomposed attention analysis to characterize visual-text fusion stability
Hypothesis: PSF arises when paraphrases perturb fusion pathways more than visual evidence, while MEE occurs when attention-output coupling is weak.
Thrust 3: Robustness Interventions
Targeted LoRA-based fine-tuning on causally-identified layers with a combined training objective:
- Task loss: Cross-entropy for correctness
- Consistency loss: KL divergence penalizing distributional disagreement across paraphrases
- Representation loss: Cosine alignment of embeddings for equivalent questions
Target: 50%+ PFR reduction, MEEC → 0, with <1% additional parameters.
Thrust 4: Clinical Safety Framework
Risk-stratified deployment protocols including:
- Real-time paraphrase probing and consistency scoring
- Abstention triggers for high-PSF cases
- Uncertainty communication with calibrated confidence and PSF risk indicators
- Radiologist validation studies (N=10-15)
Datasets & Models
Imaging Datasets: MIMIC-CXR (377K images), CheXpert (224K images), VinDr-CXR (18K images)
Target Models: MedGemma-4b-it, LLaVA-Rad
Focus Pathologies: Cardiomegaly, Pleural Effusion, Pneumothorax, Consolidation/Pneumonia, Atelectasis
GitHub Repositories
Dissertation Research
- robust-med-mllm-experiments — Robustness experiments for medical multimodal LLMs
- medical-vlm-intepret — Interpretability tools for medical VLMs
Medical VLM Implementations
- LLaVA-Med — Large Language and Vision Assistant for biomedicine
- CheXagent — Chest X-ray analysis agent
- CARES — Clinical AI reasoning and evaluation
Previous Research Projects
Fault Detection in Medical Devices
Comparative study of generative models (GAN, VAE) versus classical methods (HMM) for early detection of failures in surgical devices, leveraging Data-Driven Digital Twins for predictive maintenance.
- Fault-Detection-on-Surgical-Stapler — SAIL Lab project
Precision Oncology Decision Support
Integration of data-driven Physiology-Based PharmacoKinetic (PBPK) modeling with Reinforcement Learning for dynamic treatment optimization in cancer therapy.
- rl_for_theranostics — SAIL Lab project
Publications
Sadanandan, B., Behzadan, V. (2025). “VSF-Med: A Vulnerability Scoring Framework for Medical Vision-Language Models.” arXiv preprint arXiv:2507.00052
Sadanandan, B., Behzadan, V. (2024). “Promise of Data-Driven Modeling and Decision Support for Precision Oncology and Theranostics.” Under Review
Sadanandan, B., Arghavani Nobar, B., Behzadan, V. (2023). “Analysis of Fault Detection in Medical Devices Leveraging Generative Machine Learning Methods.” Submitted
Collaborators
- Dr. Vahid Behzadan — Dissertation Advisor, SAIL Lab Director
- SAIL Lab — Secure and Assured Intelligent Learning Lab, University of New Haven
Contact
| For research inquiries or collaboration opportunities: LinkedIn | contact(at)bineshkumar(dot)me |