If you’ve tried asking ChatGPT to interpret a chest X-ray, you know the answer: it can’t. Not because the technology doesn’t exist, but because most general-purpose models weren’t built for medical imaging.
That’s changing fast. A new generation of vision-language models can now look at scans, read clinical notes, and answer questions about both. Some of these models are already matching specialist performance on diagnostic benchmarks.
Here’s what’s actually working, what’s still experimental, and what it takes to deploy these systems safely.
Why Healthcare Needs Multimodal AI
Healthcare data doesn’t fit neatly into text or images alone. A single patient encounter might include X-rays, MRI scans, lab results, vital signs over time, and pages of clinical notes. Traditionally, each data type required its own specialized model.
Multimodal models change this. A single architecture can detect subtle abnormalities in a scan, summarize a 20-page discharge summary, spot concerning trends in vital signs, and explain its reasoning in plain language. The potential is obvious: faster diagnoses, fewer missed findings, less cognitive load on clinicians.
But potential and reality are different things. Let’s look at the models that are actually delivering results.
Vision-Language Models That Work on Medical Images
These models take an image and a question, then return a text answer. The architecture typically combines a vision encoder (to “see” the image) with a language model (to understand questions and generate responses).
LLaVA-Med 1.5
LLaVA-Med pairs a CLIP vision encoder with Vicuna, a 13B parameter language model. The team trained it on 200,000 image-text pairs from PubMed Central, supplemented with synthetic instructions generated by GPT-4.
The results are solid. On radiology and pathology question-answering benchmarks, it matches or beats prior approaches. The architecture is straightforward: the vision encoder extracts image features, an MLP projects them into the language model’s embedding space, and the language model handles the rest.
Visual Med-Alpaca
This one takes a different approach. Instead of a single end-to-end model, Visual Med-Alpaca uses a routing system. A classifier first determines what type of input it’s dealing with, then dispatches to specialized experts (Med-GIT for general medical images, DePlot for charts and graphs). The outputs feed into a LLaMA-7B core with LoRA adapters.
Training data came from 54,000 Q&A pairs drawn from BigBIO and ROCO radiology datasets. The team used GPT-3.5 to generate additional prompts, then filtered them with human review.
One caveat: this is strictly research-use only, with no FDA approval.
CheXagent
CheXagent focuses specifically on chest X-rays. The image encoder (SigLIP-Large) processes 512×512 pixel images through 24 transformer layers. A projection MLP maps those features into a Phi-2.7B decoder trained on medical and scientific text.
The training corpus is impressive: over one million chest X-ray and report pairs, plus 2.7 billion tokens from clinical notes and research articles. The intended use cases include drafting radiology reports, flagging abnormalities, and explaining findings to patients.
MedGemma-4B-IT
Google’s entry into this space launched in July 2025. It’s a decoder-only transformer with 4B parameters, built on the Gemma 3 base. The SigLIP image encoder was pre-trained on de-identified data spanning chest X-rays, dermatology, ophthalmology, and histopathology.
The context window is generous: 128K tokens of text plus images (each 896×896 image converts to 256 tokens). Here’s how it compares to the base Gemma model:
| Task | Base Gemma 3 4B | MedGemma 4B-IT |
|---|---|---|
| MIMIC-CXR macro F1 (top 5) | 81.2 | 88.9 |
| CheXpert macro F1 (top 5) | 32.6 | 48.1 |
| CXR14 macro F1 (3 conditions) | 32.0 | 50.1 |
| SLAKE VQA token F1 | 40.2 | 72.3 |
| PathMCQA histopathology accuracy | 37.1 | 69.8 |
| EyePACS fundus accuracy | 14.4 | 64.9 |
The improvements are substantial across the board. MedGemma is available on Hugging Face under the Health AI Developer Foundations license, with fine-tuning notebooks on GitHub.
Language Models for Clinical Text
Not everything in healthcare is an image. Electronic health records contain millions of words: admission notes, progress updates, discharge summaries, lab interpretations. Models trained specifically on this text outperform general-purpose LLMs.
GatorTron
GatorTron comes in sizes from 110M to 8.9B parameters, all trained on 82 billion words of de-identified clinical text. The researchers tested it on concept extraction, relation extraction, clinical inference, and question answering. The finding won’t surprise anyone who’s followed scaling laws: bigger models and more data improve everything.
Few-Shot Health Learners
This work from Google explores whether large language models can handle time-series health data with minimal examples. Starting from PaLM-24B (pre-trained on 780B tokens), the team fine-tuned on ECG waveforms and vital signs using just a handful of examples per task.
The results suggest that LLMs can ground numeric health data surprisingly well. Applications include arrhythmia detection, activity recognition, and estimating calorie expenditure or stress levels from sensor data.
How Do You Test These Models?
Benchmarks matter. A model that aces one dataset might fail completely in a real clinical setting. Here are the validation sets researchers are using:
NEJM Clinicopathologic Cases contains 143 diagnostic puzzles from 2021 to 2024, scored on a Bond Scale (0-5) and Likert scale (0-2). These are the kind of cases that stump experienced clinicians.
NEJM Healer Series walks models through 20 complete patient encounters across four stages: triage, examination, testing, and management. Scoring uses the R-IDEA rubric (0-10).
Grey Matters Management presents 5 complex scenarios scored on a 100-point rubric. Notably, this benchmark compares GPT-4 against physicians working with and without AI assistance.
MIMIC-IV-Ext Clinical Decision Making draws from 2,400 emergency department visits for abdominal pain, testing whether models can distinguish appendicitis, cholecystitis, diverticulitis, and pancreatitis.
Probabilistic Reasoning Challenges test whether models can perform Bayesian inference with lab results. This matters because clinical decision-making is fundamentally probabilistic, and models that give false confidence are dangerous.
What It Takes to Deploy Safely
Research performance doesn’t guarantee safe clinical use. Several factors separate a promising paper from a deployable system.
Privacy is non-negotiable. Patient data must be de-identified and encrypted. Models trained on identifiable data face both legal liability and the risk of memorizing sensitive information.
Generalization trips up many models. Performance on one hospital’s data often doesn’t transfer to another institution with different patient populations, imaging equipment, or documentation practices. Diverse testing is essential.
Explainability helps clinicians trust (and appropriately distrust) model outputs. Attention maps, saliency scores, and counterfactual explanations all help, though none fully solve the interpretability problem.
Regulation remains unsettled. The FDA and CE marking bodies are still working out how to evaluate AI that learns and updates. Liability questions are largely unresolved.
Where This Is Heading
The immediate future is clear: these models will get better, handle more modalities, and integrate more tightly into clinical workflows.
Longer term, expect models that incorporate genomic data, wearable sensor streams, and even environmental factors. Real-time decision support integrated directly into EHRs is coming. Personalization based on individual patient histories will follow.
The harder problems are institutional and regulatory. Who’s liable when an AI-assisted diagnosis is wrong? How do you validate a model that keeps learning? What does informed consent look like when AI is involved in care decisions?
Multimodal LLMs will transform healthcare. The technology is nearly ready. The question is whether our institutions can adapt fast enough to deploy it safely.
References
Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J., Naumann, T., Poon, H., & Gao, J. (2023). LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890.
Han, T., Adams, L. C., Papaioannou, J. M., Grundmann, P., Oberhauser, T., Löser, A., Truhn, D., & Bressem, K. K. (2023). MedAlpaca: An open-source collection of medical conversational AI models and training data. arXiv preprint arXiv:2304.08247.
Chen, Z., Diao, S., Wang, B., Wang, H., Liu, T., Hu, Z., & Jiang, L. (2024). CheXagent: Towards a foundation model for chest X-ray interpretation. arXiv preprint arXiv:2401.12208.
Google (2025). MedGemma: Medical vision-language models. Google Health AI Developer Foundations. Retrieved from https://huggingface.co/google/medgemma
Yang, X., Chen, A., PourNejatian, N., Shin, H. C., Smith, K. E., Parisien, C., Compas, C., Martin, C., Costa, A. B., Flores, M. G., Zhang, Y., Magoc, T., Harle, C. A., Lipori, G., Mitchell, D. A., Hogan, W. R., Shenkman, E. A., Bian, J., & Wu, Y. (2022). GatorTron: A large clinical language model to unlock patient information from unstructured electronic health records. arXiv preprint arXiv:2203.03540.
Rasul, K., Ashok, A., Williams, A. R., Khorasani, M., Adamopoulos, G., Bhagwatkar, R., Biloš, M., Ghonia, H., Hassen, N. V., Anderson, D., Schneider, J., Nevmyvaka, Y., & Rätsch, G. (2023). Medical time-series data generation using generative adversarial networks. Proceedings of Machine Learning Research, 182.