LLaVA-Rad: Clinically Accessible Radiology Multimodal Model (Summary)
Summary notes based on the paper “Towards a clinically accessible radiology multimodal model: open-access and lightweight, with automatic evaluation”.
← Healthcare Index | Back to Site Index
Executive Summary
LLaVA-Rad is a small multimodal model (7B) tailored for radiology that pairs state-of-the-art pre-trained vision and text encoders with a lightweight adapter. Trained on ~697k image–text pairs, it targets real clinical usability: fast inference on a single V100 GPU, modest training cost (1 day on 8×A100), and strong performance on radiology tasks. The work also proposes CheXprompt, a GPT-4–based factuality metric that matches expert judgments.
Role in Robustness Gauntlet: LLaVA-Rad serves as a primary baseline model for evaluating robustness in the Robustness Gauntlet Framework, representing lightweight, clinically-deployable architectures.
Key Contributions
- Lightweight, open-access SMM for radiology (7B) focused on practical deployment.
- Modular design: reuse powerful pre-trained vision/text encoders; train a small adapter to align modalities.
- Curated large-scale radiology dataset (~697k image–text pairs) for efficient training.
- CheXprompt metric for automatic factual accuracy evaluation with expert-level parity.
- State-of-the-art on report generation and retrieval, outperforming larger models (e.g., GPT-4V, Med-PaLM M 84B) on radiology benchmarks.
Data & Training
- Data: ~697,000 radiology image–text pairs.
- Approach: freeze or reuse SOTA components for image and text; train a lightweight grounding adapter to shared text-embedding space.
- Efficiency: 1 day training on 8×A100; inference runs on a single V100 GPU for private settings.
Evaluation & Results
- Tasks: radiology report generation, cross-modal retrieval, and related radiology benchmarks.
- Metric: CheXprompt (GPT-4–based) for factual accuracy; shown to align with expert evaluation.
- Outcome: SOTA performance for a 7B model, surpassing much larger general models on targeted tasks.
Clinical Relevance
- Private, on-prem deployment feasible (V100-class GPU).
- Lower cost and latency; better fit for clinical workflows than frontier closed models.
- Open-source, enabling local fine-tuning on institution data.
Robustness Evaluation Insights
Known Vulnerabilities
- Paraphrase Sensitivity: Initial testing shows >30% flip rate on semantically equivalent questions
- Visual Perturbations: Performance degrades under noise and distribution shifts
- Attention Grounding: Sometimes diffuse attention patterns, not always focused on relevant pathology
Enhancement Opportunities
- Fine-tuning with paraphrase-augmented data
- Attention supervision for better grounding
- Integration with triage mechanisms for safe deployment
References
- PDF:
../refererence_docs/2403.08002v5.pdf
- Related: Medical Vision-Language Models · Validation and Datasets
- Evaluation: Robustness Gauntlet Framework