LLaVA-Rad: Clinically Accessible Radiology Multimodal Model (Summary)

Summary notes based on the paper “Towards a clinically accessible radiology multimodal model: open-access and lightweight, with automatic evaluation”.

← Healthcare Index | Back to Site Index


Executive Summary

LLaVA-Rad is a small multimodal model (7B) tailored for radiology that pairs state-of-the-art pre-trained vision and text encoders with a lightweight adapter. Trained on ~697k image–text pairs, it targets real clinical usability: fast inference on a single V100 GPU, modest training cost (1 day on 8×A100), and strong performance on radiology tasks. The work also proposes CheXprompt, a GPT-4–based factuality metric that matches expert judgments.

Role in Robustness Gauntlet: LLaVA-Rad serves as a primary baseline model for evaluating robustness in the Robustness Gauntlet Framework, representing lightweight, clinically-deployable architectures.

Key Contributions

  • Lightweight, open-access SMM for radiology (7B) focused on practical deployment.
  • Modular design: reuse powerful pre-trained vision/text encoders; train a small adapter to align modalities.
  • Curated large-scale radiology dataset (~697k image–text pairs) for efficient training.
  • CheXprompt metric for automatic factual accuracy evaluation with expert-level parity.
  • State-of-the-art on report generation and retrieval, outperforming larger models (e.g., GPT-4V, Med-PaLM M 84B) on radiology benchmarks.

Data & Training

  • Data: ~697,000 radiology image–text pairs.
  • Approach: freeze or reuse SOTA components for image and text; train a lightweight grounding adapter to shared text-embedding space.
  • Efficiency: 1 day training on 8×A100; inference runs on a single V100 GPU for private settings.

Evaluation & Results

  • Tasks: radiology report generation, cross-modal retrieval, and related radiology benchmarks.
  • Metric: CheXprompt (GPT-4–based) for factual accuracy; shown to align with expert evaluation.
  • Outcome: SOTA performance for a 7B model, surpassing much larger general models on targeted tasks.

Clinical Relevance

  • Private, on-prem deployment feasible (V100-class GPU).
  • Lower cost and latency; better fit for clinical workflows than frontier closed models.
  • Open-source, enabling local fine-tuning on institution data.

Robustness Evaluation Insights

Known Vulnerabilities

  • Paraphrase Sensitivity: Initial testing shows >30% flip rate on semantically equivalent questions
  • Visual Perturbations: Performance degrades under noise and distribution shifts
  • Attention Grounding: Sometimes diffuse attention patterns, not always focused on relevant pathology

Enhancement Opportunities

  • Fine-tuning with paraphrase-augmented data
  • Attention supervision for better grounding
  • Integration with triage mechanisms for safe deployment

References