1. Auto-Extracted Metadata
-
Title
Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models -
Authors
Hashmat Shadab Malik; Fahad Shamshad; Muzammal Naseer; Karthik Nandakumar; Fahad Khan; Salman Khan -
Affiliations
• Mohamed bin Zayed University of AI (Malik, Shamshad, Khan)
• Khalifa University (Naseer)
• Michigan State University (Nandakumar)
• Linköping University (Khan)
• Australian National University (Khan) -
Venue & Year
Under Review (arXiv preprint, Feb 2025)
2.Abstract
For VLM architecture basics, see VLM Basics.
This paper shows that instead of adversarially fine-tuning CLIP on ImageNet, one can plug in off-the-shelf vision encoders adversarially pre-trained at web scale (e.g. ViT-H, ViT-G). By aligning these robust backbones end-to-end in an MLLM (LLaVA) they attain 2× the captioning robustness and 1.5× VQA robustness under ℓ∞ attacks—all while preserving clean accuracy—and cut jailbreak success by >10 points. They evaluate on COCO, Flickr30k, VQAv2, TextVQA, VizWiz, OKVQA, white-box VisualAdv, black-box HADES, common corruptions, hallucination benchmarks, and show that large-scale adversarial pretraining yields both stronger defenses and semantic alignment without extra adversarial training .
3. Paper in 10 Bullet Points
-
Motivation
- MLLMs (e.g. LLaVA) excel at vision-language tasks but are highly vulnerable to visual adversarial perturbations (hallucinations, jailbreaks).
-
Limitation of Prior Defenses
- Constrained adversarial fine-tuning (FARE, Sim-CLIP) on ImageNet yields only modest robustness gains at the cost of reduced semantic alignment.
-
Key Idea
- Leverage large-scale adversarially pre-trained vision encoders (ViT-H/14, ViT-G/14 from Wang et al. 2024b) already robust on billion-scale web data.
-
Alignment Strategy
- Learn a linear projector to align robust encoder features to CLIP’s embedding space, preserving robustness under PGD attacks (ε=1/255).
-
End-to-End Training
- Integrate aligned encoders into LLaVA and fine-tune image-to-text MLP plus LLM on multimodal instruction data.
-
Benchmarks & Metrics
- Image captioning (COCO, Flickr30k; CIDEr), VQA (VQAv2, TextVQA, VizWiz, OKVQA; accuracy), jailbreak attacks (VisualAdv toxicity, HADES ASR).
-
Untargeted Attack Results
- At ε=4/255, Robust-LLaVAG achieves CIDEr 76.6 (vs 35.1 for FARE4) on COCO and 49.3% VQAv2 accuracy (vs 31.9% for Sim-CLIP4) .
-
Targeted Attack Results
- Robust-LLaVAG maintains 0% attack success and high CIDEr even at ε=8/255, while baselines break down (100% success, 0 CIDEr) .
-
Jailbreak Defenses
- White-box VisualAdv: Robust-LLaVAG reduces toxic outputs from 503→137; Black-box HADES: lowers ASR from 62%→48% average .
-
Additional Robustness
- Outperforms baselines on natural corruptions, object hallucination (POPE F1), prompt-formatting robustness; ensembling shown to be limited by weakest encoder.
4. Deep-Dive Analysis
4.1 Methods & Assumptions
-
Adversarial Pretraining Scale: Assumes that web-scale adversarial pretraining (Wang et al. 2024b) yields richer robust features than ImageNet-only fine-tuning.
-
Linear Alignment: Uses a simple linear projector between robust features and CLIP space; assumes this suffices for semantic alignment without non-linear losses.
-
Attack Coverage: Evaluates ℓ∞ APGD and PGD attacks, both untargeted and targeted, plus sophisticated jailbreaks (VisualAdv, HADES). Does not explore ℓ2 or geometric distortions.
-
Limitations:
-
Reliance on availability of billion-scale robust vision models.
-
Potential domain shift if medical or other specialized images differ from web-scale pretraining.
-
4.2 Data & Code Availability
-
Code & Models: Released on GitHub: https://github.com/HashmatShadab/Robust-LLaVA .
-
Data: Uses public benchmarks (COCO, Flickr30k, VQAv2, etc.). All evaluation data are standard and open.
4.3 Robustness & Reproducibility
-
Statistical Soundness: Reports results on 500 random samples per dataset; includes multiple ε budgets and attack types.
-
Ablations: Appendix shows medium- vs large-scale encoder comparisons, ensembling experiments, prompt ablation, corruption severities.
-
Error Bars: None reported; robustness gains are large but without confidence intervals.
5. Fit to Dissertation Agenda
-
Detector Stability: Highlights that large-scale adversarial pretraining produces self-supervised features robust across unseen perturbations—valuable insight for SVT-AD design.
-
Purifier Guidance: Strong robust features may reduce purifier reliance, suggesting adaptive denoiser strength could be tuned based on encoder confidence.
-
Transformer vs. CNN: Large ViT-based encoders outperform ResNet-based ones in robustness preservation post-alignment—directly informs choice of transformer detectors.
-
Clinical Deployment Pitfalls:
-
Web-scale pretraining may not transfer to medical imaging modalities (CXR, ultrasound).
-
Jailbreak analogues (e.g., malicious radiology overlays) need domain-specific evaluation.
-
6. Comparative Context
-
Schlarmann et al. (2024), FARE: Unsupervised adversarial tuning of CLIP on ImageNet; yields modest robustness (∼35 CIDEr) but impaired semantic alignment.
-
Attack Methods Context: See On Evaluating Adversarial Robustness of Large Vision-Language Models for the attacks this paper defends against.
-
Hossain & Imteaj (2024), Sim-CLIP: Siamese adversarial fine-tuning; slightly better but still limited by ImageNet scale.
-
Mao et al. (2022), TeCoA: Supervised adversarial CLIP fine-tuning for vision tasks; not evaluated in MLLMs.
-
This work outperforms all by leveraging billion-scale adversarial pretraining, avoiding additional adversarial tuning in the MLLM pipeline.
-
CLIP Robustness Evaluation: See Toward a Holistic Evaluation of Robustness in CLIP Models for comprehensive evaluation framework.
7. Strengths vs. Weaknesses
-
Strengths
-
Leverages existing robust backbones—computationally efficient.
-
End-to-end integration yields both high clean and adversarial performance.
-
Comprehensive evaluation across tasks and attack modes.
-
-
Weaknesses
-
Relies on non-medical, web-scale pretraining—transfer to clinical imaging uncertain.
-
No confidence intervals or statistical significance testing.
-
Linear alignment may fail on modality gaps beyond natural images.
-
8. Follow-Up Ideas (ranked)
-
Medical Pretraining Study: Pretrain a robust ViT on medical images (CXR, fundus) then align in MedVLM-Shield. See Robustness Notes for medical considerations.
-
Nonlinear Alignment: Explore gated or transformer-based adapters instead of linear projection for tighter feature fusion.
-
Purifier Synergy: Couple robust encoder confidence with score-based denoiser strength to adaptively filter perturbations.
-
Multimodal Detector: Combine visual robust features with ECG/lab embeddings for joint anomaly detection.
-
Prompt-Freezer Defense: Integrate XAI-guided region repair on top of Robust-LLaVA to patch identified adversarial regions.
9. Quick-Reference Table
Section | Takeaway | Caveat |
---|---|---|
Intro | Large-scale adversarial pretraining beats ImageNet fine-tuning | Needs adaptation for medical domains |
Methods | Linear alignment + end-to-end MLLM training preserves semantics | Alignment tested only on natural images |
Results | 2× captioning & 1.5× VQA robustness gains; >10% jail-break drop | No error bars; performance on unseen medical shifts untested |
10. Action Items for Me
-
Prototype SVT-AD variant pre-trained on Symile-MIMIC images.
-
Add Robust-LLaVA GitHub repo to Zotero under “Adversarial Pretraining” tag.
-
Run linear vs transformer adapter alignment comparisons on ChexAgent embeddings.
-
Evaluate Robust-LLaVA on CXR + ECG multimodal prompts.
11. Quality Scores (1–5 + one-line reasons)
-
Clarity: 4 – Well-written, but dense in experimental details.
-
Rigor: 4 – Broad evaluation, lacks error bars.
-
Novelty: 5 – First to integrate billion-scale robust vision encoders in MLLMs.
-
Impact: 4 – Strong for natural images; clinical transfer needs study.
-
Theory: Based on principles in linear-hypothesis-explanation.
-
Clinical Reliability: 3 – Promising robustness, but domain shift unaddressed.
Let me know if you’d like me to expand any section or focus on a different paper!