On Evaluating Adversarial Robustness of Large Vision-Language Models

Metadata

Title: On Evaluating Adversarial Robustness of Large Vision-Language Models
Authors: Yunqing Zhao¹, Tianyu Pang², Chao Du², Xiao Yang³, Chongxuan Li⁴, Ngai-Man Cheung¹, Min Lin²
- ¹Singapore University of Technology and Design
- ²Sea AI Lab, Singapore
- ³Tsinghua University
- ⁴Renmin University of China
Venue: 37th Conference on Neural Information Processing Systems (NeurIPS 2023)
Year: 2023

Abstract

This paper investigates how susceptible state-of-the-art open-source vision–language models (e.g., BLIP-2, MiniGPT-4, LLaVA) are to small, targeted image perturbations when attackers have only black-box access. For background on VLM architectures, see VLM Basics. The authors craft adversarial images using transfer-based (via surrogate CLIP/BLIP encoders or text-to-image generation) and query-based (random gradient-free) methods, then evaluate transferability and success rates on image-grounded text generation tasks. They report high fooling rates—measured by similarity between generated captions and attacker-specified targets—and analyze trade-offs between perceptual imperceptibility and attack efficacy. The work quantifies vulnerabilities in multimodal systems and underscores the need for robust detection and defense before deployment.

Paper in 10

Motivation: Multimodal VLMs (e.g., GPT-4 with images) may be stealthily evaded by perturbing only the vision input under black-box conditions.
Threat Model: Attackers have only API-level access (input image → output text), aim for targeted captions/answers, and are restricted by a small ℓ∞ budget (ϵ=8).
Transfer-Based Attacks:
- MF-it: Align adversarial image with target text embedding via surrogate CLIP/BLIP.
- MF-ii: Align image–image features by generating a “target image” (e.g., via Stable Diffusion) and matching its CLIP-ViT encoding.
Query-Based Attacks: Use random gradient-free (RGF) estimation to iteratively refine the adversarial image toward maximizing text–text similarity between model output and target text.
Combined Strategy: Initialize with transfer-based (MF-ii) perturbations then apply query-based tuning (MF-tt) for highest success under fixed budget.
Evaluation Suite: Six open-source VLMs tested—BLIP, UniDiffuser, Img2Prompt, BLIP-2, LLaVA, MiniGPT-4—on ImageNet-1K + MS-COCO captions. Success measured by CLIP similarity between generated and target text.
Results: Transfer-based MF-ii already yields high targeted-caption rates (average CLIP ensemble ≈ 0.75), and combined MF-ii+MF-tt boosts this to ≈ 0.85 on BLIP.
Perceptual Trade-off: Larger ϵ → stronger attacks but more visible noise; ϵ = 8 (LPIPS≈0.05) balances stealth and efficacy.
Interpretability: GradCAM shows that adversarial images shift model attention from original to attacker-desired regions.
Implication: Highlights critical safety gap in multimodal APIs: visual inputs alone can be automated to fully compromise downstream text generation.

4. Deep-Dive Analysis

4.1 Methods & Assumptions

Black-Box Setting: No access to model internals; only image→text queries allowed.
Attack Theory: Based on linear-hypothesis-explanation which explains vulnerability through high-dimensional accumulation.
Perturbation Constraint: ℓ∞ ≤ ϵ ensures human-imperceptibility; default ϵ=8/255.
Surrogates: CLIP and BLIP image/text encoders for transfer-based attacks; Stable Diffusion, Midjourney, DALL-E for generating proxy target images.
Optimization: 100-step PGD for transfer; 8-step PGD with 100 RGF queries for querying.
Limitations:
- Evaluation on fixed prompt (“what is the content of this image?”) may not reflect conversational use.
- No physical-world or end-to-end robotic/clinical deployment scenarios.
- Reliance on CLIP-based similarity may not capture nuanced semantic failures.

4.2 Data & Code Availability

Data:
- Clean images: ImageNet-1K validation set.
- Target texts: Random MS-COCO captions.
Code/Project Page: Available at https://yunqing-me.github.io/AttackVLM/.
Environments: Experiments on single NVIDIA A100 GPU with public checkpoints for all tested VLMs.

4.3 Robustness & Reproducibility

Reported Metrics: Mean CLIP scores across five CLIP backbones; no standard deviations or confidence intervals provided.
Ablations:
- Varying ϵ to study quality vs. success.
- Interpolation between pure transfer/query budgets.
Reproducibility: Full attack pipelines and model weights are public; hyperparameters (PGD steps, σ, N) are specified.

5. Fit to Dissertation Agenda

Anomaly Detector (SVT-AD):
- Insight: Attacks exploit subtle, distributed perturbations invisible to humans but potent to VLM encoders—motivates self-supervised anomaly scores at multiple transformer layers.
- Technique: Contrast adversarial vs. clean features in ViT attention maps for detector training.
Score-Based Purifier:
- Lesson: Combining cross-modality priors with query feedback could inform adaptive denoising steps.
Transformer vs. CNN Stability:
- Attacks transfer across CLIP-RN50 (CNN) and CLIP-ViT (transformer) surrogates with similar success, indicating brittleness is architecture-agnostic.
Clinical Deployment Pitfalls:
- In CXR-LLM systems (e.g., LLaVA-Rad), adversarial noise could trigger dangerous misdiagnoses without clinician oversight.
- See Robustness Notes for medical domain-specific robustness considerations.

6. Comparative Context

Chen et al. (2017), “Attacking Visual Language Grounding”: Early white-box adversarial attacks on CNN-RNN captioners; this work extends to large, black-box VLMs with targeted goals.
Li et al. (2021), “Adversarial VQA”: Focused on untargeted, white-box attacks on VQA models; here we see high-success targeted evasion under strict black-box constraints.
Dong et al. (2023), “How Robust Is Google’s Bard to Adversarial Image Attacks?”: Concurrent red-teaming of proprietary multimodal API; our paper provides systematic open-source benchmarks and attack strategies.
Related Resources: See vlm-attacks for practical implementation guide and Robust-LLaVA - On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models for defense mechanisms.

7. Strengths vs. Weaknesses

Strengths
- Comprehensive evaluation across six modern VLMs under realistic black-box, targeted threat model.
- Clear methodology combining transfer and query attacks, with ablations on budget allocation.
- Public code and data promote reproducibility.
Weaknesses
- Limited to digital-domain attacks; no physical-world validation.
- Single generic prompt may understate or overstate vulnerabilities in varied contexts.
- Lack of statistical uncertainty measures (e.g., error bars) in reported metrics.

8. Follow-Up Ideas (Ranked)

Prototype SVT-AD: Train a self-supervised transformer-based detector on clean vs. adversarial embeddings from CLIP and BLIP.
Adaptive Purifier: Integrate query-feedback (MF-tt) to guide iterative denoising on suspicious inputs.
Physical-World Tests: Print adversarial examples and photograph under varying lighting to assess real-world feasibility.
Medical VLM Evaluation: Apply attacks to LLaVA-Rad or ChexAgent on CXR+report generation.
Prompt Robustness: Test whether richer instruction templates (beyond fixed prompt) alleviate or worsen attack success.

9. Quick-Reference Table

Section	Takeaway	Caveat
Introduction	Black-box targeted attacks can compromise VLMs with subtle noise.	Real-world pipelines use richer prompts than fixed queries.
Methods	Combines transfer-based (MF-ii) and query-based (MF-tt) strategies.	Surrogate alignment assumes similar feature spaces across models.
Results	High CLIP-score success (≈ 0.85) on BLIP; consistent transfer across architectures.	No error bars; uses single random seed per image/text pair.

10. Action Items for Me

Replicate MF-ii + MF-tt attack on LLaVA-Rad using Symile-MIMIC chest X-rays.
Develop a lightweight SVT-AD module to flag high-confidence adversarial queries.
Benchmark purifier performance (e.g., diffusion-based denoiser) on adversarial VLM inputs.

11. Quality Scores

Metric	Score (1–5)	Rationale
Clarity	4	Well-structured; methods clearly described.
Rigor	4	Thorough ablations; uses standard adversarial protocols.
Novelty	3	Builds on existing attack frameworks; extends to VLMs.
Impact	4	Highlights urgent safety concerns for multimodal APIs.
Clinical Reliability	2	No evaluation on medical datasets or physical settings.

RobMed LLM Notes

Explorer

03-blackbox-attacks