The Evolution of Adversarial Robustness: From Neural Networks to Vision-Language Models

Adversarial attacks expose hidden weaknesses in machine learning. This post traces how robustness research began with early neural networks and now extends to complex vision-language models. We highlight key discoveries, methods, and lessons for safe AI in healthcare.

1. Early Insights into Neural Networks (2014)

Szegedy et al. first showed that deep nets have two surprising traits:

Distributed knowledge
Every neuron carries bits of meaning, so no single node holds a concept alone.
Adversarial examples
Tiny, carefully chosen pixel changes can fool a classifier with high confidence.

Main findings:

Minimal perturbations (σ ≈ 0.06–0.14) yield 100% success against Inception models.
Attacks transfer across models: 15–40% of examples that fool one network also fool others.
The effect holds for linear models, convolutional nets, and unsupervised learners.

Why it happens
High-dimensional input spaces behave almost linearly in many directions. With limited training data, models leave blind spots that adversaries can exploit.

Impact
This work launched adversarial machine learning. A decade later, even advanced vision-language systems show the same core vulnerability.

2. Robustness in Vision-Language Models (2023)

Zhao et al. studied how models like GPT-4 respond to image perturbations when accessed only via APIs.

Attack strategy:

Transfer stage
Generate initial adversarial images on a surrogate (CLIP or BLIP).
Query refinement
Fine-tune perturbations with a small number of black-box calls.

Key results:

CLIP similarity scores jumped from 0.45 to over 0.80 using ε = 8/255 noise.
Even small models (BLIP-Base, 224 M params) and large ones (MiniGPT-4, 14 B) were vulnerable.
Only eight API queries were enough to reach high success rates.

Why it matters
Vision attacks are invisible to humans and fully automated. Standard text defenses cannot block them.

3. Text-Only Deception of VLMs (2025)

Ahn et al. introduced the MAC (Multimodal Adversarial Compositionality) framework. They used LLMs to craft text prompts that mislead vision-language models without changing images.

Framework highlights:

Crossmodal similarity
Ensure generated text scores highly when paired with the original image.
Non-entailment
Text contradicts the true image content.
Lexical distance
Keep word changes subtle but effective.
Diversity
Produce varied attack patterns to avoid detection.

Performance:

Success rates rose from 6.9% (zero-shot) to 42.1% with self-training.
Transfer attacks fooled 23–41% of other models.
Only four self-training steps matched a 16-step zero-shot baseline.

Takeaway
Even without touching pixels, clever text can exploit VLM weaknesses.

4. Universal Surrogate Attacks (2025)

Xu et al. showed that a single CLIP model can craft universal noise patterns that mislead many classifiers.

UnivIntruder method:

Surrogate alignment
Use CLIP text embeddings to guide perturbation direction.
Feature-space tuning
Target the embedding shifts that cause misclassification.
Robust transformations
Apply random crops and shifts to generalize perturbations.

Results:

ImageNet top-1 attack rate: 85.1%
CIFAR-10 attack rate: 99.4%
Real-world services (Google, Baidu, GPT-4) saw 50–80% success.
Required 80% fewer queries than previous black-box methods.

Defense notes
Adversarial training helps but cuts clean accuracy and demands heavy compute. Test-time filters also degrade performance.

5. Pitfalls in Medical Imaging (2023)

Ghamizi et al. reevaluated adversarial claims on chest X-ray classifiers and found common flaws:

Binary labels only
Most studies ignore multi-label disease patterns.
Single dataset tests
Lack of cross-hospital validation hides generalization issues.
Unrealistic threat models
White-box attacks assume full model access.
Standard metrics fall short
Plain accuracy misses risks from co-occurring conditions.

Key insights:

PadChest models were three times more robust than NIH-trained ones (53.6% vs 13.8%).
Risk-based attacks dropped robust accuracy from 32.1% to 11.1%.
Data variety mattered more than model architecture for true robustness.

Lesson
Healthcare demands domain-specific evaluations and realistic threat scenarios.

6. Preserving High Frequencies in Vision Transformers (2024)

Lin et al. identified that ViTs act like low-pass filters, losing fine details in diagrams and charts.

FM-ViT solution:

Fourier split
Separate image into low- and high-frequency components.
Adaptive reweighting
Learn weights for each frequency band per image.
Wavelet guidance
Use discrete wavelet transforms to set dynamic filters.

Impact:

Significant accuracy gains on chart and diagram tasks.
Better detection of small text and lines critical to medical scans.

Relevance
In X-rays, preserving edges can make the difference between spotting a nodule or missing it.

7. Interpreting VLM Attention (2024)

Stan et al. released LVLM-Interpret, a tool to visualize how vision-language models allocate attention.

Features:

Layer-by-layer maps
Show token interactions across image and text.
Relevancy scores
Use gradients to rank which patches drive each token.
Causal graphs
Identify minimal image regions needed for a given output.

Findings:

Text prompts sometimes dominate over visual cues.
Five to ten patches often suffice to fix the model’s answer.
Bias patterns emerged when prompts contained sensitive topics.

Use case
Integrate this tool into clinical validation to catch unwanted text bias in medical VLMs.

8. Token Activation Maps (2025)

Yi Li et al. tackled the problem of early-token interference in multimodal generation.

TAM approach:

Causal inference
Estimate and remove influence from prior tokens.
Gaussian denoising
Filter out noise while keeping sharp details.
Fusion maps
Combine cleaned visual maps with textual relevance for each generated token.