Visual perturbations
For foundational VLM architecture understanding, see VLM Basics.
- Natural Noise and Artifacts - Regular CXR vs Porrable ER noises, grid lines from digitizers
- Adversarial Noises: FGSM — a single-step attack that calculates the gradient of the model’s loss with respect to the input image (see linear-hypothesis-explanation for theoretical foundation) and then adds a small perturbation in the direction of the sign of that gradient to maximize the loss, often causing misclassification. PGD — an iterative extension of FGSM that takes multiple smaller steps; after each step, it projects the perturbed image back into a small, allowed neighborhood to remain imperceptible.
- Adversarial Patches: Instead of distributed noise, insert a localized patch
- Domain Shifts : different hospitals, populations, demographics
- Data Quality Issues
- Composite Attacks
Current Defense Techniques
- Adversarial Training: The model is explicitly trained on adversarially perturbed images so that it learns to be resilient. computationally expensive vs accuracy compromises
- Data Augmentation: augment the training data with various perturbations: adding noise, random rotations, scaling, simulating lower quality, Standard augmentations (flips, rotations).
- Defensive Preprocessing: inference time, apply filters to the input to remove potential adversarial signals.
- Certified and Provable Defenses - like the randomized smoothing
- Ensemble and Redundancy
VLM Specific Issues
For comprehensive attack methods, see vlm-attacks and On Evaluating Adversarial Robustness of Large Vision-Language Models.
Architecture
- Vision Encoder (ViT, SigLIP)) - encoder might output a set of image feature tokens or a fixed-length embedding vector.MedGemma’s encoder normalizes a cxr to 896×896 and encodes it into 256 latent tokens
- Fusion Layer - map image features into the LLM’s token spac, cross-attention where the LLaVA learned projection that turns the visual encoder’s output into pseudo-word embeddings Vicuna attend to . During training the model learns to align visual features with textual descriptions (feature vector for an enlarged cardiac silhouette aligns with the token “cardiomegaly”)
- LLM - model generates an output sequence based conditioned on the fused representation. the cxr has [MASK]
Vulnerability Areas
Vision Encoder: a high-dimensional pattern recognizer where adversarial pixel perturbations directly target this stage by exploiting the encoder’s gradients. domain shifts - just with intensity distribution Fusion Layer : If the image features are out-of-distribution the projection layer might produce embedding tokens that are not meaningful to the LLM, or accidentally correspond to unrelated tokens a steganographic attack where hidden text is present in the image pixels LLM : all robustness issues like hallucination
Possible Robustness techniques
Architecture Level
Frequency Aware Vision Encoders -
Related to robust encoder approaches in Robust-LLaVA - On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models.
-
Integrate FM-ViT modules after each SigLIP self-attention block, Fourier-decomposing attention outputs into low/high frequencies and re-weighting them with learnable coefficients.
-
Adjust the input embedding to accept single-channel CXRs so the FM layers preserve subtle lung textures and soft-tissue details.
-
Add a sample-adaptive branch using DWT-derived statistics (mean/variance of sub-bands) to dynamically tune low/high frequency weights per chest X-ray.
-
Fine-tune in three stages: (1) contrastive tuning on CXR–report pairs, (2) cross-modal projector training to align image and report embeddings, (3) instruction tuning with radiology Q&A/report-generation prompts.
-
Benchmark and hyperparameter-tune on CXR classification and VQA datasets, optimizing frequency-band thresholds for radiographic feature robustness.
Mixture-of-Experts (MoE)
Multiple vision encoders — trained on normal, noisy, and overlay conditions — then weight their votes.
Training Level
Fine-tuning the LLM with noisy visual inputs: when fine-tuning a multimodal model like LLaVA-Med, training typically uses clean image–text pairs. You can fine-tune the language component by feeding perturbed images with correct explanations to improve robustness.
Contrastive loss: train the vision encoder with a contrastive loss where an image and a perturbed version of the same image must produce similar embeddings (invariant representations).
Test time or inference time Level
Preprocessing filters - Filter or compression
Test-time augmentation: ask the model the same question on the original and on augmented images, then use a voting or agreement check. Large disagreement flags attack/uncertainty; the system can abstain or respond with caution.
Uncertainty aware outputs
Key questions
- What evaluation methods (MedPhr‑Rad baselining, VSF‑Med (prior), CARES, RexVQA) best capture robustness in medical AI? See Toward a Holistic Evaluation of Robustness in CLIP Models for evaluation framework.
- How do multimodal prompt attacks (image + text) interact, and how to defend them?
- Which robustness techniques transfer across datasets and pathologies?
- What is the trade-off between model complexity and robustness in this domain?
- Can incorporating medical domain knowledge or even more modalities enhance robustness