06-medical-vlm-security

Visual perturbations

For foundational VLM architecture understanding, see VLM Basics.

Natural Noise and Artifacts - Regular CXR vs Porrable ER noises, grid lines from digitizers
Adversarial Noises: FGSM — a single-step attack that calculates the gradient of the model’s loss with respect to the input image (see linear-hypothesis-explanation for theoretical foundation) and then adds a small perturbation in the direction of the sign of that gradient to maximize the loss, often causing misclassification. PGD — an iterative extension of FGSM that takes multiple smaller steps; after each step, it projects the perturbed image back into a small, allowed neighborhood to remain imperceptible.
Adversarial Patches: Instead of distributed noise, insert a localized patch
Domain Shifts : different hospitals, populations, demographics
Data Quality Issues
Composite Attacks

Current Defense Techniques

Adversarial Training: The model is explicitly trained on adversarially perturbed images so that it learns to be resilient. computationally expensive vs accuracy compromises
Data Augmentation: augment the training data with various perturbations: adding noise, random rotations, scaling, simulating lower quality, Standard augmentations (flips, rotations).
Defensive Preprocessing: inference time, apply filters to the input to remove potential adversarial signals.
Certified and Provable Defenses - like the randomized smoothing
Ensemble and Redundancy

VLM Specific Issues

For comprehensive attack methods, see vlm-attacks and On Evaluating Adversarial Robustness of Large Vision-Language Models.

Architecture

Vision Encoder (ViT, SigLIP)) - encoder might output a set of image feature tokens or a fixed-length embedding vector.MedGemma’s encoder normalizes a cxr to 896×896 and encodes it into 256 latent tokens
Fusion Layer - map image features into the LLM’s token spac, cross-attention where the LLaVA learned projection that turns the visual encoder’s output into pseudo-word embeddings Vicuna attend to . During training the model learns to align visual features with textual descriptions (feature vector for an enlarged cardiac silhouette aligns with the token “cardiomegaly”)
LLM - model generates an output sequence based conditioned on the fused representation. the cxr has [MASK]

Vulnerability Areas

Vision Encoder: a high-dimensional pattern recognizer where adversarial pixel perturbations directly target this stage by exploiting the encoder’s gradients. domain shifts - just with intensity distribution Fusion Layer : If the image features are out-of-distribution the projection layer might produce embedding tokens that are not meaningful to the LLM, or accidentally correspond to unrelated tokens a steganographic attack where hidden text is present in the image pixels LLM : all robustness issues like hallucination

Possible Robustness techniques

Architecture Level

Frequency Aware Vision Encoders -

Integrate FM-ViT modules after each SigLIP self-attention block, Fourier-decomposing attention outputs into low/high frequencies and re-weighting them with learnable coefficients.
Adjust the input embedding to accept single-channel CXRs so the FM layers preserve subtle lung textures and soft-tissue details.
Add a sample-adaptive branch using DWT-derived statistics (mean/variance of sub-bands) to dynamically tune low/high frequency weights per chest X-ray.
Fine-tune in three stages: (1) contrastive tuning on CXR–report pairs, (2) cross-modal projector training to align image and report embeddings, (3) instruction tuning with radiology Q&A/report-generation prompts.
Benchmark and hyperparameter-tune on CXR classification and VQA datasets, optimizing frequency-band thresholds for radiographic feature robustness.

Mixture-of-Experts (MoE)

Multiple vision encoders — trained on normal, noisy, and overlay conditions — then weight their votes.

Training Level

Fine-tuning the LLM with noisy visual inputs: when fine-tuning a multimodal model like LLaVA-Med, training typically uses clean image–text pairs. You can fine-tune the language component by feeding perturbed images with correct explanations to improve robustness.

Contrastive loss: train the vision encoder with a contrastive loss where an image and a perturbed version of the same image must produce similar embeddings (invariant representations).

Test time or inference time Level

Preprocessing filters - Filter or compression

Test-time augmentation: ask the model the same question on the original and on augmented images, then use a voting or agreement check. Large disagreement flags attack/uncertainty; the system can abstain or respond with caution.

Uncertainty aware outputs

Key questions

What evaluation methods (MedPhr‑Rad baselining, VSF‑Med (prior), CARES, RexVQA) best capture robustness in medical LLMs? See Toward a Holistic Evaluation of Robustness in CLIP Models for evaluation framework.
How do multimodal prompt attacks (image + text) interact, and how to defend them?
Which robustness techniques transfer across datasets and pathologies?
What is the trade-off between model complexity and robustness in this domain?
Can incorporating medical domain knowledge or even more modalities enhance robustness

RobMed LLM Notes

Explorer