Thrust 2: Causal Analysis — Where Do Flips Come From?

Goal: Localize which components (layers/heads/paths) cause paraphrase-driven answer flips while vision focus stays stable.

Core idea

Treat the VLM as a causal system. Paraphrases are interventions on the text stream; we measure how effects propagate through text encoding, cross-modal fusion, and decoding.

Methods

Structural Causal Model (SCM): Formalizes variables and pathways (image → vision encoder; paraphrase → text encoder; fusion; decoding).
Activation Patching: Replace selected activations (e.g., specific cross-attention heads) with those from another paraphrase; observe answer changes.
Token Ablation: Remove or alter differing tokens in paraphrase pairs; measure necessity for flips (negation, scope markers, synonyms).
Region-Constrained Evaluation: Clamp visual features to ROI to test decoupling (attention stability with answer flips).
Mediation Analysis: Decompose total paraphrase effect into direct (text-only) vs. indirect (via fusion) components.

Early expectations (to validate)

Divergence hotspots in cross-attention (layers 12–16 for MedGemma; 8–12 for LLaVA-Rad).
Negation tokens have outsized causal importance; lexical swaps smaller but non-trivial.
Vision encoder features remain highly similar across paraphrases (>0.9 sim).

Outputs

Layer/head attribution atlas for paraphrase sensitivity.
Quantified Natural Direct/Indirect Effects (NDE/NIE) per phenomenon.
Validation of explanation faithfulness via interventions vs. saliency.

Decisions / TODOs

Standardize divergence metric (e.g., answer flip indicator, logit delta).
Batch-efficient patching implementation; caching strategy for fusion states.
Controls: identity/random patching baselines; complementary head patching.
Public toolkit release with examples and unit tests.

RobMed LLM Notes

Explorer

04b-thrust-2-causal-analysis

Thrust 2: Causal Analysis — Where Do Flips Come From?

Core idea

Methods

Early expectations (to validate)

Outputs

Decisions / TODOs

Graph View

Table of Contents