Thrust 2: Causal Analysis — Where Do Flips Come From?

Goal: Localize which components (layers/heads/paths) cause paraphrase-driven answer flips while vision focus stays stable.

Core idea

Treat the VLM as a causal system. Paraphrases are interventions on the text stream; we measure how effects propagate through text encoding, cross-modal fusion, and decoding.

Methods

  • Structural Causal Model (SCM): Formalizes variables and pathways (image → vision encoder; paraphrase → text encoder; fusion; decoding).
  • Activation Patching: Replace selected activations (e.g., specific cross-attention heads) with those from another paraphrase; observe answer changes.
  • Token Ablation: Remove or alter differing tokens in paraphrase pairs; measure necessity for flips (negation, scope markers, synonyms).
  • Region-Constrained Evaluation: Clamp visual features to ROI to test decoupling (attention stability with answer flips).
  • Mediation Analysis: Decompose total paraphrase effect into direct (text-only) vs. indirect (via fusion) components.

Early expectations (to validate)

  • Divergence hotspots in cross-attention (layers 12–16 for MedGemma; 8–12 for LLaVA-Rad).
  • Negation tokens have outsized causal importance; lexical swaps smaller but non-trivial.
  • Vision encoder features remain highly similar across paraphrases (>0.9 sim).

Outputs

  • Layer/head attribution atlas for paraphrase sensitivity.
  • Quantified Natural Direct/Indirect Effects (NDE/NIE) per phenomenon.
  • Validation of explanation faithfulness via interventions vs. saliency.

Decisions / TODOs

  • Standardize divergence metric (e.g., answer flip indicator, logit delta).
  • Batch-efficient patching implementation; caching strategy for fusion states.
  • Controls: identity/random patching baselines; complementary head patching.
  • Public toolkit release with examples and unit tests.