Thrust 1: Measuring Phrasing Brittleness (PSF) and Misleading Explanations (MEE)
Goal: Build a rigorous, repeatable way to measure when models flip answers under paraphrases while their visual focus stays the same — and when explanation metrics look better for wrong answers.
Why this matters
- Phrasing-Sensitive Failure (PSF): Answers change under clinically equivalent rewordings while attention maps stay stable (false reassurance risk).
- Misleading Explanation Effect (MEE): Deletion/Insertion AUC often higher for wrong predictions than right ones (faithfulness ≠ correctness).
- Clinical risk: Stable-looking saliency + confident answer can still be wrong; small phrasing change flips the decision.
Dataset: VSF Med (Vulnerability Scoring Framework — Medical)
- Scope: Chest X-ray VQA from MIMIC-CXR; integrates Chest ImaGenome ROIs for 1,500 images.
- Categories (target):
- Binary findings (~800), Localization (~400), Severity (~300), Temporal/Comparative (~300), Modality/View (~200).
- Paraphrases: 8–10 expert-validated variants per question (LLM generation + multi-stage clinical filtering).
- Phenomena tags: lexical, syntactic, negation, scope/quantification, specificity, combined.
Metrics (notes)
- Flip Rate: % of paraphrase pairs that change answer.
- Attention Stability Index (ASI): Cosine similarity of attention maps across paraphrases; high ASI + flip → PSF.
- Faithfulness Gap (EFG): Deletion/Insertion AUC for wrong > right.
- Answer-Attention Coupling (AAC): Correlation between prediction change and attention change.
Evaluation plan
- Models: MedGemma-4b-it, LLaVA-Rad (initial); human baseline for context.
- Stressors: negation, scope shifts, syntax reorderings, synonym swaps; report phenomenon-specific flip rates.
- Outputs: dataset card, evaluator scripts, per-phenomenon vulnerability profiles.
Expected deliverables
- VSF Med benchmark + annotation schema.
- Reproducible evaluation code (flip/ASI/EFG/AAC metrics).
- Figures: flip distributions, ASI vs. flip scatter, faithfulness inversion examples.
Decisions / TODOs
- Finalize question set and per-category counts.
- Lock paraphrase validation thresholds and QA checklist.
- Choose default attention probe(s) for ASI (Grad-CAM, rollout, etc.).
- Publish minimal evaluator with cached model outputs to lower compute.