Paraphrase Robustness and Metrics (MedPhr‑Rad)

← Evaluation Index | Benchmark →


Taxonomy of Paraphrase Variation

  • Synonymy and lexical choice
  • Negation and polarity flips
  • Hedging and uncertainty markers
  • Temporality (prior vs current; onset)
  • Quantifiers and numeric ranges
  • Units and measurement formats
  • Reading level and clinician style

Core Metrics

  • Paraphrase consistency rate: fraction of a paraphrase group matching modal answer
  • Flip rate: share of items where any paraphrase changes the decision
  • Robust accuracy: accuracy aggregated over paraphrase sets
  • ECE (Expected Calibration Error): base calibration quality
  • Selective risk at coverage c%: risk under an abstention policy at target coverage

Risk Score and Triage Signal

  • Dispersion across paraphrases (vote entropy/variance)
  • Combine with confidence for selective automation
  • Use conformal risk control for guaranteed error rates on auto‑accept cases

Dataset Scope

  • Radiology VQA (VQA‑RAD, PMC‑VQA, SLAKE)
  • Extensions with standardized paraphrase sets per item
  • Clinician spot checks + NLI + concept‑equivalence filters

Implementation Notes

  • Group paraphrases per item; report per‑group metrics
  • Paired tests across paraphrase groups; bootstrap CIs
  • Fairness slices across subgroups and sentinel findings

See also: Selective Conformal Triage, MedGemma, LLaVA‑Rad