obversarystudios's picture
Align requirements and mirror copies
2d85dac verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: Failure Geometry Demo
emoji: 🧩
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Failure structure discovery on CARB reasoning

Failure Geometry Demo

Self-contained research demo for failure structure analysis on compositional reasoning. No API key required. Runs entirely with scikit-learn.

CARB dataset β†’ weak baselines β†’ failure extraction β†’ TF-IDF + SVD embeddings β†’ KMeans β†’ MI comparison

What this demonstrates

Two deliberately weak baselines expose different failure geometries on the same dataset:

Baseline Failure pattern
always_1 Systematic bias β€” fails on every false-labeled item
keyword_heuristic Negation-sensitive β€” fails on affirmative-false and negated-true items

Pooling failures from both lets the demo ask:

Are failure clusters organised by reasoning category, by which baseline failed, or both?

Mutual information over cluster assignments answers this. The accuracy-by-type chart shows per-baseline slice performance. The 2-D SVD scatter shows whether clusters are visually separable.

Dataset

50 controlled reasoning examples across four types:

  • Transitivity β€” chained relational premises (All A β†’ B; All B β†’ C; therefore All A β†’ C)
  • Negation β€” rule-state-conclusion triples where the state negates the conclusion
  • Syllogism β€” classical universal / particular forms with valid and invalid conclusions
  • Distractor logic β€” valid causal rules with irrelevant sentences injected as distractors

Dataset schema matches failure-induced-benchmarks CARB conventions: x, y, reasoning_type.

What this does not claim

  • The baselines are deliberately weak to surface failures; they are not representative of production models.
  • MI scores on 50 items are indicative, not statistically conclusive.
  • This is a scaffold for the pipeline, not a benchmark result.

Related

Honest scope

  • Verified here: pipeline runs end-to-end on the seed dataset; MI scores are computed correctly.
  • Described but not verified here: statistical significance of MI gaps; generalization beyond this seed set.