--- title: Failure Geometry Demo emoji: 🧩 colorFrom: purple colorTo: indigo sdk: gradio sdk_version: 5.50.0 app_file: app.py pinned: false license: mit short_description: Failure structure discovery on CARB reasoning --- # Failure Geometry Demo Self-contained research demo for failure structure analysis on compositional reasoning. **No API key required.** Runs entirely with scikit-learn. ```text CARB dataset → weak baselines → failure extraction → TF-IDF + SVD embeddings → KMeans → MI comparison ``` ## What this demonstrates Two deliberately weak baselines expose different failure geometries on the same dataset: | Baseline | Failure pattern | |----------|----------------| | `always_1` | Systematic bias — fails on every false-labeled item | | `keyword_heuristic` | Negation-sensitive — fails on affirmative-false and negated-true items | Pooling failures from both lets the demo ask: > **Are failure clusters organised by reasoning category, by which baseline failed, or both?** Mutual information over cluster assignments answers this. The accuracy-by-type chart shows per-baseline slice performance. The 2-D SVD scatter shows whether clusters are visually separable. ## Dataset 50 controlled reasoning examples across four types: - **Transitivity** — chained relational premises (`All A → B; All B → C; therefore All A → C`) - **Negation** — rule-state-conclusion triples where the state negates the conclusion - **Syllogism** — classical universal / particular forms with valid and invalid conclusions - **Distractor logic** — valid causal rules with irrelevant sentences injected as distractors Dataset schema matches `failure-induced-benchmarks` CARB conventions: `x`, `y`, `reasoning_type`. ## What this does not claim - The baselines are **deliberately weak** to surface failures; they are not representative of production models. - MI scores on 50 items are **indicative, not statistically conclusive**. - This is a scaffold for the pipeline, not a benchmark result. ## Related - **[obversarystudios.org](https://obversarystudios.org)** — research engineering narrative. - [Failure discovery on binary reasoning](https://obversarystudios.org/docs/failure_discovery_binary_reasoning.html) — research framing. - [Failure clusters as interventions](https://obversarystudios.org/docs/failure_clusters_as_interventions.html) — what clusters imply for system changes. - **[carb-observability-space](https://huggingface.co/spaces/obversarystudios/carb-observability-space)** — live HF Inference API version (requires `HF_TOKEN`). - **[agent-threat-map](https://huggingface.co/spaces/obversarystudios/agent-threat-map)** — agent fragility benchmark (metrics + embedding/cluster/MI on scored probes). - **[github.com/architectfromthefuture/failure-induced-benchmarks](https://github.com/architectfromthefuture/failure-induced-benchmarks)** — CARB generator and failure geometry library. ## Honest scope - **Verified here:** pipeline runs end-to-end on the seed dataset; MI scores are computed correctly. - **Described but not verified here:** statistical significance of MI gaps; generalization beyond this seed set.