| --- |
| title: Failure Geometry Demo |
| emoji: π§© |
| colorFrom: purple |
| colorTo: indigo |
| sdk: gradio |
| sdk_version: 5.50.0 |
| app_file: app.py |
| pinned: false |
| license: mit |
| short_description: Failure structure discovery on CARB reasoning |
| --- |
| |
| # Failure Geometry Demo |
|
|
| Self-contained research demo for failure structure analysis on compositional reasoning. |
| **No API key required.** Runs entirely with scikit-learn. |
|
|
| ```text |
| CARB dataset β weak baselines β failure extraction β TF-IDF + SVD embeddings β KMeans β MI comparison |
| ``` |
|
|
| ## What this demonstrates |
|
|
| Two deliberately weak baselines expose different failure geometries on the same dataset: |
|
|
| | Baseline | Failure pattern | |
| |----------|----------------| |
| | `always_1` | Systematic bias β fails on every false-labeled item | |
| | `keyword_heuristic` | Negation-sensitive β fails on affirmative-false and negated-true items | |
|
|
| Pooling failures from both lets the demo ask: |
|
|
| > **Are failure clusters organised by reasoning category, by which baseline failed, or both?** |
|
|
| Mutual information over cluster assignments answers this. The accuracy-by-type chart shows |
| per-baseline slice performance. The 2-D SVD scatter shows whether clusters are visually separable. |
|
|
| ## Dataset |
|
|
| 50 controlled reasoning examples across four types: |
|
|
| - **Transitivity** β chained relational premises (`All A β B; All B β C; therefore All A β C`) |
| - **Negation** β rule-state-conclusion triples where the state negates the conclusion |
| - **Syllogism** β classical universal / particular forms with valid and invalid conclusions |
| - **Distractor logic** β valid causal rules with irrelevant sentences injected as distractors |
|
|
| Dataset schema matches `failure-induced-benchmarks` CARB conventions: `x`, `y`, `reasoning_type`. |
|
|
| ## What this does not claim |
|
|
| - The baselines are **deliberately weak** to surface failures; they are not representative of production models. |
| - MI scores on 50 items are **indicative, not statistically conclusive**. |
| - This is a scaffold for the pipeline, not a benchmark result. |
|
|
| ## Related |
|
|
| - **[obversarystudios.org](https://obversarystudios.org)** β research engineering narrative. |
| - [Failure discovery on binary reasoning](https://obversarystudios.org/docs/failure_discovery_binary_reasoning.html) β research framing. |
| - [Failure clusters as interventions](https://obversarystudios.org/docs/failure_clusters_as_interventions.html) β what clusters imply for system changes. |
| - **[carb-observability-space](https://huggingface.co/spaces/obversarystudios/carb-observability-space)** β live HF Inference API version (requires `HF_TOKEN`). |
| - **[agent-threat-map](https://huggingface.co/spaces/obversarystudios/agent-threat-map)** β agent fragility benchmark (metrics + embedding/cluster/MI on scored probes). |
| - **[github.com/architectfromthefuture/failure-induced-benchmarks](https://github.com/architectfromthefuture/failure-induced-benchmarks)** β CARB generator and failure geometry library. |
|
|
| ## Honest scope |
|
|
| - **Verified here:** pipeline runs end-to-end on the seed dataset; MI scores are computed correctly. |
| - **Described but not verified here:** statistical significance of MI gaps; generalization beyond this seed set. |
|
|