File size: 3,187 Bytes
dbfe66d 09f4a33 4ebb017 dbfe66d b2083a8 dbfe66d 4ebb017 dbfe66d 09f4a33 3206a9e 2d85dac 09f4a33 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | ---
title: Failure Geometry Demo
emoji: π§©
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Failure structure discovery on CARB reasoning
---
# Failure Geometry Demo
Self-contained research demo for failure structure analysis on compositional reasoning.
**No API key required.** Runs entirely with scikit-learn.
```text
CARB dataset β weak baselines β failure extraction β TF-IDF + SVD embeddings β KMeans β MI comparison
```
## What this demonstrates
Two deliberately weak baselines expose different failure geometries on the same dataset:
| Baseline | Failure pattern |
|----------|----------------|
| `always_1` | Systematic bias β fails on every false-labeled item |
| `keyword_heuristic` | Negation-sensitive β fails on affirmative-false and negated-true items |
Pooling failures from both lets the demo ask:
> **Are failure clusters organised by reasoning category, by which baseline failed, or both?**
Mutual information over cluster assignments answers this. The accuracy-by-type chart shows
per-baseline slice performance. The 2-D SVD scatter shows whether clusters are visually separable.
## Dataset
50 controlled reasoning examples across four types:
- **Transitivity** β chained relational premises (`All A β B; All B β C; therefore All A β C`)
- **Negation** β rule-state-conclusion triples where the state negates the conclusion
- **Syllogism** β classical universal / particular forms with valid and invalid conclusions
- **Distractor logic** β valid causal rules with irrelevant sentences injected as distractors
Dataset schema matches `failure-induced-benchmarks` CARB conventions: `x`, `y`, `reasoning_type`.
## What this does not claim
- The baselines are **deliberately weak** to surface failures; they are not representative of production models.
- MI scores on 50 items are **indicative, not statistically conclusive**.
- This is a scaffold for the pipeline, not a benchmark result.
## Related
- **[obversarystudios.org](https://obversarystudios.org)** β research engineering narrative.
- [Failure discovery on binary reasoning](https://obversarystudios.org/docs/failure_discovery_binary_reasoning.html) β research framing.
- [Failure clusters as interventions](https://obversarystudios.org/docs/failure_clusters_as_interventions.html) β what clusters imply for system changes.
- **[carb-observability-space](https://huggingface.co/spaces/obversarystudios/carb-observability-space)** β live HF Inference API version (requires `HF_TOKEN`).
- **[agent-threat-map](https://huggingface.co/spaces/obversarystudios/agent-threat-map)** β agent fragility benchmark (metrics + embedding/cluster/MI on scored probes).
- **[github.com/architectfromthefuture/failure-induced-benchmarks](https://github.com/architectfromthefuture/failure-induced-benchmarks)** β CARB generator and failure geometry library.
## Honest scope
- **Verified here:** pipeline runs end-to-end on the seed dataset; MI scores are computed correctly.
- **Described but not verified here:** statistical significance of MI gaps; generalization beyond this seed set.
|