obversarystudios's picture
Align requirements and mirror copies
2d85dac verified
---
title: Failure Geometry Demo
emoji: 🧩
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Failure structure discovery on CARB reasoning
---
# Failure Geometry Demo
Self-contained research demo for failure structure analysis on compositional reasoning.
**No API key required.** Runs entirely with scikit-learn.
```text
CARB dataset β†’ weak baselines β†’ failure extraction β†’ TF-IDF + SVD embeddings β†’ KMeans β†’ MI comparison
```
## What this demonstrates
Two deliberately weak baselines expose different failure geometries on the same dataset:
| Baseline | Failure pattern |
|----------|----------------|
| `always_1` | Systematic bias β€” fails on every false-labeled item |
| `keyword_heuristic` | Negation-sensitive β€” fails on affirmative-false and negated-true items |
Pooling failures from both lets the demo ask:
> **Are failure clusters organised by reasoning category, by which baseline failed, or both?**
Mutual information over cluster assignments answers this. The accuracy-by-type chart shows
per-baseline slice performance. The 2-D SVD scatter shows whether clusters are visually separable.
## Dataset
50 controlled reasoning examples across four types:
- **Transitivity** β€” chained relational premises (`All A β†’ B; All B β†’ C; therefore All A β†’ C`)
- **Negation** β€” rule-state-conclusion triples where the state negates the conclusion
- **Syllogism** β€” classical universal / particular forms with valid and invalid conclusions
- **Distractor logic** β€” valid causal rules with irrelevant sentences injected as distractors
Dataset schema matches `failure-induced-benchmarks` CARB conventions: `x`, `y`, `reasoning_type`.
## What this does not claim
- The baselines are **deliberately weak** to surface failures; they are not representative of production models.
- MI scores on 50 items are **indicative, not statistically conclusive**.
- This is a scaffold for the pipeline, not a benchmark result.
## Related
- **[obversarystudios.org](https://obversarystudios.org)** β€” research engineering narrative.
- [Failure discovery on binary reasoning](https://obversarystudios.org/docs/failure_discovery_binary_reasoning.html) β€” research framing.
- [Failure clusters as interventions](https://obversarystudios.org/docs/failure_clusters_as_interventions.html) β€” what clusters imply for system changes.
- **[carb-observability-space](https://huggingface.co/spaces/obversarystudios/carb-observability-space)** β€” live HF Inference API version (requires `HF_TOKEN`).
- **[agent-threat-map](https://huggingface.co/spaces/obversarystudios/agent-threat-map)** β€” agent fragility benchmark (metrics + embedding/cluster/MI on scored probes).
- **[github.com/architectfromthefuture/failure-induced-benchmarks](https://github.com/architectfromthefuture/failure-induced-benchmarks)** β€” CARB generator and failure geometry library.
## Honest scope
- **Verified here:** pipeline runs end-to-end on the seed dataset; MI scores are computed correctly.
- **Described but not verified here:** statistical significance of MI gaps; generalization beyond this seed set.