A newer version of the Gradio SDK is available: 6.14.0
metadata
title: Failure Geometry Demo
emoji: π§©
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Failure structure discovery on CARB reasoning
Failure Geometry Demo
Self-contained research demo for failure structure analysis on compositional reasoning. No API key required. Runs entirely with scikit-learn.
CARB dataset β weak baselines β failure extraction β TF-IDF + SVD embeddings β KMeans β MI comparison
What this demonstrates
Two deliberately weak baselines expose different failure geometries on the same dataset:
| Baseline | Failure pattern |
|---|---|
always_1 |
Systematic bias β fails on every false-labeled item |
keyword_heuristic |
Negation-sensitive β fails on affirmative-false and negated-true items |
Pooling failures from both lets the demo ask:
Are failure clusters organised by reasoning category, by which baseline failed, or both?
Mutual information over cluster assignments answers this. The accuracy-by-type chart shows per-baseline slice performance. The 2-D SVD scatter shows whether clusters are visually separable.
Dataset
50 controlled reasoning examples across four types:
- Transitivity β chained relational premises (
All A β B; All B β C; therefore All A β C) - Negation β rule-state-conclusion triples where the state negates the conclusion
- Syllogism β classical universal / particular forms with valid and invalid conclusions
- Distractor logic β valid causal rules with irrelevant sentences injected as distractors
Dataset schema matches failure-induced-benchmarks CARB conventions: x, y, reasoning_type.
What this does not claim
- The baselines are deliberately weak to surface failures; they are not representative of production models.
- MI scores on 50 items are indicative, not statistically conclusive.
- This is a scaffold for the pipeline, not a benchmark result.
Related
- obversarystudios.org β research engineering narrative.
- Failure discovery on binary reasoning β research framing.
- Failure clusters as interventions β what clusters imply for system changes.
- carb-observability-space β live HF Inference API version (requires
HF_TOKEN). - agent-threat-map β agent fragility benchmark (metrics + embedding/cluster/MI on scored probes).
- github.com/architectfromthefuture/failure-induced-benchmarks β CARB generator and failure geometry library.
Honest scope
- Verified here: pipeline runs end-to-end on the seed dataset; MI scores are computed correctly.
- Described but not verified here: statistical significance of MI gaps; generalization beyond this seed set.