Spaces:

obversarystudios
/

failure-geometry-demo

Running

File size: 3,187 Bytes

dbfe66d
 
09f4a33
4ebb017
 
dbfe66d
b2083a8
dbfe66d
 
 
4ebb017
dbfe66d
 
09f4a33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3206a9e
2d85dac
09f4a33

---
title: Failure Geometry Demo
emoji: 🧩
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Failure structure discovery on CARB reasoning
---

# Failure Geometry Demo

Self-contained research demo for failure structure analysis on compositional reasoning.
**No API key required.** Runs entirely with scikit-learn.

```text
CARB dataset → weak baselines → failure extraction → TF-IDF + SVD embeddings → KMeans → MI comparison
```

## What this demonstrates

Two deliberately weak baselines expose different failure geometries on the same dataset:

| Baseline | Failure pattern |
|----------|----------------|
| `always_1` | Systematic bias — fails on every false-labeled item |
| `keyword_heuristic` | Negation-sensitive — fails on affirmative-false and negated-true items |

Pooling failures from both lets the demo ask:

> **Are failure clusters organised by reasoning category, by which baseline failed, or both?**

Mutual information over cluster assignments answers this. The accuracy-by-type chart shows
per-baseline slice performance. The 2-D SVD scatter shows whether clusters are visually separable.

## Dataset

50 controlled reasoning examples across four types:

- **Transitivity** — chained relational premises (`All A → B; All B → C; therefore All A → C`)
- **Negation** — rule-state-conclusion triples where the state negates the conclusion
- **Syllogism** — classical universal / particular forms with valid and invalid conclusions
- **Distractor logic** — valid causal rules with irrelevant sentences injected as distractors

Dataset schema matches `failure-induced-benchmarks` CARB conventions: `x`, `y`, `reasoning_type`.

## What this does not claim

- The baselines are **deliberately weak** to surface failures; they are not representative of production models.
- MI scores on 50 items are **indicative, not statistically conclusive**.
- This is a scaffold for the pipeline, not a benchmark result.

## Related

- **[obversarystudios.org](https://obversarystudios.org)** — research engineering narrative.
- [Failure discovery on binary reasoning](https://obversarystudios.org/docs/failure_discovery_binary_reasoning.html) — research framing.
- [Failure clusters as interventions](https://obversarystudios.org/docs/failure_clusters_as_interventions.html) — what clusters imply for system changes.
- **[carb-observability-space](https://huggingface.co/spaces/obversarystudios/carb-observability-space)** — live HF Inference API version (requires `HF_TOKEN`).
- **[agent-threat-map](https://huggingface.co/spaces/obversarystudios/agent-threat-map)** — agent fragility benchmark (metrics + embedding/cluster/MI on scored probes).
- **[github.com/architectfromthefuture/failure-induced-benchmarks](https://github.com/architectfromthefuture/failure-induced-benchmarks)** — CARB generator and failure geometry library.

## Honest scope

- **Verified here:** pipeline runs end-to-end on the seed dataset; MI scores are computed correctly.
- **Described but not verified here:** statistical significance of MI gaps; generalization beyond this seed set.