Spaces:

obversarystudios
/

failure-geometry-demo

Running

App Files Files Community

failure-geometry-demo / README.md

obversarystudios

Align requirements and mirror copies

2d85dac verified 1 day ago

preview code

raw

history blame contribute delete

3.19 kB

	---
	title: Failure Geometry Demo
	emoji: 🧩
	colorFrom: purple
	colorTo: indigo
	sdk: gradio
	sdk_version: 5.50.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: Failure structure discovery on CARB reasoning
	---

	# Failure Geometry Demo

	Self-contained research demo for failure structure analysis on compositional reasoning.
	No API key required. Runs entirely with scikit-learn.

	```text
	CARB dataset → weak baselines → failure extraction → TF-IDF + SVD embeddings → KMeans → MI comparison
	```

	## What this demonstrates

	Two deliberately weak baselines expose different failure geometries on the same dataset:

	\| Baseline \| Failure pattern \|
	\|----------\|----------------\|
	\| `always_1` \| Systematic bias — fails on every false-labeled item \|
	\| `keyword_heuristic` \| Negation-sensitive — fails on affirmative-false and negated-true items \|

	Pooling failures from both lets the demo ask:

	> Are failure clusters organised by reasoning category, by which baseline failed, or both?

	Mutual information over cluster assignments answers this. The accuracy-by-type chart shows
	per-baseline slice performance. The 2-D SVD scatter shows whether clusters are visually separable.

	## Dataset

	50 controlled reasoning examples across four types:

	- Transitivity — chained relational premises (`All A → B; All B → C; therefore All A → C`)
	- Negation — rule-state-conclusion triples where the state negates the conclusion
	- Syllogism — classical universal / particular forms with valid and invalid conclusions
	- Distractor logic — valid causal rules with irrelevant sentences injected as distractors

	Dataset schema matches `failure-induced-benchmarks` CARB conventions: `x`, `y`, `reasoning_type`.

	## What this does not claim

	- The baselines are deliberately weak to surface failures; they are not representative of production models.
	- MI scores on 50 items are indicative, not statistically conclusive.
	- This is a scaffold for the pipeline, not a benchmark result.

	## Related

	- [obversarystudios.org](https://obversarystudios.org) — research engineering narrative.
	- [Failure discovery on binary reasoning](https://obversarystudios.org/docs/failure_discovery_binary_reasoning.html) — research framing.
	- [Failure clusters as interventions](https://obversarystudios.org/docs/failure_clusters_as_interventions.html) — what clusters imply for system changes.
	- [carb-observability-space](https://huggingface.co/spaces/obversarystudios/carb-observability-space) — live HF Inference API version (requires `HF_TOKEN`).
	- [agent-threat-map](https://huggingface.co/spaces/obversarystudios/agent-threat-map) — agent fragility benchmark (metrics + embedding/cluster/MI on scored probes).
	- [github.com/architectfromthefuture/failure-induced-benchmarks](https://github.com/architectfromthefuture/failure-induced-benchmarks) — CARB generator and failure geometry library.

	## Honest scope

	- Verified here: pipeline runs end-to-end on the seed dataset; MI scores are computed correctly.
	- Described but not verified here: statistical significance of MI gaps; generalization beyond this seed set.