Spaces:

obversarystudios
/

failure-geometry-demo

Running

App Files Files Community

failure-geometry-demo / README.md

obversarystudios

Align requirements and mirror copies

2d85dac verified 1 day ago

preview code

raw

history blame contribute delete

3.19 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

metadata

title: Failure Geometry Demo
emoji: 🧩
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Failure structure discovery on CARB reasoning

Failure Geometry Demo

Self-contained research demo for failure structure analysis on compositional reasoning. No API key required. Runs entirely with scikit-learn.

CARB dataset → weak baselines → failure extraction → TF-IDF + SVD embeddings → KMeans → MI comparison

What this demonstrates

Two deliberately weak baselines expose different failure geometries on the same dataset:

Baseline	Failure pattern
`always_1`	Systematic bias — fails on every false-labeled item
`keyword_heuristic`	Negation-sensitive — fails on affirmative-false and negated-true items

Pooling failures from both lets the demo ask:

Are failure clusters organised by reasoning category, by which baseline failed, or both?

Mutual information over cluster assignments answers this. The accuracy-by-type chart shows per-baseline slice performance. The 2-D SVD scatter shows whether clusters are visually separable.

Dataset

50 controlled reasoning examples across four types:

Transitivity — chained relational premises (All A → B; All B → C; therefore All A → C)
Negation — rule-state-conclusion triples where the state negates the conclusion
Syllogism — classical universal / particular forms with valid and invalid conclusions
Distractor logic — valid causal rules with irrelevant sentences injected as distractors

Dataset schema matches failure-induced-benchmarks CARB conventions: x, y, reasoning_type.

What this does not claim

The baselines are deliberately weak to surface failures; they are not representative of production models.
MI scores on 50 items are indicative, not statistically conclusive.
This is a scaffold for the pipeline, not a benchmark result.

obversarystudios.org — research engineering narrative.
Failure discovery on binary reasoning — research framing.
Failure clusters as interventions — what clusters imply for system changes.
carb-observability-space — live HF Inference API version (requires HF_TOKEN).
agent-threat-map — agent fragility benchmark (metrics + embedding/cluster/MI on scored probes).
github.com/architectfromthefuture/failure-induced-benchmarks — CARB generator and failure geometry library.

Honest scope

Verified here: pipeline runs end-to-end on the seed dataset; MI scores are computed correctly.
Described but not verified here: statistical significance of MI gaps; generalization beyond this seed set.

Failure Geometry Demo

What this demonstrates

Dataset

What this does not claim

Related

Honest scope