--- title: CARB Failure Observability emoji: 🔬 colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 5.50.0 app_file: app.py pinned: false license: mit short_description: Failure analysis for LM reasoning via HF Inference API --- # CARB Failure Observability Research pipeline for structured failure analysis in language model reasoning tasks. ```text CARB dataset → HF Inference API → failure extraction → MiniLM embeddings → KMeans → mutual information ``` The central question: *do failure clusters align with reasoning categories (transitivity, negation, syllogism, distractor logic) more than with model identity?* ## What this Space does 1. Loads 50 controlled reasoning examples across four reasoning types (CARB-style: compositional, negation, syllogism, distractor logic). 2. Sends each prompt to one or more HF Inference API models. 3. Parses binary predictions and isolates failures (incorrect or unparsable outputs). 4. Embeds failures with `sentence-transformers/all-MiniLM-L6-v2`. 5. Clusters embeddings with KMeans (`k` is user-selectable). 6. Computes mutual information between cluster assignments and (a) reasoning type, (b) model identity. 7. Displays the MI comparison as a bar plot alongside a failure summary table. ## What this Space does not claim - Benchmark results, leaderboard rankings, or SOTA comparisons. - That the MI gap proves a general theory of failure structure — it is a signal on this dataset and these models. - Production readiness; this is a research scaffold intended to be inspectable, not deployed. ## Running Set `HF_TOKEN` in **Space secrets** before clicking **Run Experiment**. Models queried by default: `google/flan-t5-small`, `google/flan-t5-base`. ## Related work - **[obversarystudios.org](https://obversarystudios.org)** — research engineering narrative. - [Failure discovery on binary reasoning](https://obversarystudios.org/docs/failure_discovery_binary_reasoning.html) — framing for this experiment. - [Failure clusters as interventions](https://obversarystudios.org/docs/failure_clusters_as_interventions.html) — what to do with clusters once found. - [Evaluation systems](https://obversarystudios.org/docs/evaluation_systems.html) — how this fits the broader eval lane. - **[failure-geometry-demo](https://huggingface.co/spaces/obversarystudios/failure-geometry-demo)** — always-runnable sibling Space (sklearn baseline, no API key needed). - **[agent-threat-map](https://huggingface.co/spaces/obversarystudios/agent-threat-map)** — agent-threat benchmark and observability (manual responses; optional geometry/MI). ## Honest scope Evidence posture follows the lab template at [github.com/architectfromthefuture](https://github.com/architectfromthefuture): - **Verified here:** pipeline runs end-to-end with a valid `HF_TOKEN`; MI scores are computed and plotted. - **Described but not verified here:** generalization beyond this seed dataset; statistical significance of any MI gap.