A newer version of the Gradio SDK is available: 6.14.0
metadata
title: CARB Failure Observability
emoji: π¬
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Failure analysis for LM reasoning via HF Inference API
CARB Failure Observability
Research pipeline for structured failure analysis in language model reasoning tasks.
CARB dataset β HF Inference API β failure extraction β MiniLM embeddings β KMeans β mutual information
The central question: do failure clusters align with reasoning categories (transitivity, negation, syllogism, distractor logic) more than with model identity?
What this Space does
- Loads 50 controlled reasoning examples across four reasoning types (CARB-style: compositional, negation, syllogism, distractor logic).
- Sends each prompt to one or more HF Inference API models.
- Parses binary predictions and isolates failures (incorrect or unparsable outputs).
- Embeds failures with
sentence-transformers/all-MiniLM-L6-v2. - Clusters embeddings with KMeans (
kis user-selectable). - Computes mutual information between cluster assignments and (a) reasoning type, (b) model identity.
- Displays the MI comparison as a bar plot alongside a failure summary table.
What this Space does not claim
- Benchmark results, leaderboard rankings, or SOTA comparisons.
- That the MI gap proves a general theory of failure structure β it is a signal on this dataset and these models.
- Production readiness; this is a research scaffold intended to be inspectable, not deployed.
Running
Set HF_TOKEN in Space secrets before clicking Run Experiment.
Models queried by default: google/flan-t5-small, google/flan-t5-base.
Related work
- obversarystudios.org β research engineering narrative.
- Failure discovery on binary reasoning β framing for this experiment.
- Failure clusters as interventions β what to do with clusters once found.
- Evaluation systems β how this fits the broader eval lane.
- failure-geometry-demo β always-runnable sibling Space (sklearn baseline, no API key needed).
- agent-threat-map β agent-threat benchmark and observability (manual responses; optional geometry/MI).
Honest scope
Evidence posture follows the lab template at github.com/architectfromthefuture:
- Verified here: pipeline runs end-to-end with a valid
HF_TOKEN; MI scores are computed and plotted. - Described but not verified here: generalization beyond this seed dataset; statistical significance of any MI gap.