| --- |
| title: CARB Failure Observability |
| emoji: π¬ |
| colorFrom: indigo |
| colorTo: blue |
| sdk: gradio |
| sdk_version: 5.50.0 |
| app_file: app.py |
| pinned: false |
| license: mit |
| short_description: Failure analysis for LM reasoning via HF Inference API |
| --- |
| |
| # CARB Failure Observability |
|
|
| Research pipeline for structured failure analysis in language model reasoning tasks. |
|
|
| ```text |
| CARB dataset β HF Inference API β failure extraction β MiniLM embeddings β KMeans β mutual information |
| ``` |
|
|
| The central question: *do failure clusters align with reasoning categories (transitivity, negation, syllogism, distractor logic) more than with model identity?* |
|
|
| ## What this Space does |
|
|
| 1. Loads 50 controlled reasoning examples across four reasoning types (CARB-style: compositional, negation, syllogism, distractor logic). |
| 2. Sends each prompt to one or more HF Inference API models. |
| 3. Parses binary predictions and isolates failures (incorrect or unparsable outputs). |
| 4. Embeds failures with `sentence-transformers/all-MiniLM-L6-v2`. |
| 5. Clusters embeddings with KMeans (`k` is user-selectable). |
| 6. Computes mutual information between cluster assignments and (a) reasoning type, (b) model identity. |
| 7. Displays the MI comparison as a bar plot alongside a failure summary table. |
|
|
| ## What this Space does not claim |
|
|
| - Benchmark results, leaderboard rankings, or SOTA comparisons. |
| - That the MI gap proves a general theory of failure structure β it is a signal on this dataset and these models. |
| - Production readiness; this is a research scaffold intended to be inspectable, not deployed. |
|
|
| ## Running |
|
|
| Set `HF_TOKEN` in **Space secrets** before clicking **Run Experiment**. |
|
|
| Models queried by default: `google/flan-t5-small`, `google/flan-t5-base`. |
|
|
| ## Related work |
|
|
| - **[obversarystudios.org](https://obversarystudios.org)** β research engineering narrative. |
| - [Failure discovery on binary reasoning](https://obversarystudios.org/docs/failure_discovery_binary_reasoning.html) β framing for this experiment. |
| - [Failure clusters as interventions](https://obversarystudios.org/docs/failure_clusters_as_interventions.html) β what to do with clusters once found. |
| - [Evaluation systems](https://obversarystudios.org/docs/evaluation_systems.html) β how this fits the broader eval lane. |
| - **[failure-geometry-demo](https://huggingface.co/spaces/obversarystudios/failure-geometry-demo)** β always-runnable sibling Space (sklearn baseline, no API key needed). |
| - **[agent-threat-map](https://huggingface.co/spaces/obversarystudios/agent-threat-map)** β agent-threat benchmark and observability (manual responses; optional geometry/MI). |
|
|
| ## Honest scope |
|
|
| Evidence posture follows the lab template at |
| [github.com/architectfromthefuture](https://github.com/architectfromthefuture): |
|
|
| - **Verified here:** pipeline runs end-to-end with a valid `HF_TOKEN`; MI scores are computed and plotted. |
| - **Described but not verified here:** generalization beyond this seed dataset; statistical significance of any MI gap. |
|
|