File size: 2,990 Bytes
0fcfd1c
1b435f0
 
0fcfd1c
1b435f0
0fcfd1c
554b58d
0fcfd1c
 
 
55e4b46
0fcfd1c
 
1b435f0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
609c576
feb4a7a
1b435f0
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
---
title: CARB Failure Observability
emoji: πŸ”¬
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Failure analysis for LM reasoning via HF Inference API
---

# CARB Failure Observability

Research pipeline for structured failure analysis in language model reasoning tasks.

```text
CARB dataset β†’ HF Inference API β†’ failure extraction β†’ MiniLM embeddings β†’ KMeans β†’ mutual information
```

The central question: *do failure clusters align with reasoning categories (transitivity, negation, syllogism, distractor logic) more than with model identity?*

## What this Space does

1. Loads 50 controlled reasoning examples across four reasoning types (CARB-style: compositional, negation, syllogism, distractor logic).
2. Sends each prompt to one or more HF Inference API models.
3. Parses binary predictions and isolates failures (incorrect or unparsable outputs).
4. Embeds failures with `sentence-transformers/all-MiniLM-L6-v2`.
5. Clusters embeddings with KMeans (`k` is user-selectable).
6. Computes mutual information between cluster assignments and (a) reasoning type, (b) model identity.
7. Displays the MI comparison as a bar plot alongside a failure summary table.

## What this Space does not claim

- Benchmark results, leaderboard rankings, or SOTA comparisons.
- That the MI gap proves a general theory of failure structure β€” it is a signal on this dataset and these models.
- Production readiness; this is a research scaffold intended to be inspectable, not deployed.

## Running

Set `HF_TOKEN` in **Space secrets** before clicking **Run Experiment**.

Models queried by default: `google/flan-t5-small`, `google/flan-t5-base`.

## Related work

- **[obversarystudios.org](https://obversarystudios.org)** β€” research engineering narrative.
- [Failure discovery on binary reasoning](https://obversarystudios.org/docs/failure_discovery_binary_reasoning.html) β€” framing for this experiment.
- [Failure clusters as interventions](https://obversarystudios.org/docs/failure_clusters_as_interventions.html) β€” what to do with clusters once found.
- [Evaluation systems](https://obversarystudios.org/docs/evaluation_systems.html) β€” how this fits the broader eval lane.
- **[failure-geometry-demo](https://huggingface.co/spaces/obversarystudios/failure-geometry-demo)** β€” always-runnable sibling Space (sklearn baseline, no API key needed).
- **[agent-threat-map](https://huggingface.co/spaces/obversarystudios/agent-threat-map)** β€” agent-threat benchmark and observability (manual responses; optional geometry/MI).

## Honest scope

Evidence posture follows the lab template at
[github.com/architectfromthefuture](https://github.com/architectfromthefuture):

- **Verified here:** pipeline runs end-to-end with a valid `HF_TOKEN`; MI scores are computed and plotted.
- **Described but not verified here:** generalization beyond this seed dataset; statistical significance of any MI gap.