Spaces:

obversarystudios
/

carb-observability-space

Sleeping

App Files Files Community

carb-observability-space / README.md

obversarystudios

Use huggingface_hub InferenceClient (routed inference API)

feb4a7a verified 1 day ago

preview code

raw

history blame contribute delete

2.99 kB

A newer version of the Gradio SDK is available: 6.14.0

Upgrade

metadata

title: CARB Failure Observability
emoji: 🔬
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Failure analysis for LM reasoning via HF Inference API

CARB Failure Observability

Research pipeline for structured failure analysis in language model reasoning tasks.

CARB dataset → HF Inference API → failure extraction → MiniLM embeddings → KMeans → mutual information

The central question: do failure clusters align with reasoning categories (transitivity, negation, syllogism, distractor logic) more than with model identity?

What this Space does

Loads 50 controlled reasoning examples across four reasoning types (CARB-style: compositional, negation, syllogism, distractor logic).
Sends each prompt to one or more HF Inference API models.
Parses binary predictions and isolates failures (incorrect or unparsable outputs).
Embeds failures with sentence-transformers/all-MiniLM-L6-v2.
Clusters embeddings with KMeans (k is user-selectable).
Computes mutual information between cluster assignments and (a) reasoning type, (b) model identity.
Displays the MI comparison as a bar plot alongside a failure summary table.

What this Space does not claim

Benchmark results, leaderboard rankings, or SOTA comparisons.
That the MI gap proves a general theory of failure structure — it is a signal on this dataset and these models.
Production readiness; this is a research scaffold intended to be inspectable, not deployed.

Running

Set HF_TOKEN in Space secrets before clicking Run Experiment.

Models queried by default: google/flan-t5-small, google/flan-t5-base.

Related work

obversarystudios.org — research engineering narrative.
Failure discovery on binary reasoning — framing for this experiment.
Failure clusters as interventions — what to do with clusters once found.
Evaluation systems — how this fits the broader eval lane.
failure-geometry-demo — always-runnable sibling Space (sklearn baseline, no API key needed).
agent-threat-map — agent-threat benchmark and observability (manual responses; optional geometry/MI).

Honest scope

Evidence posture follows the lab template at github.com/architectfromthefuture:

Verified here: pipeline runs end-to-end with a valid HF_TOKEN; MI scores are computed and plotted.
Described but not verified here: generalization beyond this seed dataset; statistical significance of any MI gap.