obversarystudios's picture
Use huggingface_hub InferenceClient (routed inference API)
feb4a7a verified

A newer version of the Gradio SDK is available: 6.14.0

Upgrade
metadata
title: CARB Failure Observability
emoji: πŸ”¬
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.50.0
app_file: app.py
pinned: false
license: mit
short_description: Failure analysis for LM reasoning via HF Inference API

CARB Failure Observability

Research pipeline for structured failure analysis in language model reasoning tasks.

CARB dataset β†’ HF Inference API β†’ failure extraction β†’ MiniLM embeddings β†’ KMeans β†’ mutual information

The central question: do failure clusters align with reasoning categories (transitivity, negation, syllogism, distractor logic) more than with model identity?

What this Space does

  1. Loads 50 controlled reasoning examples across four reasoning types (CARB-style: compositional, negation, syllogism, distractor logic).
  2. Sends each prompt to one or more HF Inference API models.
  3. Parses binary predictions and isolates failures (incorrect or unparsable outputs).
  4. Embeds failures with sentence-transformers/all-MiniLM-L6-v2.
  5. Clusters embeddings with KMeans (k is user-selectable).
  6. Computes mutual information between cluster assignments and (a) reasoning type, (b) model identity.
  7. Displays the MI comparison as a bar plot alongside a failure summary table.

What this Space does not claim

  • Benchmark results, leaderboard rankings, or SOTA comparisons.
  • That the MI gap proves a general theory of failure structure β€” it is a signal on this dataset and these models.
  • Production readiness; this is a research scaffold intended to be inspectable, not deployed.

Running

Set HF_TOKEN in Space secrets before clicking Run Experiment.

Models queried by default: google/flan-t5-small, google/flan-t5-base.

Related work

Honest scope

Evidence posture follows the lab template at github.com/architectfromthefuture:

  • Verified here: pipeline runs end-to-end with a valid HF_TOKEN; MI scores are computed and plotted.
  • Described but not verified here: generalization beyond this seed dataset; statistical significance of any MI gap.