Spaces:

obversarystudios
/

carb-observability-space

Sleeping

App Files Files Community

carb-observability-space / README.md

obversarystudios

Use huggingface_hub InferenceClient (routed inference API)

feb4a7a verified 1 day ago

preview code

raw

history blame contribute delete

2.99 kB

	---
	title: CARB Failure Observability
	emoji: 🔬
	colorFrom: indigo
	colorTo: blue
	sdk: gradio
	sdk_version: 5.50.0
	app_file: app.py
	pinned: false
	license: mit
	short_description: Failure analysis for LM reasoning via HF Inference API
	---

	# CARB Failure Observability

	Research pipeline for structured failure analysis in language model reasoning tasks.

	```text
	CARB dataset → HF Inference API → failure extraction → MiniLM embeddings → KMeans → mutual information
	```

	The central question: do failure clusters align with reasoning categories (transitivity, negation, syllogism, distractor logic) more than with model identity?

	## What this Space does

	1. Loads 50 controlled reasoning examples across four reasoning types (CARB-style: compositional, negation, syllogism, distractor logic).
	2. Sends each prompt to one or more HF Inference API models.
	3. Parses binary predictions and isolates failures (incorrect or unparsable outputs).
	4. Embeds failures with `sentence-transformers/all-MiniLM-L6-v2`.
	5. Clusters embeddings with KMeans (`k` is user-selectable).
	6. Computes mutual information between cluster assignments and (a) reasoning type, (b) model identity.
	7. Displays the MI comparison as a bar plot alongside a failure summary table.

	## What this Space does not claim

	- Benchmark results, leaderboard rankings, or SOTA comparisons.
	- That the MI gap proves a general theory of failure structure — it is a signal on this dataset and these models.
	- Production readiness; this is a research scaffold intended to be inspectable, not deployed.

	## Running

	Set `HF_TOKEN` in Space secrets before clicking Run Experiment.

	Models queried by default: `google/flan-t5-small`, `google/flan-t5-base`.

	## Related work

	- [obversarystudios.org](https://obversarystudios.org) — research engineering narrative.
	- [Failure discovery on binary reasoning](https://obversarystudios.org/docs/failure_discovery_binary_reasoning.html) — framing for this experiment.
	- [Failure clusters as interventions](https://obversarystudios.org/docs/failure_clusters_as_interventions.html) — what to do with clusters once found.
	- [Evaluation systems](https://obversarystudios.org/docs/evaluation_systems.html) — how this fits the broader eval lane.
	- [failure-geometry-demo](https://huggingface.co/spaces/obversarystudios/failure-geometry-demo) — always-runnable sibling Space (sklearn baseline, no API key needed).
	- [agent-threat-map](https://huggingface.co/spaces/obversarystudios/agent-threat-map) — agent-threat benchmark and observability (manual responses; optional geometry/MI).

	## Honest scope

	Evidence posture follows the lab template at
	[github.com/architectfromthefuture](https://github.com/architectfromthefuture):

	- Verified here: pipeline runs end-to-end with a valid `HF_TOKEN`; MI scores are computed and plotted.
	- Described but not verified here: generalization beyond this seed dataset; statistical significance of any MI gap.