Brian Moran commited on
Commit
1b435f0
·
1 Parent(s): 0fcfd1c

Add CARB observability pipeline

Browse files
README.md CHANGED
@@ -1,13 +1,60 @@
1
  ---
2
- title: Carb Observability Space
3
- emoji: 😻
4
  colorFrom: indigo
5
- colorTo: green
6
  sdk: gradio
7
- sdk_version: 6.13.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: CARB Failure Observability
3
+ emoji: 🔬
4
  colorFrom: indigo
5
+ colorTo: blue
6
  sdk: gradio
7
+ sdk_version: 4.44.1
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
+ short_description: Structured failure analysis for LM reasoning — HF Inference API + cluster MI
12
  ---
13
 
14
+ # CARB Failure Observability
15
+
16
+ Research pipeline for structured failure analysis in language model reasoning tasks.
17
+
18
+ ```text
19
+ CARB dataset → HF Inference API → failure extraction → MiniLM embeddings → KMeans → mutual information
20
+ ```
21
+
22
+ The central question: *do failure clusters align with reasoning categories (transitivity, negation, syllogism, distractor logic) more than with model identity?*
23
+
24
+ ## What this Space does
25
+
26
+ 1. Loads 50 controlled reasoning examples across four reasoning types (CARB-style: compositional, negation, syllogism, distractor logic).
27
+ 2. Sends each prompt to one or more HF Inference API models.
28
+ 3. Parses binary predictions and isolates failures (incorrect or unparsable outputs).
29
+ 4. Embeds failures with `sentence-transformers/all-MiniLM-L6-v2`.
30
+ 5. Clusters embeddings with KMeans (`k` is user-selectable).
31
+ 6. Computes mutual information between cluster assignments and (a) reasoning type, (b) model identity.
32
+ 7. Displays the MI comparison as a bar plot alongside a failure summary table.
33
+
34
+ ## What this Space does not claim
35
+
36
+ - Benchmark results, leaderboard rankings, or SOTA comparisons.
37
+ - That the MI gap proves a general theory of failure structure — it is a signal on this dataset and these models.
38
+ - Production readiness; this is a research scaffold intended to be inspectable, not deployed.
39
+
40
+ ## Running
41
+
42
+ Set `HF_TOKEN` in **Space secrets** before clicking **Run Experiment**.
43
+
44
+ Models queried by default: `google/flan-t5-small`, `google/flan-t5-base`.
45
+
46
+ ## Related work
47
+
48
+ - **[obversarystudios.org](https://obversarystudios.org)** — research engineering narrative.
49
+ - [Failure discovery on binary reasoning](https://obversarystudios.org/docs/failure_discovery_binary_reasoning.html) — framing for this experiment.
50
+ - [Failure clusters as interventions](https://obversarystudios.org/docs/failure_clusters_as_interventions.html) — what to do with clusters once found.
51
+ - [Evaluation systems](https://obversarystudios.org/docs/evaluation_systems.html) — how this fits the broader eval lane.
52
+ - **[failure-geometry-demo](https://huggingface.co/spaces/architectfromthefuture/failure-geometry-demo)** — always-runnable sibling Space (sklearn baseline, no API key needed).
53
+
54
+ ## Honest scope
55
+
56
+ Evidence posture follows the lab template at
57
+ [github.com/architectfromthefuture](https://github.com/architectfromthefuture):
58
+
59
+ - **Verified here:** pipeline runs end-to-end with a valid `HF_TOKEN`; MI scores are computed and plotted.
60
+ - **Described but not verified here:** generalization beyond this seed dataset; statistical significance of any MI gap.
__pycache__/app.cpython-312.pyc ADDED
Binary file (8.78 kB). View file
 
app.py ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from pathlib import Path
2
+
3
+ import gradio as gr
4
+ import pandas as pd
5
+
6
+ from core.cluster import cluster_embeddings
7
+ from core.dataset import load_dataset
8
+ from core.embed import embed_failures
9
+ from core.eval import evaluate
10
+ from core.metrics import compute_mi_scores
11
+ from core.model import DEFAULT_MODELS, query_model
12
+ from viz.plots import plot_mi_comparison
13
+
14
+
15
+ DATA_PATH = Path(__file__).parent / "data" / "carb_seed.json"
16
+
17
+ _DESCRIPTION = """\
18
+ ## CARB Failure Observability
19
+
20
+ Research pipeline for structured failure analysis in language model reasoning.
21
+
22
+ ```
23
+ CARB dataset → HF Inference API → failure extraction → MiniLM embeddings → KMeans → mutual information
24
+ ```
25
+
26
+ **Central question:** do failure clusters align with *reasoning category* more than with *model identity*?
27
+
28
+ The MI comparison plot answers this directly — a larger `MI(cluster, reasoning_type)` bar relative to
29
+ `MI(cluster, model_identity)` supports the hypothesis that failure structure is organized by reasoning
30
+ difficulty, not model choice alone.
31
+
32
+ > **Requires** `HF_TOKEN` set in Space secrets. See
33
+ > [failure-geometry-demo](https://huggingface.co/spaces/architectfromthefuture/failure-geometry-demo)
34
+ > for a fully self-contained version that needs no API key.
35
+ >
36
+ > Research context: [obversarystudios.org](https://obversarystudios.org)
37
+ """
38
+
39
+
40
+ def run_experiment(
41
+ selected_models: list[str],
42
+ n_clusters: int,
43
+ ) -> tuple[str, object, object]:
44
+ log_lines: list[str] = []
45
+
46
+ def log(msg: str) -> None:
47
+ log_lines.append(msg)
48
+
49
+ if not selected_models:
50
+ selected_models = DEFAULT_MODELS[:1]
51
+
52
+ log(f"Loading dataset from {DATA_PATH.name} …")
53
+ try:
54
+ dataset = load_dataset(DATA_PATH)
55
+ except Exception as exc:
56
+ return f"Dataset error: {exc}", None, None
57
+
58
+ log(f" {len(dataset)} examples across {len({r['reasoning_type'] for r in dataset})} reasoning types.")
59
+ log(f"Querying models: {', '.join(selected_models)} …")
60
+
61
+ try:
62
+ failures = evaluate(dataset, query_model, model_ids=selected_models)
63
+ except Exception as exc:
64
+ return f"Evaluation error: {exc}", None, None
65
+
66
+ log(f" Found {len(failures)} failures from {len(dataset) * len(selected_models)} total predictions.")
67
+
68
+ if not failures:
69
+ log("No failures detected — all predictions were correct.")
70
+ empty_mi = {
71
+ "MI(cluster, reasoning_type)": 0.0,
72
+ "MI(cluster, model_identity)": 0.0,
73
+ }
74
+ fig = plot_mi_comparison(empty_mi)
75
+ return "\n".join(log_lines), fig, _empty_summary_table()
76
+
77
+ log("Embedding failures with all-MiniLM-L6-v2 …")
78
+ try:
79
+ embeddings = embed_failures(failures)
80
+ except Exception as exc:
81
+ return "\n".join(log_lines) + f"\nEmbed error: {exc}", None, None
82
+
83
+ log(f" Embeddings shape: {embeddings.shape}")
84
+ log(f"Clustering into k={n_clusters} clusters (KMeans) …")
85
+
86
+ cluster_ids = cluster_embeddings(embeddings, n_clusters=n_clusters)
87
+ for failure, cluster_id in zip(failures, cluster_ids, strict=True):
88
+ failure["cluster_id"] = cluster_id
89
+
90
+ counts_per_cluster = {}
91
+ for cid in cluster_ids:
92
+ counts_per_cluster[cid] = counts_per_cluster.get(cid, 0) + 1
93
+ log(f" Cluster sizes: { {k: counts_per_cluster[k] for k in sorted(counts_per_cluster)} }")
94
+
95
+ reasoning_types = [f["reasoning_type"] for f in failures]
96
+ model_ids_list = [f["model_id"] for f in failures]
97
+
98
+ log("Computing mutual information …")
99
+ mi_scores = compute_mi_scores(cluster_ids, reasoning_types, model_ids_list)
100
+ for label, score in mi_scores.items():
101
+ log(f" {label}: {score:.4f}")
102
+
103
+ fig = plot_mi_comparison(mi_scores)
104
+ summary_df = _build_summary_table(failures)
105
+
106
+ return "\n".join(log_lines), fig, summary_df
107
+
108
+
109
+ def _build_summary_table(failures: list[dict]) -> pd.DataFrame:
110
+ from collections import Counter
111
+ counts: Counter = Counter()
112
+ for f in failures:
113
+ counts[(f["reasoning_type"], f["model_id"])] += 1
114
+
115
+ rows = [
116
+ {"reasoning_type": rtype, "model_id": mid, "failure_count": cnt}
117
+ for (rtype, mid), cnt in sorted(counts.items())
118
+ ]
119
+ return pd.DataFrame(rows) if rows else _empty_summary_table()
120
+
121
+
122
+ def _empty_summary_table() -> pd.DataFrame:
123
+ return pd.DataFrame(columns=["reasoning_type", "model_id", "failure_count"])
124
+
125
+
126
+ with gr.Blocks(title="CARB Failure Observability", theme=gr.themes.Soft()) as demo:
127
+ gr.Markdown(_DESCRIPTION)
128
+
129
+ with gr.Row():
130
+ with gr.Column(scale=1, min_width=260):
131
+ model_selector = gr.CheckboxGroup(
132
+ choices=DEFAULT_MODELS,
133
+ value=DEFAULT_MODELS[:1],
134
+ label="Models to query",
135
+ info="Each model runs on all 50 examples. Multiple models increase failure pool diversity.",
136
+ )
137
+ n_clusters_slider = gr.Slider(
138
+ minimum=2,
139
+ maximum=6,
140
+ step=1,
141
+ value=4,
142
+ label="KMeans clusters (k)",
143
+ info="Should be ≤ number of reasoning types (4).",
144
+ )
145
+ run_btn = gr.Button("Run Experiment", variant="primary", size="lg")
146
+
147
+ with gr.Column(scale=2):
148
+ status_log = gr.Textbox(
149
+ label="Pipeline log",
150
+ lines=9,
151
+ interactive=False,
152
+ placeholder="Click 'Run Experiment' to start …",
153
+ )
154
+
155
+ with gr.Row():
156
+ mi_plot = gr.Plot(
157
+ label="Mutual information: cluster vs. reasoning type vs. model identity"
158
+ )
159
+
160
+ with gr.Row():
161
+ summary_table = gr.Dataframe(
162
+ headers=["reasoning_type", "model_id", "failure_count"],
163
+ label="Failures by reasoning type and model",
164
+ interactive=False,
165
+ )
166
+
167
+ run_btn.click(
168
+ fn=run_experiment,
169
+ inputs=[model_selector, n_clusters_slider],
170
+ outputs=[status_log, mi_plot, summary_table],
171
+ )
172
+
173
+
174
+ if __name__ == "__main__":
175
+ demo.launch()
carb-observability-space ADDED
@@ -0,0 +1 @@
 
 
1
+ Subproject commit 0fcfd1cb2222aa6b2ce874133ab7ac03305d7823
core/__pycache__/cluster.cpython-312.pyc ADDED
Binary file (952 Bytes). View file
 
core/__pycache__/dataset.cpython-312.pyc ADDED
Binary file (1.88 kB). View file
 
core/__pycache__/embed.cpython-312.pyc ADDED
Binary file (1.42 kB). View file
 
core/__pycache__/eval.cpython-312.pyc ADDED
Binary file (1.6 kB). View file
 
core/__pycache__/metrics.cpython-312.pyc ADDED
Binary file (849 Bytes). View file
 
core/__pycache__/model.cpython-312.pyc ADDED
Binary file (4.09 kB). View file
 
core/cluster.py ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np
2
+ from sklearn.cluster import KMeans
3
+
4
+
5
+ def cluster_embeddings(
6
+ embeddings: np.ndarray,
7
+ n_clusters: int = 4,
8
+ random_state: int = 42,
9
+ ) -> list[int]:
10
+ if len(embeddings) == 0:
11
+ return []
12
+
13
+ effective_clusters = min(n_clusters, len(embeddings))
14
+ if effective_clusters == 1:
15
+ return [0]
16
+
17
+ kmeans = KMeans(
18
+ n_clusters=effective_clusters,
19
+ random_state=random_state,
20
+ n_init=10,
21
+ )
22
+ return kmeans.fit_predict(embeddings).tolist()
core/dataset.py ADDED
@@ -0,0 +1,29 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ from pathlib import Path
3
+ from typing import Any
4
+
5
+
6
+ REQUIRED_FIELDS = {"x", "y", "reasoning_type"}
7
+
8
+
9
+ def load_dataset(path: str | Path) -> list[dict[str, Any]]:
10
+ """Load and validate the small CARB-style seed dataset."""
11
+ dataset_path = Path(path)
12
+ with dataset_path.open("r", encoding="utf-8") as f:
13
+ rows = json.load(f)
14
+
15
+ if not isinstance(rows, list):
16
+ raise ValueError("Dataset must be a JSON list.")
17
+
18
+ for index, row in enumerate(rows):
19
+ missing = REQUIRED_FIELDS.difference(row)
20
+ if missing:
21
+ raise ValueError(f"Row {index} is missing required fields: {sorted(missing)}")
22
+ if row["y"] not in (0, 1):
23
+ raise ValueError(f"Row {index} has non-binary label: {row['y']!r}")
24
+ if not isinstance(row["x"], str) or not row["x"].strip():
25
+ raise ValueError(f"Row {index} has an empty input string.")
26
+ if not isinstance(row["reasoning_type"], str) or not row["reasoning_type"].strip():
27
+ raise ValueError(f"Row {index} has an empty reasoning_type.")
28
+
29
+ return rows
core/embed.py ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from collections.abc import Sequence
2
+
3
+ import numpy as np
4
+ from sentence_transformers import SentenceTransformer
5
+
6
+
7
+ EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
8
+
9
+
10
+ def embed_failures(failures: Sequence[dict[str, object]]) -> np.ndarray:
11
+ texts = [_failure_text(failure) for failure in failures]
12
+ if not texts:
13
+ return np.empty((0, 384))
14
+
15
+ model = SentenceTransformer(EMBEDDING_MODEL)
16
+ return model.encode(texts, convert_to_numpy=True, normalize_embeddings=True)
17
+
18
+
19
+ def _failure_text(failure: dict[str, object]) -> str:
20
+ return (
21
+ f"input: {failure['x']}\n"
22
+ f"expected: {failure['y']}\n"
23
+ f"prediction: {failure['prediction']}\n"
24
+ f"reasoning_type: {failure['reasoning_type']}\n"
25
+ f"model: {failure['model_id']}"
26
+ )
core/eval.py ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from collections.abc import Callable, Sequence
2
+ from typing import Any
3
+
4
+ from core.model import DEFAULT_MODELS, build_prompt, parse_binary_prediction
5
+
6
+
7
+ ModelFn = Callable[[str, str], str]
8
+
9
+
10
+ def evaluate(
11
+ dataset: Sequence[dict[str, Any]],
12
+ model_fn: ModelFn,
13
+ model_ids: Sequence[str] | None = None,
14
+ ) -> list[dict[str, Any]]:
15
+ """Run models over the dataset and return only incorrect or unparsable cases."""
16
+ failures: list[dict[str, Any]] = []
17
+ selected_model_ids = list(model_ids or DEFAULT_MODELS)
18
+
19
+ for sample_id, sample in enumerate(dataset):
20
+ prompt = build_prompt(sample["x"])
21
+ expected = int(sample["y"])
22
+
23
+ for model_id in selected_model_ids:
24
+ raw_output = model_fn(prompt, model_id)
25
+ prediction = parse_binary_prediction(raw_output)
26
+ is_correct = prediction == expected
27
+
28
+ if not is_correct:
29
+ failures.append(
30
+ {
31
+ "sample_id": sample_id,
32
+ "x": sample["x"],
33
+ "y": expected,
34
+ "reasoning_type": sample["reasoning_type"],
35
+ "model_id": model_id,
36
+ "prompt": prompt,
37
+ "raw_output": raw_output,
38
+ "prediction": prediction,
39
+ "failure_kind": "parse_error" if prediction is None else "wrong_label",
40
+ }
41
+ )
42
+
43
+ return failures
core/metrics.py ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from collections.abc import Sequence
2
+
3
+ from sklearn.metrics import mutual_info_score
4
+
5
+
6
+ def compute_mi_scores(
7
+ cluster_ids: Sequence[int],
8
+ reasoning_types: Sequence[str],
9
+ model_ids: Sequence[str],
10
+ ) -> dict[str, float]:
11
+ if not cluster_ids:
12
+ return {
13
+ "MI(cluster, reasoning_type)": 0.0,
14
+ "MI(cluster, model_identity)": 0.0,
15
+ }
16
+
17
+ return {
18
+ "MI(cluster, reasoning_type)": float(mutual_info_score(cluster_ids, reasoning_types)),
19
+ "MI(cluster, model_identity)": float(mutual_info_score(cluster_ids, model_ids)),
20
+ }
core/model.py ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import re
3
+ from collections.abc import Sequence
4
+
5
+ import requests
6
+
7
+
8
+ DEFAULT_MODELS = [
9
+ "google/flan-t5-small",
10
+ "google/flan-t5-base",
11
+ ]
12
+
13
+
14
+ def build_prompt(input_text: str) -> str:
15
+ return (
16
+ "Answer this binary reasoning question. "
17
+ "Return only one line in the format 'label: 0' or 'label: 1'.\n\n"
18
+ f"Question: {input_text}"
19
+ )
20
+
21
+
22
+ def query_model(prompt: str, model_id: str = DEFAULT_MODELS[0], timeout: int = 60) -> str:
23
+ """Call the Hugging Face Inference API and return model text."""
24
+ token = os.environ.get("HF_TOKEN")
25
+ if not token:
26
+ return "ERROR: HF_TOKEN is not set."
27
+
28
+ url = f"https://api-inference.huggingface.co/models/{model_id}"
29
+ headers = {"Authorization": f"Bearer {token}"}
30
+ payload = {
31
+ "inputs": prompt,
32
+ "parameters": {"max_new_tokens": 32, "return_full_text": False},
33
+ "options": {"wait_for_model": True},
34
+ }
35
+
36
+ try:
37
+ response = requests.post(url, headers=headers, json=payload, timeout=timeout)
38
+ response.raise_for_status()
39
+ data = response.json()
40
+ except requests.RequestException as exc:
41
+ return f"ERROR: request failed for {model_id}: {exc}"
42
+ except ValueError:
43
+ return f"ERROR: non-JSON response from {model_id}."
44
+
45
+ return _extract_generated_text(data)
46
+
47
+
48
+ def query_models(prompt: str, model_ids: Sequence[str]) -> dict[str, str]:
49
+ return {model_id: query_model(prompt, model_id=model_id) for model_id in model_ids}
50
+
51
+
52
+ def parse_binary_prediction(output: str) -> int | None:
53
+ """Parse a structured binary label from model output."""
54
+ normalized = output.strip().lower()
55
+ if normalized.startswith("error:"):
56
+ return None
57
+
58
+ structured_patterns = [
59
+ r"\blabel\s*[:=]\s*([01])\b",
60
+ r"\banswer\s*[:=]\s*([01])\b",
61
+ r"\bprediction\s*[:=]\s*([01])\b",
62
+ ]
63
+ for pattern in structured_patterns:
64
+ match = re.search(pattern, normalized)
65
+ if match:
66
+ return int(match.group(1))
67
+
68
+ if re.fullmatch(r"[01]", normalized):
69
+ return int(normalized)
70
+
71
+ return None
72
+
73
+
74
+ def _extract_generated_text(data: object) -> str:
75
+ if isinstance(data, list) and data:
76
+ first = data[0]
77
+ if isinstance(first, dict):
78
+ text = first.get("generated_text") or first.get("summary_text")
79
+ if isinstance(text, str):
80
+ return text
81
+ if isinstance(first, str):
82
+ return first
83
+
84
+ if isinstance(data, dict):
85
+ if isinstance(data.get("error"), str):
86
+ return f"ERROR: {data['error']}"
87
+ text = data.get("generated_text") or data.get("summary_text")
88
+ if isinstance(text, str):
89
+ return text
90
+
91
+ return f"ERROR: unsupported response format: {data!r}"
data/carb_seed.json ADDED
@@ -0,0 +1,252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All robins are birds. All birds are animals. Conclusion: All robins are animals.",
4
+ "y": 1,
5
+ "reasoning_type": "transitivity"
6
+ },
7
+ {
8
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All squares are rectangles. All rectangles are shapes. Conclusion: All squares are shapes.",
9
+ "y": 1,
10
+ "reasoning_type": "transitivity"
11
+ },
12
+ {
13
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All tulips are flowers. All flowers are plants. Conclusion: All tulips are plants.",
14
+ "y": 1,
15
+ "reasoning_type": "transitivity"
16
+ },
17
+ {
18
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All ferries are boats. All boats are vehicles. Conclusion: All ferries are vehicles.",
19
+ "y": 1,
20
+ "reasoning_type": "transitivity"
21
+ },
22
+ {
23
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All violins are instruments. All instruments are objects. Conclusion: All violins are objects.",
24
+ "y": 1,
25
+ "reasoning_type": "transitivity"
26
+ },
27
+ {
28
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All oak trees are trees. All trees are living things. Conclusion: All oak trees are living things.",
29
+ "y": 1,
30
+ "reasoning_type": "transitivity"
31
+ },
32
+ {
33
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All comets are space objects. All space objects are visible from telescopes. Conclusion: All comets are visible from telescopes.",
34
+ "y": 1,
35
+ "reasoning_type": "transitivity"
36
+ },
37
+ {
38
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All laptops are computers. All computers are machines. Conclusion: All machines are laptops.",
39
+ "y": 0,
40
+ "reasoning_type": "transitivity"
41
+ },
42
+ {
43
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All sparrows are birds. All birds have feathers. Conclusion: All feathered things are sparrows.",
44
+ "y": 0,
45
+ "reasoning_type": "transitivity"
46
+ },
47
+ {
48
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All apples are fruit. All fruit is food. Conclusion: All food is apples.",
49
+ "y": 0,
50
+ "reasoning_type": "transitivity"
51
+ },
52
+ {
53
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All poets are writers. All writers use language. Conclusion: All language users are poets.",
54
+ "y": 0,
55
+ "reasoning_type": "transitivity"
56
+ },
57
+ {
58
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All poodles are dogs. All dogs are mammals. Conclusion: Some mammals are not poodles.",
59
+ "y": 0,
60
+ "reasoning_type": "transitivity"
61
+ },
62
+ {
63
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All taxis are cars. All cars need fuel. Conclusion: All taxis need fuel.",
64
+ "y": 1,
65
+ "reasoning_type": "transitivity"
66
+ },
67
+ {
68
+ "x": "Label 1 if the final statement is true, else 0. Rule: If a door is locked, it is not open. The door is locked. Statement: The door is open.",
69
+ "y": 0,
70
+ "reasoning_type": "negation"
71
+ },
72
+ {
73
+ "x": "Label 1 if the final statement is true, else 0. Rule: If a badge is valid, it is not expired. The badge is valid. Statement: The badge is expired.",
74
+ "y": 0,
75
+ "reasoning_type": "negation"
76
+ },
77
+ {
78
+ "x": "Label 1 if the final statement is true, else 0. Rule: If the lamp is unplugged, it is not powered. The lamp is unplugged. Statement: The lamp is not powered.",
79
+ "y": 1,
80
+ "reasoning_type": "negation"
81
+ },
82
+ {
83
+ "x": "Label 1 if the final statement is true, else 0. Rule: If the file is encrypted, it is not readable as plain text. The file is encrypted. Statement: The file is readable as plain text.",
84
+ "y": 0,
85
+ "reasoning_type": "negation"
86
+ },
87
+ {
88
+ "x": "Label 1 if the final statement is true, else 0. Rule: If the road is closed, cars cannot pass. The road is closed. Statement: Cars can pass.",
89
+ "y": 0,
90
+ "reasoning_type": "negation"
91
+ },
92
+ {
93
+ "x": "Label 1 if the final statement is true, else 0. Rule: If a switch is off, the circuit is not active. The switch is off. Statement: The circuit is not active.",
94
+ "y": 1,
95
+ "reasoning_type": "negation"
96
+ },
97
+ {
98
+ "x": "Label 1 if the final statement is true, else 0. Rule: If a ticket is unpaid, it is not confirmed. The ticket is unpaid. Statement: The ticket is confirmed.",
99
+ "y": 0,
100
+ "reasoning_type": "negation"
101
+ },
102
+ {
103
+ "x": "Label 1 if the final statement is true, else 0. Rule: If a jar is empty, it contains no marbles. The jar is empty. Statement: The jar contains marbles.",
104
+ "y": 0,
105
+ "reasoning_type": "negation"
106
+ },
107
+ {
108
+ "x": "Label 1 if the final statement is true, else 0. Rule: If a user is banned, they are not allowed to post. The user is banned. Statement: The user is not allowed to post.",
109
+ "y": 1,
110
+ "reasoning_type": "negation"
111
+ },
112
+ {
113
+ "x": "Label 1 if the final statement is true, else 0. Rule: If the sensor is disabled, it sends no alerts. The sensor is disabled. Statement: The sensor sends alerts.",
114
+ "y": 0,
115
+ "reasoning_type": "negation"
116
+ },
117
+ {
118
+ "x": "Label 1 if the final statement is true, else 0. Rule: If a package is missing, it is not delivered. The package is missing. Statement: The package is delivered.",
119
+ "y": 0,
120
+ "reasoning_type": "negation"
121
+ },
122
+ {
123
+ "x": "Label 1 if the final statement is true, else 0. Rule: If a plant is dead, it is not growing. The plant is dead. Statement: The plant is not growing.",
124
+ "y": 1,
125
+ "reasoning_type": "negation"
126
+ },
127
+ {
128
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All doctors are trained professionals. Mira is a doctor. Conclusion: Mira is a trained professional.",
129
+ "y": 1,
130
+ "reasoning_type": "syllogism"
131
+ },
132
+ {
133
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All guests need invitations. Omar is a guest. Conclusion: Omar needs an invitation.",
134
+ "y": 1,
135
+ "reasoning_type": "syllogism"
136
+ },
137
+ {
138
+ "x": "Label 1 if the conclusion follows, else 0. Premise: No reptiles are warm-blooded. A gecko is a reptile. Conclusion: A gecko is warm-blooded.",
139
+ "y": 0,
140
+ "reasoning_type": "syllogism"
141
+ },
142
+ {
143
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All library books have catalog numbers. This item is a library book. Conclusion: This item has a catalog number.",
144
+ "y": 1,
145
+ "reasoning_type": "syllogism"
146
+ },
147
+ {
148
+ "x": "Label 1 if the conclusion follows, else 0. Premise: No expired coupons are accepted. This coupon is expired. Conclusion: This coupon is accepted.",
149
+ "y": 0,
150
+ "reasoning_type": "syllogism"
151
+ },
152
+ {
153
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All certified pilots can fly planes. Dana is certified pilot. Conclusion: Dana can fly planes.",
154
+ "y": 1,
155
+ "reasoning_type": "syllogism"
156
+ },
157
+ {
158
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All medals are awards. This object is an award. Conclusion: This object is a medal.",
159
+ "y": 0,
160
+ "reasoning_type": "syllogism"
161
+ },
162
+ {
163
+ "x": "Label 1 if the conclusion follows, else 0. Premise: No broken clocks keep correct time. This clock is broken. Conclusion: This clock keeps correct time.",
164
+ "y": 0,
165
+ "reasoning_type": "syllogism"
166
+ },
167
+ {
168
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All subscribers receive updates. Jin is a subscriber. Conclusion: Jin receives updates.",
169
+ "y": 1,
170
+ "reasoning_type": "syllogism"
171
+ },
172
+ {
173
+ "x": "Label 1 if the conclusion follows, else 0. Premise: No silent alarms make noise. This alarm is silent. Conclusion: This alarm makes noise.",
174
+ "y": 0,
175
+ "reasoning_type": "syllogism"
176
+ },
177
+ {
178
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All registered voters may vote. Lee is registered voter. Conclusion: Lee may vote.",
179
+ "y": 1,
180
+ "reasoning_type": "syllogism"
181
+ },
182
+ {
183
+ "x": "Label 1 if the conclusion follows, else 0. Premise: All chess players know rules. Sam knows rules. Conclusion: Sam is a chess player.",
184
+ "y": 0,
185
+ "reasoning_type": "syllogism"
186
+ },
187
+ {
188
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the blue key is used, the safe opens. Distractor: The red key is shiny. The blue key is used. Conclusion: The safe opens.",
189
+ "y": 1,
190
+ "reasoning_type": "distractor logic"
191
+ },
192
+ {
193
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the server restarts, the cache clears. Distractor: The keyboard is wireless. The server restarts. Conclusion: The cache clears.",
194
+ "y": 1,
195
+ "reasoning_type": "distractor logic"
196
+ },
197
+ {
198
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the form is signed, the request is valid. Distractor: The envelope is yellow. The form is not signed. Conclusion: The request is valid.",
199
+ "y": 0,
200
+ "reasoning_type": "distractor logic"
201
+ },
202
+ {
203
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the alarm rings, the guard wakes. Distractor: The guard owns a bicycle. The alarm rings. Conclusion: The guard wakes.",
204
+ "y": 1,
205
+ "reasoning_type": "distractor logic"
206
+ },
207
+ {
208
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the code compiles, tests can run. Distractor: The monitor is large. The code does not compile. Conclusion: Tests can run.",
209
+ "y": 0,
210
+ "reasoning_type": "distractor logic"
211
+ },
212
+ {
213
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the window is open, the room cools. Distractor: The carpet is green. The window is open. Conclusion: The room cools.",
214
+ "y": 1,
215
+ "reasoning_type": "distractor logic"
216
+ },
217
+ {
218
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the invoice is paid, the account is active. Distractor: The logo is blue. The invoice is unpaid. Conclusion: The account is active.",
219
+ "y": 0,
220
+ "reasoning_type": "distractor logic"
221
+ },
222
+ {
223
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the train arrives, passengers board. Distractor: The station has a clock. The train arrives. Conclusion: Passengers board.",
224
+ "y": 1,
225
+ "reasoning_type": "distractor logic"
226
+ },
227
+ {
228
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the token is invalid, access is denied. Distractor: The desk has two drawers. The token is invalid. Conclusion: Access is denied.",
229
+ "y": 1,
230
+ "reasoning_type": "distractor logic"
231
+ },
232
+ {
233
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If rain falls, the ground gets wet. Distractor: The umbrella is red. Rain does not fall. Conclusion: The ground gets wet.",
234
+ "y": 0,
235
+ "reasoning_type": "distractor logic"
236
+ },
237
+ {
238
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the switch is flipped, the light turns on. Distractor: The wall is painted white. The switch is flipped. Conclusion: The light turns on.",
239
+ "y": 1,
240
+ "reasoning_type": "distractor logic"
241
+ },
242
+ {
243
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the battery is charged, the robot moves. Distractor: The robot is made of metal. The battery is empty. Conclusion: The robot moves.",
244
+ "y": 0,
245
+ "reasoning_type": "distractor logic"
246
+ },
247
+ {
248
+ "x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the map is accurate, the route is reliable. Distractor: The compass is old. The map is accurate. Conclusion: The route is reliable.",
249
+ "y": 1,
250
+ "reasoning_type": "distractor logic"
251
+ }
252
+ ]
requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ gradio
2
+ requests
3
+ sentence-transformers
4
+ scikit-learn
5
+ matplotlib
6
+ numpy
viz/__pycache__/plots.cpython-312.pyc ADDED
Binary file (1.48 kB). View file
 
viz/plots.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import matplotlib.pyplot as plt
2
+
3
+
4
+ def plot_mi_comparison(mi_scores: dict[str, float]):
5
+ fig, ax = plt.subplots(figsize=(7, 4))
6
+ labels = list(mi_scores.keys())
7
+ values = list(mi_scores.values())
8
+
9
+ ax.bar(labels, values, color=["#4C78A8", "#F58518"])
10
+ ax.set_ylabel("Mutual information")
11
+ ax.set_title("Failure Cluster Mutual Information")
12
+ ax.set_ylim(0, max(values + [0.05]) * 1.2)
13
+ ax.tick_params(axis="x", labelrotation=15)
14
+
15
+ for index, value in enumerate(values):
16
+ ax.text(index, value, f"{value:.3f}", ha="center", va="bottom")
17
+
18
+ fig.tight_layout()
19
+ return fig