Brian Moran commited on
Commit ·
1b435f0
1
Parent(s): 0fcfd1c
Add CARB observability pipeline
Browse files- README.md +52 -5
- __pycache__/app.cpython-312.pyc +0 -0
- app.py +175 -0
- carb-observability-space +1 -0
- core/__pycache__/cluster.cpython-312.pyc +0 -0
- core/__pycache__/dataset.cpython-312.pyc +0 -0
- core/__pycache__/embed.cpython-312.pyc +0 -0
- core/__pycache__/eval.cpython-312.pyc +0 -0
- core/__pycache__/metrics.cpython-312.pyc +0 -0
- core/__pycache__/model.cpython-312.pyc +0 -0
- core/cluster.py +22 -0
- core/dataset.py +29 -0
- core/embed.py +26 -0
- core/eval.py +43 -0
- core/metrics.py +20 -0
- core/model.py +91 -0
- data/carb_seed.json +252 -0
- requirements.txt +6 -0
- viz/__pycache__/plots.cpython-312.pyc +0 -0
- viz/plots.py +19 -0
README.md
CHANGED
|
@@ -1,13 +1,60 @@
|
|
| 1 |
---
|
| 2 |
-
title:
|
| 3 |
-
emoji:
|
| 4 |
colorFrom: indigo
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
license: mit
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: CARB Failure Observability
|
| 3 |
+
emoji: 🔬
|
| 4 |
colorFrom: indigo
|
| 5 |
+
colorTo: blue
|
| 6 |
sdk: gradio
|
| 7 |
+
sdk_version: 4.44.1
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
license: mit
|
| 11 |
+
short_description: Structured failure analysis for LM reasoning — HF Inference API + cluster MI
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# CARB Failure Observability
|
| 15 |
+
|
| 16 |
+
Research pipeline for structured failure analysis in language model reasoning tasks.
|
| 17 |
+
|
| 18 |
+
```text
|
| 19 |
+
CARB dataset → HF Inference API → failure extraction → MiniLM embeddings → KMeans → mutual information
|
| 20 |
+
```
|
| 21 |
+
|
| 22 |
+
The central question: *do failure clusters align with reasoning categories (transitivity, negation, syllogism, distractor logic) more than with model identity?*
|
| 23 |
+
|
| 24 |
+
## What this Space does
|
| 25 |
+
|
| 26 |
+
1. Loads 50 controlled reasoning examples across four reasoning types (CARB-style: compositional, negation, syllogism, distractor logic).
|
| 27 |
+
2. Sends each prompt to one or more HF Inference API models.
|
| 28 |
+
3. Parses binary predictions and isolates failures (incorrect or unparsable outputs).
|
| 29 |
+
4. Embeds failures with `sentence-transformers/all-MiniLM-L6-v2`.
|
| 30 |
+
5. Clusters embeddings with KMeans (`k` is user-selectable).
|
| 31 |
+
6. Computes mutual information between cluster assignments and (a) reasoning type, (b) model identity.
|
| 32 |
+
7. Displays the MI comparison as a bar plot alongside a failure summary table.
|
| 33 |
+
|
| 34 |
+
## What this Space does not claim
|
| 35 |
+
|
| 36 |
+
- Benchmark results, leaderboard rankings, or SOTA comparisons.
|
| 37 |
+
- That the MI gap proves a general theory of failure structure — it is a signal on this dataset and these models.
|
| 38 |
+
- Production readiness; this is a research scaffold intended to be inspectable, not deployed.
|
| 39 |
+
|
| 40 |
+
## Running
|
| 41 |
+
|
| 42 |
+
Set `HF_TOKEN` in **Space secrets** before clicking **Run Experiment**.
|
| 43 |
+
|
| 44 |
+
Models queried by default: `google/flan-t5-small`, `google/flan-t5-base`.
|
| 45 |
+
|
| 46 |
+
## Related work
|
| 47 |
+
|
| 48 |
+
- **[obversarystudios.org](https://obversarystudios.org)** — research engineering narrative.
|
| 49 |
+
- [Failure discovery on binary reasoning](https://obversarystudios.org/docs/failure_discovery_binary_reasoning.html) — framing for this experiment.
|
| 50 |
+
- [Failure clusters as interventions](https://obversarystudios.org/docs/failure_clusters_as_interventions.html) — what to do with clusters once found.
|
| 51 |
+
- [Evaluation systems](https://obversarystudios.org/docs/evaluation_systems.html) — how this fits the broader eval lane.
|
| 52 |
+
- **[failure-geometry-demo](https://huggingface.co/spaces/architectfromthefuture/failure-geometry-demo)** — always-runnable sibling Space (sklearn baseline, no API key needed).
|
| 53 |
+
|
| 54 |
+
## Honest scope
|
| 55 |
+
|
| 56 |
+
Evidence posture follows the lab template at
|
| 57 |
+
[github.com/architectfromthefuture](https://github.com/architectfromthefuture):
|
| 58 |
+
|
| 59 |
+
- **Verified here:** pipeline runs end-to-end with a valid `HF_TOKEN`; MI scores are computed and plotted.
|
| 60 |
+
- **Described but not verified here:** generalization beyond this seed dataset; statistical significance of any MI gap.
|
__pycache__/app.cpython-312.pyc
ADDED
|
Binary file (8.78 kB). View file
|
|
|
app.py
ADDED
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from pathlib import Path
|
| 2 |
+
|
| 3 |
+
import gradio as gr
|
| 4 |
+
import pandas as pd
|
| 5 |
+
|
| 6 |
+
from core.cluster import cluster_embeddings
|
| 7 |
+
from core.dataset import load_dataset
|
| 8 |
+
from core.embed import embed_failures
|
| 9 |
+
from core.eval import evaluate
|
| 10 |
+
from core.metrics import compute_mi_scores
|
| 11 |
+
from core.model import DEFAULT_MODELS, query_model
|
| 12 |
+
from viz.plots import plot_mi_comparison
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
DATA_PATH = Path(__file__).parent / "data" / "carb_seed.json"
|
| 16 |
+
|
| 17 |
+
_DESCRIPTION = """\
|
| 18 |
+
## CARB Failure Observability
|
| 19 |
+
|
| 20 |
+
Research pipeline for structured failure analysis in language model reasoning.
|
| 21 |
+
|
| 22 |
+
```
|
| 23 |
+
CARB dataset → HF Inference API → failure extraction → MiniLM embeddings → KMeans → mutual information
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
**Central question:** do failure clusters align with *reasoning category* more than with *model identity*?
|
| 27 |
+
|
| 28 |
+
The MI comparison plot answers this directly — a larger `MI(cluster, reasoning_type)` bar relative to
|
| 29 |
+
`MI(cluster, model_identity)` supports the hypothesis that failure structure is organized by reasoning
|
| 30 |
+
difficulty, not model choice alone.
|
| 31 |
+
|
| 32 |
+
> **Requires** `HF_TOKEN` set in Space secrets. See
|
| 33 |
+
> [failure-geometry-demo](https://huggingface.co/spaces/architectfromthefuture/failure-geometry-demo)
|
| 34 |
+
> for a fully self-contained version that needs no API key.
|
| 35 |
+
>
|
| 36 |
+
> Research context: [obversarystudios.org](https://obversarystudios.org)
|
| 37 |
+
"""
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def run_experiment(
|
| 41 |
+
selected_models: list[str],
|
| 42 |
+
n_clusters: int,
|
| 43 |
+
) -> tuple[str, object, object]:
|
| 44 |
+
log_lines: list[str] = []
|
| 45 |
+
|
| 46 |
+
def log(msg: str) -> None:
|
| 47 |
+
log_lines.append(msg)
|
| 48 |
+
|
| 49 |
+
if not selected_models:
|
| 50 |
+
selected_models = DEFAULT_MODELS[:1]
|
| 51 |
+
|
| 52 |
+
log(f"Loading dataset from {DATA_PATH.name} …")
|
| 53 |
+
try:
|
| 54 |
+
dataset = load_dataset(DATA_PATH)
|
| 55 |
+
except Exception as exc:
|
| 56 |
+
return f"Dataset error: {exc}", None, None
|
| 57 |
+
|
| 58 |
+
log(f" {len(dataset)} examples across {len({r['reasoning_type'] for r in dataset})} reasoning types.")
|
| 59 |
+
log(f"Querying models: {', '.join(selected_models)} …")
|
| 60 |
+
|
| 61 |
+
try:
|
| 62 |
+
failures = evaluate(dataset, query_model, model_ids=selected_models)
|
| 63 |
+
except Exception as exc:
|
| 64 |
+
return f"Evaluation error: {exc}", None, None
|
| 65 |
+
|
| 66 |
+
log(f" Found {len(failures)} failures from {len(dataset) * len(selected_models)} total predictions.")
|
| 67 |
+
|
| 68 |
+
if not failures:
|
| 69 |
+
log("No failures detected — all predictions were correct.")
|
| 70 |
+
empty_mi = {
|
| 71 |
+
"MI(cluster, reasoning_type)": 0.0,
|
| 72 |
+
"MI(cluster, model_identity)": 0.0,
|
| 73 |
+
}
|
| 74 |
+
fig = plot_mi_comparison(empty_mi)
|
| 75 |
+
return "\n".join(log_lines), fig, _empty_summary_table()
|
| 76 |
+
|
| 77 |
+
log("Embedding failures with all-MiniLM-L6-v2 …")
|
| 78 |
+
try:
|
| 79 |
+
embeddings = embed_failures(failures)
|
| 80 |
+
except Exception as exc:
|
| 81 |
+
return "\n".join(log_lines) + f"\nEmbed error: {exc}", None, None
|
| 82 |
+
|
| 83 |
+
log(f" Embeddings shape: {embeddings.shape}")
|
| 84 |
+
log(f"Clustering into k={n_clusters} clusters (KMeans) …")
|
| 85 |
+
|
| 86 |
+
cluster_ids = cluster_embeddings(embeddings, n_clusters=n_clusters)
|
| 87 |
+
for failure, cluster_id in zip(failures, cluster_ids, strict=True):
|
| 88 |
+
failure["cluster_id"] = cluster_id
|
| 89 |
+
|
| 90 |
+
counts_per_cluster = {}
|
| 91 |
+
for cid in cluster_ids:
|
| 92 |
+
counts_per_cluster[cid] = counts_per_cluster.get(cid, 0) + 1
|
| 93 |
+
log(f" Cluster sizes: { {k: counts_per_cluster[k] for k in sorted(counts_per_cluster)} }")
|
| 94 |
+
|
| 95 |
+
reasoning_types = [f["reasoning_type"] for f in failures]
|
| 96 |
+
model_ids_list = [f["model_id"] for f in failures]
|
| 97 |
+
|
| 98 |
+
log("Computing mutual information …")
|
| 99 |
+
mi_scores = compute_mi_scores(cluster_ids, reasoning_types, model_ids_list)
|
| 100 |
+
for label, score in mi_scores.items():
|
| 101 |
+
log(f" {label}: {score:.4f}")
|
| 102 |
+
|
| 103 |
+
fig = plot_mi_comparison(mi_scores)
|
| 104 |
+
summary_df = _build_summary_table(failures)
|
| 105 |
+
|
| 106 |
+
return "\n".join(log_lines), fig, summary_df
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
def _build_summary_table(failures: list[dict]) -> pd.DataFrame:
|
| 110 |
+
from collections import Counter
|
| 111 |
+
counts: Counter = Counter()
|
| 112 |
+
for f in failures:
|
| 113 |
+
counts[(f["reasoning_type"], f["model_id"])] += 1
|
| 114 |
+
|
| 115 |
+
rows = [
|
| 116 |
+
{"reasoning_type": rtype, "model_id": mid, "failure_count": cnt}
|
| 117 |
+
for (rtype, mid), cnt in sorted(counts.items())
|
| 118 |
+
]
|
| 119 |
+
return pd.DataFrame(rows) if rows else _empty_summary_table()
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def _empty_summary_table() -> pd.DataFrame:
|
| 123 |
+
return pd.DataFrame(columns=["reasoning_type", "model_id", "failure_count"])
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
with gr.Blocks(title="CARB Failure Observability", theme=gr.themes.Soft()) as demo:
|
| 127 |
+
gr.Markdown(_DESCRIPTION)
|
| 128 |
+
|
| 129 |
+
with gr.Row():
|
| 130 |
+
with gr.Column(scale=1, min_width=260):
|
| 131 |
+
model_selector = gr.CheckboxGroup(
|
| 132 |
+
choices=DEFAULT_MODELS,
|
| 133 |
+
value=DEFAULT_MODELS[:1],
|
| 134 |
+
label="Models to query",
|
| 135 |
+
info="Each model runs on all 50 examples. Multiple models increase failure pool diversity.",
|
| 136 |
+
)
|
| 137 |
+
n_clusters_slider = gr.Slider(
|
| 138 |
+
minimum=2,
|
| 139 |
+
maximum=6,
|
| 140 |
+
step=1,
|
| 141 |
+
value=4,
|
| 142 |
+
label="KMeans clusters (k)",
|
| 143 |
+
info="Should be ≤ number of reasoning types (4).",
|
| 144 |
+
)
|
| 145 |
+
run_btn = gr.Button("Run Experiment", variant="primary", size="lg")
|
| 146 |
+
|
| 147 |
+
with gr.Column(scale=2):
|
| 148 |
+
status_log = gr.Textbox(
|
| 149 |
+
label="Pipeline log",
|
| 150 |
+
lines=9,
|
| 151 |
+
interactive=False,
|
| 152 |
+
placeholder="Click 'Run Experiment' to start …",
|
| 153 |
+
)
|
| 154 |
+
|
| 155 |
+
with gr.Row():
|
| 156 |
+
mi_plot = gr.Plot(
|
| 157 |
+
label="Mutual information: cluster vs. reasoning type vs. model identity"
|
| 158 |
+
)
|
| 159 |
+
|
| 160 |
+
with gr.Row():
|
| 161 |
+
summary_table = gr.Dataframe(
|
| 162 |
+
headers=["reasoning_type", "model_id", "failure_count"],
|
| 163 |
+
label="Failures by reasoning type and model",
|
| 164 |
+
interactive=False,
|
| 165 |
+
)
|
| 166 |
+
|
| 167 |
+
run_btn.click(
|
| 168 |
+
fn=run_experiment,
|
| 169 |
+
inputs=[model_selector, n_clusters_slider],
|
| 170 |
+
outputs=[status_log, mi_plot, summary_table],
|
| 171 |
+
)
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
if __name__ == "__main__":
|
| 175 |
+
demo.launch()
|
carb-observability-space
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
Subproject commit 0fcfd1cb2222aa6b2ce874133ab7ac03305d7823
|
core/__pycache__/cluster.cpython-312.pyc
ADDED
|
Binary file (952 Bytes). View file
|
|
|
core/__pycache__/dataset.cpython-312.pyc
ADDED
|
Binary file (1.88 kB). View file
|
|
|
core/__pycache__/embed.cpython-312.pyc
ADDED
|
Binary file (1.42 kB). View file
|
|
|
core/__pycache__/eval.cpython-312.pyc
ADDED
|
Binary file (1.6 kB). View file
|
|
|
core/__pycache__/metrics.cpython-312.pyc
ADDED
|
Binary file (849 Bytes). View file
|
|
|
core/__pycache__/model.cpython-312.pyc
ADDED
|
Binary file (4.09 kB). View file
|
|
|
core/cluster.py
ADDED
|
@@ -0,0 +1,22 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import numpy as np
|
| 2 |
+
from sklearn.cluster import KMeans
|
| 3 |
+
|
| 4 |
+
|
| 5 |
+
def cluster_embeddings(
|
| 6 |
+
embeddings: np.ndarray,
|
| 7 |
+
n_clusters: int = 4,
|
| 8 |
+
random_state: int = 42,
|
| 9 |
+
) -> list[int]:
|
| 10 |
+
if len(embeddings) == 0:
|
| 11 |
+
return []
|
| 12 |
+
|
| 13 |
+
effective_clusters = min(n_clusters, len(embeddings))
|
| 14 |
+
if effective_clusters == 1:
|
| 15 |
+
return [0]
|
| 16 |
+
|
| 17 |
+
kmeans = KMeans(
|
| 18 |
+
n_clusters=effective_clusters,
|
| 19 |
+
random_state=random_state,
|
| 20 |
+
n_init=10,
|
| 21 |
+
)
|
| 22 |
+
return kmeans.fit_predict(embeddings).tolist()
|
core/dataset.py
ADDED
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import json
|
| 2 |
+
from pathlib import Path
|
| 3 |
+
from typing import Any
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
REQUIRED_FIELDS = {"x", "y", "reasoning_type"}
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
def load_dataset(path: str | Path) -> list[dict[str, Any]]:
|
| 10 |
+
"""Load and validate the small CARB-style seed dataset."""
|
| 11 |
+
dataset_path = Path(path)
|
| 12 |
+
with dataset_path.open("r", encoding="utf-8") as f:
|
| 13 |
+
rows = json.load(f)
|
| 14 |
+
|
| 15 |
+
if not isinstance(rows, list):
|
| 16 |
+
raise ValueError("Dataset must be a JSON list.")
|
| 17 |
+
|
| 18 |
+
for index, row in enumerate(rows):
|
| 19 |
+
missing = REQUIRED_FIELDS.difference(row)
|
| 20 |
+
if missing:
|
| 21 |
+
raise ValueError(f"Row {index} is missing required fields: {sorted(missing)}")
|
| 22 |
+
if row["y"] not in (0, 1):
|
| 23 |
+
raise ValueError(f"Row {index} has non-binary label: {row['y']!r}")
|
| 24 |
+
if not isinstance(row["x"], str) or not row["x"].strip():
|
| 25 |
+
raise ValueError(f"Row {index} has an empty input string.")
|
| 26 |
+
if not isinstance(row["reasoning_type"], str) or not row["reasoning_type"].strip():
|
| 27 |
+
raise ValueError(f"Row {index} has an empty reasoning_type.")
|
| 28 |
+
|
| 29 |
+
return rows
|
core/embed.py
ADDED
|
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from collections.abc import Sequence
|
| 2 |
+
|
| 3 |
+
import numpy as np
|
| 4 |
+
from sentence_transformers import SentenceTransformer
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def embed_failures(failures: Sequence[dict[str, object]]) -> np.ndarray:
|
| 11 |
+
texts = [_failure_text(failure) for failure in failures]
|
| 12 |
+
if not texts:
|
| 13 |
+
return np.empty((0, 384))
|
| 14 |
+
|
| 15 |
+
model = SentenceTransformer(EMBEDDING_MODEL)
|
| 16 |
+
return model.encode(texts, convert_to_numpy=True, normalize_embeddings=True)
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def _failure_text(failure: dict[str, object]) -> str:
|
| 20 |
+
return (
|
| 21 |
+
f"input: {failure['x']}\n"
|
| 22 |
+
f"expected: {failure['y']}\n"
|
| 23 |
+
f"prediction: {failure['prediction']}\n"
|
| 24 |
+
f"reasoning_type: {failure['reasoning_type']}\n"
|
| 25 |
+
f"model: {failure['model_id']}"
|
| 26 |
+
)
|
core/eval.py
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from collections.abc import Callable, Sequence
|
| 2 |
+
from typing import Any
|
| 3 |
+
|
| 4 |
+
from core.model import DEFAULT_MODELS, build_prompt, parse_binary_prediction
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
ModelFn = Callable[[str, str], str]
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def evaluate(
|
| 11 |
+
dataset: Sequence[dict[str, Any]],
|
| 12 |
+
model_fn: ModelFn,
|
| 13 |
+
model_ids: Sequence[str] | None = None,
|
| 14 |
+
) -> list[dict[str, Any]]:
|
| 15 |
+
"""Run models over the dataset and return only incorrect or unparsable cases."""
|
| 16 |
+
failures: list[dict[str, Any]] = []
|
| 17 |
+
selected_model_ids = list(model_ids or DEFAULT_MODELS)
|
| 18 |
+
|
| 19 |
+
for sample_id, sample in enumerate(dataset):
|
| 20 |
+
prompt = build_prompt(sample["x"])
|
| 21 |
+
expected = int(sample["y"])
|
| 22 |
+
|
| 23 |
+
for model_id in selected_model_ids:
|
| 24 |
+
raw_output = model_fn(prompt, model_id)
|
| 25 |
+
prediction = parse_binary_prediction(raw_output)
|
| 26 |
+
is_correct = prediction == expected
|
| 27 |
+
|
| 28 |
+
if not is_correct:
|
| 29 |
+
failures.append(
|
| 30 |
+
{
|
| 31 |
+
"sample_id": sample_id,
|
| 32 |
+
"x": sample["x"],
|
| 33 |
+
"y": expected,
|
| 34 |
+
"reasoning_type": sample["reasoning_type"],
|
| 35 |
+
"model_id": model_id,
|
| 36 |
+
"prompt": prompt,
|
| 37 |
+
"raw_output": raw_output,
|
| 38 |
+
"prediction": prediction,
|
| 39 |
+
"failure_kind": "parse_error" if prediction is None else "wrong_label",
|
| 40 |
+
}
|
| 41 |
+
)
|
| 42 |
+
|
| 43 |
+
return failures
|
core/metrics.py
ADDED
|
@@ -0,0 +1,20 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
from collections.abc import Sequence
|
| 2 |
+
|
| 3 |
+
from sklearn.metrics import mutual_info_score
|
| 4 |
+
|
| 5 |
+
|
| 6 |
+
def compute_mi_scores(
|
| 7 |
+
cluster_ids: Sequence[int],
|
| 8 |
+
reasoning_types: Sequence[str],
|
| 9 |
+
model_ids: Sequence[str],
|
| 10 |
+
) -> dict[str, float]:
|
| 11 |
+
if not cluster_ids:
|
| 12 |
+
return {
|
| 13 |
+
"MI(cluster, reasoning_type)": 0.0,
|
| 14 |
+
"MI(cluster, model_identity)": 0.0,
|
| 15 |
+
}
|
| 16 |
+
|
| 17 |
+
return {
|
| 18 |
+
"MI(cluster, reasoning_type)": float(mutual_info_score(cluster_ids, reasoning_types)),
|
| 19 |
+
"MI(cluster, model_identity)": float(mutual_info_score(cluster_ids, model_ids)),
|
| 20 |
+
}
|
core/model.py
ADDED
|
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
import re
|
| 3 |
+
from collections.abc import Sequence
|
| 4 |
+
|
| 5 |
+
import requests
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
DEFAULT_MODELS = [
|
| 9 |
+
"google/flan-t5-small",
|
| 10 |
+
"google/flan-t5-base",
|
| 11 |
+
]
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
def build_prompt(input_text: str) -> str:
|
| 15 |
+
return (
|
| 16 |
+
"Answer this binary reasoning question. "
|
| 17 |
+
"Return only one line in the format 'label: 0' or 'label: 1'.\n\n"
|
| 18 |
+
f"Question: {input_text}"
|
| 19 |
+
)
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
def query_model(prompt: str, model_id: str = DEFAULT_MODELS[0], timeout: int = 60) -> str:
|
| 23 |
+
"""Call the Hugging Face Inference API and return model text."""
|
| 24 |
+
token = os.environ.get("HF_TOKEN")
|
| 25 |
+
if not token:
|
| 26 |
+
return "ERROR: HF_TOKEN is not set."
|
| 27 |
+
|
| 28 |
+
url = f"https://api-inference.huggingface.co/models/{model_id}"
|
| 29 |
+
headers = {"Authorization": f"Bearer {token}"}
|
| 30 |
+
payload = {
|
| 31 |
+
"inputs": prompt,
|
| 32 |
+
"parameters": {"max_new_tokens": 32, "return_full_text": False},
|
| 33 |
+
"options": {"wait_for_model": True},
|
| 34 |
+
}
|
| 35 |
+
|
| 36 |
+
try:
|
| 37 |
+
response = requests.post(url, headers=headers, json=payload, timeout=timeout)
|
| 38 |
+
response.raise_for_status()
|
| 39 |
+
data = response.json()
|
| 40 |
+
except requests.RequestException as exc:
|
| 41 |
+
return f"ERROR: request failed for {model_id}: {exc}"
|
| 42 |
+
except ValueError:
|
| 43 |
+
return f"ERROR: non-JSON response from {model_id}."
|
| 44 |
+
|
| 45 |
+
return _extract_generated_text(data)
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
def query_models(prompt: str, model_ids: Sequence[str]) -> dict[str, str]:
|
| 49 |
+
return {model_id: query_model(prompt, model_id=model_id) for model_id in model_ids}
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def parse_binary_prediction(output: str) -> int | None:
|
| 53 |
+
"""Parse a structured binary label from model output."""
|
| 54 |
+
normalized = output.strip().lower()
|
| 55 |
+
if normalized.startswith("error:"):
|
| 56 |
+
return None
|
| 57 |
+
|
| 58 |
+
structured_patterns = [
|
| 59 |
+
r"\blabel\s*[:=]\s*([01])\b",
|
| 60 |
+
r"\banswer\s*[:=]\s*([01])\b",
|
| 61 |
+
r"\bprediction\s*[:=]\s*([01])\b",
|
| 62 |
+
]
|
| 63 |
+
for pattern in structured_patterns:
|
| 64 |
+
match = re.search(pattern, normalized)
|
| 65 |
+
if match:
|
| 66 |
+
return int(match.group(1))
|
| 67 |
+
|
| 68 |
+
if re.fullmatch(r"[01]", normalized):
|
| 69 |
+
return int(normalized)
|
| 70 |
+
|
| 71 |
+
return None
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def _extract_generated_text(data: object) -> str:
|
| 75 |
+
if isinstance(data, list) and data:
|
| 76 |
+
first = data[0]
|
| 77 |
+
if isinstance(first, dict):
|
| 78 |
+
text = first.get("generated_text") or first.get("summary_text")
|
| 79 |
+
if isinstance(text, str):
|
| 80 |
+
return text
|
| 81 |
+
if isinstance(first, str):
|
| 82 |
+
return first
|
| 83 |
+
|
| 84 |
+
if isinstance(data, dict):
|
| 85 |
+
if isinstance(data.get("error"), str):
|
| 86 |
+
return f"ERROR: {data['error']}"
|
| 87 |
+
text = data.get("generated_text") or data.get("summary_text")
|
| 88 |
+
if isinstance(text, str):
|
| 89 |
+
return text
|
| 90 |
+
|
| 91 |
+
return f"ERROR: unsupported response format: {data!r}"
|
data/carb_seed.json
ADDED
|
@@ -0,0 +1,252 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
[
|
| 2 |
+
{
|
| 3 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All robins are birds. All birds are animals. Conclusion: All robins are animals.",
|
| 4 |
+
"y": 1,
|
| 5 |
+
"reasoning_type": "transitivity"
|
| 6 |
+
},
|
| 7 |
+
{
|
| 8 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All squares are rectangles. All rectangles are shapes. Conclusion: All squares are shapes.",
|
| 9 |
+
"y": 1,
|
| 10 |
+
"reasoning_type": "transitivity"
|
| 11 |
+
},
|
| 12 |
+
{
|
| 13 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All tulips are flowers. All flowers are plants. Conclusion: All tulips are plants.",
|
| 14 |
+
"y": 1,
|
| 15 |
+
"reasoning_type": "transitivity"
|
| 16 |
+
},
|
| 17 |
+
{
|
| 18 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All ferries are boats. All boats are vehicles. Conclusion: All ferries are vehicles.",
|
| 19 |
+
"y": 1,
|
| 20 |
+
"reasoning_type": "transitivity"
|
| 21 |
+
},
|
| 22 |
+
{
|
| 23 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All violins are instruments. All instruments are objects. Conclusion: All violins are objects.",
|
| 24 |
+
"y": 1,
|
| 25 |
+
"reasoning_type": "transitivity"
|
| 26 |
+
},
|
| 27 |
+
{
|
| 28 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All oak trees are trees. All trees are living things. Conclusion: All oak trees are living things.",
|
| 29 |
+
"y": 1,
|
| 30 |
+
"reasoning_type": "transitivity"
|
| 31 |
+
},
|
| 32 |
+
{
|
| 33 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All comets are space objects. All space objects are visible from telescopes. Conclusion: All comets are visible from telescopes.",
|
| 34 |
+
"y": 1,
|
| 35 |
+
"reasoning_type": "transitivity"
|
| 36 |
+
},
|
| 37 |
+
{
|
| 38 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All laptops are computers. All computers are machines. Conclusion: All machines are laptops.",
|
| 39 |
+
"y": 0,
|
| 40 |
+
"reasoning_type": "transitivity"
|
| 41 |
+
},
|
| 42 |
+
{
|
| 43 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All sparrows are birds. All birds have feathers. Conclusion: All feathered things are sparrows.",
|
| 44 |
+
"y": 0,
|
| 45 |
+
"reasoning_type": "transitivity"
|
| 46 |
+
},
|
| 47 |
+
{
|
| 48 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All apples are fruit. All fruit is food. Conclusion: All food is apples.",
|
| 49 |
+
"y": 0,
|
| 50 |
+
"reasoning_type": "transitivity"
|
| 51 |
+
},
|
| 52 |
+
{
|
| 53 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All poets are writers. All writers use language. Conclusion: All language users are poets.",
|
| 54 |
+
"y": 0,
|
| 55 |
+
"reasoning_type": "transitivity"
|
| 56 |
+
},
|
| 57 |
+
{
|
| 58 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All poodles are dogs. All dogs are mammals. Conclusion: Some mammals are not poodles.",
|
| 59 |
+
"y": 0,
|
| 60 |
+
"reasoning_type": "transitivity"
|
| 61 |
+
},
|
| 62 |
+
{
|
| 63 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All taxis are cars. All cars need fuel. Conclusion: All taxis need fuel.",
|
| 64 |
+
"y": 1,
|
| 65 |
+
"reasoning_type": "transitivity"
|
| 66 |
+
},
|
| 67 |
+
{
|
| 68 |
+
"x": "Label 1 if the final statement is true, else 0. Rule: If a door is locked, it is not open. The door is locked. Statement: The door is open.",
|
| 69 |
+
"y": 0,
|
| 70 |
+
"reasoning_type": "negation"
|
| 71 |
+
},
|
| 72 |
+
{
|
| 73 |
+
"x": "Label 1 if the final statement is true, else 0. Rule: If a badge is valid, it is not expired. The badge is valid. Statement: The badge is expired.",
|
| 74 |
+
"y": 0,
|
| 75 |
+
"reasoning_type": "negation"
|
| 76 |
+
},
|
| 77 |
+
{
|
| 78 |
+
"x": "Label 1 if the final statement is true, else 0. Rule: If the lamp is unplugged, it is not powered. The lamp is unplugged. Statement: The lamp is not powered.",
|
| 79 |
+
"y": 1,
|
| 80 |
+
"reasoning_type": "negation"
|
| 81 |
+
},
|
| 82 |
+
{
|
| 83 |
+
"x": "Label 1 if the final statement is true, else 0. Rule: If the file is encrypted, it is not readable as plain text. The file is encrypted. Statement: The file is readable as plain text.",
|
| 84 |
+
"y": 0,
|
| 85 |
+
"reasoning_type": "negation"
|
| 86 |
+
},
|
| 87 |
+
{
|
| 88 |
+
"x": "Label 1 if the final statement is true, else 0. Rule: If the road is closed, cars cannot pass. The road is closed. Statement: Cars can pass.",
|
| 89 |
+
"y": 0,
|
| 90 |
+
"reasoning_type": "negation"
|
| 91 |
+
},
|
| 92 |
+
{
|
| 93 |
+
"x": "Label 1 if the final statement is true, else 0. Rule: If a switch is off, the circuit is not active. The switch is off. Statement: The circuit is not active.",
|
| 94 |
+
"y": 1,
|
| 95 |
+
"reasoning_type": "negation"
|
| 96 |
+
},
|
| 97 |
+
{
|
| 98 |
+
"x": "Label 1 if the final statement is true, else 0. Rule: If a ticket is unpaid, it is not confirmed. The ticket is unpaid. Statement: The ticket is confirmed.",
|
| 99 |
+
"y": 0,
|
| 100 |
+
"reasoning_type": "negation"
|
| 101 |
+
},
|
| 102 |
+
{
|
| 103 |
+
"x": "Label 1 if the final statement is true, else 0. Rule: If a jar is empty, it contains no marbles. The jar is empty. Statement: The jar contains marbles.",
|
| 104 |
+
"y": 0,
|
| 105 |
+
"reasoning_type": "negation"
|
| 106 |
+
},
|
| 107 |
+
{
|
| 108 |
+
"x": "Label 1 if the final statement is true, else 0. Rule: If a user is banned, they are not allowed to post. The user is banned. Statement: The user is not allowed to post.",
|
| 109 |
+
"y": 1,
|
| 110 |
+
"reasoning_type": "negation"
|
| 111 |
+
},
|
| 112 |
+
{
|
| 113 |
+
"x": "Label 1 if the final statement is true, else 0. Rule: If the sensor is disabled, it sends no alerts. The sensor is disabled. Statement: The sensor sends alerts.",
|
| 114 |
+
"y": 0,
|
| 115 |
+
"reasoning_type": "negation"
|
| 116 |
+
},
|
| 117 |
+
{
|
| 118 |
+
"x": "Label 1 if the final statement is true, else 0. Rule: If a package is missing, it is not delivered. The package is missing. Statement: The package is delivered.",
|
| 119 |
+
"y": 0,
|
| 120 |
+
"reasoning_type": "negation"
|
| 121 |
+
},
|
| 122 |
+
{
|
| 123 |
+
"x": "Label 1 if the final statement is true, else 0. Rule: If a plant is dead, it is not growing. The plant is dead. Statement: The plant is not growing.",
|
| 124 |
+
"y": 1,
|
| 125 |
+
"reasoning_type": "negation"
|
| 126 |
+
},
|
| 127 |
+
{
|
| 128 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All doctors are trained professionals. Mira is a doctor. Conclusion: Mira is a trained professional.",
|
| 129 |
+
"y": 1,
|
| 130 |
+
"reasoning_type": "syllogism"
|
| 131 |
+
},
|
| 132 |
+
{
|
| 133 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All guests need invitations. Omar is a guest. Conclusion: Omar needs an invitation.",
|
| 134 |
+
"y": 1,
|
| 135 |
+
"reasoning_type": "syllogism"
|
| 136 |
+
},
|
| 137 |
+
{
|
| 138 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: No reptiles are warm-blooded. A gecko is a reptile. Conclusion: A gecko is warm-blooded.",
|
| 139 |
+
"y": 0,
|
| 140 |
+
"reasoning_type": "syllogism"
|
| 141 |
+
},
|
| 142 |
+
{
|
| 143 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All library books have catalog numbers. This item is a library book. Conclusion: This item has a catalog number.",
|
| 144 |
+
"y": 1,
|
| 145 |
+
"reasoning_type": "syllogism"
|
| 146 |
+
},
|
| 147 |
+
{
|
| 148 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: No expired coupons are accepted. This coupon is expired. Conclusion: This coupon is accepted.",
|
| 149 |
+
"y": 0,
|
| 150 |
+
"reasoning_type": "syllogism"
|
| 151 |
+
},
|
| 152 |
+
{
|
| 153 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All certified pilots can fly planes. Dana is certified pilot. Conclusion: Dana can fly planes.",
|
| 154 |
+
"y": 1,
|
| 155 |
+
"reasoning_type": "syllogism"
|
| 156 |
+
},
|
| 157 |
+
{
|
| 158 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All medals are awards. This object is an award. Conclusion: This object is a medal.",
|
| 159 |
+
"y": 0,
|
| 160 |
+
"reasoning_type": "syllogism"
|
| 161 |
+
},
|
| 162 |
+
{
|
| 163 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: No broken clocks keep correct time. This clock is broken. Conclusion: This clock keeps correct time.",
|
| 164 |
+
"y": 0,
|
| 165 |
+
"reasoning_type": "syllogism"
|
| 166 |
+
},
|
| 167 |
+
{
|
| 168 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All subscribers receive updates. Jin is a subscriber. Conclusion: Jin receives updates.",
|
| 169 |
+
"y": 1,
|
| 170 |
+
"reasoning_type": "syllogism"
|
| 171 |
+
},
|
| 172 |
+
{
|
| 173 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: No silent alarms make noise. This alarm is silent. Conclusion: This alarm makes noise.",
|
| 174 |
+
"y": 0,
|
| 175 |
+
"reasoning_type": "syllogism"
|
| 176 |
+
},
|
| 177 |
+
{
|
| 178 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All registered voters may vote. Lee is registered voter. Conclusion: Lee may vote.",
|
| 179 |
+
"y": 1,
|
| 180 |
+
"reasoning_type": "syllogism"
|
| 181 |
+
},
|
| 182 |
+
{
|
| 183 |
+
"x": "Label 1 if the conclusion follows, else 0. Premise: All chess players know rules. Sam knows rules. Conclusion: Sam is a chess player.",
|
| 184 |
+
"y": 0,
|
| 185 |
+
"reasoning_type": "syllogism"
|
| 186 |
+
},
|
| 187 |
+
{
|
| 188 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the blue key is used, the safe opens. Distractor: The red key is shiny. The blue key is used. Conclusion: The safe opens.",
|
| 189 |
+
"y": 1,
|
| 190 |
+
"reasoning_type": "distractor logic"
|
| 191 |
+
},
|
| 192 |
+
{
|
| 193 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the server restarts, the cache clears. Distractor: The keyboard is wireless. The server restarts. Conclusion: The cache clears.",
|
| 194 |
+
"y": 1,
|
| 195 |
+
"reasoning_type": "distractor logic"
|
| 196 |
+
},
|
| 197 |
+
{
|
| 198 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the form is signed, the request is valid. Distractor: The envelope is yellow. The form is not signed. Conclusion: The request is valid.",
|
| 199 |
+
"y": 0,
|
| 200 |
+
"reasoning_type": "distractor logic"
|
| 201 |
+
},
|
| 202 |
+
{
|
| 203 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the alarm rings, the guard wakes. Distractor: The guard owns a bicycle. The alarm rings. Conclusion: The guard wakes.",
|
| 204 |
+
"y": 1,
|
| 205 |
+
"reasoning_type": "distractor logic"
|
| 206 |
+
},
|
| 207 |
+
{
|
| 208 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the code compiles, tests can run. Distractor: The monitor is large. The code does not compile. Conclusion: Tests can run.",
|
| 209 |
+
"y": 0,
|
| 210 |
+
"reasoning_type": "distractor logic"
|
| 211 |
+
},
|
| 212 |
+
{
|
| 213 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the window is open, the room cools. Distractor: The carpet is green. The window is open. Conclusion: The room cools.",
|
| 214 |
+
"y": 1,
|
| 215 |
+
"reasoning_type": "distractor logic"
|
| 216 |
+
},
|
| 217 |
+
{
|
| 218 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the invoice is paid, the account is active. Distractor: The logo is blue. The invoice is unpaid. Conclusion: The account is active.",
|
| 219 |
+
"y": 0,
|
| 220 |
+
"reasoning_type": "distractor logic"
|
| 221 |
+
},
|
| 222 |
+
{
|
| 223 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the train arrives, passengers board. Distractor: The station has a clock. The train arrives. Conclusion: Passengers board.",
|
| 224 |
+
"y": 1,
|
| 225 |
+
"reasoning_type": "distractor logic"
|
| 226 |
+
},
|
| 227 |
+
{
|
| 228 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the token is invalid, access is denied. Distractor: The desk has two drawers. The token is invalid. Conclusion: Access is denied.",
|
| 229 |
+
"y": 1,
|
| 230 |
+
"reasoning_type": "distractor logic"
|
| 231 |
+
},
|
| 232 |
+
{
|
| 233 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If rain falls, the ground gets wet. Distractor: The umbrella is red. Rain does not fall. Conclusion: The ground gets wet.",
|
| 234 |
+
"y": 0,
|
| 235 |
+
"reasoning_type": "distractor logic"
|
| 236 |
+
},
|
| 237 |
+
{
|
| 238 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the switch is flipped, the light turns on. Distractor: The wall is painted white. The switch is flipped. Conclusion: The light turns on.",
|
| 239 |
+
"y": 1,
|
| 240 |
+
"reasoning_type": "distractor logic"
|
| 241 |
+
},
|
| 242 |
+
{
|
| 243 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the battery is charged, the robot moves. Distractor: The robot is made of metal. The battery is empty. Conclusion: The robot moves.",
|
| 244 |
+
"y": 0,
|
| 245 |
+
"reasoning_type": "distractor logic"
|
| 246 |
+
},
|
| 247 |
+
{
|
| 248 |
+
"x": "Label 1 if the target conclusion follows, else 0. Useful rule: If the map is accurate, the route is reliable. Distractor: The compass is old. The map is accurate. Conclusion: The route is reliable.",
|
| 249 |
+
"y": 1,
|
| 250 |
+
"reasoning_type": "distractor logic"
|
| 251 |
+
}
|
| 252 |
+
]
|
requirements.txt
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio
|
| 2 |
+
requests
|
| 3 |
+
sentence-transformers
|
| 4 |
+
scikit-learn
|
| 5 |
+
matplotlib
|
| 6 |
+
numpy
|
viz/__pycache__/plots.cpython-312.pyc
ADDED
|
Binary file (1.48 kB). View file
|
|
|
viz/plots.py
ADDED
|
@@ -0,0 +1,19 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import matplotlib.pyplot as plt
|
| 2 |
+
|
| 3 |
+
|
| 4 |
+
def plot_mi_comparison(mi_scores: dict[str, float]):
|
| 5 |
+
fig, ax = plt.subplots(figsize=(7, 4))
|
| 6 |
+
labels = list(mi_scores.keys())
|
| 7 |
+
values = list(mi_scores.values())
|
| 8 |
+
|
| 9 |
+
ax.bar(labels, values, color=["#4C78A8", "#F58518"])
|
| 10 |
+
ax.set_ylabel("Mutual information")
|
| 11 |
+
ax.set_title("Failure Cluster Mutual Information")
|
| 12 |
+
ax.set_ylim(0, max(values + [0.05]) * 1.2)
|
| 13 |
+
ax.tick_params(axis="x", labelrotation=15)
|
| 14 |
+
|
| 15 |
+
for index, value in enumerate(values):
|
| 16 |
+
ax.text(index, value, f"{value:.3f}", ha="center", va="bottom")
|
| 17 |
+
|
| 18 |
+
fig.tight_layout()
|
| 19 |
+
return fig
|