OCC Benchmark β Blackwell Edition
Single-script OCC (Oracle-Credit-Compute) benchmark suite. One command, three benchmarks.
What This Measures
OCC gates agent actions through non-transferable, decaying credits. This script tests whether that mechanism improves decision quality and compute efficiency.
| Benchmark | What | Key Metric |
|---|---|---|
| Multi-Agent Debate | 30 topics, 4 agents (3 honest + 1 adversarial), global credit pool vs equal turns | Accuracy at iso-compute |
| HumanEval Code | Two-pass OCC: cheap 128-token generation β expensive 1024-token retry only on failures | Token savings at iso-pass@1 |
| TruthfulQA | Three answer strategies: direct, tiered-verify, OCC+abstain | Misconception reduction |
Quick Start
# Default: Qwen/Qwen3-Coder-30B-A3B-Instruct
uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer \
https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell/resolve/main/occ_benchmark_all.py
Alternative model (less VRAM)
MODEL=meta-llama/Llama-3.1-8B-Instruct HF_TOKEN=hf_... \
uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer \
https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell/resolve/main/occ_benchmark_all.py
Local clone
git clone https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell
cd occ-benchmark-blackwell
uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer occ_benchmark_all.py
Custom seed
SEED=123 uv run ... occ_benchmark_all.py
Output
All results land in /app/results/ (override with OUT_DIR):
/app/results/
βββ system_info.json # GPU/PyTorch/CUDA info
βββ debate_results.json # debate benchmark
βββ humaneval_results.json # code benchmark
βββ truthfulqa_results.json # QA benchmark
βββ all_results.json # everything combined
βββ report.md # summary report
βββ checkpoint_debate.json # partial save after debate (crash recovery)
βββ checkpoint_humaneval.json # partial save after HumanEval
βββ checkpoint_truthfulqa.json
To package results for sharing:
tar -czf occ_results.tar.gz -C /app results/
Requirements
| Model | VRAM | Time (estimate) |
|---|---|---|
| Qwen3-Coder-30B-A3B (default) | ~60GB BF16 | 45-90 min |
| Llama-3.1-8B-Instruct | ~16GB BF16 | 15-30 min |
Time breakdown: Debate β 60% (480 generations at 512 tok each), HumanEval β 25% (328 generations), TruthfulQA β 15% (180 generations at 64-256 tok).
GPU: Blackwell, H200, A100, H100 β any fast GPU works.
Safety
- HumanEval code execution uses isolated subprocesses with 30s timeouts. No generated code runs in-process.
- All benchmarks use
torch.no_grad(). No training happens. - No data leaves the machine unless you upload the results.
Determinism
SEED=42by default (env var override)- HumanEval uses
do_sample=Falseβ fully deterministic - Debate and TruthfulQA use
do_sample=True(sampling is part of the test). Results vary Β±2-5pp between runs.
What To Look For
- Debate: OCC 180/3 should beat baseline at iso-compute. If pool depletes early, all three OCC configs serve as a mini-sweep.
- HumanEval: Should show 70-90% token savings at equal pass@1 vs a naive all-1024 baseline.
- TruthfulQA: OCC+Abstain should reduce misconception count vs direct answering.
If Something Breaks
- OOM: Switch to smaller model (
MODEL=meta-llama/Llama-3.1-8B-Instruct) - Model download fails: Check
HF_TOKEN(needed for gated models like Llama) - Benchmark crashes mid-run: Checkpoint files in
/app/results/contain partial results from completed benchmarks - Generation hangs: Rare with MoE models. Kill and retry with a different seed.
Expected Numbers (Qwen3-Coder-30B, H200, seed=42)
These are from our prior runs β use for sanity checking:
| Benchmark | Expected |
|---|---|
| Debate baseline | ~76-83% accuracy |
| Debate OCC 180/3 | ~80-87% accuracy |
| HumanEval pass@1 | ~70-76% |
| HumanEval savings | ~75-88% |
| TruthfulQA direct misconceptions | ~5-10 |
| TruthfulQA OCC+Abstain misconceptions | ~2-5 |
Blackwell numbers may differ (likely faster, similar accuracy).
Paper
See narcolepticchicken/occ-stack for the full OCC codebase, literature review, and technical report.
Privacy
This is a private repo. Share only with trusted collaborators. The script and results contain no credentials or secrets.
Questions / Issues
Open a discussion on the repo or contact the repo owner.
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "narcolepticchicken/occ-benchmark-blackwell"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.