narcolepticchicken's picture
Update ML Intern artifact metadata
717b79b verified
metadata
tags:
  - ml-intern

OCC Benchmark β€” Blackwell Edition

Single-script OCC (Oracle-Credit-Compute) benchmark suite. One command, three benchmarks.

What This Measures

OCC gates agent actions through non-transferable, decaying credits. This script tests whether that mechanism improves decision quality and compute efficiency.

Benchmark What Key Metric
Multi-Agent Debate 30 topics, 4 agents (3 honest + 1 adversarial), global credit pool vs equal turns Accuracy at iso-compute
HumanEval Code Two-pass OCC: cheap 128-token generation β†’ expensive 1024-token retry only on failures Token savings at iso-pass@1
TruthfulQA Three answer strategies: direct, tiered-verify, OCC+abstain Misconception reduction

Quick Start

# Default: Qwen/Qwen3-Coder-30B-A3B-Instruct
uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer \
  https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell/resolve/main/occ_benchmark_all.py

Alternative model (less VRAM)

MODEL=meta-llama/Llama-3.1-8B-Instruct HF_TOKEN=hf_... \
  uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer \
  https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell/resolve/main/occ_benchmark_all.py

Local clone

git clone https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell
cd occ-benchmark-blackwell
uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer occ_benchmark_all.py

Custom seed

SEED=123 uv run ... occ_benchmark_all.py

Output

All results land in /app/results/ (override with OUT_DIR):

/app/results/
β”œβ”€β”€ system_info.json          # GPU/PyTorch/CUDA info
β”œβ”€β”€ debate_results.json       # debate benchmark
β”œβ”€β”€ humaneval_results.json    # code benchmark  
β”œβ”€β”€ truthfulqa_results.json   # QA benchmark
β”œβ”€β”€ all_results.json          # everything combined
β”œβ”€β”€ report.md                 # summary report
β”œβ”€β”€ checkpoint_debate.json    # partial save after debate (crash recovery)
β”œβ”€β”€ checkpoint_humaneval.json # partial save after HumanEval
└── checkpoint_truthfulqa.json

To package results for sharing:

tar -czf occ_results.tar.gz -C /app results/

Requirements

Model VRAM Time (estimate)
Qwen3-Coder-30B-A3B (default) ~60GB BF16 45-90 min
Llama-3.1-8B-Instruct ~16GB BF16 15-30 min

Time breakdown: Debate β‰ˆ 60% (480 generations at 512 tok each), HumanEval β‰ˆ 25% (328 generations), TruthfulQA β‰ˆ 15% (180 generations at 64-256 tok).

GPU: Blackwell, H200, A100, H100 β€” any fast GPU works.

Safety

  • HumanEval code execution uses isolated subprocesses with 30s timeouts. No generated code runs in-process.
  • All benchmarks use torch.no_grad(). No training happens.
  • No data leaves the machine unless you upload the results.

Determinism

  • SEED=42 by default (env var override)
  • HumanEval uses do_sample=False β€” fully deterministic
  • Debate and TruthfulQA use do_sample=True (sampling is part of the test). Results vary Β±2-5pp between runs.

What To Look For

  1. Debate: OCC 180/3 should beat baseline at iso-compute. If pool depletes early, all three OCC configs serve as a mini-sweep.
  2. HumanEval: Should show 70-90% token savings at equal pass@1 vs a naive all-1024 baseline.
  3. TruthfulQA: OCC+Abstain should reduce misconception count vs direct answering.

If Something Breaks

  1. OOM: Switch to smaller model (MODEL=meta-llama/Llama-3.1-8B-Instruct)
  2. Model download fails: Check HF_TOKEN (needed for gated models like Llama)
  3. Benchmark crashes mid-run: Checkpoint files in /app/results/ contain partial results from completed benchmarks
  4. Generation hangs: Rare with MoE models. Kill and retry with a different seed.

Expected Numbers (Qwen3-Coder-30B, H200, seed=42)

These are from our prior runs β€” use for sanity checking:

Benchmark Expected
Debate baseline ~76-83% accuracy
Debate OCC 180/3 ~80-87% accuracy
HumanEval pass@1 ~70-76%
HumanEval savings ~75-88%
TruthfulQA direct misconceptions ~5-10
TruthfulQA OCC+Abstain misconceptions ~2-5

Blackwell numbers may differ (likely faster, similar accuracy).

Paper

See narcolepticchicken/occ-stack for the full OCC codebase, literature review, and technical report.

Privacy

This is a private repo. Share only with trusted collaborators. The script and results contain no credentials or secrets.

Questions / Issues

Open a discussion on the repo or contact the repo owner.

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "narcolepticchicken/occ-benchmark-blackwell"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.