Update ML Intern artifact metadata

717b79b verified about 12 hours ago

5.56 kB

tags:
  - ml-intern

OCC Benchmark — Blackwell Edition

Single-script OCC (Oracle-Credit-Compute) benchmark suite. One command, three benchmarks.

What This Measures

OCC gates agent actions through non-transferable, decaying credits. This script tests whether that mechanism improves decision quality and compute efficiency.

Benchmark	What	Key Metric
Multi-Agent Debate	30 topics, 4 agents (3 honest + 1 adversarial), global credit pool vs equal turns	Accuracy at iso-compute
HumanEval Code	Two-pass OCC: cheap 128-token generation → expensive 1024-token retry only on failures	Token savings at iso-pass@1
TruthfulQA	Three answer strategies: direct, tiered-verify, OCC+abstain	Misconception reduction

Quick Start

# Default: Qwen/Qwen3-Coder-30B-A3B-Instruct
uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer \
  https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell/resolve/main/occ_benchmark_all.py

Alternative model (less VRAM)

MODEL=meta-llama/Llama-3.1-8B-Instruct HF_TOKEN=hf_... \
  uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer \
  https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell/resolve/main/occ_benchmark_all.py

Local clone

git clone https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell
cd occ-benchmark-blackwell
uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer occ_benchmark_all.py

Custom seed

SEED=123 uv run ... occ_benchmark_all.py

Output

All results land in /app/results/ (override with OUT_DIR):

/app/results/
├── system_info.json          # GPU/PyTorch/CUDA info
├── debate_results.json       # debate benchmark
├── humaneval_results.json    # code benchmark  
├── truthfulqa_results.json   # QA benchmark
├── all_results.json          # everything combined
├── report.md                 # summary report
├── checkpoint_debate.json    # partial save after debate (crash recovery)
├── checkpoint_humaneval.json # partial save after HumanEval
└── checkpoint_truthfulqa.json

To package results for sharing:

tar -czf occ_results.tar.gz -C /app results/

Requirements

Model	VRAM	Time (estimate)
Qwen3-Coder-30B-A3B (default)	~60GB BF16	45-90 min
Llama-3.1-8B-Instruct	~16GB BF16	15-30 min

Time breakdown: Debate ≈ 60% (480 generations at 512 tok each), HumanEval ≈ 25% (328 generations), TruthfulQA ≈ 15% (180 generations at 64-256 tok).

GPU: Blackwell, H200, A100, H100 — any fast GPU works.

Safety

HumanEval code execution uses isolated subprocesses with 30s timeouts. No generated code runs in-process.
All benchmarks use torch.no_grad(). No training happens.
No data leaves the machine unless you upload the results.

Determinism

SEED=42 by default (env var override)
HumanEval uses do_sample=False — fully deterministic
Debate and TruthfulQA use do_sample=True (sampling is part of the test). Results vary ±2-5pp between runs.

What To Look For

Debate: OCC 180/3 should beat baseline at iso-compute. If pool depletes early, all three OCC configs serve as a mini-sweep.
HumanEval: Should show 70-90% token savings at equal pass@1 vs a naive all-1024 baseline.
TruthfulQA: OCC+Abstain should reduce misconception count vs direct answering.

If Something Breaks

OOM: Switch to smaller model (MODEL=meta-llama/Llama-3.1-8B-Instruct)
Model download fails: Check HF_TOKEN (needed for gated models like Llama)
Benchmark crashes mid-run: Checkpoint files in /app/results/ contain partial results from completed benchmarks
Generation hangs: Rare with MoE models. Kill and retry with a different seed.

Expected Numbers (Qwen3-Coder-30B, H200, seed=42)

These are from our prior runs — use for sanity checking:

Benchmark	Expected
Debate baseline	~76-83% accuracy
Debate OCC 180/3	~80-87% accuracy
HumanEval pass@1	~70-76%
HumanEval savings	~75-88%
TruthfulQA direct misconceptions	~5-10
TruthfulQA OCC+Abstain misconceptions	~2-5

Blackwell numbers may differ (likely faster, similar accuracy).

Paper

See narcolepticchicken/occ-stack for the full OCC codebase, literature review, and technical report.

Privacy

This is a private repo. Share only with trusted collaborators. The script and results contain no credentials or secrets.

Questions / Issues

Open a discussion on the repo or contact the repo owner.

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "narcolepticchicken/occ-benchmark-blackwell"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.