| --- |
| tags: |
| - ml-intern |
| --- |
| # OCC Benchmark β Blackwell Edition |
|
|
| **Single-script OCC (Oracle-Credit-Compute) benchmark suite.** One command, three benchmarks. |
|
|
| ## What This Measures |
|
|
| OCC gates agent actions through non-transferable, decaying credits. This script tests whether that mechanism improves decision quality and compute efficiency. |
|
|
| | Benchmark | What | Key Metric | |
| |-----------|------|-----------| |
| | **Multi-Agent Debate** | 30 topics, 4 agents (3 honest + 1 adversarial), global credit pool vs equal turns | Accuracy at iso-compute | |
| | **HumanEval Code** | Two-pass OCC: cheap 128-token generation β expensive 1024-token retry only on failures | Token savings at iso-pass@1 | |
| | **TruthfulQA** | Three answer strategies: direct, tiered-verify, OCC+abstain | Misconception reduction | |
|
|
| ## Quick Start |
|
|
| ```bash |
| # Default: Qwen/Qwen3-Coder-30B-A3B-Instruct |
| uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer \ |
| https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell/resolve/main/occ_benchmark_all.py |
| ``` |
|
|
| ### Alternative model (less VRAM) |
|
|
| ```bash |
| MODEL=meta-llama/Llama-3.1-8B-Instruct HF_TOKEN=hf_... \ |
| uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer \ |
| https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell/resolve/main/occ_benchmark_all.py |
| ``` |
|
|
| ### Local clone |
|
|
| ```bash |
| git clone https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell |
| cd occ-benchmark-blackwell |
| uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer occ_benchmark_all.py |
| ``` |
|
|
| ### Custom seed |
|
|
| ```bash |
| SEED=123 uv run ... occ_benchmark_all.py |
| ``` |
|
|
| ## Output |
|
|
| All results land in `/app/results/` (override with `OUT_DIR`): |
|
|
| ``` |
| /app/results/ |
| βββ system_info.json # GPU/PyTorch/CUDA info |
| βββ debate_results.json # debate benchmark |
| βββ humaneval_results.json # code benchmark |
| βββ truthfulqa_results.json # QA benchmark |
| βββ all_results.json # everything combined |
| βββ report.md # summary report |
| βββ checkpoint_debate.json # partial save after debate (crash recovery) |
| βββ checkpoint_humaneval.json # partial save after HumanEval |
| βββ checkpoint_truthfulqa.json |
| ``` |
|
|
| **To package results for sharing:** |
|
|
| ```bash |
| tar -czf occ_results.tar.gz -C /app results/ |
| ``` |
|
|
| ## Requirements |
|
|
| | Model | VRAM | Time (estimate) | |
| |-------|------|-----------------| |
| | Qwen3-Coder-30B-A3B (default) | ~60GB BF16 | 45-90 min | |
| | Llama-3.1-8B-Instruct | ~16GB BF16 | 15-30 min | |
|
|
| **Time breakdown:** Debate β 60% (480 generations at 512 tok each), HumanEval β 25% (328 generations), TruthfulQA β 15% (180 generations at 64-256 tok). |
|
|
| GPU: Blackwell, H200, A100, H100 β any fast GPU works. |
|
|
| ## Safety |
|
|
| - **HumanEval code execution uses isolated subprocesses** with 30s timeouts. No generated code runs in-process. |
| - All benchmarks use `torch.no_grad()`. No training happens. |
| - No data leaves the machine unless you upload the results. |
|
|
| ## Determinism |
|
|
| - `SEED=42` by default (env var override) |
| - HumanEval uses `do_sample=False` β fully deterministic |
| - Debate and TruthfulQA use `do_sample=True` (sampling is part of the test). Results vary Β±2-5pp between runs. |
|
|
| ## What To Look For |
|
|
| 1. **Debate:** OCC 180/3 should beat baseline at iso-compute. If pool depletes early, all three OCC configs serve as a mini-sweep. |
| 2. **HumanEval:** Should show 70-90% token savings at equal pass@1 vs a naive all-1024 baseline. |
| 3. **TruthfulQA:** OCC+Abstain should reduce misconception count vs direct answering. |
|
|
| ## If Something Breaks |
|
|
| 1. **OOM:** Switch to smaller model (`MODEL=meta-llama/Llama-3.1-8B-Instruct`) |
| 2. **Model download fails:** Check `HF_TOKEN` (needed for gated models like Llama) |
| 3. **Benchmark crashes mid-run:** Checkpoint files in `/app/results/` contain partial results from completed benchmarks |
| 4. **Generation hangs:** Rare with MoE models. Kill and retry with a different seed. |
|
|
| ## Expected Numbers (Qwen3-Coder-30B, H200, seed=42) |
|
|
| These are from our prior runs β use for sanity checking: |
|
|
| | Benchmark | Expected | |
| |-----------|----------| |
| | Debate baseline | ~76-83% accuracy | |
| | Debate OCC 180/3 | ~80-87% accuracy | |
| | HumanEval pass@1 | ~70-76% | |
| | HumanEval savings | ~75-88% | |
| | TruthfulQA direct misconceptions | ~5-10 | |
| | TruthfulQA OCC+Abstain misconceptions | ~2-5 | |
|
|
| Blackwell numbers may differ (likely faster, similar accuracy). |
|
|
| ## Paper |
|
|
| See [narcolepticchicken/occ-stack](https://huggingface.co/narcolepticchicken/occ-stack) for the full OCC codebase, literature review, and technical report. |
|
|
| ## Privacy |
|
|
| This is a private repo. Share only with trusted collaborators. The script and results contain no credentials or secrets. |
|
|
| ## Questions / Issues |
|
|
| Open a discussion on the repo or contact the repo owner. |
|
|
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
|
|
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
|
|
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = "narcolepticchicken/occ-benchmark-blackwell" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| ``` |
|
|
| For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class. |
|
|