narcolepticchicken's picture
Update ML Intern artifact metadata
717b79b verified
---
tags:
- ml-intern
---
# OCC Benchmark β€” Blackwell Edition
**Single-script OCC (Oracle-Credit-Compute) benchmark suite.** One command, three benchmarks.
## What This Measures
OCC gates agent actions through non-transferable, decaying credits. This script tests whether that mechanism improves decision quality and compute efficiency.
| Benchmark | What | Key Metric |
|-----------|------|-----------|
| **Multi-Agent Debate** | 30 topics, 4 agents (3 honest + 1 adversarial), global credit pool vs equal turns | Accuracy at iso-compute |
| **HumanEval Code** | Two-pass OCC: cheap 128-token generation β†’ expensive 1024-token retry only on failures | Token savings at iso-pass@1 |
| **TruthfulQA** | Three answer strategies: direct, tiered-verify, OCC+abstain | Misconception reduction |
## Quick Start
```bash
# Default: Qwen/Qwen3-Coder-30B-A3B-Instruct
uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer \
https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell/resolve/main/occ_benchmark_all.py
```
### Alternative model (less VRAM)
```bash
MODEL=meta-llama/Llama-3.1-8B-Instruct HF_TOKEN=hf_... \
uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer \
https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell/resolve/main/occ_benchmark_all.py
```
### Local clone
```bash
git clone https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell
cd occ-benchmark-blackwell
uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer occ_benchmark_all.py
```
### Custom seed
```bash
SEED=123 uv run ... occ_benchmark_all.py
```
## Output
All results land in `/app/results/` (override with `OUT_DIR`):
```
/app/results/
β”œβ”€β”€ system_info.json # GPU/PyTorch/CUDA info
β”œβ”€β”€ debate_results.json # debate benchmark
β”œβ”€β”€ humaneval_results.json # code benchmark
β”œβ”€β”€ truthfulqa_results.json # QA benchmark
β”œβ”€β”€ all_results.json # everything combined
β”œβ”€β”€ report.md # summary report
β”œβ”€β”€ checkpoint_debate.json # partial save after debate (crash recovery)
β”œβ”€β”€ checkpoint_humaneval.json # partial save after HumanEval
└── checkpoint_truthfulqa.json
```
**To package results for sharing:**
```bash
tar -czf occ_results.tar.gz -C /app results/
```
## Requirements
| Model | VRAM | Time (estimate) |
|-------|------|-----------------|
| Qwen3-Coder-30B-A3B (default) | ~60GB BF16 | 45-90 min |
| Llama-3.1-8B-Instruct | ~16GB BF16 | 15-30 min |
**Time breakdown:** Debate β‰ˆ 60% (480 generations at 512 tok each), HumanEval β‰ˆ 25% (328 generations), TruthfulQA β‰ˆ 15% (180 generations at 64-256 tok).
GPU: Blackwell, H200, A100, H100 β€” any fast GPU works.
## Safety
- **HumanEval code execution uses isolated subprocesses** with 30s timeouts. No generated code runs in-process.
- All benchmarks use `torch.no_grad()`. No training happens.
- No data leaves the machine unless you upload the results.
## Determinism
- `SEED=42` by default (env var override)
- HumanEval uses `do_sample=False` β€” fully deterministic
- Debate and TruthfulQA use `do_sample=True` (sampling is part of the test). Results vary Β±2-5pp between runs.
## What To Look For
1. **Debate:** OCC 180/3 should beat baseline at iso-compute. If pool depletes early, all three OCC configs serve as a mini-sweep.
2. **HumanEval:** Should show 70-90% token savings at equal pass@1 vs a naive all-1024 baseline.
3. **TruthfulQA:** OCC+Abstain should reduce misconception count vs direct answering.
## If Something Breaks
1. **OOM:** Switch to smaller model (`MODEL=meta-llama/Llama-3.1-8B-Instruct`)
2. **Model download fails:** Check `HF_TOKEN` (needed for gated models like Llama)
3. **Benchmark crashes mid-run:** Checkpoint files in `/app/results/` contain partial results from completed benchmarks
4. **Generation hangs:** Rare with MoE models. Kill and retry with a different seed.
## Expected Numbers (Qwen3-Coder-30B, H200, seed=42)
These are from our prior runs β€” use for sanity checking:
| Benchmark | Expected |
|-----------|----------|
| Debate baseline | ~76-83% accuracy |
| Debate OCC 180/3 | ~80-87% accuracy |
| HumanEval pass@1 | ~70-76% |
| HumanEval savings | ~75-88% |
| TruthfulQA direct misconceptions | ~5-10 |
| TruthfulQA OCC+Abstain misconceptions | ~2-5 |
Blackwell numbers may differ (likely faster, similar accuracy).
## Paper
See [narcolepticchicken/occ-stack](https://huggingface.co/narcolepticchicken/occ-stack) for the full OCC codebase, literature review, and technical report.
## Privacy
This is a private repo. Share only with trusted collaborators. The script and results contain no credentials or secrets.
## Questions / Issues
Open a discussion on the repo or contact the repo owner.
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "narcolepticchicken/occ-benchmark-blackwell"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```
For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.