Update ML Intern artifact metadata

717b79b verified about 14 hours ago

5.56 kB

	---
	tags:
	- ml-intern
	---
	# OCC Benchmark — Blackwell Edition

	Single-script OCC (Oracle-Credit-Compute) benchmark suite. One command, three benchmarks.

	## What This Measures

	OCC gates agent actions through non-transferable, decaying credits. This script tests whether that mechanism improves decision quality and compute efficiency.

	\| Benchmark \| What \| Key Metric \|
	\|-----------\|------\|-----------\|
	\| Multi-Agent Debate \| 30 topics, 4 agents (3 honest + 1 adversarial), global credit pool vs equal turns \| Accuracy at iso-compute \|
	\| HumanEval Code \| Two-pass OCC: cheap 128-token generation → expensive 1024-token retry only on failures \| Token savings at iso-pass@1 \|
	\| TruthfulQA \| Three answer strategies: direct, tiered-verify, OCC+abstain \| Misconception reduction \|

	## Quick Start

	```bash
	# Default: Qwen/Qwen3-Coder-30B-A3B-Instruct
	uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer \
	https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell/resolve/main/occ_benchmark_all.py
	```

	### Alternative model (less VRAM)

	```bash
	MODEL=meta-llama/Llama-3.1-8B-Instruct HF_TOKEN=hf_... \
	uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer \
	https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell/resolve/main/occ_benchmark_all.py
	```

	### Local clone

	```bash
	git clone https://huggingface.co/narcolepticchicken/occ-benchmark-blackwell
	cd occ-benchmark-blackwell
	uv run --with transformers --with torch --with accelerate --with datasets --with hf-transfer occ_benchmark_all.py
	```

	### Custom seed

	```bash
	SEED=123 uv run ... occ_benchmark_all.py
	```

	## Output

	All results land in `/app/results/` (override with `OUT_DIR`):

	```
	/app/results/
	├── system_info.json # GPU/PyTorch/CUDA info
	├── debate_results.json # debate benchmark
	├── humaneval_results.json # code benchmark
	├── truthfulqa_results.json # QA benchmark
	├── all_results.json # everything combined
	├── report.md # summary report
	├── checkpoint_debate.json # partial save after debate (crash recovery)
	├── checkpoint_humaneval.json # partial save after HumanEval
	└── checkpoint_truthfulqa.json
	```

	To package results for sharing:

	```bash
	tar -czf occ_results.tar.gz -C /app results/
	```

	## Requirements

	\| Model \| VRAM \| Time (estimate) \|
	\|-------\|------\|-----------------\|
	\| Qwen3-Coder-30B-A3B (default) \| ~60GB BF16 \| 45-90 min \|
	\| Llama-3.1-8B-Instruct \| ~16GB BF16 \| 15-30 min \|

	Time breakdown: Debate ≈ 60% (480 generations at 512 tok each), HumanEval ≈ 25% (328 generations), TruthfulQA ≈ 15% (180 generations at 64-256 tok).

	GPU: Blackwell, H200, A100, H100 — any fast GPU works.

	## Safety

	- HumanEval code execution uses isolated subprocesses with 30s timeouts. No generated code runs in-process.
	- All benchmarks use `torch.no_grad()`. No training happens.
	- No data leaves the machine unless you upload the results.

	## Determinism

	- `SEED=42` by default (env var override)
	- HumanEval uses `do_sample=False` — fully deterministic
	- Debate and TruthfulQA use `do_sample=True` (sampling is part of the test). Results vary ±2-5pp between runs.

	## What To Look For

	1. Debate: OCC 180/3 should beat baseline at iso-compute. If pool depletes early, all three OCC configs serve as a mini-sweep.
	2. HumanEval: Should show 70-90% token savings at equal pass@1 vs a naive all-1024 baseline.
	3. TruthfulQA: OCC+Abstain should reduce misconception count vs direct answering.

	## If Something Breaks

	1. OOM: Switch to smaller model (`MODEL=meta-llama/Llama-3.1-8B-Instruct`)
	2. Model download fails: Check `HF_TOKEN` (needed for gated models like Llama)
	3. Benchmark crashes mid-run: Checkpoint files in `/app/results/` contain partial results from completed benchmarks
	4. Generation hangs: Rare with MoE models. Kill and retry with a different seed.

	## Expected Numbers (Qwen3-Coder-30B, H200, seed=42)

	These are from our prior runs — use for sanity checking:

	\| Benchmark \| Expected \|
	\|-----------\|----------\|
	\| Debate baseline \| ~76-83% accuracy \|
	\| Debate OCC 180/3 \| ~80-87% accuracy \|
	\| HumanEval pass@1 \| ~70-76% \|
	\| HumanEval savings \| ~75-88% \|
	\| TruthfulQA direct misconceptions \| ~5-10 \|
	\| TruthfulQA OCC+Abstain misconceptions \| ~2-5 \|

	Blackwell numbers may differ (likely faster, similar accuracy).

	## Paper

	See [narcolepticchicken/occ-stack](https://huggingface.co/narcolepticchicken/occ-stack) for the full OCC codebase, literature review, and technical report.

	## Privacy

	This is a private repo. Share only with trusted collaborators. The script and results contain no credentials or secrets.

	## Questions / Issues

	Open a discussion on the repo or contact the repo owner.

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "narcolepticchicken/occ-benchmark-blackwell"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)
	```

	For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.