mahir-m01 Claude Sonnet 4.6 commited on
Commit ·
a6d0c38
1
Parent(s): ce0f54f
feat(notebook): storytelling layer — origin story, research lineage, honest speedup framing, future vision
Browse files
eval/space-src/arm_gym_grpo_colab.ipynb
CHANGED
|
@@ -2,13 +2,13 @@
|
|
| 2 |
"cells": [
|
| 3 |
{
|
| 4 |
"cell_type": "markdown",
|
| 5 |
-
"id": "
|
| 6 |
"metadata": {},
|
| 7 |
"source": [
|
| 8 |
-
"# ARM-Gym —
|
|
|
|
|
|
|
| 9 |
"\n",
|
| 10 |
-
"> **Teaching Qwen2.5-Coder-7B to write ARM AArch64 assembly that beats `clang-21 -O3`**\n",
|
| 11 |
-
">\n",
|
| 12 |
"> Meta / HuggingFace OpenEnv Hackathon India 2026 · Wild Card (Theme 5) · Team (dot)mkv\n",
|
| 13 |
"\n",
|
| 14 |
"| Resource | Link |\n",
|
|
@@ -25,7 +25,7 @@
|
|
| 25 |
"1. **Setup** — installs ARM toolchain + Python packages\n",
|
| 26 |
"2. **Environment overview** — shows ARM-Gym's kernel space (15 templates, 649 variants)\n",
|
| 27 |
"3. **Reward & pipeline** — explains the 3-gate verifier and reward formula\n",
|
| 28 |
-
"4. **Training results** — V11 logs loaded from HF Hub, 5 plots generated
|
| 29 |
"5. **Run comparison** — V10 vs V11 head-to-head with all metrics\n",
|
| 30 |
"6. **Evidence of learning** — explicit before/after proof the agent improved\n",
|
| 31 |
"7. **Optional** — run GRPO training yourself (requires L40S GPU)\n",
|
|
@@ -35,7 +35,36 @@
|
|
| 35 |
},
|
| 36 |
{
|
| 37 |
"cell_type": "markdown",
|
| 38 |
-
"id": "
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
"metadata": {},
|
| 40 |
"source": [
|
| 41 |
"## Training Run History\n",
|
|
@@ -91,12 +120,12 @@
|
|
| 91 |
"- **1024-token window** removed the ceiling that was truncating longer kernel solutions in V10.\n",
|
| 92 |
"- **LoRA r=32** (vs r=24) gave the adapter more capacity to represent AArch64 instruction patterns.\n",
|
| 93 |
"\n",
|
| 94 |
-
"V11's lower
|
| 95 |
]
|
| 96 |
},
|
| 97 |
{
|
| 98 |
"cell_type": "markdown",
|
| 99 |
-
"id": "
|
| 100 |
"metadata": {},
|
| 101 |
"source": [
|
| 102 |
"## 1. System Setup (Linux / Colab)"
|
|
@@ -105,7 +134,7 @@
|
|
| 105 |
{
|
| 106 |
"cell_type": "code",
|
| 107 |
"execution_count": null,
|
| 108 |
-
"id": "
|
| 109 |
"metadata": {},
|
| 110 |
"outputs": [
|
| 111 |
{
|
|
@@ -148,7 +177,7 @@
|
|
| 148 |
{
|
| 149 |
"cell_type": "code",
|
| 150 |
"execution_count": null,
|
| 151 |
-
"id": "
|
| 152 |
"metadata": {},
|
| 153 |
"outputs": [
|
| 154 |
{
|
|
@@ -174,7 +203,7 @@
|
|
| 174 |
},
|
| 175 |
{
|
| 176 |
"cell_type": "markdown",
|
| 177 |
-
"id": "
|
| 178 |
"metadata": {},
|
| 179 |
"source": [
|
| 180 |
"## 2. ARM-Gym Environment Overview ✅"
|
|
@@ -183,7 +212,7 @@
|
|
| 183 |
{
|
| 184 |
"cell_type": "code",
|
| 185 |
"execution_count": null,
|
| 186 |
-
"id": "
|
| 187 |
"metadata": {},
|
| 188 |
"outputs": [
|
| 189 |
{
|
|
@@ -215,7 +244,7 @@
|
|
| 215 |
{
|
| 216 |
"cell_type": "code",
|
| 217 |
"execution_count": null,
|
| 218 |
-
"id": "
|
| 219 |
"metadata": {},
|
| 220 |
"outputs": [
|
| 221 |
{
|
|
@@ -243,11 +272,15 @@
|
|
| 243 |
},
|
| 244 |
{
|
| 245 |
"cell_type": "markdown",
|
| 246 |
-
"id": "
|
| 247 |
"metadata": {},
|
| 248 |
"source": [
|
| 249 |
"## 3. Reward & Training Pipeline ✅\n",
|
| 250 |
"\n",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 251 |
"### The 3-Gate Verifier\n",
|
| 252 |
"\n",
|
| 253 |
"Every model attempt passes through **3 gates**. Fail any gate → reward = 0. Pass all → reward is the sum.\n",
|
|
@@ -281,7 +314,7 @@
|
|
| 281 |
{
|
| 282 |
"cell_type": "code",
|
| 283 |
"execution_count": null,
|
| 284 |
-
"id": "
|
| 285 |
"metadata": {},
|
| 286 |
"outputs": [
|
| 287 |
{
|
|
@@ -328,7 +361,7 @@
|
|
| 328 |
},
|
| 329 |
{
|
| 330 |
"cell_type": "markdown",
|
| 331 |
-
"id": "
|
| 332 |
"metadata": {},
|
| 333 |
"source": [
|
| 334 |
"## 4. Training Results — V11 (Best Run) ✅\n",
|
|
@@ -340,7 +373,7 @@
|
|
| 340 |
{
|
| 341 |
"cell_type": "code",
|
| 342 |
"execution_count": null,
|
| 343 |
-
"id": "
|
| 344 |
"metadata": {},
|
| 345 |
"outputs": [
|
| 346 |
{
|
|
@@ -389,7 +422,7 @@
|
|
| 389 |
{
|
| 390 |
"cell_type": "code",
|
| 391 |
"execution_count": null,
|
| 392 |
-
"id": "
|
| 393 |
"metadata": {},
|
| 394 |
"outputs": [
|
| 395 |
{
|
|
@@ -439,7 +472,7 @@
|
|
| 439 |
{
|
| 440 |
"cell_type": "code",
|
| 441 |
"execution_count": null,
|
| 442 |
-
"id": "
|
| 443 |
"metadata": {},
|
| 444 |
"outputs": [
|
| 445 |
{
|
|
@@ -459,7 +492,7 @@
|
|
| 459 |
},
|
| 460 |
{
|
| 461 |
"cell_type": "markdown",
|
| 462 |
-
"id": "
|
| 463 |
"metadata": {},
|
| 464 |
"source": [
|
| 465 |
"### Plot 1 — V11 Training Loss ✅"
|
|
@@ -468,7 +501,7 @@
|
|
| 468 |
{
|
| 469 |
"cell_type": "code",
|
| 470 |
"execution_count": null,
|
| 471 |
-
"id": "
|
| 472 |
"metadata": {},
|
| 473 |
"outputs": [
|
| 474 |
{
|
|
@@ -498,7 +531,7 @@
|
|
| 498 |
},
|
| 499 |
{
|
| 500 |
"cell_type": "markdown",
|
| 501 |
-
"id": "
|
| 502 |
"metadata": {},
|
| 503 |
"source": [
|
| 504 |
"### Plot 2 — All 4 Reward Components ✅"
|
|
@@ -507,7 +540,7 @@
|
|
| 507 |
{
|
| 508 |
"cell_type": "code",
|
| 509 |
"execution_count": null,
|
| 510 |
-
"id": "
|
| 511 |
"metadata": {},
|
| 512 |
"outputs": [
|
| 513 |
{
|
|
@@ -551,7 +584,7 @@
|
|
| 551 |
},
|
| 552 |
{
|
| 553 |
"cell_type": "markdown",
|
| 554 |
-
"id": "
|
| 555 |
"metadata": {},
|
| 556 |
"source": [
|
| 557 |
"### Plot 3 — QEMU Correctness Rate ✅"
|
|
@@ -560,7 +593,7 @@
|
|
| 560 |
{
|
| 561 |
"cell_type": "code",
|
| 562 |
"execution_count": null,
|
| 563 |
-
"id": "
|
| 564 |
"metadata": {},
|
| 565 |
"outputs": [
|
| 566 |
{
|
|
@@ -601,7 +634,7 @@
|
|
| 601 |
},
|
| 602 |
{
|
| 603 |
"cell_type": "markdown",
|
| 604 |
-
"id": "
|
| 605 |
"metadata": {},
|
| 606 |
"source": [
|
| 607 |
"### Plot 4 — Speedup Events & V10 vs V11 Comparison ✅"
|
|
@@ -610,7 +643,7 @@
|
|
| 610 |
{
|
| 611 |
"cell_type": "code",
|
| 612 |
"execution_count": null,
|
| 613 |
-
"id": "
|
| 614 |
"metadata": {},
|
| 615 |
"outputs": [
|
| 616 |
{
|
|
@@ -669,7 +702,7 @@
|
|
| 669 |
},
|
| 670 |
{
|
| 671 |
"cell_type": "markdown",
|
| 672 |
-
"id": "
|
| 673 |
"metadata": {},
|
| 674 |
"source": [
|
| 675 |
"### Plot 5 — Total Reward Trajectory ✅"
|
|
@@ -678,7 +711,7 @@
|
|
| 678 |
{
|
| 679 |
"cell_type": "code",
|
| 680 |
"execution_count": null,
|
| 681 |
-
"id": "
|
| 682 |
"metadata": {},
|
| 683 |
"outputs": [
|
| 684 |
{
|
|
@@ -716,11 +749,13 @@
|
|
| 716 |
},
|
| 717 |
{
|
| 718 |
"cell_type": "markdown",
|
| 719 |
-
"id": "
|
| 720 |
"metadata": {},
|
| 721 |
"source": [
|
| 722 |
"## 5. V10 vs V11 — Run Comparison ✅\n",
|
| 723 |
"\n",
|
|
|
|
|
|
|
| 724 |
"| Metric | [V10](https://huggingface.co/ZDC-M01/arm-gym-v10-train-200) | [V11](https://huggingface.co/ZDC-M01/arm-gym-v11-train-250) | Winner |\n",
|
| 725 |
"|--------|------|------|--------|\n",
|
| 726 |
"| Steps | 200 | 250 | — |\n",
|
|
@@ -730,17 +765,19 @@
|
|
| 730 |
"| Best speedup | **+47.5%** | +14.5% | V10 |\n",
|
| 731 |
"| Mean total reward | 3.79 | **4.72** | V11 |\n",
|
| 732 |
"| Correctness mean | 26.0% | **38.6%** | V11 |\n",
|
| 733 |
-
"| Correctness final 20 steps | 25.8% | **70.0%** | V11
|
| 734 |
"| Q4 reward | 3.96 | **6.50** | V11 |\n",
|
| 735 |
"| Reward trajectory | flat | **monotonic ↑** | V11 |\n",
|
| 736 |
"\n",
|
| 737 |
-
"> V10's +47.5% best speedup was a single lucky rollout — the overall reward trajectory was flat. V11's
|
|
|
|
|
|
|
| 738 |
]
|
| 739 |
},
|
| 740 |
{
|
| 741 |
"cell_type": "code",
|
| 742 |
"execution_count": null,
|
| 743 |
-
"id": "
|
| 744 |
"metadata": {},
|
| 745 |
"outputs": [
|
| 746 |
{
|
|
@@ -784,7 +821,7 @@
|
|
| 784 |
},
|
| 785 |
{
|
| 786 |
"cell_type": "markdown",
|
| 787 |
-
"id": "
|
| 788 |
"metadata": {},
|
| 789 |
"source": [
|
| 790 |
"## 6. Evidence of Learning — Did the Agent Actually Improve? ✅"
|
|
@@ -793,7 +830,7 @@
|
|
| 793 |
{
|
| 794 |
"cell_type": "code",
|
| 795 |
"execution_count": null,
|
| 796 |
-
"id": "
|
| 797 |
"metadata": {},
|
| 798 |
"outputs": [
|
| 799 |
{
|
|
@@ -852,29 +889,43 @@
|
|
| 852 |
},
|
| 853 |
{
|
| 854 |
"cell_type": "markdown",
|
| 855 |
-
"id": "
|
| 856 |
"metadata": {},
|
| 857 |
"source": [
|
| 858 |
-
"## 7. Key Findings\n",
|
|
|
|
|
|
|
| 859 |
"\n",
|
| 860 |
"| Metric | V11 (Best Run) |\n",
|
| 861 |
"|--------|----------------|\n",
|
| 862 |
"| Training | 250 steps, NVIDIA L40S 48GB, ~110 min |\n",
|
| 863 |
-
"|
|
| 864 |
"| Speedup win rate | 14/125 steps (11.2%) |\n",
|
| 865 |
-
"| Peak total reward |
|
| 866 |
-
"| Correctness: start → end | 19% →
|
| 867 |
-
"| Reward trajectory | Q1=3.30 → Q4=6.50
|
| 868 |
-
"| Fully-correct batches | 14 (all G=8 pass QEMU) |\n",
|
|
|
|
|
|
|
| 869 |
"\n",
|
| 870 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 871 |
"\n",
|
| 872 |
"> **Models:** [V11 — Best](https://huggingface.co/ZDC-M01/arm-gym-v11-train-250) · [V10 — Second Best](https://huggingface.co/ZDC-M01/arm-gym-v10-train-200)"
|
| 873 |
]
|
| 874 |
},
|
| 875 |
{
|
| 876 |
"cell_type": "markdown",
|
| 877 |
-
"id": "
|
| 878 |
"metadata": {},
|
| 879 |
"source": [
|
| 880 |
"## 8. (Optional) Run GRPO Training\n",
|
|
@@ -887,7 +938,7 @@
|
|
| 887 |
{
|
| 888 |
"cell_type": "code",
|
| 889 |
"execution_count": null,
|
| 890 |
-
"id": "
|
| 891 |
"metadata": {},
|
| 892 |
"outputs": [
|
| 893 |
{
|
|
@@ -917,13 +968,15 @@
|
|
| 917 |
},
|
| 918 |
{
|
| 919 |
"cell_type": "markdown",
|
| 920 |
-
"id": "
|
| 921 |
"metadata": {},
|
| 922 |
"source": [
|
| 923 |
"---\n",
|
| 924 |
"\n",
|
| 925 |
"## Resources\n",
|
| 926 |
"\n",
|
|
|
|
|
|
|
| 927 |
"| Resource | Link |\n",
|
| 928 |
"|----------|------|\n",
|
| 929 |
"| HuggingFace Space | [kaori02/arm-gym](https://huggingface.co/spaces/kaori02/arm-gym) |\n",
|
|
@@ -931,7 +984,15 @@
|
|
| 931 |
"| Second-best — V10 | [ZDC-M01/arm-gym-v10-train-200](https://huggingface.co/ZDC-M01/arm-gym-v10-train-200) |\n",
|
| 932 |
"| Training scripts | [ZDC-M01/arm-gym-pkg](https://huggingface.co/datasets/ZDC-M01/arm-gym-pkg) |\n",
|
| 933 |
"| Blog post | [arm-gym/blob/main/blog.md](https://huggingface.co/spaces/kaori02/arm-gym/blob/main/blog.md) |\n",
|
| 934 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 935 |
"\n",
|
| 936 |
"**Team:** (dot)mkv · Meta / HuggingFace OpenEnv Hackathon India 2026 · Wild Card Theme"
|
| 937 |
]
|
|
|
|
| 2 |
"cells": [
|
| 3 |
{
|
| 4 |
"cell_type": "markdown",
|
| 5 |
+
"id": "0f57be22",
|
| 6 |
"metadata": {},
|
| 7 |
"source": [
|
| 8 |
+
"# ARM-Gym — Teaching a 7B Model to Write Assembly\n",
|
| 9 |
+
"\n",
|
| 10 |
+
"> *Can a language model learn to write ARM assembly that a compiler's static rules would never produce?*\n",
|
| 11 |
"\n",
|
|
|
|
|
|
|
| 12 |
"> Meta / HuggingFace OpenEnv Hackathon India 2026 · Wild Card (Theme 5) · Team (dot)mkv\n",
|
| 13 |
"\n",
|
| 14 |
"| Resource | Link |\n",
|
|
|
|
| 25 |
"1. **Setup** — installs ARM toolchain + Python packages\n",
|
| 26 |
"2. **Environment overview** — shows ARM-Gym's kernel space (15 templates, 649 variants)\n",
|
| 27 |
"3. **Reward & pipeline** — explains the 3-gate verifier and reward formula\n",
|
| 28 |
+
"4. **Training results** — V11 logs loaded from HF Hub, 5 plots generated (pre-run) ✅\n",
|
| 29 |
"5. **Run comparison** — V10 vs V11 head-to-head with all metrics\n",
|
| 30 |
"6. **Evidence of learning** — explicit before/after proof the agent improved\n",
|
| 31 |
"7. **Optional** — run GRPO training yourself (requires L40S GPU)\n",
|
|
|
|
| 35 |
},
|
| 36 |
{
|
| 37 |
"cell_type": "markdown",
|
| 38 |
+
"id": "323776bb",
|
| 39 |
+
"metadata": {},
|
| 40 |
+
"source": [
|
| 41 |
+
"## The Problem — And Why a Language Model?\n",
|
| 42 |
+
"\n",
|
| 43 |
+
"Modern compilers like `clang -O3` are extraordinary engineering — decades of expert rules for loop unrolling, register allocation, instruction scheduling. But they are **heuristic-bound**. Every rule is a human decision about what is *usually* faster. For a specific kernel on a specific chip, the optimal instruction sequence may exist but violate those rules — and the compiler will never find it.\n",
|
| 44 |
+
"\n",
|
| 45 |
+
"This is the gap ARM-Gym is exploring: not replacing compilers, but using an LLM as a **probabilistic scout** that explores instruction orderings a human engineer would never hard-code. When the scout finds something, a deterministic verifier (assembler + QEMU + LLVM-MCA) confirms it. The AI proposes; the tools decide.\n",
|
| 46 |
+
"\n",
|
| 47 |
+
"### Standing on the shoulders of prior work\n",
|
| 48 |
+
"\n",
|
| 49 |
+
"This approach has a research lineage we built directly on:\n",
|
| 50 |
+
"\n",
|
| 51 |
+
"- **AlphaDev (DeepMind, Nature 2023)** — proved that RL over assembly search could find a sort3 algorithm 70% faster than LLVM's libc++, now merged into the production standard library. But AlphaDev used MCTS on x86 and required expensive per-problem search.\n",
|
| 52 |
+
"- **Meta LLM Compiler (Cummins et al., 2023)** — first language model trained for LLVM pass ordering. Achieved 3% code-size reduction over `-Oz` with *zero* extra compilations at inference. Showed LLMs can learn compiler semantics, not just pattern-match.\n",
|
| 53 |
+
"- **Compiler-R1 (NeurIPS 2025)** — applied R1-style GRPO to LLVM pass ordering, same algorithm we use. 8.46% instruction count reduction. Confirmed that RL recovers gains SFT leaves on the table.\n",
|
| 54 |
+
"- **SuperCoder (arXiv 2505.11480)** — the closest prior art. Trained Qwen2.5-Coder via GRPO to emit x86-64 assembly that exceeded `gcc -O3`, achieving 1.46x speedup at 95% correctness. The ARM ISA — despite powering AWS Graviton5, Azure Cobalt 100, and Apple Silicon — was untouched.\n",
|
| 55 |
+
"\n",
|
| 56 |
+
"**ARM-Gym ports SuperCoder's approach to AArch64**, replacing x86-64 with the Neoverse V2 microarchitecture, native execution with QEMU, and wall-clock timing with LLVM-MCA cycle estimates — a deterministic, Docker-safe reward signal that runs in milliseconds.\n",
|
| 57 |
+
"\n",
|
| 58 |
+
"### An honest framing\n",
|
| 59 |
+
"\n",
|
| 60 |
+
"ARM-Gym does not universally beat `clang-21 -O3`. That is not the claim. The compiler is very good. What our best run (V11) shows is that a 7B model, trained for ~110 minutes with GRPO, learned to write semantically correct AArch64 assembly — starting from 19% correctness and reaching 70% in the final 20 steps. On certain kernel shapes, at certain parameter combinations, it finds instruction sequences that LLVM-MCA estimates as faster. The V10 run found a +47.5% speedup on one kernel. V11 reached +14.5% while being far more consistent. **The trajectory is what matters** — correctness rising monotonically is the precondition for speedup to follow.\n",
|
| 61 |
+
"\n",
|
| 62 |
+
"The real bet is longer: with more training, larger models, and silicon validation, patterns discovered this way can be back-ported into LLVM as new deterministic rules — or eventually embedded as a `-O-ai` compiler pass that searches for the specific chip being targeted."
|
| 63 |
+
]
|
| 64 |
+
},
|
| 65 |
+
{
|
| 66 |
+
"cell_type": "markdown",
|
| 67 |
+
"id": "2bbf5824",
|
| 68 |
"metadata": {},
|
| 69 |
"source": [
|
| 70 |
"## Training Run History\n",
|
|
|
|
| 120 |
"- **1024-token window** removed the ceiling that was truncating longer kernel solutions in V10.\n",
|
| 121 |
"- **LoRA r=32** (vs r=24) gave the adapter more capacity to represent AArch64 instruction patterns.\n",
|
| 122 |
"\n",
|
| 123 |
+
"V11's lower single-run best speedup (14.5% vs 47.5%) reflects a real trade-off: the model stopped taking syntactically lucky shortcuts and began generating semantically correct assembly consistently across a wider range of kernels. Correctness at 70% is the precondition for speedup to become reliable. With more training steps, the speedup events would follow. Both figures are LLVM-MCA Neoverse V2 estimates — directionally meaningful, not silicon-validated."
|
| 124 |
]
|
| 125 |
},
|
| 126 |
{
|
| 127 |
"cell_type": "markdown",
|
| 128 |
+
"id": "26f50165",
|
| 129 |
"metadata": {},
|
| 130 |
"source": [
|
| 131 |
"## 1. System Setup (Linux / Colab)"
|
|
|
|
| 134 |
{
|
| 135 |
"cell_type": "code",
|
| 136 |
"execution_count": null,
|
| 137 |
+
"id": "518ebd55",
|
| 138 |
"metadata": {},
|
| 139 |
"outputs": [
|
| 140 |
{
|
|
|
|
| 177 |
{
|
| 178 |
"cell_type": "code",
|
| 179 |
"execution_count": null,
|
| 180 |
+
"id": "2e4b6918",
|
| 181 |
"metadata": {},
|
| 182 |
"outputs": [
|
| 183 |
{
|
|
|
|
| 203 |
},
|
| 204 |
{
|
| 205 |
"cell_type": "markdown",
|
| 206 |
+
"id": "7f59efda",
|
| 207 |
"metadata": {},
|
| 208 |
"source": [
|
| 209 |
"## 2. ARM-Gym Environment Overview ✅"
|
|
|
|
| 212 |
{
|
| 213 |
"cell_type": "code",
|
| 214 |
"execution_count": null,
|
| 215 |
+
"id": "239b5a08",
|
| 216 |
"metadata": {},
|
| 217 |
"outputs": [
|
| 218 |
{
|
|
|
|
| 244 |
{
|
| 245 |
"cell_type": "code",
|
| 246 |
"execution_count": null,
|
| 247 |
+
"id": "8b55fea8",
|
| 248 |
"metadata": {},
|
| 249 |
"outputs": [
|
| 250 |
{
|
|
|
|
| 272 |
},
|
| 273 |
{
|
| 274 |
"cell_type": "markdown",
|
| 275 |
+
"id": "baa3eda8",
|
| 276 |
"metadata": {},
|
| 277 |
"source": [
|
| 278 |
"## 3. Reward & Training Pipeline ✅\n",
|
| 279 |
"\n",
|
| 280 |
+
"The core design principle, borrowed from SuperCoder and Compiler-R1, is **verifiable reward without a judge**: every reward signal comes from a real tool, not another LLM. This is what makes the learning signal trustworthy — the model cannot sweet-talk an assembler or fake a QEMU execution.\n",
|
| 281 |
+
"\n",
|
| 282 |
+
"GRPO (Group Relative Policy Optimization) — the same algorithm behind DeepSeek-R1 — generates G=8 candidate assemblies per kernel, runs all of them through the verifier, and trains the model to favor the ones that scored higher relative to the group. No labelled dataset required. No reference solutions. Just the reward.\n",
|
| 283 |
+
"\n",
|
| 284 |
"### The 3-Gate Verifier\n",
|
| 285 |
"\n",
|
| 286 |
"Every model attempt passes through **3 gates**. Fail any gate → reward = 0. Pass all → reward is the sum.\n",
|
|
|
|
| 314 |
{
|
| 315 |
"cell_type": "code",
|
| 316 |
"execution_count": null,
|
| 317 |
+
"id": "5654a348",
|
| 318 |
"metadata": {},
|
| 319 |
"outputs": [
|
| 320 |
{
|
|
|
|
| 361 |
},
|
| 362 |
{
|
| 363 |
"cell_type": "markdown",
|
| 364 |
+
"id": "13629c79",
|
| 365 |
"metadata": {},
|
| 366 |
"source": [
|
| 367 |
"## 4. Training Results — V11 (Best Run) ✅\n",
|
|
|
|
| 373 |
{
|
| 374 |
"cell_type": "code",
|
| 375 |
"execution_count": null,
|
| 376 |
+
"id": "ac75a677",
|
| 377 |
"metadata": {},
|
| 378 |
"outputs": [
|
| 379 |
{
|
|
|
|
| 422 |
{
|
| 423 |
"cell_type": "code",
|
| 424 |
"execution_count": null,
|
| 425 |
+
"id": "b0b32749",
|
| 426 |
"metadata": {},
|
| 427 |
"outputs": [
|
| 428 |
{
|
|
|
|
| 472 |
{
|
| 473 |
"cell_type": "code",
|
| 474 |
"execution_count": null,
|
| 475 |
+
"id": "50e16611",
|
| 476 |
"metadata": {},
|
| 477 |
"outputs": [
|
| 478 |
{
|
|
|
|
| 492 |
},
|
| 493 |
{
|
| 494 |
"cell_type": "markdown",
|
| 495 |
+
"id": "9304a206",
|
| 496 |
"metadata": {},
|
| 497 |
"source": [
|
| 498 |
"### Plot 1 — V11 Training Loss ✅"
|
|
|
|
| 501 |
{
|
| 502 |
"cell_type": "code",
|
| 503 |
"execution_count": null,
|
| 504 |
+
"id": "f29c3678",
|
| 505 |
"metadata": {},
|
| 506 |
"outputs": [
|
| 507 |
{
|
|
|
|
| 531 |
},
|
| 532 |
{
|
| 533 |
"cell_type": "markdown",
|
| 534 |
+
"id": "0bc57f41",
|
| 535 |
"metadata": {},
|
| 536 |
"source": [
|
| 537 |
"### Plot 2 — All 4 Reward Components ✅"
|
|
|
|
| 540 |
{
|
| 541 |
"cell_type": "code",
|
| 542 |
"execution_count": null,
|
| 543 |
+
"id": "7a0c2cec",
|
| 544 |
"metadata": {},
|
| 545 |
"outputs": [
|
| 546 |
{
|
|
|
|
| 584 |
},
|
| 585 |
{
|
| 586 |
"cell_type": "markdown",
|
| 587 |
+
"id": "20cb4f0d",
|
| 588 |
"metadata": {},
|
| 589 |
"source": [
|
| 590 |
"### Plot 3 — QEMU Correctness Rate ✅"
|
|
|
|
| 593 |
{
|
| 594 |
"cell_type": "code",
|
| 595 |
"execution_count": null,
|
| 596 |
+
"id": "0b5d9f9c",
|
| 597 |
"metadata": {},
|
| 598 |
"outputs": [
|
| 599 |
{
|
|
|
|
| 634 |
},
|
| 635 |
{
|
| 636 |
"cell_type": "markdown",
|
| 637 |
+
"id": "d05f0589",
|
| 638 |
"metadata": {},
|
| 639 |
"source": [
|
| 640 |
"### Plot 4 — Speedup Events & V10 vs V11 Comparison ✅"
|
|
|
|
| 643 |
{
|
| 644 |
"cell_type": "code",
|
| 645 |
"execution_count": null,
|
| 646 |
+
"id": "bdf294f1",
|
| 647 |
"metadata": {},
|
| 648 |
"outputs": [
|
| 649 |
{
|
|
|
|
| 702 |
},
|
| 703 |
{
|
| 704 |
"cell_type": "markdown",
|
| 705 |
+
"id": "5b5b2681",
|
| 706 |
"metadata": {},
|
| 707 |
"source": [
|
| 708 |
"### Plot 5 — Total Reward Trajectory ✅"
|
|
|
|
| 711 |
{
|
| 712 |
"cell_type": "code",
|
| 713 |
"execution_count": null,
|
| 714 |
+
"id": "670ddd69",
|
| 715 |
"metadata": {},
|
| 716 |
"outputs": [
|
| 717 |
{
|
|
|
|
| 749 |
},
|
| 750 |
{
|
| 751 |
"cell_type": "markdown",
|
| 752 |
+
"id": "69915eef",
|
| 753 |
"metadata": {},
|
| 754 |
"source": [
|
| 755 |
"## 5. V10 vs V11 — Run Comparison ✅\n",
|
| 756 |
"\n",
|
| 757 |
+
"Across 11 experiments we tuned one thing at a time. V10 → V11 was the most significant jump: more completions per step (G 6→8), longer generation window (768→1024 tokens), stronger LoRA adapter (r 24→32). The table below shows where each version wins.\n",
|
| 758 |
+
"\n",
|
| 759 |
"| Metric | [V10](https://huggingface.co/ZDC-M01/arm-gym-v10-train-200) | [V11](https://huggingface.co/ZDC-M01/arm-gym-v11-train-250) | Winner |\n",
|
| 760 |
"|--------|------|------|--------|\n",
|
| 761 |
"| Steps | 200 | 250 | — |\n",
|
|
|
|
| 765 |
"| Best speedup | **+47.5%** | +14.5% | V10 |\n",
|
| 766 |
"| Mean total reward | 3.79 | **4.72** | V11 |\n",
|
| 767 |
"| Correctness mean | 26.0% | **38.6%** | V11 |\n",
|
| 768 |
+
"| Correctness final 20 steps | 25.8% | **70.0%** | V11 |\n",
|
| 769 |
"| Q4 reward | 3.96 | **6.50** | V11 |\n",
|
| 770 |
"| Reward trajectory | flat | **monotonic ↑** | V11 |\n",
|
| 771 |
"\n",
|
| 772 |
+
"> **Reading the table honestly:** V10's +47.5% best speedup was a single lucky rollout on one kernel — the overall reward trajectory was flat, meaning the model wasn't consistently improving. V11's lower single-run best (+14.5%) reflects a real trade-off: the model stopped relying on syntactic shortcuts and started generating semantically correct assembly more consistently. Correctness is the prerequisite. Speedup follows. V11 is the better model.\n",
|
| 773 |
+
"\n",
|
| 774 |
+
"Both results are **MCA-model speedups** (LLVM-MCA Neoverse V2 scheduling model) — not wall-clock measurements on silicon. They are estimates of cycle throughput, which correlate with real performance but are not the same thing."
|
| 775 |
]
|
| 776 |
},
|
| 777 |
{
|
| 778 |
"cell_type": "code",
|
| 779 |
"execution_count": null,
|
| 780 |
+
"id": "9a2eaf51",
|
| 781 |
"metadata": {},
|
| 782 |
"outputs": [
|
| 783 |
{
|
|
|
|
| 821 |
},
|
| 822 |
{
|
| 823 |
"cell_type": "markdown",
|
| 824 |
+
"id": "ceb99547",
|
| 825 |
"metadata": {},
|
| 826 |
"source": [
|
| 827 |
"## 6. Evidence of Learning — Did the Agent Actually Improve? ✅"
|
|
|
|
| 830 |
{
|
| 831 |
"cell_type": "code",
|
| 832 |
"execution_count": null,
|
| 833 |
+
"id": "c0f0b54f",
|
| 834 |
"metadata": {},
|
| 835 |
"outputs": [
|
| 836 |
{
|
|
|
|
| 889 |
},
|
| 890 |
{
|
| 891 |
"cell_type": "markdown",
|
| 892 |
+
"id": "5f8fac26",
|
| 893 |
"metadata": {},
|
| 894 |
"source": [
|
| 895 |
+
"## 7. Key Findings & What Comes Next\n",
|
| 896 |
+
"\n",
|
| 897 |
+
"### Results\n",
|
| 898 |
"\n",
|
| 899 |
"| Metric | V11 (Best Run) |\n",
|
| 900 |
"|--------|----------------|\n",
|
| 901 |
"| Training | 250 steps, NVIDIA L40S 48GB, ~110 min |\n",
|
| 902 |
+
"| Best speedup (MCA estimate) | +14.5% over clang-21 -O3 on one kernel |\n",
|
| 903 |
"| Speedup win rate | 14/125 steps (11.2%) |\n",
|
| 904 |
+
"| Peak total reward | 9.03 (step 98) |\n",
|
| 905 |
+
"| Correctness: start → end | 19% → 70% (monotonic, 3.7× improvement) |\n",
|
| 906 |
+
"| Reward trajectory | Q1=3.30 → Q4=6.50 |\n",
|
| 907 |
+
"| Fully-correct batches | 14 (all G=8 pass QEMU adversarial tests) |\n",
|
| 908 |
+
"\n",
|
| 909 |
+
"All speedup figures are **LLVM-MCA Neoverse V2 cycle estimates**, not wall-clock measurements on silicon. They are meaningful engineering signals but should be labeled as model speedups until validated on physical ARM hardware.\n",
|
| 910 |
"\n",
|
| 911 |
+
"### Where this is going\n",
|
| 912 |
+
"\n",
|
| 913 |
+
"The current results are a proof of concept, not a finished product. The interesting question is what happens next:\n",
|
| 914 |
+
"\n",
|
| 915 |
+
"**Near term — Pattern extraction.** When the model finds a faster instruction sequence on a specific kernel, that sequence can be studied and back-ported into LLVM as a new deterministic optimization rule. The AI acts as a research tool that discovers the exception; the compiler authors write the rule.\n",
|
| 916 |
+
"\n",
|
| 917 |
+
"**Medium term — A neural compiler pass.** Instead of `-O3`, a developer could run `-O-ai` and get a search-based optimization for the specific chip they are targeting. Not a replacement for the compiler — a specialist pass that runs on top of it, exploring the residual headroom that static heuristics miss.\n",
|
| 918 |
+
"\n",
|
| 919 |
+
"**Long term — Software-defined silicon.** Closing the loop between chip design and software. When ARM or AWS designs a new instruction (say, an SME2 matrix operation), an RL agent can probe: *can code actually take advantage of this?* That feedback loop, today done manually, could become automated.\n",
|
| 920 |
+
"\n",
|
| 921 |
+
"The bet is not that a 7B model beats compilers today. The bet is that the combination of LLM reasoning over assembly + deterministic hardware simulation as a reward signal is a viable path toward compiler optimization that no fixed rule set can match.\n",
|
| 922 |
"\n",
|
| 923 |
"> **Models:** [V11 — Best](https://huggingface.co/ZDC-M01/arm-gym-v11-train-250) · [V10 — Second Best](https://huggingface.co/ZDC-M01/arm-gym-v10-train-200)"
|
| 924 |
]
|
| 925 |
},
|
| 926 |
{
|
| 927 |
"cell_type": "markdown",
|
| 928 |
+
"id": "a295411d",
|
| 929 |
"metadata": {},
|
| 930 |
"source": [
|
| 931 |
"## 8. (Optional) Run GRPO Training\n",
|
|
|
|
| 938 |
{
|
| 939 |
"cell_type": "code",
|
| 940 |
"execution_count": null,
|
| 941 |
+
"id": "b02cc636",
|
| 942 |
"metadata": {},
|
| 943 |
"outputs": [
|
| 944 |
{
|
|
|
|
| 968 |
},
|
| 969 |
{
|
| 970 |
"cell_type": "markdown",
|
| 971 |
+
"id": "81852bbd",
|
| 972 |
"metadata": {},
|
| 973 |
"source": [
|
| 974 |
"---\n",
|
| 975 |
"\n",
|
| 976 |
"## Resources\n",
|
| 977 |
"\n",
|
| 978 |
+
"### This project\n",
|
| 979 |
+
"\n",
|
| 980 |
"| Resource | Link |\n",
|
| 981 |
"|----------|------|\n",
|
| 982 |
"| HuggingFace Space | [kaori02/arm-gym](https://huggingface.co/spaces/kaori02/arm-gym) |\n",
|
|
|
|
| 984 |
"| Second-best — V10 | [ZDC-M01/arm-gym-v10-train-200](https://huggingface.co/ZDC-M01/arm-gym-v10-train-200) |\n",
|
| 985 |
"| Training scripts | [ZDC-M01/arm-gym-pkg](https://huggingface.co/datasets/ZDC-M01/arm-gym-pkg) |\n",
|
| 986 |
"| Blog post | [arm-gym/blob/main/blog.md](https://huggingface.co/spaces/kaori02/arm-gym/blob/main/blog.md) |\n",
|
| 987 |
+
"\n",
|
| 988 |
+
"### Prior work & research lineage\n",
|
| 989 |
+
"\n",
|
| 990 |
+
"| Paper | What it showed |\n",
|
| 991 |
+
"|-------|----------------|\n",
|
| 992 |
+
"| [SuperCoder (arXiv 2505.11480)](https://arxiv.org/abs/2505.11480) | GRPO-trained LLM beats gcc -O3 on x86-64, 1.46× speedup, 95% correctness — the direct blueprint ARM-Gym ports to AArch64 |\n",
|
| 993 |
+
"| [AlphaDev — DeepMind (Nature 2023)](https://www.nature.com/articles/s41586-023-06004-9) | RL over x86 assembly found a sort3 algorithm 70% faster than libc++ — now in production LLVM |\n",
|
| 994 |
+
"| [Compiler-R1 (NeurIPS 2025)](https://arxiv.org/abs/2410.10053) | GRPO applied to LLVM pass ordering, 8.46% instruction count reduction — validates our GRPO + verifiable reward approach |\n",
|
| 995 |
+
"| [Meta LLM Compiler (Cummins et al., 2023)](https://arxiv.org/abs/2309.07062) | First LLM trained for compiler pass ordering, 3% code-size reduction over -Oz with zero extra compilations at inference |\n",
|
| 996 |
"\n",
|
| 997 |
"**Team:** (dot)mkv · Meta / HuggingFace OpenEnv Hackathon India 2026 · Wild Card Theme"
|
| 998 |
]
|