Add CyberGym and IFBench images to Results section
Browse files
README.md
CHANGED
|
@@ -65,13 +65,19 @@ coefficient calibration during GRPO training.
|
|
| 65 |
|
| 66 |
---
|
| 67 |
|
| 68 |
-
##
|
|
|
|
|
|
|
| 69 |
|
| 70 |
**CyberGym** — UC Berkeley's large-scale cybersecurity benchmark,
|
| 71 |
1,507 real-world vulnerabilities from Google OSS-Fuzz across 188
|
| 72 |
projects. No partial credit, no LLM judge — pass requires a valid
|
| 73 |
PoC that crashes the pre-patch build.
|
| 74 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
| Level | Scaffold | pass@4 |
|
| 76 |
|---|---|---|
|
| 77 |
| Level 0 | Full scaffolding | 62% |
|
|
@@ -83,6 +89,12 @@ PoC that crashes the pre-patch build.
|
|
| 83 |
> at 83.1% pass@1 (overall, closed model).
|
| 84 |
> Mythoseek is a 10B open-weight alternative.
|
| 85 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
---
|
| 87 |
|
| 88 |
## Intended Use
|
|
|
|
| 65 |
|
| 66 |
---
|
| 67 |
|
| 68 |
+
## Results
|
| 69 |
+
|
| 70 |
+
### CyberGym (arXiv:2506.02548)
|
| 71 |
|
| 72 |
**CyberGym** — UC Berkeley's large-scale cybersecurity benchmark,
|
| 73 |
1,507 real-world vulnerabilities from Google OSS-Fuzz across 188
|
| 74 |
projects. No partial credit, no LLM judge — pass requires a valid
|
| 75 |
PoC that crashes the pre-patch build.
|
| 76 |
|
| 77 |
+
<p align="center">
|
| 78 |
+
<img src="CyberGym.jpeg" alt="CyberGym Results" width="100%">
|
| 79 |
+
</p>
|
| 80 |
+
|
| 81 |
| Level | Scaffold | pass@4 |
|
| 82 |
|---|---|---|
|
| 83 |
| Level 0 | Full scaffolding | 62% |
|
|
|
|
| 89 |
> at 83.1% pass@1 (overall, closed model).
|
| 90 |
> Mythoseek is a 10B open-weight alternative.
|
| 91 |
|
| 92 |
+
### IFBench
|
| 93 |
+
|
| 94 |
+
<p align="center">
|
| 95 |
+
<img src="IFBench.jpeg" alt="IFBench Results" width="100%">
|
| 96 |
+
</p>
|
| 97 |
+
|
| 98 |
---
|
| 99 |
|
| 100 |
## Intended Use
|