Text Generation
MLX
Safetensors
GGUF
Rust
qwen3_5_text
4b
agentic-coding
android
apple-silicon
attested
bash
c
chain-of-custody
chinese
code
code-completion
code-generation
code-infill
coder
coding
consumer-gpu
cpp
cryptographically-verified
css
delta-forge
edge-inference
embedded
english
forge-alloy
function-calling
ggml
go
html
iphone
java
javascript
kotlin
llama-cpp
lm-studio
local-inference
macbook
mobile
multilingual
ollama
on-device
php
programming
python
q4-k-m
quantized
qwen
qwen3
qwen3.5
raspberry-pi
reproducible
ruby
software-engineering
sql
swift
typescript
Add HumanEval benchmark results (57.3% pass@1)
Browse files
README.md
CHANGED
|
@@ -54,6 +54,22 @@ The architecture co-evolves with training: heads that contribute to the domain s
|
|
| 54 |
| Cycles | 3 |
|
| 55 |
| Steps/Cycle | 500 |
|
| 56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
## Runs On
|
| 58 |
|
| 59 |
| Device | Format | Verified |
|
|
|
|
| 54 |
| Cycles | 3 |
|
| 55 |
| Steps/Cycle | 500 |
|
| 56 |
|
| 57 |
+
## Benchmarks
|
| 58 |
+
|
| 59 |
+
| Model | Size | HumanEval | HumanEval+ |
|
| 60 |
+
|-------|------|-----------|------------|
|
| 61 |
+
| StarCoder2-3B | 3B | 31.7% | — |
|
| 62 |
+
| Qwen2.5-Coder-3B | 3B | ~31% | — |
|
| 63 |
+
| Phi-2 | 2.7B | 47.6% | — |
|
| 64 |
+
| **qwen3.5-4b-code-forged** | **3.4B** | **57.3%** | **49.4%** |
|
| 65 |
+
|
| 66 |
+
**+20% above Phi-2, +82% above StarCoder2-3B** in the sub-5B class.
|
| 67 |
+
|
| 68 |
+
- **HumanEval**: 57.3% pass@1 (94/164 base problems)
|
| 69 |
+
- **HumanEval+**: 49.4% pass@1 (81/164 base + extra tests)
|
| 70 |
+
- **Method**: Greedy decoding (temperature 0), single sample, EvalPlus framework
|
| 71 |
+
- **Hardware**: Evaluated as fp16 HuggingFace transformers on RTX 5090
|
| 72 |
+
|
| 73 |
## Runs On
|
| 74 |
|
| 75 |
| Device | Format | Verified |
|