Text Generation
MLX
Safetensors
GGUF
Rust
qwen3_5_text
4b
agentic-coding
android
apple-silicon
attested
bash
c
chain-of-custody
chinese
code
code-completion
code-generation
code-infill
coder
coding
consumer-gpu
cpp
cryptographically-verified
css
delta-forge
edge-inference
embedded
english
forge-alloy
function-calling
ggml
go
html
iphone
java
javascript
kotlin
llama-cpp
lm-studio
local-inference
macbook
mobile
multilingual
ollama
on-device
php
programming
python
q4-k-m
quantized
qwen
qwen3
qwen3.5
raspberry-pi
reproducible
ruby
software-engineering
sql
swift
typescript
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -33,9 +33,11 @@ datasets:
|
|
| 33 |
|
| 34 |
# qwen3.5-4b-code-forged
|
| 35 |
|
| 36 |
-
**+26.6% better than baseline**
|
| 37 |
|
| 38 |
-
|
|
|
|
|
|
|
| 39 |
|
| 40 |
## Results
|
| 41 |
|
|
@@ -52,13 +54,15 @@ datasets:
|
|
| 52 |
| Cycles | 3 |
|
| 53 |
| Steps/Cycle | 500 |
|
| 54 |
|
| 55 |
-
##
|
| 56 |
|
| 57 |
| Device | Format | Verified |
|
| 58 |
|--------|--------|----------|
|
| 59 |
| MacBook Pro 16GB | fp16 | Yes |
|
| 60 |
| MacBook Pro 32GB | fp16 | Yes |
|
| 61 |
|
|
|
|
|
|
|
| 62 |
## Quick Start
|
| 63 |
|
| 64 |
```python
|
|
@@ -73,7 +77,9 @@ output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperatur
|
|
| 73 |
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
| 74 |
```
|
| 75 |
|
| 76 |
-
##
|
|
|
|
|
|
|
| 77 |
|
| 78 |
```bash
|
| 79 |
git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
|
|
@@ -81,18 +87,58 @@ source .venv/bin/activate
|
|
| 81 |
python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code
|
| 82 |
```
|
| 83 |
|
|
|
|
|
|
|
| 84 |
## The Science: Experiential Plasticity
|
| 85 |
|
| 86 |
-
Traditional model compression (quantization, distillation) makes models **smaller but worse**. Experiential Plasticity makes them **smaller AND better**
|
|
|
|
|
|
|
| 87 |
|
| 88 |
1. **Train** on domain-specific data (LoRA + AMP mixed precision)
|
| 89 |
-
2. **
|
| 90 |
-
3. **
|
| 91 |
-
4. **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
-
|
| 94 |
|
| 95 |
-
**
|
| 96 |
|
| 97 |
## Output Samples
|
| 98 |
|
|
|
|
| 33 |
|
| 34 |
# qwen3.5-4b-code-forged
|
| 35 |
|
| 36 |
+
**+26.6% better than baseline.** Forged from [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) for **code** tasks.
|
| 37 |
|
| 38 |
+
**Not quantized. Not distilled. Structurally reshaped.**
|
| 39 |
+
|
| 40 |
+
The architecture co-evolves with training: heads that contribute to the domain specialize, heads that don't are removed. The result is a model architecturally optimized for its task β like biological synaptic pruning during brain development.
|
| 41 |
|
| 42 |
## Results
|
| 43 |
|
|
|
|
| 54 |
| Cycles | 3 |
|
| 55 |
| Steps/Cycle | 500 |
|
| 56 |
|
| 57 |
+
## Runs On
|
| 58 |
|
| 59 |
| Device | Format | Verified |
|
| 60 |
|--------|--------|----------|
|
| 61 |
| MacBook Pro 16GB | fp16 | Yes |
|
| 62 |
| MacBook Pro 32GB | fp16 | Yes |
|
| 63 |
|
| 64 |
+
These models are designed for **consumer hardware**. No A100s required. Your MacBook, your gaming PC, your home server.
|
| 65 |
+
|
| 66 |
## Quick Start
|
| 67 |
|
| 68 |
```python
|
|
|
|
| 77 |
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
| 78 |
```
|
| 79 |
|
| 80 |
+
## Forge Your Own
|
| 81 |
+
|
| 82 |
+
Three commands. Any NVIDIA GPU with 8GB+ VRAM.
|
| 83 |
|
| 84 |
```bash
|
| 85 |
git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
|
|
|
|
| 87 |
python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code
|
| 88 |
```
|
| 89 |
|
| 90 |
+
The forge script auto-detects your GPU, picks the right memory tier (fp16 / 4-bit NF4), trains with LoRA + AMP, prunes attention heads, defrags, and saves. Progress observable via `status.json`.
|
| 91 |
+
|
| 92 |
## The Science: Experiential Plasticity
|
| 93 |
|
| 94 |
+
Traditional model compression (quantization, distillation) makes models **smaller but worse**. Experiential Plasticity makes them **smaller AND better**.
|
| 95 |
+
|
| 96 |
+
### How It Works
|
| 97 |
|
| 98 |
1. **Train** on domain-specific data (LoRA + AMP mixed precision)
|
| 99 |
+
2. **Measure** each attention head's information contribution (entropy-based importance)
|
| 100 |
+
3. **Prune** the lowest-contributing heads
|
| 101 |
+
4. **Retrain** on the same domain data β surviving heads specialize and compensate
|
| 102 |
+
5. **Defrag** β structurally remove dead heads, free VRAM
|
| 103 |
+
6. **Repeat** β each cycle the model improves on its domain
|
| 104 |
+
|
| 105 |
+
### Scaling Law
|
| 106 |
+
|
| 107 |
+
Larger models harbor more architectural redundancy. Plasticity exploits this β bigger models benefit more:
|
| 108 |
+
|
| 109 |
+
| Model | Params | Domain | Improvement |
|
| 110 |
+
|-------|--------|--------|------------|
|
| 111 |
+
| Qwen2.5-0.5B | 0.5B | General | -3.2% (too small to prune) |
|
| 112 |
+
| Qwen2.5-1.5B | 1.5B | General | +3.0% |
|
| 113 |
+
| Qwen2.5-7B | 7.6B | General | +11.8% |
|
| 114 |
+
| **Qwen3.5-4B** | **3.4B** | **Code** | **+24.0%** |
|
| 115 |
+
| **Qwen3.5-27B** | **23.6B** | **Code** | **+3.5%** (4-bit, runs in 17GB) |
|
| 116 |
+
|
| 117 |
+
Domain-specific training amplifies the effect. Qwen3.5-4B on code (+24%) exceeds Qwen2.5-7B on generic text (+11.8%) despite being a smaller model.
|
| 118 |
+
|
| 119 |
+
### Transfer Function
|
| 120 |
+
|
| 121 |
+
Recovery from iterative pruning follows a measurable exponential decay:
|
| 122 |
+
|
| 123 |
+
```
|
| 124 |
+
recovery = 1.45 * exp(-0.18 * cycle) - 0.03
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
This connects transformer optimization to classical control theory β the same mathematics used in electrical engineering and robotics for decades. A PID controller can manage the entire forging process with zero human hyperparameters.
|
| 128 |
+
|
| 129 |
+
### Continuous Defrag
|
| 130 |
+
|
| 131 |
+
Traditional pruning masks heads but doesn't free memory. Continuous defrag structurally removes dead heads between cycles:
|
| 132 |
+
|
| 133 |
+
```
|
| 134 |
+
Cycle 1: train (batch=1, 27B, 17.9GB) -> prune -> defrag -> freed 1.7GB
|
| 135 |
+
Cycle 2: train (batch=2, 24.5B, 16.2GB) -> prune -> defrag -> freed 1.7GB (2x faster)
|
| 136 |
+
Cycle 3: train (batch=3, 22B, 14.5GB) -> prune -> defrag (2.8x faster)
|
| 137 |
+
```
|
| 138 |
|
| 139 |
+
40% faster total training and a 33% smaller final model.
|
| 140 |
|
| 141 |
+
**Read the full paper**: [Experiential Plasticity: Transformers That Grow Their Own Architecture From Experience](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)
|
| 142 |
|
| 143 |
## Output Samples
|
| 144 |
|