continuum-ai
/

qwen3.5-4b-code-forged

Model card Files Files and versions

xet

Community

EnricoFermi commited on Mar 28

Commit

0f6549a

verified ·

1 Parent(s): 2eeb77d

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +56 -10

README.md CHANGED Viewed

@@ -33,9 +33,11 @@ datasets:
 # qwen3.5-4b-code-forged
-**+26.6% better than baseline** — a **forged** version of [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B), optimized through [Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md) for **code** tasks.
-> Experiential Plasticity: iteratively prune attention heads, retrain on domain data, repeat. The model emerges **smaller AND more capable** — like biological synaptic pruning during brain development.
 ## Results
@@ -52,13 +54,15 @@ datasets:
 | Cycles | 3 |
 | Steps/Cycle | 500 |
-## Target Hardware
 | Device | Format | Verified |
 |--------|--------|----------|
 | MacBook Pro 16GB | fp16 | Yes |
 | MacBook Pro 32GB | fp16 | Yes |
 ## Quick Start
 ```python
@@ -73,7 +77,9 @@ output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperatur
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
-## Reproduce
 ```bash
 git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
@@ -81,18 +87,58 @@ source .venv/bin/activate
 python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code
 ```
 ## The Science: Experiential Plasticity
-Traditional model compression (quantization, distillation) makes models **smaller but worse**. Experiential Plasticity makes them **smaller AND better**:
 1. **Train** on domain-specific data (LoRA + AMP mixed precision)
-2. **Prune** attention heads with lowest entropy (information content)
-3. **Retrain** — surviving heads specialize and compensate
-4. **Repeat** — each cycle, the model gets better at its domain
-The improvement follows a measurable **transfer function**: `recovery = 1.45 * exp(-0.18 * cycle) - 0.03` — connecting transformer optimization to classical control theory.
-**Scaling law**: Larger models harbor more redundancy and benefit more from plasticity. A 7B model improves +11.8%, while a 0.5B model is already too compressed to benefit.
 ## Output Samples

 # qwen3.5-4b-code-forged
+**+26.6% better than baseline.** Forged from [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) for **code** tasks.
+**Not quantized. Not distilled. Structurally reshaped.**
+The architecture co-evolves with training: heads that contribute to the domain specialize, heads that don't are removed. The result is a model architecturally optimized for its task — like biological synaptic pruning during brain development.
 ## Results
 | Cycles | 3 |
 | Steps/Cycle | 500 |
+## Runs On
 | Device | Format | Verified |
 |--------|--------|----------|
 | MacBook Pro 16GB | fp16 | Yes |
 | MacBook Pro 32GB | fp16 | Yes |
+These models are designed for **consumer hardware**. No A100s required. Your MacBook, your gaming PC, your home server.
 ## Quick Start
 ```python
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
+## Forge Your Own
+Three commands. Any NVIDIA GPU with 8GB+ VRAM.
 ```bash
 git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
 python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code
 ```
+The forge script auto-detects your GPU, picks the right memory tier (fp16 / 4-bit NF4), trains with LoRA + AMP, prunes attention heads, defrags, and saves. Progress observable via `status.json`.
 ## The Science: Experiential Plasticity
+Traditional model compression (quantization, distillation) makes models **smaller but worse**. Experiential Plasticity makes them **smaller AND better**.
+### How It Works
 1. **Train** on domain-specific data (LoRA + AMP mixed precision)
+2. **Measure** each attention head's information contribution (entropy-based importance)
+3. **Prune** the lowest-contributing heads
+4. **Retrain** on the same domain data — surviving heads specialize and compensate
+5. **Defrag** — structurally remove dead heads, free VRAM
+6. **Repeat** — each cycle the model improves on its domain
+### Scaling Law
+Larger models harbor more architectural redundancy. Plasticity exploits this — bigger models benefit more:
+| Model | Params | Domain | Improvement |
+|-------|--------|--------|------------|
+| Qwen2.5-0.5B | 0.5B | General | -3.2% (too small to prune) |
+| Qwen2.5-1.5B | 1.5B | General | +3.0% |
+| Qwen2.5-7B | 7.6B | General | +11.8% |
+| **Qwen3.5-4B** | **3.4B** | **Code** | **+24.0%** |
+| **Qwen3.5-27B** | **23.6B** | **Code** | **+3.5%** (4-bit, runs in 17GB) |
+Domain-specific training amplifies the effect. Qwen3.5-4B on code (+24%) exceeds Qwen2.5-7B on generic text (+11.8%) despite being a smaller model.
+### Transfer Function
+Recovery from iterative pruning follows a measurable exponential decay:
+```
+recovery = 1.45 * exp(-0.18 * cycle) - 0.03
+```
+This connects transformer optimization to classical control theory — the same mathematics used in electrical engineering and robotics for decades. A PID controller can manage the entire forging process with zero human hyperparameters.
+### Continuous Defrag
+Traditional pruning masks heads but doesn't free memory. Continuous defrag structurally removes dead heads between cycles:
+```
+Cycle 1: train (batch=1, 27B, 17.9GB) -> prune -> defrag -> freed 1.7GB
+Cycle 2: train (batch=2, 24.5B, 16.2GB) -> prune -> defrag -> freed 1.7GB  (2x faster)
+Cycle 3: train (batch=3, 22B, 14.5GB)  -> prune -> defrag                  (2.8x faster)
+```
+40% faster total training and a 33% smaller final model.
+**Read the full paper**: [Experiential Plasticity: Transformers That Grow Their Own Architecture From Experience](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)
 ## Output Samples