continuum-ai
/

qwen3.5-4b-code-forged

Model card Files Files and versions

xet

Community

EnricoFermi commited on Mar 31

Commit

0534d08

verified ·

1 Parent(s): ab22f51

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +142 -144

README.md CHANGED Viewed

@@ -1,43 +1,23 @@
 ---
-language:
-- en
-- zh
-license: apache-2.0
-library_name: transformers
-pipeline_tag: text-generation
 tags:
-- qwen3.5
 - experiential-plasticity
 - forged
-- head-pruning
-- neural-plasticity
-- sentinel-ai
-- continuum
-- safetensors
 - code
-- code-generation
-- coding
-- coder
-- programming
-- software-engineering
-- local-inference
-- efficient
-- optimized
-- pruned
-- 4b
-base_model:
-- Qwen/Qwen3.5-4B
-datasets:
-- m-a-p/CodeFeedback-Filtered-Instruction
 ---
 # qwen3.5-4b-code-forged
-**Beats Qwen2.5-Coder-1.5B** — a purpose-built coder pre-trained on trillions of code tokens — **with a general model forged in 3 hours.** 53.0% vs 51.8% HumanEval (Q4_K_M). Forged from [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) for **code** tasks (+26.6% perplexity improvement).
-**Not quantized. Not distilled. Structurally reshaped.**
-The architecture co-evolves with training: heads that contribute to the domain specialize, heads that don't are removed. The result is a model architecturally optimized for its task — like biological synaptic pruning during brain development.
 ## Results
@@ -45,45 +25,21 @@ The architecture co-evolves with training: heads that contribute to the domain s
 |--------|-------|
 | Base Model | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) |
 | Baseline Perplexity | 3.04 |
-| **Forged Perplexity** | **2.23** |
-| **Improvement** | **+26.6%** |
 | Domain | code |
-| Training Data | m-a-p/CodeFeedback-Filtered-Instruction |
-| Strategy | experiential_plasticity |
-| Pruning Level | 45% |
 | Cycles | 3 |
-| Steps/Cycle | 500 |
-## Benchmarks
-| Model | Size | HumanEval | HumanEval+ |
-|-------|------|-----------|------------|
-| StarCoder2-3B | 3B | 31.7% | — |
-| Qwen2.5-Coder-3B | 3B | ~31% | — |
-| Phi-2 | 2.7B | 47.6% | — |
-| Qwen2.5-Coder-1.5B Q4_K_M | ~1GB | 51.8% | 48.2% |
-| **qwen3.5-4b-code-forged** | **3.4B** | **57.3%** | **49.4%** |
-| **qwen3.5-4b-code-forged Q4_K_M** | **2.6GB** | **53.0%** | **47.0%** |
-**Beats Qwen2.5-Coder-1.5B** (purpose-built coder, ~1GB) at Q4_K_M: 53.0% vs 51.8%. **+20% above Phi-2, +82% above StarCoder2-3B** in the sub-5B class.
-- **HumanEval**: 57.3% pass@1 (94/164 base problems)
-- **HumanEval+**: 49.4% pass@1 (81/164 base + extra tests)
-- **Method**: Greedy decoding (temperature 0), single sample, EvalPlus framework
-- **Hardware**: Evaluated as fp16 HuggingFace transformers on RTX 5090
-- **GGUF Q4_K_M**: 53.0% / 47.0% — only -4.3 points (7.5% relative drop from fp16)
-- **GGUF evaluated via**: llama-cpp-python on RTX 5090
-## Runs On
 | Device | Format | Verified |
 |--------|--------|----------|
 | MacBook Pro 16GB | fp16 | Yes |
 | MacBook Pro 32GB | fp16 | Yes |
-| RTX 5090 | GGUF Q4_K_M | Yes (HumanEval 53.0%) |
-| MacBook Pro M1 | GGUF Q4_K_M | Yes (llama.cpp Metal) |
-These models are designed for **consumer hardware**. No A100s required. Your MacBook, your gaming PC, your home server.
 ## Quick Start
@@ -93,84 +49,151 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 model = AutoModelForCausalLM.from_pretrained("continuum-ai/qwen3.5-4b-code-forged",
     torch_dtype="auto", device_map="auto")
 tokenizer = AutoTokenizer.from_pretrained("continuum-ai/qwen3.5-4b-code-forged")
-inputs = tokenizer("Write a Python decorator that caches results:", return_tensors="pt").to(model.device)
-output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
-## Forge Your Own
-Three commands. Any NVIDIA GPU with 8GB+ VRAM.
 ```bash
 git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
 source .venv/bin/activate
-python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code
 ```
-The forge script auto-detects your GPU, picks the right memory tier (fp16 / 4-bit NF4), trains with LoRA + AMP, prunes attention heads, defrags, and saves. Progress observable via `status.json`.
-## The Science: Experiential Plasticity
-Traditional model compression (quantization, distillation) makes models **smaller but worse**. Experiential Plasticity makes them **smaller AND better**.
-### How It Works
-1. **Train** on domain-specific data (LoRA + AMP mixed precision)
-2. **Measure** each attention head's information contribution (entropy-based importance)
-3. **Prune** the lowest-contributing heads
-4. **Retrain** on the same domain data — surviving heads specialize and compensate
-5. **Defrag** — structurally remove dead heads, free VRAM
-6. **Repeat** — each cycle the model improves on its domain
-### Scaling Law
-Larger models harbor more architectural redundancy. Plasticity exploits this — bigger models benefit more:
-| Model | Params | Domain | Improvement |
-|-------|--------|--------|------------|
-| Qwen2.5-0.5B | 0.5B | General | -3.2% (too small to prune) |
-| Qwen2.5-1.5B | 1.5B | General | +3.0% |
-| Qwen2.5-7B | 7.6B | General | +11.8% |
-| **Qwen3.5-4B** | **3.4B** | **Code** | **+24.0%** |
-| **Qwen3.5-27B** | **23.6B** | **Code** | **+3.5%** (4-bit, runs in 17GB) |
-Domain-specific training amplifies the effect. Qwen3.5-4B on code (+24%) exceeds Qwen2.5-7B on generic text (+11.8%) despite being a smaller model.
-### Transfer Function
-Recovery from iterative pruning follows a measurable exponential decay:
 ```
-recovery = 1.45 * exp(-0.18 * cycle) - 0.03
 ```
-This connects transformer optimization to classical control theory — the same mathematics used in electrical engineering and robotics for decades. A PID controller can manage the entire forging process with zero human hyperparameters.
-### Continuous Defrag
-Traditional pruning masks heads but doesn't free memory. Continuous defrag structurally removes dead heads between cycles:
-```
-Cycle 1: train (batch=1, 27B, 17.9GB) -> prune -> defrag -> freed 1.7GB
-Cycle 2: train (batch=2, 24.5B, 16.2GB) -> prune -> defrag -> freed 1.7GB  (2x faster)
-Cycle 3: train (batch=3, 22B, 14.5GB)  -> prune -> defrag                  (2.8x faster)
 ```
-40% faster total training and a 33% smaller final model.
-### Head Mitosis
-Pruning frees slots. Mitosis fills them. When a head is overutilized, it gets cloned into a pruned slot — each copy at 50% gate value to maintain output continuity. After continued training, the clones **diverge and specialize**, like cell differentiation after biological mitosis. The model grows new specialized capacity exactly where it's needed.
-**Read the full paper**: [Experiential Plasticity: Transformers That Grow Their Own Architecture From Experience](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)
-## Output Samples
-Generated by the forged model immediately after forging — **no cherry-picking, no post-processing**.
-*No generation samples available for this model.*
 ## Forging Metadata
@@ -178,42 +201,19 @@ Generated by the forged model immediately after forging — **no cherry-picking,
 {
   "model": "Qwen/Qwen3.5-4B",
   "domain": "code",
-  "strategy": "experiential_plasticity",
-  "pruning_level": 0.45,
-  "cycles": 3,
-  "training_steps": 500,
   "baseline_ppl": 3.0382,
-  "final_ppl": 2.2305,
-  "improvement_pct": 26.58,
-  "forged_at": "2026-03-28T04:48:47-0500",
   "device": "NVIDIA GeForce RTX 5090",
   "tier": "A",
-  "load_4bit": false,
-  "training_data": "m-a-p/CodeFeedback-Filtered-Instruction",
-  "training_method": "LoRA (r=16, alpha=32)",
-  "batch_size": 4,
-  "grad_accum_steps": 2,
-  "seq_len": 256,
-  "cycle_results": [
-    {
-      "cycle": 1,
-      "post_prune_ppl": 2.2001,
-      "post_train_ppl": 2.2001,
-      "improvement_vs_baseline_pct": 27.59
-    },
-    {
-      "cycle": 2,
-      "post_prune_ppl": 2.2839,
-      "post_train_ppl": 2.2839,
-      "improvement_vs_baseline_pct": 24.83
-    },
-    {
-      "cycle": 3,
-      "post_prune_ppl": 2.2305,
-      "post_train_ppl": 2.2305,
-      "improvement_vs_baseline_pct": 26.58
-    }
   ],
   "hardware_targets": [
     {
       "device": "MacBook Pro 16GB",
@@ -231,12 +231,10 @@ Generated by the forged model immediately after forging — **no cherry-picking,
 ## Research
-- **[Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)** — Scaling law, transfer function, self-directed controller, domain forging, continuous defrag
-- **[Neural Plasticity in Transformers](https://github.com/CambrianTech/continuum/blob/main/docs/papers/SENTINEL-AI-NEURAL-PLASTICITY.md)** — Foundation paper with cross-architecture results
-- **[Plasticity Compaction](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md)** — MoE expert pruning (67GB to 14GB)
-## Links
-- [All published models](https://huggingface.co/continuum-ai)
-- [sentinel-ai](https://github.com/CambrianTech/sentinel-ai) — Open source forge framework
-- [continuum](https://github.com/CambrianTech/continuum) — Distributed AI on consumer hardware

 ---
 tags:
+- continuum
 - experiential-plasticity
 - forged
+- forge-alloy
+- text-generation
 - code
+base_model: Qwen/Qwen3.5-4B
+pipeline_tag: text-generation
+license: apache-2.0
 ---
 # qwen3.5-4b-code-forged
+A **forged** version of [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) — optimized through [Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md) for **code** use.
+## What is Forging?
+Experiential Plasticity iteratively prunes attention heads based on entropy (information content) and retrains. Remaining heads specialize and compensate — the model emerges smaller AND more capable. Like biological synaptic pruning during brain development.
 ## Results
 |--------|-------|
 | Base Model | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) |
 | Baseline Perplexity | 3.04 |
+| **Forged Perplexity** | **2.35** |
+| **Improvement** | **+22.7%** |
 | Domain | code |
+| Training Data | wikitext-2 |
+| Strategy | combined |
+| Pruning Level | 30% |
 | Cycles | 3 |
+| Steps/Cycle | 1000 |
+## Target Hardware
 | Device | Format | Verified |
 |--------|--------|----------|
 | MacBook Pro 16GB | fp16 | Yes |
 | MacBook Pro 32GB | fp16 | Yes |
 ## Quick Start
 model = AutoModelForCausalLM.from_pretrained("continuum-ai/qwen3.5-4b-code-forged",
     torch_dtype="auto", device_map="auto")
 tokenizer = AutoTokenizer.from_pretrained("continuum-ai/qwen3.5-4b-code-forged")
 ```
+## Reproduce
 ```bash
 git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
 source .venv/bin/activate
+python scripts/run_neural_plasticity.py \
+  --model_name Qwen/Qwen3.5-4B --pruning_strategy combined \
+  --pruning_level 0.3 --training_steps 1000 --cycles 3
 ```
+## Output Samples
+Generated by the forged model immediately after forging — **no cherry-picking, no post-processing**.
+<details>
+<summary><b>Concurrent System</b></summary>
+```
+Prompt: Implement a lock-free concurrent hash map in Python using atomics. Handle resize operations without blocking readers. Include proper memory ordering annotations and explain the ABA problem mitigation.
+Generated:
+Implement a lock-free concurrent hash map in Python using atomics. Handle resize operations without blocking readers. Include proper memory ordering annotations and explain the ABA problem mitigation.
+Reference erroneous code:
+```python
+import threading
+class LockFreeHashMap:
+    def __init__(self):
+        self.hash_table = [None] * 10
+        self.size = 0
+    def insert(self, key, value):
+        index = hash(key) % len(self.hash_table)
+        while True:
+            old_value = self._get_value(index)
+            if old_value is None or old_value == value:
+                self.hash_table[index] = (key, value)
+                self.size += 1
+                return
+            elif old_value != value:
+                break
+            else:
+                self._compare_and_swap(index, (key, value))
+    def _get_value(self, index):
+        current = self.hash_table[index]
+        while True:
+            old_value = self._read_value(current)
+            new_value = self._read_value(old_value)
+            if new
+```
+</details>
+<details>
+<summary><b>Debugging</b></summary>
+```
+Prompt: This async Python server has a memory leak that only manifests under high concurrency. The leak grows at ~50MB/hour. Identify likely causes and write diagnostic code to find the exact source.
+Generated:
+This async Python server has a memory leak that only manifests under high concurrency. The leak grows at ~50MB/hour. Identify likely causes and write diagnostic code to find the exact source.
+Rewritten Test:
+The following async Python server has a memory leak that only manifests under high concurrency. The leak grows at approximately 50MB per hour. Your task is to identify the likely cause(s) of this memory leak and provide diagnostic code to pinpoint the exact source.
+Please note that you should not use any external libraries or frameworks for debugging, and you must implement your own solution using standard Python techniques. Additionally, please ensure that your code is optimized for performance and can handle high concurrency efficiently.
+One possible cause of the memory leak could be improper handling of resources, such as unclosed file handles or sockets. Another potential cause could be excessive memory allocation without proper deallocation.
+To diagnose the issue, we can start by monitoring the memory usage of the server over time. We can use tools like `tracemalloc` in Python to track memory allocations and identify where memory is being leaked. Here's an example code snippet to get started:
+```python
 ```
+</details>
+<details>
+<summary><b>Refactoring</b></summary>
 ```
+Prompt: Refactor this 500-line God class into a proper domain model using the repository pattern, dependency injection, and event sourcing. Show the key interfaces and one concrete implementation.
+Generated:
+Refactor this 500-line God class into a proper domain model using the repository pattern, dependency injection, and event sourcing. Show the key interfaces and one concrete implementation. Additionally, implement a unit test to verify that the refactored code works correctly.
+Furthermore, ensure that the refactored code is thread-safe by implementing proper synchronization mechanisms.
+Here's an example of how you can refactor the given God class into a proper domain model using the repository pattern, dependency injection, and event sourcing:
+Domain Model:
+```csharp
+public interface IGodRepository {
+    Task<God> GetAsync(Guid id);
+    Task AddAsync(God god);
+}
+public interface IEventSourcedRepository {
+    Task<God> GetAsync(Guid id);
+    Task AddAsync(God god);
+    Task SaveChangesAsync();
+}
+public interface IDependencyInjection {
+    IGodRepository GetGodRepository();
+}
+public class God {
+    public Guid Id { get; set; }
+    public string Name { get; set; }
+    public List<Event> Events { get; set; }
 ```
+</details>
+<details>
+<summary><b>System Design</b></summary>
+```
+Prompt: Design a distributed rate limiter that works across multiple servers without a central coordinator. Use a sliding window algorithm. Handle clock skew between nodes. Show the implementation.
+Generated:
+Design a distributed rate limiter that works across multiple servers without a central coordinator. Use a sliding window algorithm. Handle clock skew between nodes. Show the implementation.
+In addition to the above requirements, implement a feature that allows for burst traffic handling. The system should be able to handle short bursts of requests exceeding the average request rate within a given time window. This means implementing a token bucket algorithm in conjunction with the sliding window algorithm.
+Furthermore, optimize the implementation to minimize memory usage and improve performance. Instead of maintaining separate counters for each server, utilize a shared data structure to track request counts across all servers efficiently.
+Here's an implementation of a distributed rate limiter using a sliding window algorithm with burst traffic handling:
+```python
+import threading
+from collections import defaultdict
+class DistributedRateLimiter:
+    def __init__(self, max_requests_per_window=10):
+        self.max_requests_per_window = max_requests_per_window
+        self.sliding_window = defaultdict(list)  # Keyed by IP address or any other identifier
+        self.token_bucket = {}  # Keyed by IP address or any other identifier
+    def
+```
+</details>
 ## Forging Metadata
 {
   "model": "Qwen/Qwen3.5-4B",
   "domain": "code",
   "baseline_ppl": 3.0382,
+  "final_ppl": 2.3487,
+  "improvement_pct": 22.7,
+  "forged_at": "2026-03-31T12:13:43-0500",
   "device": "NVIDIA GeForce RTX 5090",
   "tier": "A",
+  "cycles": 3,
+  "stages": [
+    "train",
+    "quant",
+    "eval"
   ],
+  "training_data": "wikitext-2",
   "hardware_targets": [
     {
       "device": "MacBook Pro 16GB",
 ## Research
+- [Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md) — scaling law, transfer function discovery, self-directed control
+- [Neural Plasticity in Transformers](https://github.com/CambrianTech/continuum/blob/main/docs/papers/SENTINEL-AI-NEURAL-PLASTICITY.md) — the foundation
+- [Plasticity Compaction](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md) — MoE expert pruning
+[sentinel-ai](https://github.com/CambrianTech/sentinel-ai) | [continuum](https://github.com/CambrianTech/continuum) | [forge-alloy](https://github.com/CambrianTech/forge-alloy) | [HuggingFace](https://huggingface.co/continuum-ai)
+*Forged with [ForgeAlloy](https://github.com/CambrianTech/forge-alloy) — Trustless AI Compute Contract*