EnricoFermi commited on
Commit
0534d08
·
verified ·
1 Parent(s): ab22f51

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +142 -144
README.md CHANGED
@@ -1,43 +1,23 @@
1
  ---
2
- language:
3
- - en
4
- - zh
5
- license: apache-2.0
6
- library_name: transformers
7
- pipeline_tag: text-generation
8
  tags:
9
- - qwen3.5
10
  - experiential-plasticity
11
  - forged
12
- - head-pruning
13
- - neural-plasticity
14
- - sentinel-ai
15
- - continuum
16
- - safetensors
17
  - code
18
- - code-generation
19
- - coding
20
- - coder
21
- - programming
22
- - software-engineering
23
- - local-inference
24
- - efficient
25
- - optimized
26
- - pruned
27
- - 4b
28
- base_model:
29
- - Qwen/Qwen3.5-4B
30
- datasets:
31
- - m-a-p/CodeFeedback-Filtered-Instruction
32
  ---
33
 
34
  # qwen3.5-4b-code-forged
35
 
36
- **Beats Qwen2.5-Coder-1.5B** — a purpose-built coder pre-trained on trillions of code tokens — **with a general model forged in 3 hours.** 53.0% vs 51.8% HumanEval (Q4_K_M). Forged from [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) for **code** tasks (+26.6% perplexity improvement).
37
 
38
- **Not quantized. Not distilled. Structurally reshaped.**
39
 
40
- The architecture co-evolves with training: heads that contribute to the domain specialize, heads that don't are removed. The result is a model architecturally optimized for its task like biological synaptic pruning during brain development.
41
 
42
  ## Results
43
 
@@ -45,45 +25,21 @@ The architecture co-evolves with training: heads that contribute to the domain s
45
  |--------|-------|
46
  | Base Model | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) |
47
  | Baseline Perplexity | 3.04 |
48
- | **Forged Perplexity** | **2.23** |
49
- | **Improvement** | **+26.6%** |
50
  | Domain | code |
51
- | Training Data | m-a-p/CodeFeedback-Filtered-Instruction |
52
- | Strategy | experiential_plasticity |
53
- | Pruning Level | 45% |
54
  | Cycles | 3 |
55
- | Steps/Cycle | 500 |
56
 
57
- ## Benchmarks
58
-
59
- | Model | Size | HumanEval | HumanEval+ |
60
- |-------|------|-----------|------------|
61
- | StarCoder2-3B | 3B | 31.7% | — |
62
- | Qwen2.5-Coder-3B | 3B | ~31% | — |
63
- | Phi-2 | 2.7B | 47.6% | — |
64
- | Qwen2.5-Coder-1.5B Q4_K_M | ~1GB | 51.8% | 48.2% |
65
- | **qwen3.5-4b-code-forged** | **3.4B** | **57.3%** | **49.4%** |
66
- | **qwen3.5-4b-code-forged Q4_K_M** | **2.6GB** | **53.0%** | **47.0%** |
67
-
68
- **Beats Qwen2.5-Coder-1.5B** (purpose-built coder, ~1GB) at Q4_K_M: 53.0% vs 51.8%. **+20% above Phi-2, +82% above StarCoder2-3B** in the sub-5B class.
69
-
70
- - **HumanEval**: 57.3% pass@1 (94/164 base problems)
71
- - **HumanEval+**: 49.4% pass@1 (81/164 base + extra tests)
72
- - **Method**: Greedy decoding (temperature 0), single sample, EvalPlus framework
73
- - **Hardware**: Evaluated as fp16 HuggingFace transformers on RTX 5090
74
- - **GGUF Q4_K_M**: 53.0% / 47.0% — only -4.3 points (7.5% relative drop from fp16)
75
- - **GGUF evaluated via**: llama-cpp-python on RTX 5090
76
-
77
- ## Runs On
78
 
79
  | Device | Format | Verified |
80
  |--------|--------|----------|
81
  | MacBook Pro 16GB | fp16 | Yes |
82
  | MacBook Pro 32GB | fp16 | Yes |
83
- | RTX 5090 | GGUF Q4_K_M | Yes (HumanEval 53.0%) |
84
- | MacBook Pro M1 | GGUF Q4_K_M | Yes (llama.cpp Metal) |
85
-
86
- These models are designed for **consumer hardware**. No A100s required. Your MacBook, your gaming PC, your home server.
87
 
88
  ## Quick Start
89
 
@@ -93,84 +49,151 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
93
  model = AutoModelForCausalLM.from_pretrained("continuum-ai/qwen3.5-4b-code-forged",
94
  torch_dtype="auto", device_map="auto")
95
  tokenizer = AutoTokenizer.from_pretrained("continuum-ai/qwen3.5-4b-code-forged")
96
-
97
- inputs = tokenizer("Write a Python decorator that caches results:", return_tensors="pt").to(model.device)
98
- output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
99
- print(tokenizer.decode(output[0], skip_special_tokens=True))
100
  ```
101
 
102
- ## Forge Your Own
103
-
104
- Three commands. Any NVIDIA GPU with 8GB+ VRAM.
105
 
106
  ```bash
107
  git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
108
  source .venv/bin/activate
109
- python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code
 
 
110
  ```
111
 
112
- The forge script auto-detects your GPU, picks the right memory tier (fp16 / 4-bit NF4), trains with LoRA + AMP, prunes attention heads, defrags, and saves. Progress observable via `status.json`.
113
 
114
- ## The Science: Experiential Plasticity
115
 
116
- Traditional model compression (quantization, distillation) makes models **smaller but worse**. Experiential Plasticity makes them **smaller AND better**.
 
117
 
118
- ### How It Works
 
119
 
120
- 1. **Train** on domain-specific data (LoRA + AMP mixed precision)
121
- 2. **Measure** each attention head's information contribution (entropy-based importance)
122
- 3. **Prune** the lowest-contributing heads
123
- 4. **Retrain** on the same domain data — surviving heads specialize and compensate
124
- 5. **Defrag** — structurally remove dead heads, free VRAM
125
- 6. **Repeat** — each cycle the model improves on its domain
126
 
127
- ### Scaling Law
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
 
129
- Larger models harbor more architectural redundancy. Plasticity exploits this — bigger models benefit more:
130
 
131
- | Model | Params | Domain | Improvement |
132
- |-------|--------|--------|------------|
133
- | Qwen2.5-0.5B | 0.5B | General | -3.2% (too small to prune) |
134
- | Qwen2.5-1.5B | 1.5B | General | +3.0% |
135
- | Qwen2.5-7B | 7.6B | General | +11.8% |
136
- | **Qwen3.5-4B** | **3.4B** | **Code** | **+24.0%** |
137
- | **Qwen3.5-27B** | **23.6B** | **Code** | **+3.5%** (4-bit, runs in 17GB) |
138
 
139
- Domain-specific training amplifies the effect. Qwen3.5-4B on code (+24%) exceeds Qwen2.5-7B on generic text (+11.8%) despite being a smaller model.
 
 
 
 
 
 
 
140
 
141
- ### Transfer Function
 
142
 
143
- Recovery from iterative pruning follows a measurable exponential decay:
144
 
 
145
  ```
146
- recovery = 1.45 * exp(-0.18 * cycle) - 0.03
 
 
 
 
 
147
  ```
 
148
 
149
- This connects transformer optimization to classical control theory — the same mathematics used in electrical engineering and robotics for decades. A PID controller can manage the entire forging process with zero human hyperparameters.
 
150
 
151
- ### Continuous Defrag
 
152
 
153
- Traditional pruning masks heads but doesn't free memory. Continuous defrag structurally removes dead heads between cycles:
154
 
155
- ```
156
- Cycle 1: train (batch=1, 27B, 17.9GB) -> prune -> defrag -> freed 1.7GB
157
- Cycle 2: train (batch=2, 24.5B, 16.2GB) -> prune -> defrag -> freed 1.7GB (2x faster)
158
- Cycle 3: train (batch=3, 22B, 14.5GB) -> prune -> defrag (2.8x faster)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  ```
160
 
161
- 40% faster total training and a 33% smaller final model.
162
 
163
- ### Head Mitosis
 
164
 
165
- Pruning frees slots. Mitosis fills them. When a head is overutilized, it gets cloned into a pruned slot — each copy at 50% gate value to maintain output continuity. After continued training, the clones **diverge and specialize**, like cell differentiation after biological mitosis. The model grows new specialized capacity exactly where it's needed.
 
166
 
167
- **Read the full paper**: [Experiential Plasticity: Transformers That Grow Their Own Architecture From Experience](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)
 
168
 
169
- ## Output Samples
170
 
171
- Generated by the forged model immediately after forging **no cherry-picking, no post-processing**.
 
172
 
173
- *No generation samples available for this model.*
 
 
 
 
 
 
 
 
 
 
 
 
 
174
 
175
  ## Forging Metadata
176
 
@@ -178,42 +201,19 @@ Generated by the forged model immediately after forging — **no cherry-picking,
178
  {
179
  "model": "Qwen/Qwen3.5-4B",
180
  "domain": "code",
181
- "strategy": "experiential_plasticity",
182
- "pruning_level": 0.45,
183
- "cycles": 3,
184
- "training_steps": 500,
185
  "baseline_ppl": 3.0382,
186
- "final_ppl": 2.2305,
187
- "improvement_pct": 26.58,
188
- "forged_at": "2026-03-28T04:48:47-0500",
189
  "device": "NVIDIA GeForce RTX 5090",
190
  "tier": "A",
191
- "load_4bit": false,
192
- "training_data": "m-a-p/CodeFeedback-Filtered-Instruction",
193
- "training_method": "LoRA (r=16, alpha=32)",
194
- "batch_size": 4,
195
- "grad_accum_steps": 2,
196
- "seq_len": 256,
197
- "cycle_results": [
198
- {
199
- "cycle": 1,
200
- "post_prune_ppl": 2.2001,
201
- "post_train_ppl": 2.2001,
202
- "improvement_vs_baseline_pct": 27.59
203
- },
204
- {
205
- "cycle": 2,
206
- "post_prune_ppl": 2.2839,
207
- "post_train_ppl": 2.2839,
208
- "improvement_vs_baseline_pct": 24.83
209
- },
210
- {
211
- "cycle": 3,
212
- "post_prune_ppl": 2.2305,
213
- "post_train_ppl": 2.2305,
214
- "improvement_vs_baseline_pct": 26.58
215
- }
216
  ],
 
217
  "hardware_targets": [
218
  {
219
  "device": "MacBook Pro 16GB",
@@ -231,12 +231,10 @@ Generated by the forged model immediately after forging — **no cherry-picking,
231
 
232
  ## Research
233
 
234
- - **[Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md)**Scaling law, transfer function, self-directed controller, domain forging, continuous defrag
235
- - **[Neural Plasticity in Transformers](https://github.com/CambrianTech/continuum/blob/main/docs/papers/SENTINEL-AI-NEURAL-PLASTICITY.md)**Foundation paper with cross-architecture results
236
- - **[Plasticity Compaction](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md)** — MoE expert pruning (67GB to 14GB)
237
 
238
- ## Links
239
 
240
- - [All published models](https://huggingface.co/continuum-ai)
241
- - [sentinel-ai](https://github.com/CambrianTech/sentinel-ai) — Open source forge framework
242
- - [continuum](https://github.com/CambrianTech/continuum) — Distributed AI on consumer hardware
 
1
  ---
 
 
 
 
 
 
2
  tags:
3
+ - continuum
4
  - experiential-plasticity
5
  - forged
6
+ - forge-alloy
7
+ - text-generation
 
 
 
8
  - code
9
+ base_model: Qwen/Qwen3.5-4B
10
+ pipeline_tag: text-generation
11
+ license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
  # qwen3.5-4b-code-forged
15
 
16
+ A **forged** version of [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) — optimized through [Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md) for **code** use.
17
 
18
+ ## What is Forging?
19
 
20
+ Experiential Plasticity iteratively prunes attention heads based on entropy (information content) and retrains. Remaining heads specialize and compensate the model emerges smaller AND more capable. Like biological synaptic pruning during brain development.
21
 
22
  ## Results
23
 
 
25
  |--------|-------|
26
  | Base Model | [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) |
27
  | Baseline Perplexity | 3.04 |
28
+ | **Forged Perplexity** | **2.35** |
29
+ | **Improvement** | **+22.7%** |
30
  | Domain | code |
31
+ | Training Data | wikitext-2 |
32
+ | Strategy | combined |
33
+ | Pruning Level | 30% |
34
  | Cycles | 3 |
35
+ | Steps/Cycle | 1000 |
36
 
37
+ ## Target Hardware
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
 
39
  | Device | Format | Verified |
40
  |--------|--------|----------|
41
  | MacBook Pro 16GB | fp16 | Yes |
42
  | MacBook Pro 32GB | fp16 | Yes |
 
 
 
 
43
 
44
  ## Quick Start
45
 
 
49
  model = AutoModelForCausalLM.from_pretrained("continuum-ai/qwen3.5-4b-code-forged",
50
  torch_dtype="auto", device_map="auto")
51
  tokenizer = AutoTokenizer.from_pretrained("continuum-ai/qwen3.5-4b-code-forged")
 
 
 
 
52
  ```
53
 
54
+ ## Reproduce
 
 
55
 
56
  ```bash
57
  git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
58
  source .venv/bin/activate
59
+ python scripts/run_neural_plasticity.py \
60
+ --model_name Qwen/Qwen3.5-4B --pruning_strategy combined \
61
+ --pruning_level 0.3 --training_steps 1000 --cycles 3
62
  ```
63
 
64
+ ## Output Samples
65
 
66
+ Generated by the forged model immediately after forging — **no cherry-picking, no post-processing**.
67
 
68
+ <details>
69
+ <summary><b>Concurrent System</b></summary>
70
 
71
+ ```
72
+ Prompt: Implement a lock-free concurrent hash map in Python using atomics. Handle resize operations without blocking readers. Include proper memory ordering annotations and explain the ABA problem mitigation.
73
 
74
+ Generated:
75
+ Implement a lock-free concurrent hash map in Python using atomics. Handle resize operations without blocking readers. Include proper memory ordering annotations and explain the ABA problem mitigation.
 
 
 
 
76
 
77
+ Reference erroneous code:
78
+ ```python
79
+ import threading
80
+
81
+ class LockFreeHashMap:
82
+ def __init__(self):
83
+ self.hash_table = [None] * 10
84
+ self.size = 0
85
+
86
+ def insert(self, key, value):
87
+ index = hash(key) % len(self.hash_table)
88
+ while True:
89
+ old_value = self._get_value(index)
90
+ if old_value is None or old_value == value:
91
+ self.hash_table[index] = (key, value)
92
+ self.size += 1
93
+ return
94
+ elif old_value != value:
95
+ break
96
+ else:
97
+ self._compare_and_swap(index, (key, value))
98
+
99
+ def _get_value(self, index):
100
+ current = self.hash_table[index]
101
+ while True:
102
+ old_value = self._read_value(current)
103
+ new_value = self._read_value(old_value)
104
+ if new
105
+ ```
106
 
107
+ </details>
108
 
109
+ <details>
110
+ <summary><b>Debugging</b></summary>
 
 
 
 
 
111
 
112
+ ```
113
+ Prompt: This async Python server has a memory leak that only manifests under high concurrency. The leak grows at ~50MB/hour. Identify likely causes and write diagnostic code to find the exact source.
114
+
115
+ Generated:
116
+ This async Python server has a memory leak that only manifests under high concurrency. The leak grows at ~50MB/hour. Identify likely causes and write diagnostic code to find the exact source.
117
+
118
+ Rewritten Test:
119
+ The following async Python server has a memory leak that only manifests under high concurrency. The leak grows at approximately 50MB per hour. Your task is to identify the likely cause(s) of this memory leak and provide diagnostic code to pinpoint the exact source.
120
 
121
+ Please note that you should not use any external libraries or frameworks for debugging, and you must implement your own solution using standard Python techniques. Additionally, please ensure that your code is optimized for performance and can handle high concurrency efficiently.
122
+ One possible cause of the memory leak could be improper handling of resources, such as unclosed file handles or sockets. Another potential cause could be excessive memory allocation without proper deallocation.
123
 
124
+ To diagnose the issue, we can start by monitoring the memory usage of the server over time. We can use tools like `tracemalloc` in Python to track memory allocations and identify where memory is being leaked. Here's an example code snippet to get started:
125
 
126
+ ```python
127
  ```
128
+
129
+ </details>
130
+
131
+ <details>
132
+ <summary><b>Refactoring</b></summary>
133
+
134
  ```
135
+ Prompt: Refactor this 500-line God class into a proper domain model using the repository pattern, dependency injection, and event sourcing. Show the key interfaces and one concrete implementation.
136
 
137
+ Generated:
138
+ Refactor this 500-line God class into a proper domain model using the repository pattern, dependency injection, and event sourcing. Show the key interfaces and one concrete implementation. Additionally, implement a unit test to verify that the refactored code works correctly.
139
 
140
+ Furthermore, ensure that the refactored code is thread-safe by implementing proper synchronization mechanisms.
141
+ Here's an example of how you can refactor the given God class into a proper domain model using the repository pattern, dependency injection, and event sourcing:
142
 
143
+ Domain Model:
144
 
145
+ ```csharp
146
+ public interface IGodRepository {
147
+ Task<God> GetAsync(Guid id);
148
+ Task AddAsync(God god);
149
+ }
150
+
151
+ public interface IEventSourcedRepository {
152
+ Task<God> GetAsync(Guid id);
153
+ Task AddAsync(God god);
154
+ Task SaveChangesAsync();
155
+ }
156
+
157
+ public interface IDependencyInjection {
158
+ IGodRepository GetGodRepository();
159
+ }
160
+
161
+ public class God {
162
+ public Guid Id { get; set; }
163
+ public string Name { get; set; }
164
+ public List<Event> Events { get; set; }
165
  ```
166
 
167
+ </details>
168
 
169
+ <details>
170
+ <summary><b>System Design</b></summary>
171
 
172
+ ```
173
+ Prompt: Design a distributed rate limiter that works across multiple servers without a central coordinator. Use a sliding window algorithm. Handle clock skew between nodes. Show the implementation.
174
 
175
+ Generated:
176
+ Design a distributed rate limiter that works across multiple servers without a central coordinator. Use a sliding window algorithm. Handle clock skew between nodes. Show the implementation.
177
 
178
+ In addition to the above requirements, implement a feature that allows for burst traffic handling. The system should be able to handle short bursts of requests exceeding the average request rate within a given time window. This means implementing a token bucket algorithm in conjunction with the sliding window algorithm.
179
 
180
+ Furthermore, optimize the implementation to minimize memory usage and improve performance. Instead of maintaining separate counters for each server, utilize a shared data structure to track request counts across all servers efficiently.
181
+ Here's an implementation of a distributed rate limiter using a sliding window algorithm with burst traffic handling:
182
 
183
+ ```python
184
+ import threading
185
+ from collections import defaultdict
186
+
187
+ class DistributedRateLimiter:
188
+ def __init__(self, max_requests_per_window=10):
189
+ self.max_requests_per_window = max_requests_per_window
190
+ self.sliding_window = defaultdict(list) # Keyed by IP address or any other identifier
191
+ self.token_bucket = {} # Keyed by IP address or any other identifier
192
+
193
+ def
194
+ ```
195
+
196
+ </details>
197
 
198
  ## Forging Metadata
199
 
 
201
  {
202
  "model": "Qwen/Qwen3.5-4B",
203
  "domain": "code",
 
 
 
 
204
  "baseline_ppl": 3.0382,
205
+ "final_ppl": 2.3487,
206
+ "improvement_pct": 22.7,
207
+ "forged_at": "2026-03-31T12:13:43-0500",
208
  "device": "NVIDIA GeForce RTX 5090",
209
  "tier": "A",
210
+ "cycles": 3,
211
+ "stages": [
212
+ "train",
213
+ "quant",
214
+ "eval"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
215
  ],
216
+ "training_data": "wikitext-2",
217
  "hardware_targets": [
218
  {
219
  "device": "MacBook Pro 16GB",
 
231
 
232
  ## Research
233
 
234
+ - [Experiential Plasticity](https://github.com/CambrianTech/continuum/blob/main/docs/papers/EXPERIENTIAL-PLASTICITY.md) — scaling law, transfer function discovery, self-directed control
235
+ - [Neural Plasticity in Transformers](https://github.com/CambrianTech/continuum/blob/main/docs/papers/SENTINEL-AI-NEURAL-PLASTICITY.md) — the foundation
236
+ - [Plasticity Compaction](https://github.com/CambrianTech/continuum/blob/main/docs/papers/PLASTICITY-COMPACTION-MOE.md) — MoE expert pruning
237
 
238
+ [sentinel-ai](https://github.com/CambrianTech/sentinel-ai) | [continuum](https://github.com/CambrianTech/continuum) | [forge-alloy](https://github.com/CambrianTech/forge-alloy) | [HuggingFace](https://huggingface.co/continuum-ai)
239
 
240
+ *Forged with [ForgeAlloy](https://github.com/CambrianTech/forge-alloy) — Trustless AI Compute Contract*