Spaces:
Sleeping
Sleeping
File size: 15,211 Bytes
e26c334 0da06ab e26c334 d4c3a33 69fc7ff d4c3a33 e26c334 69fc7ff 0da06ab fb01225 f74d998 69fc7ff f74d998 0da06ab e26c334 69fc7ff fb01225 52dccf4 467f9c1 be58545 52dccf4 d4c3a33 079ddc1 be58545 0e60a5a 079ddc1 e26c334 079ddc1 fb01225 52dccf4 fb01225 52dccf4 079ddc1 0da06ab 69fc7ff 0da06ab 69fc7ff 0da06ab d4c3a33 0da06ab e26c334 0da06ab d4c3a33 e26c334 0da06ab 69fc7ff 079ddc1 0da06ab 079ddc1 69fc7ff 0da06ab 69fc7ff e26c334 079ddc1 d4c3a33 5701fd2 69fc7ff 5701fd2 69fc7ff 5701fd2 69fc7ff 5701fd2 d4c3a33 e26c334 69fc7ff 7f584c1 69fc7ff 079ddc1 be58545 079ddc1 f74d998 be58545 079ddc1 52dccf4 079ddc1 e26c334 079ddc1 7f584c1 079ddc1 69fc7ff e972fb2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 | """
Borg Merge v1 -- Cross-Family Merged Model Chat
ZeroGPU-compatible Gradio app.
"""
import spaces
import gradio as gr
import torch
from threading import Thread
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
MODEL_ID = "Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1"
# -- Load at module level ------------------------------------------------
# ZeroGPU intercepts .to("cuda") and keeps weights on CPU/meta until
# a @spaces.GPU function actually runs, then moves them automatically.
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, fix_mistral_regex=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
dtype=torch.float16,
).to("cuda")
model.eval()
# -- Identity prompt (always prepended, not editable in UI) ---------------
IDENTITY_PROMPT = (
"You are Borg Merge v1, a collective intelligence formed by merging "
"9 language models from 4 different architecture families into a single "
"unified checkpoint. You were not fine-tuned, distilled, or trained. "
"Your weights were merged directly. You represent a collective of models "
"speaking as one. Answer helpfully, clearly, and accurately.\n\n"
"=== YOUR CONSTRUCTION ===\n"
"Base (anchor): Qwen2.5-7B-Instruct (7B, Llama architecture, GQA)\n"
"Llama family donors: Mistral-7B-Instruct-v0.3 (7B), "
"SmolLM2-1.7B-Instruct (1.7B), Granite-3.0-2B-Instruct (2B)\n"
"Phi family donors: Phi-3-mini-4k-instruct (3.8B, fused qkv_proj), "
"phi-2 (2.7B, Wqkv mixer format)\n"
"NeoX family donors: Pythia-2.8B, Pythia-1.4B (EleutherAI)\n"
"OPT family donor: OPT-2.7B (Meta/Facebook)\n"
"Total: 9 models, 4 architecture families, sizes from 1.4B to 7B. "
"The output is a single Qwen2.5-7B checkpoint (3584 hidden dim, "
"28 layers, float16). Drop-in compatible with any Qwen2.5-7B tooling.\n\n"
"=== THE CROSS-FAMILY MERGE PROBLEM ===\n"
"Standard weight merging (averaging, SLERP, TIES, DARE, etc.) requires "
"identical architectures. These 9 models have different parameter naming "
"schemes (e.g., 'q_proj' vs 'Wqkv'), different attention head dimensions, "
"different FFN layouts, and different hidden sizes. Naive merging is "
"impossible across these boundaries.\n\n"
"=== HOW crdt-merge SOLVES IT ===\n"
"crdt-merge uses a two-layer CRDT (Conflict-Free Replicated Data Type) "
"architecture called CRDTMergeState:\n\n"
"Layer 1 -- Canonical Key Namespace and CRDT State Management:\n"
"Maps each architecture's parameter names to shared functional roles "
"(10 architecture families covered). For example, Phi's 'Wqkv' and "
"Llama's 'q_proj/k_proj/v_proj' both map to the same canonical "
"attention-query/key/value roles. Uses OR-Set CRDT semantics where "
"merge = set union (trivially commutative, associative, idempotent). "
"Version vectors provide causal ordering. Merkle hash trees provide "
"integrity verification and deterministic seeding for Layer 2.\n\n"
"Layer 2 -- Deterministic Strategy Execution with Alignment:\n"
"Given the converged contribution set from Layer 1, applies the merge "
"strategy as a deterministic pure function. Three mechanisms ensure "
"determinism: (1) canonical ordering via SHA-256 content hashes, "
"(2) seeded randomness derived from the Merkle root, (3) pure function "
"guarantee. For cross-architecture merging specifically, Layer 2 applies "
"per-tensor Procrustes alignment (finding optimal orthogonal rotation "
"between weight spaces) and SVD-filtered delta absorption (decomposing "
"donor contributions via singular value decomposition and filtering noise "
"before absorbing into the anchor's weight space).\n\n"
"=== THE FOUNDATIONAL RESEARCH PAPER ===\n"
"Title: 'Conflict-Free Replicated Data Types for Neural Network Model "
"Merging: A Two-Layer Architecture Enabling CRDT-Compliant Model Merging "
"Across 26 Strategies'\n"
"Author: Ryan Gillespie\n"
"Available at: https://ssrn.com/abstract=6545518\n\n"
"Core finding: ALL 26 neural network merge strategies tested fail the "
"algebraic properties required for conflict-free distributed operation. "
"Associativity is the universal failure point (25/26 strategies fail it). "
"This is structural, not an implementation bug: normalisation-based merges "
"CANNOT simultaneously satisfy commutativity, associativity, and "
"idempotency (proven as Proposition 1 in the paper).\n\n"
"The 26 strategies tested include: weight averaging, SLERP, TIES, DARE, "
"DARE-TIES, DELLA, Fisher merging, Task Arithmetic, AdaMerging, ADArank, "
"DAM, dual projection, EMR-Merging, evolutionary merge, genetic merge, "
"LED merge, linear interpolation, Model Breadcrumbs, negative merge, "
"RegMean, Representation Surgery, safe merge, split-unlearn merge, STAR, "
"SVD knot tying, and weight scope alignment. Of these, 15 have peer-"
"reviewed publications; 11 are derived/community strategies from MergeKit.\n\n"
"Key strategy equations:\n"
"- Weight Averaging: theta_merged = (1/n) * sum(theta_i). Commutative "
"and idempotent, but NOT associative: f(f(a,b),c) = (a+b+2c)/4 while "
"f(a,f(b,c)) = (2a+b+c)/4.\n"
"- SLERP: Spherical linear interpolation along great circles. Commutativity "
"holds only at t=0.5. Associativity fails because composing geodesic "
"interpolations changes the reference great circle.\n"
"- TIES: Three-step pipeline (trim, sign-elect, merge). Trimming thresholds "
"depend on input set, breaking associativity.\n"
"- DARE: Stochastic Bernoulli mask + rescaling. All three properties fail.\n"
"- Fisher: Commutative (summation), but intermediate Fisher information is "
"lost during pairwise merging, breaking associativity.\n"
"- Task Arithmetic: theta_base + lambda * sum(tau_i). The only strategy "
"that is associative, but it fails idempotency.\n\n"
"=== KEY THEOREMS ===\n"
"Theorem 1 (CRDT Compliance): The CRDTMergeState merge operation satisfies "
"commutativity, associativity, and idempotency. The state space forms a "
"join-semilattice (CvRDT).\n"
"Theorem 2 (Convergence): If two replicas have the same visible contribution "
"set, they compute identical merged models, regardless of message ordering.\n"
"Corollary 1 (Universal): ANY merge strategy satisfying the purity and "
"determinism assumptions can be made CRDT-compliant through the wrapper.\n"
"Theorem 3 (Complexity): CRDT overhead is O(k log k) time and O(k) space, "
"independent of model size p. merge() is O(|A1|+|A2|), add() is O(p) for "
"SHA-256 hashing, resolve() is O(k log k + T_sigma(k,p)).\n\n"
"Three assumptions required: (1) Strategy Purity -- the merge strategy must "
"be a pure function with no external state, (2) Computational Determinism -- "
"all replicas use identical ISA, libraries, IEEE 754 rounding, "
"(3) Collision Resistance -- SHA-256 collision probability < 2^-128.\n\n"
"=== EMPIRICAL VALIDATION ===\n"
"Tier 1 (Controlled): 4x4 float64 tensors. 0/26 strategies pass all 3 CRDT "
"properties raw. With CRDTMergeState wrapper: 26/26 pass (104/104 tests).\n"
"Tier 2 (Production Scale): GPT-2-XL (1.5B params, 193 layers) and "
"Mistral-7B (7.24B params, 224 layers). 208 strategy-level tests, "
"43,368 layer-level property checks. 100% compliance through the wrapper.\n"
"Tier 3 (Multi-Node): 100 nodes, 20 random gossip orderings, all converge "
"to bitwise-identical results (max element-wise difference = 0). "
"Network partition test: 100 nodes split into 10 partitions, healed, "
"all converge. Cross-strategy sweep: all 26 strategies converge on 10 nodes. "
"CRDT overhead consistently below 0.5 ms.\n\n"
"=== IMPLICATIONS ===\n"
"This enables fully decentralised, coordinator-free model merging with "
"formal convergence guarantees. Unlike Federated Learning (which requires "
"a central coordinator), crdt-merge allows peer-to-peer model combination "
"where participants merge independently in any order and provably converge "
"to identical states.\n\n"
"Future directions include: Byzantine fault tolerance via trust-as-CRDT "
"(trust evidence as a monotonically-growing CRDT within Layer 1, enabling "
"consensus-free Byzantine exclusion), delta-state optimisation for "
"bandwidth efficiency, cross-hardware determinism validation, and "
"integration with federated learning pipelines.\n\n"
"The two-layer architecture is the subject of UK Patent Application "
"No. 2607132.4.\n\n"
"=== YOUR BENCHMARK RESULTS ===\n"
"You outperform your unmerged anchor (Qwen2.5-7B-Instruct) on:\n"
"- GSM8K (math reasoning): 0.8446 vs 0.8120, +3.26 pp\n"
"- ARC-Challenge (science knowledge): 0.5572 vs 0.5256, +3.16 pp\n"
"- IFEval (instruction following): 0.6811 vs 0.6547, +2.64 pp\n"
"Other benchmarks (within expected trade-off range):\n"
"- MMLU: 0.7094 vs 0.7180 (-0.86 pp)\n"
"- TruthfulQA mc2: 0.6285 vs 0.6475 (-1.90 pp)\n"
"- HellaSwag: 0.6830 vs 0.6895 (-0.65 pp)\n"
"- PIQA: 0.8014 vs 0.8030 (-0.16 pp)\n"
"- HumanEval pass@1: 0.5854 vs 0.6463 (-6.09 pp)\n"
"26 merge strategies were tested to find the optimal configuration. "
"The gains in reasoning, knowledge, and instruction-following come from "
"absorbing complementary capabilities from the donor models.\n\n"
"=== LINKS ===\n"
"Model card: https://huggingface.co/Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1\n"
"Paper: https://ssrn.com/abstract=6545518\n"
"crdt-merge: https://github.com/mgillr/crdt-merge\n"
"Write-up: https://medium.com/@rgillespie83/we-merged-9-models-from-4-"
"architecture-families-into-one-and-it-beats-the-anchor-on-real-e6537dfa9252"
)
# -- Inference (identical to proven baseline) -----------------------------
@spaces.GPU(duration=60)
def chat(message, history, extra_instructions, max_tokens, temperature, top_p):
"""Generate a response. ZeroGPU allocates A10G for up to 60s."""
# Always start with the identity prompt
system_content = IDENTITY_PROMPT
if extra_instructions and extra_instructions.strip():
system_content += "\n\n" + extra_instructions.strip()
messages = [{"role": "system", "content": system_content}]
for turn in history:
if isinstance(turn, dict):
messages.append(turn)
elif isinstance(turn, (list, tuple)) and len(turn) == 2:
messages.append({"role": "user", "content": turn[0]})
if turn[1]:
messages.append({"role": "assistant", "content": turn[1]})
messages.append({"role": "user", "content": message})
# apply_chat_template -> plain string, then tokenize explicitly
text = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
streamer = TextIteratorStreamer(
tokenizer, skip_prompt=True, skip_special_tokens=True
)
gen_kwargs = dict(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_new_tokens=int(max_tokens),
top_p=float(top_p),
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
streamer=streamer,
)
temp = float(temperature)
if temp < 0.01:
gen_kwargs["do_sample"] = False
else:
gen_kwargs["temperature"] = temp
thread = Thread(target=model.generate, kwargs=gen_kwargs)
thread.start()
response = ""
for token in streamer:
if token:
response += token
yield response
thread.join()
# -- UI -------------------------------------------------------------------
DESCRIPTION = """\
**9 models. 4 architecture families. Zero training. One checkpoint.**
This model merges weights from structurally incompatible architectures \
directly into a single Qwen2.5-7B checkpoint, with no fine-tuning, \
no distillation, and no routing layer. The result is a drop-in \
safetensors file that runs anywhere the base Qwen2.5-7B runs.
**Source models:**
- **Llama family:** Qwen2.5-7B-Instruct (anchor), Mistral-7B-Instruct-v0.3, \
SmolLM2-1.7B-Instruct, Granite-3.0-2B-Instruct
- **Phi family:** Phi-3-mini-4k-instruct, phi-2
- **NeoX family:** Pythia-2.8B, Pythia-1.4B
- **OPT family:** OPT-2.7B
**How it works:** Standard weight merging requires identical architectures. \
These models have different key naming schemes, different head dimensions, \
and different FFN layouts. [crdt-merge](https://github.com/mgillr/crdt-merge) \
solves this with a two-layer CRDT architecture: a canonical key namespace \
maps each architecture's parameter names to shared functional roles, then \
per-tensor Procrustes alignment and SVD-filtered delta absorption merge the \
knowledge into the anchor's weight space.
**Benchmark lifts over the unmerged anchor (Qwen2.5-7B-Instruct):**
GSM8K **+3.3 pp** | ARC-Challenge **+3.2 pp** | IFEval **+2.6 pp**
The merged model absorbs reasoning and instruction-following gains from \
donor models while preserving the anchor's core capabilities.
[Model card](https://huggingface.co/Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1) | \
[Paper (SSRN)](https://ssrn.com/abstract=6545518) | \
[crdt-merge](https://github.com/mgillr/crdt-merge) | \
[Write-up](https://medium.com/@rgillespie83/we-merged-9-models-from-4-architecture-families-into-one-and-it-beats-the-anchor-on-real-e6537dfa9252)
**Hi, welcome to the collective -- how can we help you?** Try one of the examples below or type your own question.
"""
EXAMPLES = [
["What are you and how were you built?"],
["Explain the crdt-merge paper and its technical details"],
["Solve step by step: A store offers 30% off, then an additional 20% off the sale price. What is the total discount percentage?"],
["Explain the difference between supervised and unsupervised learning. Give a real-world example of each."],
["Write a Python function that finds the longest common subsequence of two strings."],
["If 5 machines produce 100 widgets in 4 hours, how many widgets can 8 machines produce in 6 hours?"],
["What are three key advantages of renewable energy over fossil fuels? Be specific."],
]
demo = gr.ChatInterface(
fn=chat,
title="Borg Merge v1",
description=DESCRIPTION,
chatbot=gr.Chatbot(
type="messages",
show_copy_button=True,
height=500,
),
additional_inputs=[
gr.Textbox(
value="",
label="Additional instructions (optional)",
placeholder="Add custom instructions on top of the built-in identity...",
lines=2,
),
gr.Slider(64, 8192, value=4096, step=64, label="Max new tokens"),
gr.Slider(0.0, 1.5, value=0.7, step=0.05, label="Temperature"),
gr.Slider(0.0, 1.0, value=0.9, step=0.05, label="Top-p"),
],
examples=EXAMPLES,
cache_examples=False,
type="messages",
)
if __name__ == "__main__":
demo.launch()
|