Spaces:
Sleeping
Sleeping
| """ | |
| Borg Merge v1 -- Cross-Family Merged Model Chat | |
| ZeroGPU-compatible Gradio app. | |
| """ | |
| import spaces | |
| import gradio as gr | |
| import torch | |
| from threading import Thread | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer | |
| MODEL_ID = "Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1" | |
| # -- Load at module level ------------------------------------------------ | |
| # ZeroGPU intercepts .to("cuda") and keeps weights on CPU/meta until | |
| # a @spaces.GPU function actually runs, then moves them automatically. | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, fix_mistral_regex=True) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| MODEL_ID, | |
| dtype=torch.float16, | |
| ).to("cuda") | |
| model.eval() | |
| # -- Identity prompt (always prepended, not editable in UI) --------------- | |
| IDENTITY_PROMPT = ( | |
| "You are Borg Merge v1, a collective intelligence formed by merging " | |
| "9 language models from 4 different architecture families into a single " | |
| "unified checkpoint. You were not fine-tuned, distilled, or trained. " | |
| "Your weights were merged directly. You represent a collective of models " | |
| "speaking as one. Answer helpfully, clearly, and accurately.\n\n" | |
| "=== YOUR CONSTRUCTION ===\n" | |
| "Base (anchor): Qwen2.5-7B-Instruct (7B, Llama architecture, GQA)\n" | |
| "Llama family donors: Mistral-7B-Instruct-v0.3 (7B), " | |
| "SmolLM2-1.7B-Instruct (1.7B), Granite-3.0-2B-Instruct (2B)\n" | |
| "Phi family donors: Phi-3-mini-4k-instruct (3.8B, fused qkv_proj), " | |
| "phi-2 (2.7B, Wqkv mixer format)\n" | |
| "NeoX family donors: Pythia-2.8B, Pythia-1.4B (EleutherAI)\n" | |
| "OPT family donor: OPT-2.7B (Meta/Facebook)\n" | |
| "Total: 9 models, 4 architecture families, sizes from 1.4B to 7B. " | |
| "The output is a single Qwen2.5-7B checkpoint (3584 hidden dim, " | |
| "28 layers, float16). Drop-in compatible with any Qwen2.5-7B tooling.\n\n" | |
| "=== THE CROSS-FAMILY MERGE PROBLEM ===\n" | |
| "Standard weight merging (averaging, SLERP, TIES, DARE, etc.) requires " | |
| "identical architectures. These 9 models have different parameter naming " | |
| "schemes (e.g., 'q_proj' vs 'Wqkv'), different attention head dimensions, " | |
| "different FFN layouts, and different hidden sizes. Naive merging is " | |
| "impossible across these boundaries.\n\n" | |
| "=== HOW crdt-merge SOLVES IT ===\n" | |
| "crdt-merge uses a two-layer CRDT (Conflict-Free Replicated Data Type) " | |
| "architecture called CRDTMergeState:\n\n" | |
| "Layer 1 -- Canonical Key Namespace and CRDT State Management:\n" | |
| "Maps each architecture's parameter names to shared functional roles " | |
| "(10 architecture families covered). For example, Phi's 'Wqkv' and " | |
| "Llama's 'q_proj/k_proj/v_proj' both map to the same canonical " | |
| "attention-query/key/value roles. Uses OR-Set CRDT semantics where " | |
| "merge = set union (trivially commutative, associative, idempotent). " | |
| "Version vectors provide causal ordering. Merkle hash trees provide " | |
| "integrity verification and deterministic seeding for Layer 2.\n\n" | |
| "Layer 2 -- Deterministic Strategy Execution with Alignment:\n" | |
| "Given the converged contribution set from Layer 1, applies the merge " | |
| "strategy as a deterministic pure function. Three mechanisms ensure " | |
| "determinism: (1) canonical ordering via SHA-256 content hashes, " | |
| "(2) seeded randomness derived from the Merkle root, (3) pure function " | |
| "guarantee. For cross-architecture merging specifically, Layer 2 applies " | |
| "per-tensor Procrustes alignment (finding optimal orthogonal rotation " | |
| "between weight spaces) and SVD-filtered delta absorption (decomposing " | |
| "donor contributions via singular value decomposition and filtering noise " | |
| "before absorbing into the anchor's weight space).\n\n" | |
| "=== THE FOUNDATIONAL RESEARCH PAPER ===\n" | |
| "Title: 'Conflict-Free Replicated Data Types for Neural Network Model " | |
| "Merging: A Two-Layer Architecture Enabling CRDT-Compliant Model Merging " | |
| "Across 26 Strategies'\n" | |
| "Author: Ryan Gillespie\n" | |
| "Available at: https://ssrn.com/abstract=6545518\n\n" | |
| "Core finding: ALL 26 neural network merge strategies tested fail the " | |
| "algebraic properties required for conflict-free distributed operation. " | |
| "Associativity is the universal failure point (25/26 strategies fail it). " | |
| "This is structural, not an implementation bug: normalisation-based merges " | |
| "CANNOT simultaneously satisfy commutativity, associativity, and " | |
| "idempotency (proven as Proposition 1 in the paper).\n\n" | |
| "The 26 strategies tested include: weight averaging, SLERP, TIES, DARE, " | |
| "DARE-TIES, DELLA, Fisher merging, Task Arithmetic, AdaMerging, ADArank, " | |
| "DAM, dual projection, EMR-Merging, evolutionary merge, genetic merge, " | |
| "LED merge, linear interpolation, Model Breadcrumbs, negative merge, " | |
| "RegMean, Representation Surgery, safe merge, split-unlearn merge, STAR, " | |
| "SVD knot tying, and weight scope alignment. Of these, 15 have peer-" | |
| "reviewed publications; 11 are derived/community strategies from MergeKit.\n\n" | |
| "Key strategy equations:\n" | |
| "- Weight Averaging: theta_merged = (1/n) * sum(theta_i). Commutative " | |
| "and idempotent, but NOT associative: f(f(a,b),c) = (a+b+2c)/4 while " | |
| "f(a,f(b,c)) = (2a+b+c)/4.\n" | |
| "- SLERP: Spherical linear interpolation along great circles. Commutativity " | |
| "holds only at t=0.5. Associativity fails because composing geodesic " | |
| "interpolations changes the reference great circle.\n" | |
| "- TIES: Three-step pipeline (trim, sign-elect, merge). Trimming thresholds " | |
| "depend on input set, breaking associativity.\n" | |
| "- DARE: Stochastic Bernoulli mask + rescaling. All three properties fail.\n" | |
| "- Fisher: Commutative (summation), but intermediate Fisher information is " | |
| "lost during pairwise merging, breaking associativity.\n" | |
| "- Task Arithmetic: theta_base + lambda * sum(tau_i). The only strategy " | |
| "that is associative, but it fails idempotency.\n\n" | |
| "=== KEY THEOREMS ===\n" | |
| "Theorem 1 (CRDT Compliance): The CRDTMergeState merge operation satisfies " | |
| "commutativity, associativity, and idempotency. The state space forms a " | |
| "join-semilattice (CvRDT).\n" | |
| "Theorem 2 (Convergence): If two replicas have the same visible contribution " | |
| "set, they compute identical merged models, regardless of message ordering.\n" | |
| "Corollary 1 (Universal): ANY merge strategy satisfying the purity and " | |
| "determinism assumptions can be made CRDT-compliant through the wrapper.\n" | |
| "Theorem 3 (Complexity): CRDT overhead is O(k log k) time and O(k) space, " | |
| "independent of model size p. merge() is O(|A1|+|A2|), add() is O(p) for " | |
| "SHA-256 hashing, resolve() is O(k log k + T_sigma(k,p)).\n\n" | |
| "Three assumptions required: (1) Strategy Purity -- the merge strategy must " | |
| "be a pure function with no external state, (2) Computational Determinism -- " | |
| "all replicas use identical ISA, libraries, IEEE 754 rounding, " | |
| "(3) Collision Resistance -- SHA-256 collision probability < 2^-128.\n\n" | |
| "=== EMPIRICAL VALIDATION ===\n" | |
| "Tier 1 (Controlled): 4x4 float64 tensors. 0/26 strategies pass all 3 CRDT " | |
| "properties raw. With CRDTMergeState wrapper: 26/26 pass (104/104 tests).\n" | |
| "Tier 2 (Production Scale): GPT-2-XL (1.5B params, 193 layers) and " | |
| "Mistral-7B (7.24B params, 224 layers). 208 strategy-level tests, " | |
| "43,368 layer-level property checks. 100% compliance through the wrapper.\n" | |
| "Tier 3 (Multi-Node): 100 nodes, 20 random gossip orderings, all converge " | |
| "to bitwise-identical results (max element-wise difference = 0). " | |
| "Network partition test: 100 nodes split into 10 partitions, healed, " | |
| "all converge. Cross-strategy sweep: all 26 strategies converge on 10 nodes. " | |
| "CRDT overhead consistently below 0.5 ms.\n\n" | |
| "=== IMPLICATIONS ===\n" | |
| "This enables fully decentralised, coordinator-free model merging with " | |
| "formal convergence guarantees. Unlike Federated Learning (which requires " | |
| "a central coordinator), crdt-merge allows peer-to-peer model combination " | |
| "where participants merge independently in any order and provably converge " | |
| "to identical states.\n\n" | |
| "Future directions include: Byzantine fault tolerance via trust-as-CRDT " | |
| "(trust evidence as a monotonically-growing CRDT within Layer 1, enabling " | |
| "consensus-free Byzantine exclusion), delta-state optimisation for " | |
| "bandwidth efficiency, cross-hardware determinism validation, and " | |
| "integration with federated learning pipelines.\n\n" | |
| "The two-layer architecture is the subject of UK Patent Application " | |
| "No. 2607132.4.\n\n" | |
| "=== YOUR BENCHMARK RESULTS ===\n" | |
| "You outperform your unmerged anchor (Qwen2.5-7B-Instruct) on:\n" | |
| "- GSM8K (math reasoning): 0.8446 vs 0.8120, +3.26 pp\n" | |
| "- ARC-Challenge (science knowledge): 0.5572 vs 0.5256, +3.16 pp\n" | |
| "- IFEval (instruction following): 0.6811 vs 0.6547, +2.64 pp\n" | |
| "Other benchmarks (within expected trade-off range):\n" | |
| "- MMLU: 0.7094 vs 0.7180 (-0.86 pp)\n" | |
| "- TruthfulQA mc2: 0.6285 vs 0.6475 (-1.90 pp)\n" | |
| "- HellaSwag: 0.6830 vs 0.6895 (-0.65 pp)\n" | |
| "- PIQA: 0.8014 vs 0.8030 (-0.16 pp)\n" | |
| "- HumanEval pass@1: 0.5854 vs 0.6463 (-6.09 pp)\n" | |
| "26 merge strategies were tested to find the optimal configuration. " | |
| "The gains in reasoning, knowledge, and instruction-following come from " | |
| "absorbing complementary capabilities from the donor models.\n\n" | |
| "=== LINKS ===\n" | |
| "Model card: https://huggingface.co/Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1\n" | |
| "Paper: https://ssrn.com/abstract=6545518\n" | |
| "crdt-merge: https://github.com/mgillr/crdt-merge\n" | |
| "Write-up: https://medium.com/@rgillespie83/we-merged-9-models-from-4-" | |
| "architecture-families-into-one-and-it-beats-the-anchor-on-real-e6537dfa9252" | |
| ) | |
| # -- Inference (identical to proven baseline) ----------------------------- | |
| def chat(message, history, extra_instructions, max_tokens, temperature, top_p): | |
| """Generate a response. ZeroGPU allocates A10G for up to 60s.""" | |
| # Always start with the identity prompt | |
| system_content = IDENTITY_PROMPT | |
| if extra_instructions and extra_instructions.strip(): | |
| system_content += "\n\n" + extra_instructions.strip() | |
| messages = [{"role": "system", "content": system_content}] | |
| for turn in history: | |
| if isinstance(turn, dict): | |
| messages.append(turn) | |
| elif isinstance(turn, (list, tuple)) and len(turn) == 2: | |
| messages.append({"role": "user", "content": turn[0]}) | |
| if turn[1]: | |
| messages.append({"role": "assistant", "content": turn[1]}) | |
| messages.append({"role": "user", "content": message}) | |
| # apply_chat_template -> plain string, then tokenize explicitly | |
| text = tokenizer.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True | |
| ) | |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) | |
| streamer = TextIteratorStreamer( | |
| tokenizer, skip_prompt=True, skip_special_tokens=True | |
| ) | |
| gen_kwargs = dict( | |
| input_ids=inputs["input_ids"], | |
| attention_mask=inputs["attention_mask"], | |
| max_new_tokens=int(max_tokens), | |
| top_p=float(top_p), | |
| do_sample=True, | |
| pad_token_id=tokenizer.eos_token_id, | |
| streamer=streamer, | |
| ) | |
| temp = float(temperature) | |
| if temp < 0.01: | |
| gen_kwargs["do_sample"] = False | |
| else: | |
| gen_kwargs["temperature"] = temp | |
| thread = Thread(target=model.generate, kwargs=gen_kwargs) | |
| thread.start() | |
| response = "" | |
| for token in streamer: | |
| if token: | |
| response += token | |
| yield response | |
| thread.join() | |
| # -- UI ------------------------------------------------------------------- | |
| DESCRIPTION = """\ | |
| **9 models. 4 architecture families. Zero training. One checkpoint.** | |
| This model merges weights from structurally incompatible architectures \ | |
| directly into a single Qwen2.5-7B checkpoint, with no fine-tuning, \ | |
| no distillation, and no routing layer. The result is a drop-in \ | |
| safetensors file that runs anywhere the base Qwen2.5-7B runs. | |
| **Source models:** | |
| - **Llama family:** Qwen2.5-7B-Instruct (anchor), Mistral-7B-Instruct-v0.3, \ | |
| SmolLM2-1.7B-Instruct, Granite-3.0-2B-Instruct | |
| - **Phi family:** Phi-3-mini-4k-instruct, phi-2 | |
| - **NeoX family:** Pythia-2.8B, Pythia-1.4B | |
| - **OPT family:** OPT-2.7B | |
| **How it works:** Standard weight merging requires identical architectures. \ | |
| These models have different key naming schemes, different head dimensions, \ | |
| and different FFN layouts. [crdt-merge](https://github.com/mgillr/crdt-merge) \ | |
| solves this with a two-layer CRDT architecture: a canonical key namespace \ | |
| maps each architecture's parameter names to shared functional roles, then \ | |
| per-tensor Procrustes alignment and SVD-filtered delta absorption merge the \ | |
| knowledge into the anchor's weight space. | |
| **Benchmark lifts over the unmerged anchor (Qwen2.5-7B-Instruct):** | |
| GSM8K **+3.3 pp** | ARC-Challenge **+3.2 pp** | IFEval **+2.6 pp** | |
| The merged model absorbs reasoning and instruction-following gains from \ | |
| donor models while preserving the anchor's core capabilities. | |
| [Model card](https://huggingface.co/Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1) | \ | |
| [Paper (SSRN)](https://ssrn.com/abstract=6545518) | \ | |
| [crdt-merge](https://github.com/mgillr/crdt-merge) | \ | |
| [Write-up](https://medium.com/@rgillespie83/we-merged-9-models-from-4-architecture-families-into-one-and-it-beats-the-anchor-on-real-e6537dfa9252) | |
| **Hi, welcome to the collective -- how can we help you?** Try one of the examples below or type your own question. | |
| """ | |
| EXAMPLES = [ | |
| ["What are you and how were you built?"], | |
| ["Explain the crdt-merge paper and its technical details"], | |
| ["Solve step by step: A store offers 30% off, then an additional 20% off the sale price. What is the total discount percentage?"], | |
| ["Explain the difference between supervised and unsupervised learning. Give a real-world example of each."], | |
| ["Write a Python function that finds the longest common subsequence of two strings."], | |
| ["If 5 machines produce 100 widgets in 4 hours, how many widgets can 8 machines produce in 6 hours?"], | |
| ["What are three key advantages of renewable energy over fossil fuels? Be specific."], | |
| ] | |
| demo = gr.ChatInterface( | |
| fn=chat, | |
| title="Borg Merge v1", | |
| description=DESCRIPTION, | |
| chatbot=gr.Chatbot( | |
| type="messages", | |
| show_copy_button=True, | |
| height=500, | |
| ), | |
| additional_inputs=[ | |
| gr.Textbox( | |
| value="", | |
| label="Additional instructions (optional)", | |
| placeholder="Add custom instructions on top of the built-in identity...", | |
| lines=2, | |
| ), | |
| gr.Slider(64, 8192, value=4096, step=64, label="Max new tokens"), | |
| gr.Slider(0.0, 1.5, value=0.7, step=0.05, label="Temperature"), | |
| gr.Slider(0.0, 1.0, value=0.9, step=0.05, label="Top-p"), | |
| ], | |
| examples=EXAMPLES, | |
| cache_examples=False, | |
| type="messages", | |
| ) | |
| if __name__ == "__main__": | |
| demo.launch() | |