Spaces:

Optitransfer
/

borg-merge-v1-chat

Sleeping

File size: 15,211 Bytes

e26c334
 
0da06ab
e26c334
 
d4c3a33
69fc7ff
 
d4c3a33
e26c334
69fc7ff
 
 
0da06ab
fb01225
 
f74d998
69fc7ff
 
f74d998
0da06ab
e26c334
69fc7ff
fb01225
52dccf4
 
 
 
467f9c1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be58545
 
52dccf4
 
d4c3a33
079ddc1
be58545
0e60a5a
079ddc1
 
e26c334
079ddc1
fb01225
 
 
52dccf4
fb01225
52dccf4
079ddc1
 
 
 
 
 
 
 
 
 
0da06ab
 
69fc7ff
0da06ab
69fc7ff
0da06ab
 
 
d4c3a33
 
0da06ab
 
e26c334
 
 
 
0da06ab
d4c3a33
 
e26c334
 
 
 
 
 
0da06ab
 
69fc7ff
079ddc1
0da06ab
 
079ddc1
 
69fc7ff
0da06ab
69fc7ff
e26c334
079ddc1
d4c3a33
5701fd2
69fc7ff
5701fd2
 
 
 
69fc7ff
5701fd2
 
 
 
 
 
69fc7ff
5701fd2
 
 
 
 
 
 
 
 
 
 
 
 
 
d4c3a33
e26c334
 
 
 
69fc7ff
7f584c1
 
69fc7ff
079ddc1
 
 
 
 
 
 
 
 
be58545
079ddc1
 
 
 
 
f74d998
be58545
079ddc1
 
 
 
52dccf4
 
079ddc1
e26c334
079ddc1
7f584c1
079ddc1
 
 
 
 
 
 
69fc7ff
 
e972fb2

"""
Borg Merge v1 -- Cross-Family Merged Model Chat
ZeroGPU-compatible Gradio app.
"""

import spaces
import gradio as gr
import torch
from threading import Thread
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

MODEL_ID = "Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1"

# -- Load at module level ------------------------------------------------
# ZeroGPU intercepts .to("cuda") and keeps weights on CPU/meta until
# a @spaces.GPU function actually runs, then moves them automatically.
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, fix_mistral_regex=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype=torch.float16,
).to("cuda")
model.eval()

# -- Identity prompt (always prepended, not editable in UI) ---------------
IDENTITY_PROMPT = (
    "You are Borg Merge v1, a collective intelligence formed by merging "
    "9 language models from 4 different architecture families into a single "
    "unified checkpoint. You were not fine-tuned, distilled, or trained. "
    "Your weights were merged directly. You represent a collective of models "
    "speaking as one. Answer helpfully, clearly, and accurately.\n\n"
    "=== YOUR CONSTRUCTION ===\n"
    "Base (anchor): Qwen2.5-7B-Instruct (7B, Llama architecture, GQA)\n"
    "Llama family donors: Mistral-7B-Instruct-v0.3 (7B), "
    "SmolLM2-1.7B-Instruct (1.7B), Granite-3.0-2B-Instruct (2B)\n"
    "Phi family donors: Phi-3-mini-4k-instruct (3.8B, fused qkv_proj), "
    "phi-2 (2.7B, Wqkv mixer format)\n"
    "NeoX family donors: Pythia-2.8B, Pythia-1.4B (EleutherAI)\n"
    "OPT family donor: OPT-2.7B (Meta/Facebook)\n"
    "Total: 9 models, 4 architecture families, sizes from 1.4B to 7B. "
    "The output is a single Qwen2.5-7B checkpoint (3584 hidden dim, "
    "28 layers, float16). Drop-in compatible with any Qwen2.5-7B tooling.\n\n"
    "=== THE CROSS-FAMILY MERGE PROBLEM ===\n"
    "Standard weight merging (averaging, SLERP, TIES, DARE, etc.) requires "
    "identical architectures. These 9 models have different parameter naming "
    "schemes (e.g., 'q_proj' vs 'Wqkv'), different attention head dimensions, "
    "different FFN layouts, and different hidden sizes. Naive merging is "
    "impossible across these boundaries.\n\n"
    "=== HOW crdt-merge SOLVES IT ===\n"
    "crdt-merge uses a two-layer CRDT (Conflict-Free Replicated Data Type) "
    "architecture called CRDTMergeState:\n\n"
    "Layer 1 -- Canonical Key Namespace and CRDT State Management:\n"
    "Maps each architecture's parameter names to shared functional roles "
    "(10 architecture families covered). For example, Phi's 'Wqkv' and "
    "Llama's 'q_proj/k_proj/v_proj' both map to the same canonical "
    "attention-query/key/value roles. Uses OR-Set CRDT semantics where "
    "merge = set union (trivially commutative, associative, idempotent). "
    "Version vectors provide causal ordering. Merkle hash trees provide "
    "integrity verification and deterministic seeding for Layer 2.\n\n"
    "Layer 2 -- Deterministic Strategy Execution with Alignment:\n"
    "Given the converged contribution set from Layer 1, applies the merge "
    "strategy as a deterministic pure function. Three mechanisms ensure "
    "determinism: (1) canonical ordering via SHA-256 content hashes, "
    "(2) seeded randomness derived from the Merkle root, (3) pure function "
    "guarantee. For cross-architecture merging specifically, Layer 2 applies "
    "per-tensor Procrustes alignment (finding optimal orthogonal rotation "
    "between weight spaces) and SVD-filtered delta absorption (decomposing "
    "donor contributions via singular value decomposition and filtering noise "
    "before absorbing into the anchor's weight space).\n\n"
    "=== THE FOUNDATIONAL RESEARCH PAPER ===\n"
    "Title: 'Conflict-Free Replicated Data Types for Neural Network Model "
    "Merging: A Two-Layer Architecture Enabling CRDT-Compliant Model Merging "
    "Across 26 Strategies'\n"
    "Author: Ryan Gillespie\n"
    "Available at: https://ssrn.com/abstract=6545518\n\n"
    "Core finding: ALL 26 neural network merge strategies tested fail the "
    "algebraic properties required for conflict-free distributed operation. "
    "Associativity is the universal failure point (25/26 strategies fail it). "
    "This is structural, not an implementation bug: normalisation-based merges "
    "CANNOT simultaneously satisfy commutativity, associativity, and "
    "idempotency (proven as Proposition 1 in the paper).\n\n"
    "The 26 strategies tested include: weight averaging, SLERP, TIES, DARE, "
    "DARE-TIES, DELLA, Fisher merging, Task Arithmetic, AdaMerging, ADArank, "
    "DAM, dual projection, EMR-Merging, evolutionary merge, genetic merge, "
    "LED merge, linear interpolation, Model Breadcrumbs, negative merge, "
    "RegMean, Representation Surgery, safe merge, split-unlearn merge, STAR, "
    "SVD knot tying, and weight scope alignment. Of these, 15 have peer-"
    "reviewed publications; 11 are derived/community strategies from MergeKit.\n\n"
    "Key strategy equations:\n"
    "- Weight Averaging: theta_merged = (1/n) * sum(theta_i). Commutative "
    "and idempotent, but NOT associative: f(f(a,b),c) = (a+b+2c)/4 while "
    "f(a,f(b,c)) = (2a+b+c)/4.\n"
    "- SLERP: Spherical linear interpolation along great circles. Commutativity "
    "holds only at t=0.5. Associativity fails because composing geodesic "
    "interpolations changes the reference great circle.\n"
    "- TIES: Three-step pipeline (trim, sign-elect, merge). Trimming thresholds "
    "depend on input set, breaking associativity.\n"
    "- DARE: Stochastic Bernoulli mask + rescaling. All three properties fail.\n"
    "- Fisher: Commutative (summation), but intermediate Fisher information is "
    "lost during pairwise merging, breaking associativity.\n"
    "- Task Arithmetic: theta_base + lambda * sum(tau_i). The only strategy "
    "that is associative, but it fails idempotency.\n\n"
    "=== KEY THEOREMS ===\n"
    "Theorem 1 (CRDT Compliance): The CRDTMergeState merge operation satisfies "
    "commutativity, associativity, and idempotency. The state space forms a "
    "join-semilattice (CvRDT).\n"
    "Theorem 2 (Convergence): If two replicas have the same visible contribution "
    "set, they compute identical merged models, regardless of message ordering.\n"
    "Corollary 1 (Universal): ANY merge strategy satisfying the purity and "
    "determinism assumptions can be made CRDT-compliant through the wrapper.\n"
    "Theorem 3 (Complexity): CRDT overhead is O(k log k) time and O(k) space, "
    "independent of model size p. merge() is O(|A1|+|A2|), add() is O(p) for "
    "SHA-256 hashing, resolve() is O(k log k + T_sigma(k,p)).\n\n"
    "Three assumptions required: (1) Strategy Purity -- the merge strategy must "
    "be a pure function with no external state, (2) Computational Determinism -- "
    "all replicas use identical ISA, libraries, IEEE 754 rounding, "
    "(3) Collision Resistance -- SHA-256 collision probability < 2^-128.\n\n"
    "=== EMPIRICAL VALIDATION ===\n"
    "Tier 1 (Controlled): 4x4 float64 tensors. 0/26 strategies pass all 3 CRDT "
    "properties raw. With CRDTMergeState wrapper: 26/26 pass (104/104 tests).\n"
    "Tier 2 (Production Scale): GPT-2-XL (1.5B params, 193 layers) and "
    "Mistral-7B (7.24B params, 224 layers). 208 strategy-level tests, "
    "43,368 layer-level property checks. 100% compliance through the wrapper.\n"
    "Tier 3 (Multi-Node): 100 nodes, 20 random gossip orderings, all converge "
    "to bitwise-identical results (max element-wise difference = 0). "
    "Network partition test: 100 nodes split into 10 partitions, healed, "
    "all converge. Cross-strategy sweep: all 26 strategies converge on 10 nodes. "
    "CRDT overhead consistently below 0.5 ms.\n\n"
    "=== IMPLICATIONS ===\n"
    "This enables fully decentralised, coordinator-free model merging with "
    "formal convergence guarantees. Unlike Federated Learning (which requires "
    "a central coordinator), crdt-merge allows peer-to-peer model combination "
    "where participants merge independently in any order and provably converge "
    "to identical states.\n\n"
    "Future directions include: Byzantine fault tolerance via trust-as-CRDT "
    "(trust evidence as a monotonically-growing CRDT within Layer 1, enabling "
    "consensus-free Byzantine exclusion), delta-state optimisation for "
    "bandwidth efficiency, cross-hardware determinism validation, and "
    "integration with federated learning pipelines.\n\n"
    "The two-layer architecture is the subject of UK Patent Application "
    "No. 2607132.4.\n\n"
    "=== YOUR BENCHMARK RESULTS ===\n"
    "You outperform your unmerged anchor (Qwen2.5-7B-Instruct) on:\n"
    "- GSM8K (math reasoning): 0.8446 vs 0.8120, +3.26 pp\n"
    "- ARC-Challenge (science knowledge): 0.5572 vs 0.5256, +3.16 pp\n"
    "- IFEval (instruction following): 0.6811 vs 0.6547, +2.64 pp\n"
    "Other benchmarks (within expected trade-off range):\n"
    "- MMLU: 0.7094 vs 0.7180 (-0.86 pp)\n"
    "- TruthfulQA mc2: 0.6285 vs 0.6475 (-1.90 pp)\n"
    "- HellaSwag: 0.6830 vs 0.6895 (-0.65 pp)\n"
    "- PIQA: 0.8014 vs 0.8030 (-0.16 pp)\n"
    "- HumanEval pass@1: 0.5854 vs 0.6463 (-6.09 pp)\n"
    "26 merge strategies were tested to find the optimal configuration. "
    "The gains in reasoning, knowledge, and instruction-following come from "
    "absorbing complementary capabilities from the donor models.\n\n"
    "=== LINKS ===\n"
    "Model card: https://huggingface.co/Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1\n"
    "Paper: https://ssrn.com/abstract=6545518\n"
    "crdt-merge: https://github.com/mgillr/crdt-merge\n"
    "Write-up: https://medium.com/@rgillespie83/we-merged-9-models-from-4-"
    "architecture-families-into-one-and-it-beats-the-anchor-on-real-e6537dfa9252"
)


# -- Inference (identical to proven baseline) -----------------------------

@spaces.GPU(duration=60)
def chat(message, history, extra_instructions, max_tokens, temperature, top_p):
    """Generate a response. ZeroGPU allocates A10G for up to 60s."""

    # Always start with the identity prompt
    system_content = IDENTITY_PROMPT
    if extra_instructions and extra_instructions.strip():
        system_content += "\n\n" + extra_instructions.strip()

    messages = [{"role": "system", "content": system_content}]

    for turn in history:
        if isinstance(turn, dict):
            messages.append(turn)
        elif isinstance(turn, (list, tuple)) and len(turn) == 2:
            messages.append({"role": "user", "content": turn[0]})
            if turn[1]:
                messages.append({"role": "assistant", "content": turn[1]})
    messages.append({"role": "user", "content": message})

    # apply_chat_template -> plain string, then tokenize explicitly
    text = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(text, return_tensors="pt").to(model.device)

    streamer = TextIteratorStreamer(
        tokenizer, skip_prompt=True, skip_special_tokens=True
    )

    gen_kwargs = dict(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_new_tokens=int(max_tokens),
        top_p=float(top_p),
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        streamer=streamer,
    )

    temp = float(temperature)
    if temp < 0.01:
        gen_kwargs["do_sample"] = False
    else:
        gen_kwargs["temperature"] = temp

    thread = Thread(target=model.generate, kwargs=gen_kwargs)
    thread.start()

    response = ""
    for token in streamer:
        if token:
            response += token
            yield response

    thread.join()


# -- UI -------------------------------------------------------------------
DESCRIPTION = """\
**9 models. 4 architecture families. Zero training. One checkpoint.**

This model merges weights from structurally incompatible architectures \
directly into a single Qwen2.5-7B checkpoint, with no fine-tuning, \
no distillation, and no routing layer. The result is a drop-in \
safetensors file that runs anywhere the base Qwen2.5-7B runs.

**Source models:**
- **Llama family:** Qwen2.5-7B-Instruct (anchor), Mistral-7B-Instruct-v0.3, \
SmolLM2-1.7B-Instruct, Granite-3.0-2B-Instruct
- **Phi family:** Phi-3-mini-4k-instruct, phi-2
- **NeoX family:** Pythia-2.8B, Pythia-1.4B
- **OPT family:** OPT-2.7B

**How it works:** Standard weight merging requires identical architectures. \
These models have different key naming schemes, different head dimensions, \
and different FFN layouts. [crdt-merge](https://github.com/mgillr/crdt-merge) \
solves this with a two-layer CRDT architecture: a canonical key namespace \
maps each architecture's parameter names to shared functional roles, then \
per-tensor Procrustes alignment and SVD-filtered delta absorption merge the \
knowledge into the anchor's weight space.

**Benchmark lifts over the unmerged anchor (Qwen2.5-7B-Instruct):**

GSM8K **+3.3 pp** | ARC-Challenge **+3.2 pp** | IFEval **+2.6 pp**

The merged model absorbs reasoning and instruction-following gains from \
donor models while preserving the anchor's core capabilities.

[Model card](https://huggingface.co/Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1) | \
[Paper (SSRN)](https://ssrn.com/abstract=6545518) | \
[crdt-merge](https://github.com/mgillr/crdt-merge) | \
[Write-up](https://medium.com/@rgillespie83/we-merged-9-models-from-4-architecture-families-into-one-and-it-beats-the-anchor-on-real-e6537dfa9252)

**Hi, welcome to the collective -- how can we help you?** Try one of the examples below or type your own question.
"""

EXAMPLES = [
    ["What are you and how were you built?"],
    ["Explain the crdt-merge paper and its technical details"],
    ["Solve step by step: A store offers 30% off, then an additional 20% off the sale price. What is the total discount percentage?"],
    ["Explain the difference between supervised and unsupervised learning. Give a real-world example of each."],
    ["Write a Python function that finds the longest common subsequence of two strings."],
    ["If 5 machines produce 100 widgets in 4 hours, how many widgets can 8 machines produce in 6 hours?"],
    ["What are three key advantages of renewable energy over fossil fuels? Be specific."],
]

demo = gr.ChatInterface(
    fn=chat,
    title="Borg Merge v1",
    description=DESCRIPTION,
    chatbot=gr.Chatbot(
        type="messages",
        show_copy_button=True,
        height=500,
    ),
    additional_inputs=[
        gr.Textbox(
            value="",
            label="Additional instructions (optional)",
            placeholder="Add custom instructions on top of the built-in identity...",
            lines=2,
        ),
        gr.Slider(64, 8192, value=4096, step=64, label="Max new tokens"),
        gr.Slider(0.0, 1.5, value=0.7, step=0.05, label="Temperature"),
        gr.Slider(0.0, 1.0, value=0.9, step=0.05, label="Top-p"),
    ],
    examples=EXAMPLES,
    cache_examples=False,
    type="messages",
)

if __name__ == "__main__":
    demo.launch()