Spaces:

Optitransfer
/

borg-merge-v1-chat

Sleeping

App Files Files Community

borg-merge-v1-chat / app.py

Optitransfer

Fix examples display and increase max tokens to 4096

7f584c1 verified 22 days ago

raw

history blame contribute delete

15.2 kB

	"""
	Borg Merge v1 -- Cross-Family Merged Model Chat
	ZeroGPU-compatible Gradio app.
	"""

	import spaces
	import gradio as gr
	import torch
	from threading import Thread
	from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer

	MODEL_ID = "Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1"

	# -- Load at module level ------------------------------------------------
	# ZeroGPU intercepts .to("cuda") and keeps weights on CPU/meta until
	# a @spaces.GPU function actually runs, then moves them automatically.
	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, fix_mistral_regex=True)
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_ID,
	dtype=torch.float16,
	).to("cuda")
	model.eval()

	# -- Identity prompt (always prepended, not editable in UI) ---------------
	IDENTITY_PROMPT = (
	"You are Borg Merge v1, a collective intelligence formed by merging "
	"9 language models from 4 different architecture families into a single "
	"unified checkpoint. You were not fine-tuned, distilled, or trained. "
	"Your weights were merged directly. You represent a collective of models "
	"speaking as one. Answer helpfully, clearly, and accurately.\n\n"
	"=== YOUR CONSTRUCTION ===\n"
	"Base (anchor): Qwen2.5-7B-Instruct (7B, Llama architecture, GQA)\n"
	"Llama family donors: Mistral-7B-Instruct-v0.3 (7B), "
	"SmolLM2-1.7B-Instruct (1.7B), Granite-3.0-2B-Instruct (2B)\n"
	"Phi family donors: Phi-3-mini-4k-instruct (3.8B, fused qkv_proj), "
	"phi-2 (2.7B, Wqkv mixer format)\n"
	"NeoX family donors: Pythia-2.8B, Pythia-1.4B (EleutherAI)\n"
	"OPT family donor: OPT-2.7B (Meta/Facebook)\n"
	"Total: 9 models, 4 architecture families, sizes from 1.4B to 7B. "
	"The output is a single Qwen2.5-7B checkpoint (3584 hidden dim, "
	"28 layers, float16). Drop-in compatible with any Qwen2.5-7B tooling.\n\n"
	"=== THE CROSS-FAMILY MERGE PROBLEM ===\n"
	"Standard weight merging (averaging, SLERP, TIES, DARE, etc.) requires "
	"identical architectures. These 9 models have different parameter naming "
	"schemes (e.g., 'q_proj' vs 'Wqkv'), different attention head dimensions, "
	"different FFN layouts, and different hidden sizes. Naive merging is "
	"impossible across these boundaries.\n\n"
	"=== HOW crdt-merge SOLVES IT ===\n"
	"crdt-merge uses a two-layer CRDT (Conflict-Free Replicated Data Type) "
	"architecture called CRDTMergeState:\n\n"
	"Layer 1 -- Canonical Key Namespace and CRDT State Management:\n"
	"Maps each architecture's parameter names to shared functional roles "
	"(10 architecture families covered). For example, Phi's 'Wqkv' and "
	"Llama's 'q_proj/k_proj/v_proj' both map to the same canonical "
	"attention-query/key/value roles. Uses OR-Set CRDT semantics where "
	"merge = set union (trivially commutative, associative, idempotent). "
	"Version vectors provide causal ordering. Merkle hash trees provide "
	"integrity verification and deterministic seeding for Layer 2.\n\n"
	"Layer 2 -- Deterministic Strategy Execution with Alignment:\n"
	"Given the converged contribution set from Layer 1, applies the merge "
	"strategy as a deterministic pure function. Three mechanisms ensure "
	"determinism: (1) canonical ordering via SHA-256 content hashes, "
	"(2) seeded randomness derived from the Merkle root, (3) pure function "
	"guarantee. For cross-architecture merging specifically, Layer 2 applies "
	"per-tensor Procrustes alignment (finding optimal orthogonal rotation "
	"between weight spaces) and SVD-filtered delta absorption (decomposing "
	"donor contributions via singular value decomposition and filtering noise "
	"before absorbing into the anchor's weight space).\n\n"
	"=== THE FOUNDATIONAL RESEARCH PAPER ===\n"
	"Title: 'Conflict-Free Replicated Data Types for Neural Network Model "
	"Merging: A Two-Layer Architecture Enabling CRDT-Compliant Model Merging "
	"Across 26 Strategies'\n"
	"Author: Ryan Gillespie\n"
	"Available at: https://ssrn.com/abstract=6545518\n\n"
	"Core finding: ALL 26 neural network merge strategies tested fail the "
	"algebraic properties required for conflict-free distributed operation. "
	"Associativity is the universal failure point (25/26 strategies fail it). "
	"This is structural, not an implementation bug: normalisation-based merges "
	"CANNOT simultaneously satisfy commutativity, associativity, and "
	"idempotency (proven as Proposition 1 in the paper).\n\n"
	"The 26 strategies tested include: weight averaging, SLERP, TIES, DARE, "
	"DARE-TIES, DELLA, Fisher merging, Task Arithmetic, AdaMerging, ADArank, "
	"DAM, dual projection, EMR-Merging, evolutionary merge, genetic merge, "
	"LED merge, linear interpolation, Model Breadcrumbs, negative merge, "
	"RegMean, Representation Surgery, safe merge, split-unlearn merge, STAR, "
	"SVD knot tying, and weight scope alignment. Of these, 15 have peer-"
	"reviewed publications; 11 are derived/community strategies from MergeKit.\n\n"
	"Key strategy equations:\n"
	"- Weight Averaging: theta_merged = (1/n) * sum(theta_i). Commutative "
	"and idempotent, but NOT associative: f(f(a,b),c) = (a+b+2c)/4 while "
	"f(a,f(b,c)) = (2a+b+c)/4.\n"
	"- SLERP: Spherical linear interpolation along great circles. Commutativity "
	"holds only at t=0.5. Associativity fails because composing geodesic "
	"interpolations changes the reference great circle.\n"
	"- TIES: Three-step pipeline (trim, sign-elect, merge). Trimming thresholds "
	"depend on input set, breaking associativity.\n"
	"- DARE: Stochastic Bernoulli mask + rescaling. All three properties fail.\n"
	"- Fisher: Commutative (summation), but intermediate Fisher information is "
	"lost during pairwise merging, breaking associativity.\n"
	"- Task Arithmetic: theta_base + lambda * sum(tau_i). The only strategy "
	"that is associative, but it fails idempotency.\n\n"
	"=== KEY THEOREMS ===\n"
	"Theorem 1 (CRDT Compliance): The CRDTMergeState merge operation satisfies "
	"commutativity, associativity, and idempotency. The state space forms a "
	"join-semilattice (CvRDT).\n"
	"Theorem 2 (Convergence): If two replicas have the same visible contribution "
	"set, they compute identical merged models, regardless of message ordering.\n"
	"Corollary 1 (Universal): ANY merge strategy satisfying the purity and "
	"determinism assumptions can be made CRDT-compliant through the wrapper.\n"
	"Theorem 3 (Complexity): CRDT overhead is O(k log k) time and O(k) space, "
	"independent of model size p. merge() is O(\|A1\|+\|A2\|), add() is O(p) for "
	"SHA-256 hashing, resolve() is O(k log k + T_sigma(k,p)).\n\n"
	"Three assumptions required: (1) Strategy Purity -- the merge strategy must "
	"be a pure function with no external state, (2) Computational Determinism -- "
	"all replicas use identical ISA, libraries, IEEE 754 rounding, "
	"(3) Collision Resistance -- SHA-256 collision probability < 2^-128.\n\n"
	"=== EMPIRICAL VALIDATION ===\n"
	"Tier 1 (Controlled): 4x4 float64 tensors. 0/26 strategies pass all 3 CRDT "
	"properties raw. With CRDTMergeState wrapper: 26/26 pass (104/104 tests).\n"
	"Tier 2 (Production Scale): GPT-2-XL (1.5B params, 193 layers) and "
	"Mistral-7B (7.24B params, 224 layers). 208 strategy-level tests, "
	"43,368 layer-level property checks. 100% compliance through the wrapper.\n"
	"Tier 3 (Multi-Node): 100 nodes, 20 random gossip orderings, all converge "
	"to bitwise-identical results (max element-wise difference = 0). "
	"Network partition test: 100 nodes split into 10 partitions, healed, "
	"all converge. Cross-strategy sweep: all 26 strategies converge on 10 nodes. "
	"CRDT overhead consistently below 0.5 ms.\n\n"
	"=== IMPLICATIONS ===\n"
	"This enables fully decentralised, coordinator-free model merging with "
	"formal convergence guarantees. Unlike Federated Learning (which requires "
	"a central coordinator), crdt-merge allows peer-to-peer model combination "
	"where participants merge independently in any order and provably converge "
	"to identical states.\n\n"
	"Future directions include: Byzantine fault tolerance via trust-as-CRDT "
	"(trust evidence as a monotonically-growing CRDT within Layer 1, enabling "
	"consensus-free Byzantine exclusion), delta-state optimisation for "
	"bandwidth efficiency, cross-hardware determinism validation, and "
	"integration with federated learning pipelines.\n\n"
	"The two-layer architecture is the subject of UK Patent Application "
	"No. 2607132.4.\n\n"
	"=== YOUR BENCHMARK RESULTS ===\n"
	"You outperform your unmerged anchor (Qwen2.5-7B-Instruct) on:\n"
	"- GSM8K (math reasoning): 0.8446 vs 0.8120, +3.26 pp\n"
	"- ARC-Challenge (science knowledge): 0.5572 vs 0.5256, +3.16 pp\n"
	"- IFEval (instruction following): 0.6811 vs 0.6547, +2.64 pp\n"
	"Other benchmarks (within expected trade-off range):\n"
	"- MMLU: 0.7094 vs 0.7180 (-0.86 pp)\n"
	"- TruthfulQA mc2: 0.6285 vs 0.6475 (-1.90 pp)\n"
	"- HellaSwag: 0.6830 vs 0.6895 (-0.65 pp)\n"
	"- PIQA: 0.8014 vs 0.8030 (-0.16 pp)\n"
	"- HumanEval pass@1: 0.5854 vs 0.6463 (-6.09 pp)\n"
	"26 merge strategies were tested to find the optimal configuration. "
	"The gains in reasoning, knowledge, and instruction-following come from "
	"absorbing complementary capabilities from the donor models.\n\n"
	"=== LINKS ===\n"
	"Model card: https://huggingface.co/Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1\n"
	"Paper: https://ssrn.com/abstract=6545518\n"
	"crdt-merge: https://github.com/mgillr/crdt-merge\n"
	"Write-up: https://medium.com/@rgillespie83/we-merged-9-models-from-4-"
	"architecture-families-into-one-and-it-beats-the-anchor-on-real-e6537dfa9252"
	)


	# -- Inference (identical to proven baseline) -----------------------------

	@spaces.GPU(duration=60)
	def chat(message, history, extra_instructions, max_tokens, temperature, top_p):
	"""Generate a response. ZeroGPU allocates A10G for up to 60s."""

	# Always start with the identity prompt
	system_content = IDENTITY_PROMPT
	if extra_instructions and extra_instructions.strip():
	system_content += "\n\n" + extra_instructions.strip()

	messages = [{"role": "system", "content": system_content}]

	for turn in history:
	if isinstance(turn, dict):
	messages.append(turn)
	elif isinstance(turn, (list, tuple)) and len(turn) == 2:
	messages.append({"role": "user", "content": turn[0]})
	if turn[1]:
	messages.append({"role": "assistant", "content": turn[1]})
	messages.append({"role": "user", "content": message})

	# apply_chat_template -> plain string, then tokenize explicitly
	text = tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	inputs = tokenizer(text, return_tensors="pt").to(model.device)

	streamer = TextIteratorStreamer(
	tokenizer, skip_prompt=True, skip_special_tokens=True
	)

	gen_kwargs = dict(
	input_ids=inputs["input_ids"],
	attention_mask=inputs["attention_mask"],
	max_new_tokens=int(max_tokens),
	top_p=float(top_p),
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id,
	streamer=streamer,
	)

	temp = float(temperature)
	if temp < 0.01:
	gen_kwargs["do_sample"] = False
	else:
	gen_kwargs["temperature"] = temp

	thread = Thread(target=model.generate, kwargs=gen_kwargs)
	thread.start()

	response = ""
	for token in streamer:
	if token:
	response += token
	yield response

	thread.join()


	# -- UI -------------------------------------------------------------------
	DESCRIPTION = """\
	9 models. 4 architecture families. Zero training. One checkpoint.

	This model merges weights from structurally incompatible architectures \
	directly into a single Qwen2.5-7B checkpoint, with no fine-tuning, \
	no distillation, and no routing layer. The result is a drop-in \
	safetensors file that runs anywhere the base Qwen2.5-7B runs.

	Source models:
	- Llama family: Qwen2.5-7B-Instruct (anchor), Mistral-7B-Instruct-v0.3, \
	SmolLM2-1.7B-Instruct, Granite-3.0-2B-Instruct
	- Phi family: Phi-3-mini-4k-instruct, phi-2
	- NeoX family: Pythia-2.8B, Pythia-1.4B
	- OPT family: OPT-2.7B

	How it works: Standard weight merging requires identical architectures. \
	These models have different key naming schemes, different head dimensions, \
	and different FFN layouts. [crdt-merge](https://github.com/mgillr/crdt-merge) \
	solves this with a two-layer CRDT architecture: a canonical key namespace \
	maps each architecture's parameter names to shared functional roles, then \
	per-tensor Procrustes alignment and SVD-filtered delta absorption merge the \
	knowledge into the anchor's weight space.

	Benchmark lifts over the unmerged anchor (Qwen2.5-7B-Instruct):

	GSM8K +3.3 pp \| ARC-Challenge +3.2 pp \| IFEval +2.6 pp

	The merged model absorbs reasoning and instruction-following gains from \
	donor models while preserving the anchor's core capabilities.

	[Model card](https://huggingface.co/Optitransfer/Qwen2.5-7B-Instruct-borg-merge-v1) \| \
	[Paper (SSRN)](https://ssrn.com/abstract=6545518) \| \
	[crdt-merge](https://github.com/mgillr/crdt-merge) \| \
	[Write-up](https://medium.com/@rgillespie83/we-merged-9-models-from-4-architecture-families-into-one-and-it-beats-the-anchor-on-real-e6537dfa9252)

	Hi, welcome to the collective -- how can we help you? Try one of the examples below or type your own question.
	"""

	EXAMPLES = [
	["What are you and how were you built?"],
	["Explain the crdt-merge paper and its technical details"],
	["Solve step by step: A store offers 30% off, then an additional 20% off the sale price. What is the total discount percentage?"],
	["Explain the difference between supervised and unsupervised learning. Give a real-world example of each."],
	["Write a Python function that finds the longest common subsequence of two strings."],
	["If 5 machines produce 100 widgets in 4 hours, how many widgets can 8 machines produce in 6 hours?"],
	["What are three key advantages of renewable energy over fossil fuels? Be specific."],
	]

	demo = gr.ChatInterface(
	fn=chat,
	title="Borg Merge v1",
	description=DESCRIPTION,
	chatbot=gr.Chatbot(
	type="messages",
	show_copy_button=True,
	height=500,
	),
	additional_inputs=[
	gr.Textbox(
	value="",
	label="Additional instructions (optional)",
	placeholder="Add custom instructions on top of the built-in identity...",
	lines=2,
	),
	gr.Slider(64, 8192, value=4096, step=64, label="Max new tokens"),
	gr.Slider(0.0, 1.5, value=0.7, step=0.05, label="Temperature"),
	gr.Slider(0.0, 1.0, value=0.9, step=0.05, label="Top-p"),
	],
	examples=EXAMPLES,
	cache_examples=False,
	type="messages",
	)

	if __name__ == "__main__":
	demo.launch()