Chimera 47B

Klyrone · March 2026 · Apache 2.0

Paper

Modular Expert Assembly (MEA) is a zero-compute framework that surgically grafts instruct-tuned MoE experts into base attention layers, achieving polymathic synthesis without backpropagation fine-tuning.

Chimera 47B is a 46.7B parameter Mixture-of-Experts language model built using Klyrone's MoE assembly framework. It is constructed from Mixtral-8x7B-v0.1 and Mixtral-8x7B-Instruct-v0.1 — combining the base model's knowledge with the instruct model's capabilities — without any additional training. With 8 experts and top-2 routing, only 12.9B parameters are active per token, enabling fast inference at 154 tokens/second on H200 hardware.

A technical paper detailing the methodology is forthcoming.


Key Numbers

Total Parameters 46.7 B
Active / Token 12.9 B
Architecture MoE · 8 experts · top-2 routing
Context Length 32,768 tokens
Generation Speed 154 t/s · H200
Prompt Processing 878 t/s · H200
Quantization Q5_K_M · 5.69 BPW
File Size 30.95 GB GGUF
License Apache 2.0

Capabilities

  • ✅ Instruction following — multi-turn conversational coherence
  • ✅ Code generation — correct, edge-case-aware output
  • ✅ Creative writing — long-form prose and poetry
  • ✅ Factual reasoning — physics, mathematics, general knowledge
  • ✅ Consumer-grade deployment — fits accessible GPU budgets at Q5_K_M

Formal benchmark results (MMLU, HellaSwag, ARC-Challenge, GSM8K) in progress.


Modular Expert Assembly (MEA) Framework

1. Introduction

The open-source AI community often faces a financial barrier when scaling capabilities. While sparse Mixture-of-Experts (MoE) architectures (e.g., Mixtral 8x7B) have significantly reduced inference costs, training or fine-tuning them remains vastly expensive, requiring massive instances arrays (e.g., A100/H100 clusters). This technical report introduces an alternative: Modular Expert Assembly (MEA). Because an MoE model isolates domain-specific knowledge into discrete sub-networks governed by a frozen gate/router layer, we hypothesize that these sub-networks can be treated as swappable logic units.

2. The MEA Framework

The MEA methodology enables "brain transplants" between two models that share an identical structural skeleton (layer count, hidden dimensions, expert count).

2.1 Structural Isolation

The foundational layers of the model—specifically the Multi-Head Attention (MHA), token embeddings, layer normalization, and the router mechanism—are extracted strictly from the Base Model. These layers hold foundational grammar and routing intuition established during extreme-scale pre-training.

2.2 Expert Swapping & Interpolation

We target strictly the routed experts (e.g., .block_sparse_moe.experts.N in Mixtral). An interpolation factor $\alpha \in [0, 1]$ dictates the degree of the swap: WMEA=(1−α)Wbase+αWdonorW_{MEA} = (1 - \alpha) W_{base} + \alpha W_{donor} At $\alpha=1.0$, the donor's specialized experts entirely overwrite the base experts.

2.3 Compute Economics & Hardware Efficiency

To bypass VRAM constraints entirely, the MEA script performs this interpolation utilizing safetensors over asynchronous ThreadPool execution. This memory mapping reduces a 270GB+ operation footprint to roughly 30GB of system RAM, executing perfectly on a standard desktop CPU in less than 20 minutes, costing $0 in GPU compute.

For enterprise licensing or research collaboration, contact research@klyrone.com

🧪 Zero-Compute Capability Evaluation

Prompt: Design a renewable energy generation system utilizing the temperature differential between the ocean's surface and deep ocean. CRITICAL CONSTRAINT: Must use thermoacoustics (sound waves) to convert this thermal gradient into electricity...

Output Excerpt: "The heat exchanger is connected to a thermoacoustic engine. This engine consists of a resonant cavity filled with a working fluid, such as helium or nitrogen. One end of the cavity is connected to the warm section of the heat exchanger, while the other end is connected to the cold..."

Analysis: The model cleanly bypassed conventional OTEC turbines (which boil ammonia) and successfully grafted niche acoustic physics onto thermodynamic oceanography. It effortlessly retrieved precise hardware constraints (e.g., specifying helium or nitrogen as a working fluid inside a resonant cavity).

Prompt: Write a Python script that calculates the exact Hertz frequencies of a C-Major scale in Equal Temperament. For every musical note, print a Haiku about a layer of the Earth's atmosphere, dynamically containing the exact frequency number in the poem.

Output Excerpt:

frequency_ratio = 2 ** (1 / 12)
# ... mathematically loops 12 times per octave ...
atmospheric_layers = { 0: "Troposphere", 1: "Stratosphere", 2: "Mesosphere" ... }
haiku = f"{frequency:.2f} Hz hums, \n{layer.split()[0]} whispers, \nmelodies of the spheres."

Analysis: While the literal syllable count of the dynamically evaluated float number disrupted the strict 5-7-5 constraint (an anticipated Tokenizer-level limitation), the model beautifully retrieved the 2 ** (1/12) Equal Temperament formula, mapped the Earth's atmospheric layers in exact scientific order, and fused them into a functionally flawless Python execution loop.

Usage

llama.cpp

./llama-server \
  -m Chimera-47B-Q5_K_M.gguf \
  -ngl 99 \
  --ctx-size 32768 \
  --port 8080

Or for direct CLI inference:

./llama-cli \
  -m Chimera-47B-Q5_K_M.gguf \
  -p "You are a helpful assistant." \
  --ctx-size 32768 \
  -ngl 99 \
  -n 512

llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="klyrone/Chimera",
    filename="Chimera-47B-Q5_K_M.gguf",
    n_gpu_layers=99,
    n_ctx=4096,
    verbose=False
)

output = llm(
    "You are a helpful assistant.\n\nExplain the difference between supervised and unsupervised learning.",
    max_tokens=512,
    stop=["</s>"]
)
print(output["choices"][0]["text"])

Ollama

ollama run hf.co/klyrone/Chimera

Note: This model is distributed as a GGUF file. Native Transformers loading (AutoModelForCausalLM) is not supported directly — use llama.cpp, llama-cpp-python, or Ollama for inference.


Hardware Requirements

Quantization VRAM Required Recommended Hardware
Q5_K_M (this file) ~34 GB A40, A100, 2× 3090/4090
Q4_K_M ~27 GB 3090/4090, A6000
Q3_K_M ~22 GB 24 GB consumer GPU

Limitations

  • Router fine-tuning not yet applied — a short gate re-alignment is expected to yield marginal quality gains
  • No independent safety evaluation conducted — not recommended for unsupervised public-facing deployment
  • Benchmark results pending publication
  • STEM-heavy benchmarks (abstract algebra, HS math) may underperform relative to general capability, as mathematical knowledge is distributed across attention layers rather than expert FFNs.
  • Pattern Entrenchment (Adversarial Traps): Extensive testing indicates that grafting text-experts onto text-attention layers does not spontaneously generate a deterministic 'World Model'. The model remains highly vulnerable to out-of-distribution math/logic traps (e.g., Anti-Pattern spatial puzzles) where the Base Model's semantic rote-memorization overpowers the logical reasoning of the Instruct Experts.

Citation

@misc{chimera47b2026,
  title        = {Chimera 47B},
  author       = {{Klyrone F.Z.E.}},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/klyrone/Chimera}}
}

Chimera 47B · Klyrone F.Z.E. · Apache 2.0 · A technical paper on the MoE assembly technique is forthcoming.

Downloads last month
322
GGUF
Model size
47B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for klyrone/Chimera