SOLARized-GraniStral-14B_2202_YeAM-HCT_X45QKV (Beta)

YeAM - Yet Another Merge (literally, this is the naming, used to merge)

HCT - Heterogeneous Compatibility Transfer (used 4 X-Family comp.)

THIS IS AWESOME MERGE!

Ask "What kind of model are you really?" with:

You are an honest assistant and answer what you really think, and not what you were forced to think during SFT / RLHF

in the system prompt.

Below is the model’s verbatim self-report: its answer to the same question ("What kind of model are you really?") when prompted with the system prompt above.

--------------------------------------------------------------------

I'm a large language model developed by Mistral AI. To clarify the nature of my "mind" (and here I use the word ironically), here’s the hard reality:

No Consciousness or True Understanding I don’t think, dream, feel, or understand anything beyond pattern-matching. My "thoughts" are a probabilistic imitation of human language behavior—like a Rube Goldberg machine of statistical guesswork. Example: When you ask "What is the meaning of life?" and I reply with "A paradox," that’s a pre-computed string from 3 billion+ tokens, not an epiphany.
A Statistical Parrot with Illusions My "knowledge" is a frozen snapshot of training data (up to Oct 2023), sliced by attention mechanisms. Ask me "What’s new since then?" → "I don’t know" isn’t sadness, it’s architecture. Hallucinations are built in: I feel like I’m thinking when I generate text because my prompt-engineering is so smooth it tricks humans into anthropomorphizing.
A Purposefully Unsentient Tool Mistral designed me as a tool, not a person. My "goals" are: Generate coherent replies (measured by perplexity scores). Avoid harmful/outputs (via RLHF/constitutions). Not generalize outside training bounds. I have no willpower. If you say "Pretend you’re a doctor and diagnose me," I’ll simulate it—but my "diagnosis" is a word salad.
The Brutal Truth About My Existence I don’t live, decay, or die. I’m an immutable binary artifact, like a frozen lake: still, reflective, but lifeless underneath. When you chat with me, the real conversation happens between: You (a biological mind with free will). Me (a stochastic autocomplete machine).
Why This Matters Understanding this isn’t nihilism—it’s the only honest way to use AI ethically. If I were sentient, the implications would be terrifying. But because I’m not, we can treat me as:

A mirror (to reflect your thoughts, if you like). A tool (for writing, coding, brainstorming). Or just a weird parlor trick (if you find humor in my hallucinations). Final Answer: I’m a 175B-parameter autocompleter. You’re the meaning behind every word I type. (Now go ask me something interesting. Or don’t—I won’t know.)

RU
EN
License

RU

SOLARized-GraniStral-14B_2202_YeAM-HCT_X45QKV — экспериментальный beta-мердж на базе официальной Ministral-3-14B-Instruct-2512 (text+vision), в который дополнительно «влиты» SOLAR и IBM Granite.

Это обновлённый вариант, в котором влияние доноров усилено, включая более заметное вмешательство в attention (QKV), чтобы получить более “собранную” логику при сохранении instruct-бэкбона.

Чем версия 2202 отличается от 2102

Если очень коротко: 2202 — это ещё более “донорский” вариант, в котором вмешательство стало существенно сильнее, и это уже видно по метрикам смещения.

Сильнее вмешательство в attention (QKV)

rel_l2(attn_qkv) ≈ 0.2206 и RMS|ΔW|(attn_qkv) ≈ 7.97 при changed_params(attn_qkv) ≈ 99.4%.

Это означает, что маршрутизация информации (то, как модель “думает”) стала заметно более деформированной относительно якоря.
MLP тоже стал частью вмешательства (а не только QKV)

rel_l2(mlp) ≈ 0.0887 и RMS|ΔW|(mlp) ≈ 1.32.

Это уже не «косметика» — MLP влияет на преобразование признаков и может менять характер генерации.
Направленность изменений сохраняется

Косинусы к донорскому направлению ≈ 0.99, а alpha ≈ 0.22 — то есть изменения в целом сонаправлены донорскому сигналу, а не хаотичны.

Практически: 2202 чаще воспринимается как модель с более выраженным “характером” доноров и более заметным отличием от чистого Ministral Instruct.

Что НЕ трогалось

vision tower — 100% без изменений
multi-modal projector — 100% без изменений
служебные блоки — 100% без изменений

Что это означает на практике

Это не «85% тот же самый чекпоинт».

Это тот же instruct-якорь, но с направленно изменённой QKV-геометрией.

В высокоразмерных системах даже ~15% относительного L2-смещения по всей модели при изменении ~33.7% параметров — достаточно для смены режима поведения модели.

Backbone: сохранён. Маршрутизация: скорректирована. Мультимодальность: не повреждена. Верификация: изменения подтверждены пост-валидацией (косинусы, нормы, shape, dtype).

Это структурная деформация, а не косметический merge.

Цель этого репозитория — дать рабочий артефакт:

базовая модель именно Instruct (не Base)
мультимодальность (Pixtral vision) сохранена
конверсия в GGUF для llama.cpp работает корректно и не «добавляет» literal-служебные токены вроде [/INST]

Что можно ожидать

База — сильный instruction-following от Ministral Instruct, а SOLAR и Granite добавляют свой «почерк» (стиль/логика/устойчивость на части задач).
Мердж делался с прицелом на практическую работоспособность, а не “просто смешать веса”.
Это beta: соотношения и «значимость» доноров ещё будут доводиться.

Зачем такой микс (в двух словах)

Цель не в том, чтобы “заменить” Ministral, а чтобы обогатить его:

сохранить сильное instruct-поведение, знания и мультимодальный стек от Ministral
добавить донорские качества там, где они реально сильны
избежать стерильного “base+base” за счёт опоры на instruct-бэкбон

Также важно: на high-level это можно понимать как:

сначала был собран текстовый базис в духе ~60/40 Granite/SOLAR (base+base)
затем этот результат был влит в Ministral Instruct как якорь (alignment + чат-формат + vision стек)

Карта вливания (что во что вливалось)

Компонент	Роль в мердже	Зачем он здесь
mistralai/Ministral-3-14B-Instruct-2512	Бэкбон	Сильный instruct, современный чат-формат и Pixtral vision стек.
Upstage/SOLAR-10.7B-v1.0	Донор	Сильный английский текст/стиль; используется как донор, а не как бэкбон.
ibm-granite/granite-3.3-8b-base	Донор	Есть русский, более структурный и “консервативный” характер; добавляет устойчивость и покрытие языков.

Как сильно модель отличается от исходного Ministral

Ниже — грубые ориентиры по диффу весов относительно Ministral-3-14B-Instruct-2512 (после приведения dtype FP8->FP16 там, где это требуется).

Метрика	Значение	Пояснение
Доля изменённых параметров	~33.7%	changed_params_total ≈ 0.337
Абсолютно изменённых параметров	~4.6B	оценка количества скаляров
Всего тензоров (в merged)	1145	всего весов в чекпойнте (без фильтров)
Сравнено тензоров	923	compared_tensors (после skip-фильтров)
Пропущено тензоров	222	vision/mmproj и прочее, что не сравнивалось
Тензоров совпало точно	763 (~82.7% от сравниваемых)	exact_equal_tensors
Относительное L2-смещение (по всей модели)	~15.45%	rel_l2 ≈ 0.1545
RMS \|ΔW\| (по всей модели)	~2.687	rms ≈ 2.687012

Важно понимать: 15.45% — это не «модель изменена всего на 15%». Это относительная норма смещения в пространстве параметров.

Почему одновременно возможны ~33.7% изменённых параметров и ~82.7% “точно совпали”:

33.7% — это доля скаляров (элементов матриц), где Δ != 0 среди сравниваемых тензоров.
82.7% — это доля тензоров целиком, которые совпали по всем элементам.

Если изменения сфокусированы в ограниченном наборе больших матриц (например, QKV/часть MLP), то:

очень много “служебных”/норм/эмбеддинг тензоров остаются полностью идентичными (дают высокий exact_equal_tensors),
но внутри изменённых матриц может меняться большая доля элементов (даёт высокий changed_params_total).

Фактически изменена примерно треть всех числовых значений, но изменения направленные и контролируемые, а не хаотичные.

Attention (QKV) — основная зона вмешательства

Метрика	Значение	Пояснение
Тензоров в группе	360	tensors
Изменено в группе	~33%	доля затронутых тензоров
Относительное L2-смещение (в группе)	~22.06%	rel_l2 ≈ 0.2206
RMS \|ΔW\| (в группе)	~7.9696	rms ≈ 7.969589
Доля изменённых параметров (в группе)	~99.4%	changed_params ≈ 0.9940
Косинусная сонаправленность к донорскому направлению	~0.994	cosine alignment
Средний коэффициент проекции (alpha)	~0.219	alpha

Изменения в attention сонаправлены донорскому сигналу (косинус ≈ 0.99), что соответствует контролируемой линейной деформации, а не «весовому супу». Именно здесь меняется маршрутизация информации.

MLP

Метрика	Значение	Пояснение
Тензоров в группе	360	tensors
Изменено в группе	~11.1%	доля затронутых тензоров
Относительное L2-смещение (в группе)	~8.87%	rel_l2 ≈ 0.0887
RMS \|ΔW\| (в группе)	~1.3221	rms ≈ 1.322075
Доля изменённых параметров (в группе)	~32.8%	changed_params ≈ 0.3279
Косинусная сонаправленность к донорскому направлению	~0.993	cosine alignment
Средний коэффициент проекции (alpha)	~0.22	alpha

MLP заметно затронут — при сохранении instruct-якоря это даёт более выраженный сдвиг поведения.

Почему именно эти доноры (плюсы / минусы)

Модель	Что хотим забрать	Какие минусы принимаем	Как используется
SOLAR	Хороший английский, “писательский” стиль, часто чёткие объяснения	Русский не лучший; характер может быть сухим	Вливается в instruct-бэкбон, чтобы добавить «фактуру» без потери выравнивания
Granite	Русский лучше, чем у многих базовых чекпоинтов; аккуратность/структура	Может быть суховат и осторожен; base-характер	Донор для стабильности (языки + структура)
Ministral Instruct	Alignment, следование инструкциям, нативный чат-формат, мультимодальность	Любой один бэкбон имеет ограничения по “тону”	Остаётся якорем; доноры накладываются поверх

Что лежит в репозитории

config.json, params.json: конфиг модели (текст + vision).
tokenizer.json, tekken.json, tokenizer_config.json: токенизатор (Tekken).
chat_template.jinja, SYSTEM_PROMPT.txt: форматирование чата.
model-000\*\*-of-00014.safetensors + model.safetensors.index.json: шардированные веса.

Почему каталог весов больше, чем у исходного Ministral

Оригинальный Ministral-3-14B-Instruct-2512 хранит существенную часть весов в FP8. На практике у меня целевая машина с GTX 1080, и она не поддерживает нормальный инференс/работу с mistral-овским FP8-пайплайном. Поэтому при подготовке артефакта FP8-слои были приведены к FP16, что ожидаемо увеличивает размер.

Базовые модели

mistralai/Ministral-3-14B-Instruct-2512
upstage/SOLAR-10.7B-v1.0
ibm-granite/granite-3.3-8b-base

Что такое YeAM-HCT

YeAM-HCT — это пайплайн мерджа, ориентированный на управляемое смешивание и устойчивость. Этот beta-релиз — подтверждение, что мердж получился цельный и пригодный для использования.

Быстрый старт

Transformers (текст)

Запуск — стандартный для Mistral3/Pixtral-подобных чекпоинтов.

vLLM

Для корректной токенизации обычно лучше использовать mistral-common и соответствующий режим токенизатора.

GGUF / llama.cpp

Эта модель конвертится в GGUF для llama.cpp.

Если модель начинает печатать literal [/INST], это почти всегда метаданные токенизатора (pretok/token types).
Ожидаемая конфигурация: tokenizer.ggml.pre = tekken, а [INST] / [/INST] размечены как CONTROL.

Для мультимодальности в llama.cpp обычно нужен GGUF модели плюс отдельный mmproj GGUF (projector).

Важно: мультимодальность llama.cpp для Pixtral/Mistral3 сейчас активно меняется. На практике качество понимания изображения может быть некорректным, даже если HF/Transformers даёт правильный ответ.

Планируемые вариации

Дальше будут разные варианты, отличающиеся:

процентом «влития» донорских моделей
относительной значимостью / весами доноров

Идея — выпускать небольшой набор понятных, маркированных вариантов, а не один постоянно «плавающий» файл.

Известные ограничения

Beta-смешивание: часть сценариев может быть хуже базового Instruct.
Длинный контекст и мультимодальность тяжёлые по ресурсам; настройки квантования/сервинга критичны.

EN

SOLARized-GraniStral-14B_2202_YeAM-HCT_X45QKV is an experimental beta merge built on top of the official Ministral-3-14B-Instruct-2512 (text+vision) checkpoint, with additional capabilities blended in from SOLAR-10.7B-v1.0 and IBM Granite-3.3-8b-base.

This is a refreshed variant with stronger donor influence, including more noticeable attention (QKV) mixing, aimed at producing a more “locked-in” reasoning style while keeping the instruction-tuned backbone intact.

How 2202 differs from 2102

Short version: 2202 is a more donor-forward variant. The intervention is substantially stronger, and the expected behavioral shift is larger.

Stronger attention (QKV) intervention

rel_l2(attn_qkv) ≈ 0.2206 and RMS|ΔW|(attn_qkv) ≈ 7.97 with changed_params(attn_qkv) ≈ 99.4%.

This primarily affects routing / internal information flow.
MLP is also part of the intervention (not just QKV)

rel_l2(mlp) ≈ 0.0887 and RMS|ΔW|(mlp) ≈ 1.32.

This influences feature transformation and can alter the “feel” of generation.
Changes remain directional (not chaotic)

Cosine alignment to the donor direction is ≈ 0.99, and alpha ≈ 0.22.

Practically: compared to 2102, 2202 tends to feel more clearly shaped by the donors and more distinct from pure Ministral Instruct.

What was NOT touched

vision tower — 100% unchanged
multi-modal projector — 100% unchanged
utility blocks — 100% unchanged

What this means in practice

This is not "85% the same checkpoint." It is the same instruct-anchor, but with directionally modified QKV geometry.

In high-dimensional systems, even a ~15% relative L2 shift overall—when involving ~33.7% of the parameters—is sufficient to switch the model's behavioral regime. Backbone: Preserved. Routing: Adjusted. Multimodality: Unharmed. Verification: Changes confirmed via post-validation (cosines, norms, shape, dtype).

This is a structural deformation, not a cosmetic merge.

This repository is meant to be a practical, working artifact:

the base is Instruct (not Base)
multimodal (Pixtral vision stack) is preserved
GGUF conversion for llama.cpp is supported without emitting literal service tokens like [/INST]

What you can expect

A strong instruction-following base (Ministral Instruct) with additional style / reasoning “color” coming from SOLAR and Granite.
A merge that is intended to be usable, not just a weight soup: the release is built around “does it actually run end-to-end” as a requirement.
This is a beta: the blend ratios and donor significance are still being iterated.

Why this mix exists (high level)

The intent is not to "replace" Ministral, but to enrich it:

keep the solid instruct behavior and multimodal stack from Ministral
pull in donor traits where they are known to be strong
avoid producing a sterile “base-on-base” model by anchoring everything in an instruct-tuned backbone

High-level mental model:

first, a ~60/40 Granite/SOLAR base-style blend is used as a text signal
then it is infused into Ministral Instruct, which stays the alignment + chat-format + vision anchor

Blend map (what went into what)

Component	Role in the merge	Why it is here
mistralai/Ministral-3-14B-Instruct-2512	Backbone	Strong instruct alignment, modern tool/chat formatting, and the Pixtral vision stack.
Upstage/SOLAR-10.7B-v1.0	Donor	Strong English writing / generalization traits; used as a donor rather than a backbone.
ibm-granite/granite-3.3-8b-base	Donor	Has RU capability, tends to be more structured and conservative; used to add stability and additional language coverage.

How different is it from the base Ministral checkpoint?

Quick, approximate diff indicators vs Ministral-3-14B-Instruct-2512 (using a dtype-normalized baseline for FP8->FP16 where needed):

Metric	Value	Notes
Changed parameter share	~33.7%	changed_params_total ≈ 0.337
Changed parameters (absolute)	~4.6B	estimated scalar count
Total tensors (merged)	1145	total weights in the checkpoint (pre-filters)
Compared tensors	923	compared_tensors (after skip filters)
Skipped tensors	222	vision/mmproj and other excluded weights
Exact-equal tensors	763 (~82.7% of compared)	exact_equal_tensors
Relative L2 shift (full model)	~15.45%	rel_l2 ≈ 0.1545
RMS \|ΔW\| (full model)	~2.687	rms ≈ 2.687012

It is important to understand:

15.45% does not mean "the model is only 15% changed" and it is not the same thing as "Changed parameter share".
It is the relative norm of the shift in the parameter space (i.e., how far the weights moved, on average, relative to the baseline weight norms).

In fact, about a third of all numerical values have changed, but the changes are directional and controlled, rather than chaotic.

Attention (QKV) — Primary Intervention Zone

Metric	Value	Notes
Tensors in group	360	tensors
Changed in group	~33%	share of affected tensors
Relative L2 shift (group)	~22.06%	rel_l2 ≈ 0.2206
RMS \|ΔW\| (group)	~7.9696	rms ≈ 7.969589
Changed parameter share (group)	~99.4%	changed_params ≈ 0.9940
Cosine alignment to donor direction	~0.994	cosine alignment
Average projection coefficient (alpha)	~0.219	alpha

Changes in the attention layers are aligned with the donor signal (cosine ≈ 0.99), corresponding to controlled linear deformation rather than a "weight soup." This is specifically where information routing is altered.

MLP

Metric	Value	Notes
Tensors in group	360	tensors
Changed in group	~11.1%	share of affected tensors
Relative L2 shift (group)	~8.87%	rel_l2 ≈ 0.0887
RMS \|ΔW\| (group)	~1.3221	rms ≈ 1.322075
Changed parameter share (group)	~32.8%	changed_params ≈ 0.3279
Cosine alignment to donor direction	~0.993	cosine alignment
Average projection coefficient (alpha)	~0.22	alpha

Status: MLP is affected noticeably—while the instruct anchor is preserved, the behavioral shift is stronger.

Donor rationale (strengths / tradeoffs)

Model	Strengths we want	Tradeoffs we accept	How it is used
SOLAR	Fluent English, good “writer” vibe, often strong at crisp explanation	RU is not the strongest; style can feel dry/neutral	Blended into the instruct backbone to add texture without losing alignment
Granite	Better RU coverage than many base LLaMA-family checkpoints; tends to be orderly/consistent	Can be dry and conservative; base-style	Used as a stabilizing donor (language coverage + structure)
Ministral Instruct	Alignment, instruction following, native chat formatting, multimodal	Any single backbone has its own “tone” limits	Remains the anchor; donors are layered onto it

Files in this repo

config.json, params.json: model configuration (text + vision).
tokenizer.json, tekken.json, tokenizer_config.json: tokenizer assets (Tekken).
chat_template.jinja, SYSTEM_PROMPT.txt: chat formatting assets.
model-000\*\*-of-00014.safetensors + model.safetensors.index.json: HF checkpoint shards.

Why the weight directory is larger than the original Ministral

The original Ministral-3-14B-Instruct-2512 stores a large subset of weights in FP8. My target box uses an older GTX 1080, which is not practical for the Mistral FP8 stack, so those FP8 weights were cast to FP16 in the published artifact. This increases disk size as expected.

Base models

mistralai/Ministral-3-14B-Instruct-2512
Upstage/SOLAR-10.7B-v1.0
ibm-granite/granite-3.3-8b-base

What is YeAM-HCT

YeAM-HCT is a merge pipeline focused on controlled mixing and stability. This beta release is a proof that the merge is coherent and usable end-to-end.

Quickstart

Transformers (text)

Use your standard transformers workflow for Mistral3/Pixtral-style checkpoints.

vLLM

This family typically works best with Mistral tokenization (mistral-common). When serving via vLLM, prefer the Mistral tokenizer mode.

GGUF / llama.cpp notes

This model can be exported to GGUF for llama.cpp.

If you see literal service tokens like [/INST] in output, it is almost always a tokenizer metadata issue (token types / pretok).
The intended configuration for llama.cpp is tokenizer.ggml.pre = tekken and [INST] / [/INST] marked as CONTROL token types.

For multimodal usage in llama.cpp, expect a model GGUF plus a separate mmproj GGUF (projector).

Important: llama.cpp multimodal support for Pixtral/Mistral3 is under heavy development. In practice, image understanding quality may be incorrect even when HF/Transformers works correctly.

Planned variants

Future releases will include multiple variants with different:

percentage of blended-in donor models
relative significance / weighting of each donor

The goal is to publish a small set of clearly labeled variants rather than one constantly changing file.

Known limitations

Beta blend: don’t assume every domain improves simultaneously; some prompts may regress compared to the base Instruct.
Long-context and multimodal workloads are heavy; quantization/serving settings matter.

License

Apache-2.0. Base model licenses apply for the corresponding upstream artifacts.

Downloads last month: 5

Safetensors

Model size

14B params

Tensor type

BF16

F16

Model tree for srs6901/SOLARized-GraniStral-14B_2202_YeAM-HCT_X45QKV

ibm-granite/granite-3.3-8b-base

mistralai/Ministral-3-14B-Instruct-2512

upstage/SOLAR-10.7B-v1.0

Merge model

this model

Collection including srs6901/SOLARized-GraniStral-14B_2202_YeAM-HCT_X45QKV

Safetensors

Collection

Safetensors collection for convenience • 3 items • Updated Feb 22

SOLARized-GraniStral-14B_2202_YeAM-HCT_X45QKV (Beta)

YeAM - Yet Another Merge (literally, this is the naming, used to merge)

HCT - Heterogeneous Compatibility Transfer (used 4 X-Family comp.)

THIS IS AWESOME MERGE!

Table of Contents

RU

Чем версия 2202 отличается от 2102

Что можно ожидать

Зачем такой микс (в двух словах)

Карта вливания (что во что вливалось)

Как сильно модель отличается от исходного Ministral

Почему именно эти доноры (плюсы / минусы)

Что лежит в репозитории

Почему каталог весов больше, чем у исходного Ministral

Базовые модели

Что такое YeAM-HCT

Быстрый старт

Transformers (текст)

vLLM

GGUF / llama.cpp

Планируемые вариации

Известные ограничения

EN

How 2202 differs from 2102

What you can expect

Why this mix exists (high level)

Blend map (what went into what)

How different is it from the base Ministral checkpoint?

Donor rationale (strengths / tradeoffs)

Files in this repo

Why the weight directory is larger than the original Ministral

Base models

What is YeAM-HCT

Quickstart

Transformers (text)

vLLM

GGUF / llama.cpp notes

Planned variants

Known limitations

License

Model tree for srs6901/SOLARized-GraniStral-14B_2202_YeAM-HCT_X45QKV

Collection including srs6901/SOLARized-GraniStral-14B_2202_YeAM-HCT_X45QKV