This is an experimental model. While retaining its visual capabilities, we aim to align its text performance with that of pure text models.

Model Highlights:

merge method: ASDF
Highest precision: dtype: float32 + out_dtype: bfloat16
Context length: 262,144

Parameter Settings:

Temperature=0.7, TopP=0.8, TopK=20,MinP=0.

Why Can Models Be Merged:

The tensors for the text portion are exactly the same between the visual model and the pure text model.
In terms of tensor naming, the only difference between vision models and pure text models is the addition of ".language_model".
Therefore, by uniformly removing this part before merging, the text-related tensors of the two can be directly merged.

How Exactly Are Models Merged:

Input

Given two weight tensors from models with identical architecture (Text and Vision branches):
$T^{\text{text}} \in \mathbb{R}^{d_1 \times \cdots \times d_n}, \quad T^{\text{vision}} \in \mathbb{R}^{d_1 \times \cdots \times d_n}$
For each vision tensor key k_v, strip the "language_model." prefix to obtain the corresponding text model key for matching.

Step 1: Special Layer Filtering

Skip merging for embedding and language modeling head layers:

If tensor name contains "embed" or "lm_head", return T^vision directly.
Proceed only if both tensors have the same shape.

Step 2: Type Conversion and Delta Computation

Convert to float32 for numerical stability and compute the difference tensor:
$W^{\text{text}} = T^{\text{text}}.\text{float}(), \quad W^{\text{vision}} = T^{\text{vision}}.\text{float}()$
$\Delta = W^{\text{vision}} - W^{\text{text}}$

Step 3: Early Exit for Low-Rank Tensors

If Delta is a vector (i.e., rank < 2, such as bias or LayerNorm parameters), return T^vision directly.

Step 4: SVD Decomposition of Delta

Perform thin SVD on the difference tensor:
$\Delta = U \Sigma V^\top, \quad U \in \mathbb{R}^{m \times r},\ \Sigma \in \mathbb{R}^{r \times r},\ V \in \mathbb{R}^{n \times r}$
where r = min(m, n), and Σ = diag(σ₁, …, σᵣ) with σ₁ ≥ ⋯ ≥ σᵣ ≥ 0.

Step 5: Automatic Rank Selection via Knee Point Detection

5.1 Normalize singular values and indices

Let s = (σ₁, …, σᵣ). Normalize to unit square:
$x_i = \frac{i - 1}{r - 1}, \quad y_i = \frac{\sigma_i - \sigma_r}{\sigma_1 - \sigma_r + \varepsilon}, \quad i = 1,\dots,r$

5.2 Compute perpendicular distance to line from first to last point

Line from (0, y_1) to (1, y_r) has direction vector (1, y_r - y_1).
For each point (x_i, y_i), compute normalized cross-product distance:
$d_i = \left| (x_i)(y_r - y_1) - (y_i - y_1)(1) \right|$

5.3 Select knee index

$k = \arg\max_i d_i, \quad k = \max(1, k)$

Step 6: Low-Rank Reconstruction of Delta

Reconstruct Delta using top-k components:
$\Delta_{\text{clean}} = U[:, :k] \cdot \operatorname{diag}(\sigma_1, \dots, \sigma_k) \cdot V^\top[:k, :]$

Step 7: Fuse into Final Tensor

Add cleaned delta to text base:
$W^{\text{merged}} = W^{\text{text}} + \Delta_{\text{clean}}$
Cast back to original dtype (e.g., bfloat16):
$\hat{T} = W^{\text{merged}}.\text{to}(T^{\text{text}}.\text{dtype})$

At the End:

This merging algorithm is based on the following assumption: the visual capability of the model is mainly concentrated in a few larger singular values within the residual terms.
It should be noted that we have not yet conducted a systematic evaluation of the model's visual capabilities, and this is only used here to demonstrate the feasibility of the merging technique.
At the same time, we call for further research into model merging methods that unify vision and text, in order to find truly suitable merging algorithms.

Downloads last month: 14

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for YOYO-AI/Qwen3-VL-4B-YOYO-Instruct

Qwen/Qwen3-4B-Instruct-2507

Qwen/Qwen3-VL-4B-Instruct

Merge model

this model

Quantizations

4 models