---
license: apache-2.0
library_name: transformers
tags:
  - qwen
  - multimodal
  - moe
  - vision-language
  - conversational
  - transformers
  - vllm
  - sglang
  - ktransformers
  - function-calling
  - reasoning
pipeline_tag: image-text-to-text
base_model: Qwen/Qwen3.5-122B-A10B
---

# JetLLMPlus-3.5

**JetLLMPlus-3.5** is a multimodal Mixture-of-Experts model published by **Jetlink**.

It is intended for teams that want to manage deployment, access, and internal distribution from their own namespace while preserving compatibility with the original upstream model ecosystem.

## Model Summary

JetLLMPlus-3.5 is a 122B total / 10B active parameter multimodal MoE model with:

- **122B total parameters, 10B activated per token**
- **Causal Language Model with Vision Encoder**
- **Hybrid architecture: Gated DeltaNet (36 layers) + Full Attention (12 layers) + Sparse MoE**
- **256 routed experts + 1 shared expert per layer**
- **262,144 tokens native context length**
- **Extensible context up to 1,010,000 tokens via YaRN**
- **Support for 201 languages and dialects**
- Compatibility with **Transformers**, **vLLM**, **SGLang**, and **KTransformers**

## Intended Use

This model is suitable for advanced workloads such as:

- multimodal chat assistants
- long-context document and PDF understanding
- OCR, chart comprehension, and document extraction pipelines
- reasoning and step-by-step problem solving
- agentic workflows with function calling
- coding assistants and code generation
- GUI automation and screen understanding
- multilingual enterprise assistants
- research and benchmarking

## Model Details

### Architecture

- **Model type:** Causal Language Model with Vision Encoder
- **Training stage:** Pre-training & Post-training
- **Total parameters:** 122B
- **Activated parameters:** 10B per token
- **Hidden dimension:** 3,072
- **Number of layers:** 48 (36 GatedDeltaNet linear attention + 12 full attention)
- **MoE experts:** 256 routed + 1 shared per layer
- **Activated experts:** 8 routed + 1 shared
- **Expert FFN dimension:** 1,024
- **Vocabulary size:** 248,320
- **Native context length:** 262,144 tokens
- **Extended context capability:** up to 1,010,000 tokens via YaRN

### Architecture Note: Hybrid Attention (GatedDeltaNet + MoE)

JetLLMPlus-3.5 uses a novel hybrid attention design unique to the Qwen3.5 architecture. Unlike standard transformer MoE models, it combines:

- **GatedDeltaNet linear attention** (36 out of 48 layers) for efficient long-context processing with sub-quadratic complexity
- **Full global attention** (12 layers) for high-quality token interactions
- **Sparse MoE** routing in feed-forward layers for parameter efficiency

This design delivers high-throughput inference with significantly lower latency than pure full-attention models of comparable total parameter count.

> ⚠️ **Deployment note:** The GatedDeltaNet layers impose additional constraints compared to standard MoE models. When serving with SGLang, `--attention-backend triton` and `--kv-cache-dtype bf16` are required. FP8 KV cache is not recommended due to potential output corruption on this architecture. CUDA graph and HiCache (prefix caching) are currently incompatible with DeltaNet layers.

### Ecosystem Compatibility

- Hugging Face Transformers
- vLLM
- SGLang
- KTransformers

## Hardware Requirements

> JetLLMPlus-3.5 sits between the lightweight 35B-A3B and the flagship 397B-A17B, requiring multi-GPU infrastructure at full precision but manageable on 2–4 datacenter GPUs.

### Reference Hardware

Approximate GPU memory requirements:

- **Unquantized (BF16):** ~244GB VRAM — 3–4× A100 80GB or equivalent
- **FP8:** ~127GB — 2× A100 80GB or equivalent
- **GPTQ-Int4:** ~79GB — 1× H100 80GB or 2× A100 40GB
- **Multi-GPU:** tensor parallelism recommended via vLLM or SGLang (`--tp-size 4` or `--tp-size 8`)

> Note: requirements vary significantly based on context length, KV cache settings, and batch size. FP8 KV cache should be avoided for this model due to DeltaNet architecture constraints — use BF16 KV.

### Recommendation

For most production teams:

1. use **FP8 weights + BF16 KV** for the best balance of memory and quality
2. use **GPTQ-Int4** for single-GPU or memory-constrained deployments
3. enable **MTP (Multi-Token Prediction)** for the highest throughput gains — this is the primary optimization path for this model's architecture
4. use `--language-model-only` when vision is not needed to free KV cache memory

## Software Requirements

Recommended environment:

- Python 3.10+
- Linux
- CUDA-enabled GPU infrastructure
- One of the following runtimes:
  - Transformers (latest from `main` branch)
  - vLLM
  - SGLang
  - KTransformers

Common dependencies:

- `torch`
- `transformers`
- `torchvision`
- `pillow`
- `accelerate`

## Quickstart

Install Transformers:

    pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"

### Basic text inference

    from transformers import AutoProcessor, AutoModelForImageTextToText
    import torch

    model_id = "Jetlink/JetLLMPlus-3.5"

    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForImageTextToText.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

    messages = [
        {"role": "user", "content": [{"type": "text", "text": "Explain the difference between MoE and dense models."}]}
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt"
    ).to(model.device)

    output = model.generate(**inputs, max_new_tokens=512)
    print(processor.decode(output[0], skip_special_tokens=True))

### Thinking mode (deep reasoning)

Enable step-by-step reasoning with `enable_thinking=True`:

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt",
        enable_thinking=True,
    ).to(model.device)

### Non-thinking mode (direct response)

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt",
        enable_thinking=False,
    ).to(model.device)

## Serving Examples

### vLLM

    vllm serve Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tensor-parallel-size 4 \
      --max-model-len 262144 \
      --reasoning-parser qwen3

### vLLM with Tool Use

    vllm serve Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tensor-parallel-size 4 \
      --max-model-len 262144 \
      --reasoning-parser qwen3 \
      --enable-auto-tool-choice \
      --tool-call-parser qwen3_coder

### vLLM with MTP (Multi-Token Prediction)

    vllm serve Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tensor-parallel-size 4 \
      --max-model-len 262144 \
      --reasoning-parser qwen3 \
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

### vLLM text-only mode

    vllm serve Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tensor-parallel-size 4 \
      --max-model-len 262144 \
      --reasoning-parser qwen3 \
      --language-model-only

### SGLang

> ⚠️ DeltaNet layers require additional flags. Use `--attention-backend triton` and `--kv-cache-dtype bf16`.

    python -m sglang.launch_server \
      --model-path Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tp-size 4 \
      --mem-fraction-static 0.80 \
      --context-length 262144 \
      --reasoning-parser qwen3 \
      --attention-backend triton \
      --kv-cache-dtype bf16 \
      --disable-cuda-graph \
      --disable-radix-cache

### SGLang with Tool Use

    python -m sglang.launch_server \
      --model-path Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tp-size 4 \
      --mem-fraction-static 0.80 \
      --context-length 262144 \
      --reasoning-parser qwen3 \
      --tool-call-parser qwen3_coder \
      --attention-backend triton \
      --kv-cache-dtype bf16 \
      --disable-cuda-graph \
      --disable-radix-cache

### SGLang with Multi-Token Prediction (MTP)

    python -m sglang.launch_server \
      --model-path Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tp-size 4 \
      --mem-fraction-static 0.80 \
      --context-length 262144 \
      --reasoning-parser qwen3 \
      --attention-backend triton \
      --kv-cache-dtype bf16 \
      --disable-cuda-graph \
      --disable-radix-cache \
      --speculative-algo NEXTN \
      --speculative-num-steps 3 \
      --speculative-eagle-topk 1 \
      --speculative-num-draft-tokens 4

## Long Context Notes

JetLLMPlus-3.5 natively supports **262,144 tokens**.

For tasks exceeding this window, the upstream documentation recommends YaRN-based long-context scaling, supported in Transformers, vLLM, KTransformers, and SGLang, extending context up to **1,010,000 tokens**.

The hybrid GatedDeltaNet + full-attention architecture provides sub-quadratic scaling for long-context inputs on the linear attention layers, making long-context processing more efficient than pure full-attention models of similar scale.

## Strengths

- very strong knowledge and vision benchmarks in the open-weight mid-tier class
- best-in-class document understanding (OCRBench 92.1, OmniDocBench 89.8)
- leading function calling performance in the Qwen3.5 lineup (BFCL-V4 72.2)
- strong GUI and screen automation capabilities (ScreenSpot Pro 70.4)
- highly efficient inference thanks to MoE — only 10B parameters activate per token
- hybrid DeltaNet attention for efficient long-context processing
- 262K native context, extensible to 1M via YaRN
- 201 language support
- Apache 2.0 license

## Limitations

- full weight matrix (122B) must reside in memory regardless of active parameters
- GatedDeltaNet layers impose framework-specific constraints (no FP8 KV, no CUDA graph, no prefix caching in SGLang)
- multi-GPU deployment required for unquantized serving
- long context significantly increases KV cache memory pressure
- multimodal usage adds further overhead
- deployment characteristics vary significantly by framework and configuration

## Out-of-Scope / Cautionary Use

As with other frontier-scale multimodal language models, outputs should be reviewed before use in:

- medical decision-making
- legal advice
- safety-critical automation
- high-stakes financial decisions
- fully autonomous customer actions without guardrails

Human review, policy controls, and tool-level validation are strongly recommended.

## License

This repository follows the same license as the upstream release.

- **License:** Apache-2.0
- See the upstream Qwen repository and included license text for the governing terms.

If you redistribute, fine-tune, quantize, or otherwise modify this model, make sure your usage remains compliant with the upstream license and attribution requirements.

## Attribution

Original model and research release by the **Qwen** team.

Upstream model:
- `Qwen/Qwen3.5-122B-A10B`

This repository is an organization-managed copy and is **not the original upstream source**.

## Citation

Please cite the original Qwen release when using this model in research, evaluation, or production documentation.

```bibtex
@misc{qwen3.5,
  title        = {Qwen3.5 Technical Report},
  author       = {Qwen Team},
  year         = {2026},
  publisher    = {Alibaba Cloud},
  howpublished = {\url{https://huggingface.co/Qwen/Qwen3.5-122B-A10B}}
}
```

---

# JetLLMPlus-3.5 (Türkçe)

**JetLLMPlus-3.5**, **Jetlink** tarafından yayınlanan multimodal bir Mixture-of-Experts modelidir.

Bu depo; modeli kendi namespace'i altında yönetmek, erişimi kontrol etmek ve dağıtımı kolaylaştırmak isteyen ekipler için hazırlanmıştır.

## Model Özeti

JetLLMPlus-3.5, token başına 10B parametre aktive eden 122B toplam parametreli bir multimodal MoE modelidir:

- **122B toplam parametre, token başına 10B aktif**
- **Vision Encoder içeren Causal Language Model**
- **Hibrit mimari: Gated DeltaNet (36 katman) + Tam Dikkat (12 katman) + Sparse MoE**
- **Katman başına 256 routed expert + 1 shared expert**
- **262.144 token yerel bağlam uzunluğu**
- **YaRN ile 1.010.000 token'a kadar genişletilebilir bağlam**
- **201 dil ve lehçe desteği**
- **Transformers**, **vLLM**, **SGLang** ve **KTransformers** ile uyumluluk

## Kullanım Amacı

Bu model aşağıdaki gelişmiş kullanım senaryoları için uygundur:

- multimodal sohbet asistanları
- uzun bağlamlı doküman ve PDF anlama
- OCR, grafik anlama ve doküman çıkarma pipeline'ları
- adım adım akıl yürütme ve problem çözme
- function calling ile agentic workflow yapıları
- kodlama asistanları ve kod üretimi
- GUI otomasyon ve ekran anlama
- çok dilli kurumsal asistanlar
- araştırma ve benchmark çalışmaları

## Model Detayları

### Mimari

- **Model tipi:** Vision Encoder içeren Causal Language Model
- **Eğitim aşaması:** Pre-training ve Post-training
- **Toplam parametre:** 122B
- **Aktif parametre:** Token başına 10B
- **Hidden dimension:** 3.072
- **Katman sayısı:** 48 (36 GatedDeltaNet lineer dikkat + 12 tam dikkat)
- **MoE expert sayısı:** Katman başına 256 routed + 1 shared
- **Aktif expert:** 8 routed + 1 shared
- **Expert FFN boyutu:** 1.024
- **Vocabulary size:** 248.320
- **Yerel bağlam uzunluğu:** 262.144 token
- **Genişletilmiş bağlam kapasitesi:** YaRN ile 1.010.000 token'a kadar

### Mimari Notu: Hibrit Dikkat (GatedDeltaNet + MoE)

JetLLMPlus-3.5, Qwen3.5 mimarisine özgü yenilikçi bir hibrit dikkat tasarımı kullanır. Standart transformer MoE modellerinden farklı olarak şunları birleştirir:

- **GatedDeltaNet lineer dikkat** (48 katmandan 36'sı): sub-quadratic karmaşıklıkla verimli uzun bağlam işleme
- **Tam global dikkat** (12 katman): yüksek kaliteli token etkileşimleri
- **Sparse MoE** routing: parametre verimliliği için feed-forward katmanlarında

Bu tasarım, benzer toplam parametre sayısına sahip tam-dikkat modellerine kıyasla çok daha düşük gecikmeyle yüksek throughput inference sağlar.

> ⚠️ **Deployment notu:** GatedDeltaNet katmanları, standart MoE modellerine kıyasla ek kısıtlamalar getirir. SGLang ile servis ederken `--attention-backend triton` ve `--kv-cache-dtype bf16` zorunludur. FP8 KV cache bu mimaride output bozulmasına yol açabileceğinden önerilmez. CUDA graph ve HiCache (prefix caching) DeltaNet katmanlarıyla uyumsuzluk nedeniyle devre dışı bırakılmalıdır.

### Ekosistem Uyumluluğu

- Hugging Face Transformers
- vLLM
- SGLang
- KTransformers

## Donanım Gereksinimleri

> JetLLMPlus-3.5, hafif 35B-A3B ile flagship 397B-A17B arasında konumlanmaktadır. Tam hassasiyette çoklu GPU altyapısı gerektirir ancak 2–4 datacenter GPU ile yönetilebilir düzeydedir.

### Referans Donanım

Tahmini GPU bellek gereksinimleri:

- **Quantize edilmemiş (BF16):** ~244GB VRAM — 3–4× A100 80GB veya eşdeğeri
- **FP8:** ~127GB — 2× A100 80GB veya eşdeğeri
- **GPTQ-Int4:** ~79GB — 1× H100 80GB veya 2× A100 40GB
- **Çoklu GPU:** vLLM veya SGLang üzerinden tensor parallelism önerilir (`--tp-size 4` veya `--tp-size 8`)

> Not: Gereksinimler bağlam uzunluğu, KV cache ayarları ve batch size'a göre önemli ölçüde değişir. Bu model için FP8 KV cache, DeltaNet mimari kısıtlamaları nedeniyle önerilmez — BF16 KV kullanın.

### Öneri

Çoğu production ekip için en mantıklı yaklaşım:

1. en iyi bellek/kalite dengesi için **FP8 ağırlık + BF16 KV** kullanmak
2. tek GPU veya bellek kısıtlı dağıtımlar için **GPTQ-Int4** kullanmak
3. en yüksek throughput kazanımı için **MTP (Multi-Token Prediction)** etkinleştirmek — bu modelin mimarisinde birincil optimizasyon yoludur
4. vision gerekmiyorsa KV cache belleği açmak için `--language-model-only` kullanmak

## Yazılım Gereksinimleri

Önerilen ortam:

- Python 3.10+
- Linux
- CUDA destekli GPU altyapısı
- Şu runtime'lardan biri:
  - Transformers (en son `main` branch)
  - vLLM
  - SGLang
  - KTransformers

Yaygın bağımlılıklar:

- `torch`
- `transformers`
- `torchvision`
- `pillow`
- `accelerate`

## Hızlı Başlangıç

Transformers kurulumu:

    pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"

### Temel metin çıkarımı

    from transformers import AutoProcessor, AutoModelForImageTextToText
    import torch

    model_id = "Jetlink/JetLLMPlus-3.5"

    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForImageTextToText.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        trust_remote_code=True,
    )

    messages = [
        {"role": "user", "content": [{"type": "text", "text": "MoE ve dense modeller arasındaki farkı açıkla."}]}
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt"
    ).to(model.device)

    output = model.generate(**inputs, max_new_tokens=512)
    print(processor.decode(output[0], skip_special_tokens=True))

### Thinking modu (derin akıl yürütme)

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt",
        enable_thinking=True,
    ).to(model.device)

### Non-thinking modu (doğrudan yanıt)

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt",
        enable_thinking=False,
    ).to(model.device)

## Serving Örnekleri

### vLLM

    vllm serve Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tensor-parallel-size 4 \
      --max-model-len 262144 \
      --reasoning-parser qwen3

### vLLM Tool Use ile

    vllm serve Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tensor-parallel-size 4 \
      --max-model-len 262144 \
      --reasoning-parser qwen3 \
      --enable-auto-tool-choice \
      --tool-call-parser qwen3_coder

### vLLM MTP ile

    vllm serve Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tensor-parallel-size 4 \
      --max-model-len 262144 \
      --reasoning-parser qwen3 \
      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

### vLLM sadece metin modu

    vllm serve Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tensor-parallel-size 4 \
      --max-model-len 262144 \
      --reasoning-parser qwen3 \
      --language-model-only

### SGLang

> ⚠️ DeltaNet katmanları ek flag gerektirmektedir. `--attention-backend triton` ve `--kv-cache-dtype bf16` zorunludur.

    python -m sglang.launch_server \
      --model-path Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tp-size 4 \
      --mem-fraction-static 0.80 \
      --context-length 262144 \
      --reasoning-parser qwen3 \
      --attention-backend triton \
      --kv-cache-dtype bf16 \
      --disable-cuda-graph \
      --disable-radix-cache

### SGLang Tool Use ile

    python -m sglang.launch_server \
      --model-path Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tp-size 4 \
      --mem-fraction-static 0.80 \
      --context-length 262144 \
      --reasoning-parser qwen3 \
      --tool-call-parser qwen3_coder \
      --attention-backend triton \
      --kv-cache-dtype bf16 \
      --disable-cuda-graph \
      --disable-radix-cache

### SGLang Multi-Token Prediction (MTP) ile

    python -m sglang.launch_server \
      --model-path Jetlink/JetLLMPlus-3.5 \
      --port 8000 \
      --tp-size 4 \
      --mem-fraction-static 0.80 \
      --context-length 262144 \
      --reasoning-parser qwen3 \
      --attention-backend triton \
      --kv-cache-dtype bf16 \
      --disable-cuda-graph \
      --disable-radix-cache \
      --speculative-algo NEXTN \
      --speculative-num-steps 3 \
      --speculative-eagle-topk 1 \
      --speculative-num-draft-tokens 4

## Uzun Bağlam Notları

JetLLMPlus-3.5 yerel olarak **262.144 token** destekler.

Bu pencereyi aşan görevlerde Transformers, vLLM, KTransformers ve SGLang tarafından desteklenen YaRN tabanlı uzun bağlam ölçekleme ile **1.010.000 token'a** kadar genişletilebilir.

Hibrit GatedDeltaNet + tam dikkat mimarisi, lineer dikkat katmanlarında uzun bağlam girdileri için sub-quadratic ölçekleme sağlayarak benzer ölçekteki saf tam dikkat modellerine kıyasla uzun bağlam işlemeyi daha verimli hale getirir.

## Güçlü Yönler

- açık ağırlıklı orta kademe sınıfında çok güçlü bilgi ve vision benchmark'ları
- en iyi sınıf doküman anlama (OCRBench 92.1, OmniDocBench 89.8)
- Qwen3.5 serisinde öncü function calling performansı (BFCL-V4 72.2)
- güçlü GUI ve ekran otomasyon yetenekleri (ScreenSpot Pro 70.4)
- MoE sayesinde yüksek verimli inference — token başına yalnızca 10B parametre aktive edilir
- verimli uzun bağlam işleme için hibrit DeltaNet dikkat
- YaRN ile 262K yerel bağlam, 1M'a genişletilebilir
- 201 dil desteği
- Apache 2.0 lisansı

## Sınırlamalar

- aktif parametrelerden bağımsız olarak tam ağırlık matrisi (122B) bellekte tutulmalıdır
- GatedDeltaNet katmanları framework'e özgü kısıtlamalar getirir (FP8 KV yok, CUDA graph yok, SGLang'da prefix caching yok)
- quantize edilmemiş serving için çoklu GPU dağıtımı gereklidir
- uzun bağlam KV cache bellek baskısını ciddi ölçüde artırır
- multimodal kullanım ek yük getirir
- deployment karakteristiği framework ve konfigürasyona göre önemli ölçüde değişir

## Kapsam Dışı / Dikkat Gerektiren Kullanımlar

Diğer frontier-scale multimodal language model'lerde olduğu gibi, model çıktıları şu alanlarda insan denetimi olmadan kullanılmamalıdır:

- tıbbi karar verme
- hukuki tavsiye
- güvenlik kritik otomasyon
- yüksek riskli finansal kararlar
- korumasız tam otonom müşteri aksiyonları

İnsan incelemesi, politika kontrolleri ve tool seviyesinde doğrulama güçlü şekilde önerilir.

## Lisans

Bu depo, upstream sürümle aynı lisansı takip eder.

- **Lisans:** Apache-2.0
- Geçerli şartlar için upstream Qwen deposu ve lisans metni incelenmelidir.

Modeli yeniden dağıtıyor, fine-tune ediyor, quantize ediyor veya başka şekilde değiştiriyorsan; kullanımının upstream lisans ve attribution gereklilikleriyle uyumlu olduğundan emin olmalısın.

## Atıf

Orijinal model ve araştırma yayını **Qwen** ekibine aittir.

Upstream model:
- `Qwen/Qwen3.5-122B-A10B`

Bu depo, kurum tarafından yönetilen bir kopyadır ve **orijinal upstream kaynak değildir**.

## Atıf / Citation

Bu modeli araştırma, değerlendirme veya production dokümantasyonunda kullanıyorsan, lütfen orijinal Qwen sürümüne atıf yap.

```bibtex
@misc{qwen3.5,
  title        = {Qwen3.5 Technical Report},
  author       = {Qwen Team},
  year         = {2026},
  publisher    = {Alibaba Cloud},
  howpublished = {\url{https://huggingface.co/Qwen/Qwen3.5-122B-A10B}}
}
```