---
license: apache-2.0
library_name: transformers
tags:
  - gemma
  - gemma4
  - multimodal
  - vision-language
  - conversational
  - transformers
  - vllm
  - sglang
  - function-calling
  - reasoning
pipeline_tag: image-text-to-text
base_model: google/gemma-4-31b-it
---

# JetLLMLite-4.0

**JetLLMLite-4.0** is a multimodal instruction-tuned model published by **Jetlink**.

It is intended for teams that want to manage deployment, access, and internal distribution from their own namespace while preserving compatibility with the original upstream model ecosystem.

## Model Summary

JetLLMLite-4.0 is a 31B dense multimodal model with:

- **31B total parameters (dense architecture)**
- **Instruction-tuned (IT) variant**
- **256,144 tokens context length**
- **Multimodal: text + image input, text output**
- **Video understanding support (up to 60 seconds at 1 fps)**
- **Built-in reasoning / thinking mode**
- **Native function calling support**
- **Support for 140+ languages**
- Compatibility with **Transformers**, **vLLM**, **SGLang**, **llama.cpp**, **MLX**, **Ollama**

## Intended Use

This model is suitable for advanced workloads such as:

- multimodal chat assistants
- long-context document and PDF understanding
- reasoning and step-by-step problem solving
- agentic workflows with function calling
- coding assistants and code generation
- image, chart, and OCR tasks
- multilingual enterprise assistants
- research and benchmarking

## Model Details

### Architecture

- **Model type:** Dense Causal Language Model with Vision Encoder
- **Training stage:** Pre-training & Post-training (Instruction-tuned)
- **Total parameters:** 31B
- **Architecture style:** Dense (not MoE)
- **Attention mechanism:** Hybrid — alternating local sliding-window (1024 tokens) and global full-context attention
- **RoPE:** Dual config — standard RoPE for sliding layers, Proportional RoPE (p-RoPE) for global layers
- **Per-Layer Embeddings (PLE):** Yes
- **Shared KV Cache:** Yes (last N layers reuse KV states from earlier layers)
- **Native context length:** 256,144 tokens
- **Vision encoder:** Variable aspect ratio; configurable token budgets (70 / 140 / 280 / 560 / 1120 tokens)
- **Thinking mode:** Configurable via `<|think|>` token in system prompt

### Ecosystem Compatibility

- Hugging Face Transformers
- vLLM
- SGLang
- llama.cpp
- MLX
- Ollama
- mistral.rs
- LM Studio

## Hardware Requirements

> JetLLMLite-4.0 is a **single-GPU capable** model at full precision (bfloat16), making it significantly more accessible than large MoE or 100B+ scale models.

### Reference Hardware

Approximate GPU memory requirements (bfloat16 / full precision):

- **Unquantized (bfloat16):** fits on a single 80GB NVIDIA H100/H200 GPU
- **4-bit quantized:** runs on consumer GPUs with 24GB+ VRAM (e.g. RTX 3090, RTX 4090)
- **Multi-GPU:** tensor parallelism supported via vLLM and SGLang for higher throughput

> Note: requirements vary based on context length, batch size, and KV cache settings. The above are practical reference points, not universal minimums.

### Practical Guidance

#### Single GPU deployment

Unlike large-scale MoE models, JetLLMLite-4.0 can be served from a **single 80GB datacenter GPU** at full precision — making it an excellent fit for single-node or cost-conscious deployment scenarios.

For consumer-grade hardware, quantized variants (GGUF, GPTQ, AWQ) significantly reduce memory requirements with minimal quality loss.

#### Text-only deployment

Use the `--language-model-only` flag in vLLM to skip vision encoder profiling and free additional KV cache memory when your workload is purely text-based.

### Recommendation

For most production teams:

1. start with **vLLM** or **SGLang** for serving
2. use a **single H100/H200** for unquantized bfloat16 deployment
3. use **4-bit quantization** for consumer GPU or cost-optimized deployments
4. disable vision if not needed via `--language-model-only`

## Software Requirements

Recommended environment:

- Python 3.10+
- Linux
- CUDA-enabled GPU infrastructure
- One of the following runtimes:
  - Transformers (`>= 4.51.0` required for Gemma 4)
  - vLLM
  - SGLang
  - llama.cpp

Common dependencies:

- `torch`
- `transformers >= 4.51.0`
- `torchvision`
- `pillow`
- `accelerate`

## Quickstart

Install Transformers:

    pip install "transformers>=4.51.0"

### Basic text inference

    from transformers import pipeline
    import torch

    pipe = pipeline(
        "image-text-to-text",
        model="Jetlink/JetLLMLite-4.0",
        device="cuda",
        torch_dtype=torch.bfloat16
    )

    messages = [
        {"role": "user", "content": [{"type": "text", "text": "What is the capital of France?"}]}
    ]

    output = pipe(messages, max_new_tokens=200)
    print(output[0]["generated_text"][-1]["content"])

### Multimodal inference (image + text)

    from transformers import AutoProcessor, AutoModelForImageTextToText
    import torch
    from PIL import Image

    model_id = "Jetlink/JetLLMLite-4.0"

    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForImageTextToText.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )

    image = Image.open("image.jpg")
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": "Describe this image in detail."}
            ]
        }
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt"
    ).to(model.device)

    output = model.generate(**inputs, max_new_tokens=512)
    print(processor.decode(output[0], skip_special_tokens=True))

### Reasoning / Thinking mode

Enable thinking mode by adding `<|think|>` to the system prompt:

    messages = [
        {"role": "system", "content": "<|think|>"},
        {"role": "user", "content": [{"type": "text", "text": "Solve: If x² + 5x + 6 = 0, what are the values of x?"}]}
    ]

## Serving Examples

### vLLM

    vllm serve Jetlink/JetLLMLite-4.0 \
      --port 8000 \
      --tensor-parallel-size 1 \
      --max-model-len 32768 \
      --dtype bfloat16

### vLLM with Tool Use

    vllm serve Jetlink/JetLLMLite-4.0 \
      --port 8000 \
      --tensor-parallel-size 1 \
      --max-model-len 32768 \
      --dtype bfloat16 \
      --enable-auto-tool-choice \
      --tool-call-parser gemma

### vLLM text-only mode

    vllm serve Jetlink/JetLLMLite-4.0 \
      --port 8000 \
      --tensor-parallel-size 1 \
      --max-model-len 32768 \
      --dtype bfloat16 \
      --language-model-only

### SGLang

    python -m sglang.launch_server \
      --model-path Jetlink/JetLLMLite-4.0 \
      --port 8000 \
      --tp-size 1 \
      --mem-fraction-static 0.85 \
      --context-length 32768 \
      --dtype bfloat16

### SGLang with Tool Use

    python -m sglang.launch_server \
      --model-path Jetlink/JetLLMLite-4.0 \
      --port 8000 \
      --tp-size 1 \
      --mem-fraction-static 0.85 \
      --context-length 32768 \
      --dtype bfloat16 \
      --tool-call-parser gemma4

### llama.cpp

    llama-server \
      -m JetLLMLite-4.0.Q4_K_M.gguf \
      --port 8080 \
      -ngl 99 \
      -c 8192

## Long Context Notes

JetLLMLite-4.0 natively supports **256,144 tokens** of context.

The hybrid attention mechanism (alternating sliding-window and global attention) with Proportional RoPE (p-RoPE) enables efficient long-context processing without degradation. For most practical deployments, setting `--max-model-len` to a lower value (e.g. 32768) is recommended to manage KV cache memory pressure.

## Thinking Mode Notes

JetLLMLite-4.0 supports configurable thinking mode inherited from the Gemma 4 architecture:

- **Thinking enabled:** add `<|think|>` token to the system prompt
- **Thinking disabled:** omit `<|think|>` from the system prompt

When thinking is enabled, the model outputs internal reasoning using `<|channel>thought\n[reasoning]<channel|>` before the final answer. In multi-turn conversations, thought content from previous turns should not be included before the next user turn.

## Strengths

- single-GPU deployable at full precision (80GB H100/H200)
- strong multimodal capabilities (image, video, OCR, document parsing)
- built-in reasoning / thinking mode
- native function calling support
- 256K token context window
- 140+ language support
- broad compatibility with inference frameworks
- dense architecture — predictable and consistent performance

## Limitations

- requires at least one high-memory GPU for unquantized deployment
- long context significantly increases KV cache memory pressure
- video understanding limited to 60 seconds at 1 fps
- multimodal usage adds memory overhead compared to text-only
- deployment characteristics depend on framework and quantization settings

## Out-of-Scope / Cautionary Use

As with other frontier-scale multimodal models, outputs should be reviewed before use in:

- medical decision-making
- legal advice
- safety-critical automation
- high-stakes financial decisions
- fully autonomous customer actions without guardrails

Human review, policy controls, and tool-level validation are strongly recommended.

## License

This repository follows the same license as the upstream release.

- **License:** Apache-2.0
- See the upstream Google Gemma repository and included license text for the governing terms.

If you redistribute, fine-tune, quantize, or otherwise modify this model, make sure your usage remains compliant with the upstream license and attribution requirements.

## Attribution

Original model and research release by **Google DeepMind**.

Upstream model:
- `google/gemma-4-31b-it`

This repository is an organization-managed copy and is **not the original upstream source**.

## Citation

Please cite the original Gemma 4 release when using this model in research, evaluation, or production documentation.

```bibtex
@misc{gemma4,
  title        = {Gemma 4 Technical Report},
  author       = {Google DeepMind},
  year         = {2026},
  publisher    = {Google DeepMind},
  howpublished = {\url{https://huggingface.co/google/gemma-4-31b-it}}
}
```

---

# JetLLMLite-4.0 (Türkçe)

**JetLLMLite-4.0**, **Jetlink** tarafından yayınlanan multimodal bir instruction-tuned modeldir.

Bu depo; modeli kendi namespace'i altında yönetmek, erişimi kontrol etmek ve dağıtımı kolaylaştırmak isteyen ekipler için hazırlanmıştır.

## Model Özeti

JetLLMLite-4.0, aşağıdaki özelliklere sahip 31B parametreli dense bir multimodal modeldir:

- **31B toplam parametre (dense mimari)**
- **Instruction-tuned (IT) varyant**
- **256.144 token bağlam uzunluğu**
- **Multimodal: metin + görüntü girişi, metin çıkışı**
- **Video anlama desteği (saniyede 1 kare, 60 saniyeye kadar)**
- **Yerleşik reasoning / thinking modu**
- **Native function calling desteği**
- **140+ dil desteği**
- **Transformers**, **vLLM**, **SGLang**, **llama.cpp**, **MLX**, **Ollama** ile uyumluluk

## Kullanım Amacı

Bu model aşağıdaki gelişmiş kullanım senaryoları için uygundur:

- multimodal sohbet asistanları
- uzun bağlamlı doküman ve PDF anlama
- adım adım akıl yürütme ve problem çözme
- function calling ile agentic workflow yapıları
- kodlama asistanları ve kod üretimi
- görüntü, grafik ve OCR görevleri
- çok dilli kurumsal asistanlar
- araştırma ve benchmark çalışmaları

## Model Detayları

### Mimari

- **Model tipi:** Vision Encoder içeren Dense Causal Language Model
- **Eğitim aşaması:** Pre-training ve Post-training (Instruction-tuned)
- **Toplam parametre:** 31B
- **Mimari stili:** Dense (MoE değil)
- **Dikkat mekanizması:** Hibrit — local sliding-window (1024 token) ve global full-context attention
- **RoPE:** Çift konfigürasyon — sliding katmanlar için standart RoPE, global katmanlar için Proportional RoPE (p-RoPE)
- **Per-Layer Embeddings (PLE):** Evet
- **Paylaşılan KV Cache:** Evet
- **Yerel bağlam uzunluğu:** 256.144 token
- **Vision encoder:** Değişken en-boy oranı; yapılandırılabilir token bütçeleri (70 / 140 / 280 / 560 / 1120 token)
- **Thinking modu:** System prompt'a `<|think|>` token eklenerek etkinleştirilir

### Ekosistem Uyumluluğu

- Hugging Face Transformers
- vLLM
- SGLang
- llama.cpp
- MLX
- Ollama
- mistral.rs
- LM Studio

## Donanım Gereksinimleri

> JetLLMLite-4.0, tam hassasiyetle (bfloat16) **tek GPU'da çalışabilen** bir modeldir. Bu özelliği, büyük ölçekli MoE veya 100B+ modellerine kıyasla çok daha erişilebilir kılar.

### Referans Donanım

Tahmini GPU bellek gereksinimleri (bfloat16 / tam hassasiyet):

- **Quantize edilmemiş (bfloat16):** tek bir 80GB NVIDIA H100/H200 GPU'ya sığar
- **4-bit quantize:** 24GB+ VRAM'li consumer GPU'larda çalışır (ör. RTX 3090, RTX 4090)
- **Çoklu GPU:** daha yüksek throughput için vLLM ve SGLang üzerinden tensor parallelism desteklenir

> Not: gereksinimler bağlam uzunluğu, batch size ve KV cache ayarlarına göre değişir. Yukarıdakiler pratik referans noktaları olup evrensel minimum değildir.

### Pratik Rehber

#### Tek GPU dağıtımı

Büyük ölçekli MoE modellerinin aksine JetLLMLite-4.0, tam hassasiyetle **tek bir 80GB datacenter GPU'dan** servis edilebilir. Bu özellik, single-node veya maliyet odaklı dağıtım senaryoları için mükemmel bir seçenek sunar.

Consumer GPU'lar için quantize varyantlar (GGUF, GPTQ, AWQ) minimal kalite kaybıyla bellek gereksinimini önemli ölçüde azaltır.

#### Sadece metin kullanımı

vLLM'de `--language-model-only` bayrağını kullanarak vision encoder profiling'i atlayabilir ve KV cache için ek bellek açabilirsiniz.

### Öneri

Çoğu production ekip için en mantıklı yaklaşım:

1. serving için **vLLM** veya **SGLang** ile başlamak
2. quantize edilmemiş bfloat16 dağıtım için **tek H100/H200** kullanmak
3. consumer GPU veya maliyet optimize edilmiş dağıtımlar için **4-bit quantization** uygulamak
4. vision gerekmiyorsa `--language-model-only` ile devre dışı bırakmak

## Yazılım Gereksinimleri

Önerilen ortam:

- Python 3.10+
- Linux
- CUDA destekli GPU altyapısı
- Şu runtime'lardan biri:
  - Transformers (`>= 4.51.0` — Gemma 4 için zorunlu)
  - vLLM
  - SGLang
  - llama.cpp

Yaygın bağımlılıklar:

- `torch`
- `transformers >= 4.51.0`
- `torchvision`
- `pillow`
- `accelerate`

## Hızlı Başlangıç

Transformers kurulumu:

    pip install "transformers>=4.51.0"

### Temel metin çıkarımı

    from transformers import pipeline
    import torch

    pipe = pipeline(
        "image-text-to-text",
        model="Jetlink/JetLLMLite-4.0",
        device="cuda",
        torch_dtype=torch.bfloat16
    )

    messages = [
        {"role": "user", "content": [{"type": "text", "text": "Fransa'nın başkenti neresidir?"}]}
    ]

    output = pipe(messages, max_new_tokens=200)
    print(output[0]["generated_text"][-1]["content"])

### Multimodal çıkarım (görüntü + metin)

    from transformers import AutoProcessor, AutoModelForImageTextToText
    import torch
    from PIL import Image

    model_id = "Jetlink/JetLLMLite-4.0"

    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForImageTextToText.from_pretrained(
        model_id,
        torch_dtype=torch.bfloat16,
        device_map="auto"
    )

    image = Image.open("goruntu.jpg")
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": "Bu görseli detaylı olarak açıkla."}
            ]
        }
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_tensors="pt"
    ).to(model.device)

    output = model.generate(**inputs, max_new_tokens=512)
    print(processor.decode(output[0], skip_special_tokens=True))

### Reasoning / Thinking modu

Thinking modunu etkinleştirmek için system prompt'a `<|think|>` ekleyin:

    messages = [
        {"role": "system", "content": "<|think|>"},
        {"role": "user", "content": [{"type": "text", "text": "x² + 5x + 6 = 0 denkleminin kökleri nelerdir?"}]}
    ]

## Serving Örnekleri

### vLLM

    vllm serve Jetlink/JetLLMLite-4.0 \
      --port 8000 \
      --tensor-parallel-size 1 \
      --max-model-len 32768 \
      --dtype bfloat16

### vLLM Tool Use ile

    vllm serve Jetlink/JetLLMLite-4.0 \
      --port 8000 \
      --tensor-parallel-size 1 \
      --max-model-len 32768 \
      --dtype bfloat16 \
      --enable-auto-tool-choice \
      --tool-call-parser gemma

### vLLM sadece metin modu

    vllm serve Jetlink/JetLLMLite-4.0 \
      --port 8000 \
      --tensor-parallel-size 1 \
      --max-model-len 32768 \
      --dtype bfloat16 \
      --language-model-only

### SGLang

    python -m sglang.launch_server \
      --model-path Jetlink/JetLLMLite-4.0 \
      --port 8000 \
      --tp-size 1 \
      --mem-fraction-static 0.85 \
      --context-length 32768 \
      --dtype bfloat16

### SGLang Tool Use ile

    python -m sglang.launch_server \
      --model-path Jetlink/JetLLMLite-4.0 \
      --port 8000 \
      --tp-size 1 \
      --mem-fraction-static 0.85 \
      --context-length 32768 \
      --dtype bfloat16 \
      --tool-call-parser gemma4

### llama.cpp

    llama-server \
      -m JetLLMLite-4.0.Q4_K_M.gguf \
      --port 8080 \
      -ngl 99 \
      -c 8192

## Uzun Bağlam Notları

JetLLMLite-4.0 yerel olarak **256.144 token** destekler.

Hibrit dikkat mekanizması (alternatif sliding-window ve global attention) ve Proportional RoPE (p-RoPE) sayesinde verimli uzun bağlam işleme sağlanır. Çoğu production dağıtımında KV cache bellek baskısını yönetmek için `--max-model-len` değerini daha düşük tutmak (ör. 32768) önerilir.

## Thinking Modu Notları

JetLLMLite-4.0, Gemma 4 mimarisinden gelen yapılandırılabilir thinking modunu destekler:

- **Thinking etkin:** system prompt'a `<|think|>` token ekleyin
- **Thinking devre dışı:** `<|think|>` tokenını system prompt'tan çıkarın

Thinking etkinleştirildiğinde model, nihai yanıttan önce `<|channel>thought\n[akıl yürütme]<channel|>` yapısıyla iç mantığını çıktılar. Çok turlu konuşmalarda önceki turların thought içeriği bir sonraki kullanıcı turuna dahil edilmemelidir.

## Güçlü Yönler

- tam hassasiyetle tek GPU'da dağıtılabilir (80GB H100/H200)
- güçlü multimodal yetenekler (görüntü, video, OCR, doküman ayrıştırma)
- yerleşik reasoning / thinking modu
- native function calling desteği
- 256K token bağlam penceresi
- 140+ dil desteği
- inference framework'leriyle geniş uyumluluk
- dense mimari — öngörülebilir ve tutarlı performans

## Sınırlamalar

- quantize edilmemiş dağıtım için en az bir yüksek belleğe sahip GPU gerektirir
- uzun bağlam KV cache bellek baskısını ciddi ölçüde artırır
- video anlama, saniyede 1 kare hızında 60 saniyeyle sınırlıdır
- multimodal kullanım metin çıkarımına kıyasla ek bellek maliyeti getirir
- deployment karakteristiği framework ve quantization ayarlarına göre değişir

## Kapsam Dışı / Dikkat Gerektiren Kullanımlar

Diğer frontier-scale multimodal modellerde olduğu gibi, model çıktıları şu alanlarda insan denetimi olmadan kullanılmamalıdır:

- tıbbi karar verme
- hukuki tavsiye
- güvenlik kritik otomasyon
- yüksek riskli finansal kararlar
- korumasız tam otonom müşteri aksiyonları

İnsan incelemesi, politika kontrolleri ve tool seviyesinde doğrulama güçlü şekilde önerilir.

## Lisans

Bu depo, upstream sürümle aynı lisansı takip eder.

- **Lisans:** Apache-2.0
- Geçerli şartlar için upstream Google Gemma deposu ve lisans metni incelenmelidir.

Modeli yeniden dağıtıyor, fine-tune ediyor, quantize ediyor veya başka şekilde değiştiriyorsan; kullanımının upstream lisans ve attribution gereklilikleriyle uyumlu olduğundan emin olmalısın.

## Atıf

Orijinal model ve araştırma yayını **Google DeepMind** ekibine aittir.

Upstream model:
- `google/gemma-4-31b-it`

Bu depo, kurum tarafından yönetilen bir kopyadır ve **orijinal upstream kaynak değildir**.

## Atıf / Citation

Bu modeli araştırma, değerlendirme veya production dokümantasyonunda kullanıyorsan, lütfen orijinal Gemma 4 sürümüne atıf yap.

```bibtex
@misc{gemma4,
  title        = {Gemma 4 Technical Report},
  author       = {Google DeepMind},
  year         = {2026},
  publisher    = {Google DeepMind},
  howpublished = {\url{https://huggingface.co/google/gemma-4-31b-it}}
}
```