HaoxingChen commited on 16 days ago

Commit

40a27ad

verified ·

1 Parent(s): 4e9959f

Upload folder using huggingface_hub

Browse files

Files changed (37) hide show

.gitattributes +4 -0
README.md +314 -3
assets/architecture.png +3 -0
assets/edit_example.png +3 -0
assets/llada_logo.png +0 -0
assets/performance.png +3 -0
assets/understanding_example.png +0 -0
config.json +48 -0
configuration_llada2uni_moe.py +122 -0
decoder-turbo/config.json +32 -0
decoder-turbo/decoder_model.safetensors +3 -0
decoder/config.json +32 -0
decoder/decoder_model.safetensors +3 -0
image_tokenizer/config.json +61 -0
image_tokenizer/image_tokenizer.safetensors +3 -0
image_tokenizer/preprocessor_config.json +16 -0
image_tokenizer/sigvq_embedding.pt +3 -0
model-00001-of-00013.safetensors +3 -0
model-00002-of-00013.safetensors +3 -0
model-00003-of-00013.safetensors +3 -0
model-00004-of-00013.safetensors +3 -0
model-00005-of-00013.safetensors +3 -0
model-00006-of-00013.safetensors +3 -0
model-00007-of-00013.safetensors +3 -0
model-00008-of-00013.safetensors +3 -0
model-00009-of-00013.safetensors +3 -0
model-00010-of-00013.safetensors +3 -0
model-00011-of-00013.safetensors +3 -0
model-00012-of-00013.safetensors +3 -0
model-00013-of-00013.safetensors +3 -0
model.safetensors.index.json +0 -0
modeling_llada2uni_moe.py +0 -0
special_tokens_map.json +37 -0
tokenizer.json +3 -0
tokenizer_config.json +0 -0
vae/config.json +38 -0
vae/diffusion_pytorch_model.safetensors +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/architecture.png filter=lfs diff=lfs merge=lfs -text
+assets/edit_example.png filter=lfs diff=lfs merge=lfs -text
+assets/performance.png filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,314 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+tags:
+- multimodal
+- image-generation
+- image-understanding
+- image-editing
+- diffusion
+- moe
+- text-to-image
+library_name: transformers
+pipeline_tag: image-text-to-text
+---
+<p align="center">
+ <img src="./assets/llada_logo.png" width="20%"/>
+</p>
+<div align="center">
+ <h1> LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model </h1>
+  [[📑 Technical Report ]()] &emsp; [[🌐 Github ](https://github.com/inclusionAI/LLaDA2.0-Uni)]
+ <b>AGI Research Center, Inclusion AI </b>
+</div>
+## Model Capabilities
+**LLaDA2.0-Uni** is a unified diffusion Large Language Model (dLLM) based on Mixture-of-Experts (MoE) that seamlessly integrates multimodal understanding and generation within a single model. It supports:
+- 🖼️ **Text-to-Image Generation** — high-fidelity image synthesis with optional thinking/reasoning.
+- 🔍 **Image Understanding** — visual question answering, image captioning, document understanding, etc.
+- ✏️ **Image Editing** — instruction-based editing with single or multi-reference support.
+- 🎨 **Interleaved Generation and Reasoning** — provide preliminary support for interleaved generation and unlock advanced interleaved reasoning.
+- ⚡ **Sprint Acceleration** — KV cache reuse and adaptive unmasking for faster inference.
+## Model Architecture
+<img src="./assets/architecture.png" width="100%"/>
+- **Unified dLLM-MoE Backbone**: Unifies multimodal understanding and generation into a simple Mask Token Prediction paradigm.
+- **Discrete Semantic Tokenizer**: Utilizes SigLIP-VQ to convert visual inputs into discrete semantic tokens, significantly enhancing multimodal understanding.
+- **Efficient Diffusion Decoder**: Pairs discrete tokens with a specialized diffusion decoder for high-fidelity generation, enabling rapid 8-step inference via distillation.
+## Evaluation Results
+<img src="./assets/performance.png" width="100%"/>
+## Quick Start
+> **Note:** Full installation instructions and CLI scripts are available in the [GitHub repository](https://github.com/inclusionAI/LLaDA2-Uni).
+### ⚙️ Installation
+#### 1. Create a conda environment
+```bash
+git clone https://github.com/inclusionAI/LLaDA2-Uni && cd LLaDA2-Uni
+conda create -n llada2_uni python=3.10 -y
+conda activate llada2_uni
+```
+#### 2. Install PyTorch (CUDA 12.4)
+```bash
+pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124
+```
+#### 3. Install Flash Attention 2 (required for efficient inference)
+```bash
+pip install flash-attn --no-build-isolation
+```
+#### 4. Install remaining dependencies
+```bash
+pip install -r requirements.txt
+```
+### 🌟 Text-to-Image Generation
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from decoder import decode_vq_tokens
+model_path = "inclusionAI/LLaDA2.0-Uni"
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
+).eval()
+model.tokenizer = tokenizer
+# Generate image tokens
+result = model.generate_image(
+    "A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f/2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.",
+    image_h=1024, image_w=1024,
+    steps=8, cfg_scale=2.0,
+)
+# Decode to PIL image (default: 50-step ODE)
+image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda")
+image.save("output.png")
+```
+> [!Note]
+>  💡 **Faster decoding** — Use the **decoder-turbo** (distilled decoder) for **~10× faster** image decoding (8 steps instead of 50) with minimal quality loss:
+> ```python
+> image = decode_vq_tokens(
+>     result["token_ids"], result["h"], result["w"], model_path, "cuda",
+>     num_steps=8, decode_mode="decoder-turbo",
+> )
+> ```
+### 🌟 Text-to-Image Generation with Thinking
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from decoder import decode_vq_tokens
+model_path = "inclusionAI/LLaDA2.0-Uni"
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
+).eval()
+model.tokenizer = tokenizer
+# Generate image tokens with thinking process
+result = model.generate_image(
+    "A fox with thick, dense, fluffy fur in a winter setting, possibly surrounded by snow.",
+    image_h=1024, image_w=1024,
+    mode="thinking",
+    steps=8, cfg_scale=2.0,
+    thinking_steps=32, thinking_gen_length=4096,
+)
+# Print thinking trace
+print("Thinking:", result["thinking"])
+# Decode to PIL image
+image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda", num_steps=8, decode_mode="decoder-turbo",)
+image.save("output_thinking.png")
+```
+### 🌟 Image Understanding
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from encoder.image_tokenizer import ImageTokenizer
+from decoder.smart_img_process import smart_resize_images
+model_path = "inclusionAI/LLaDA2.0-Uni"
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
+).eval()
+model.tokenizer = tokenizer
+# Encode image to discrete tokens
+image_tokenizer = ImageTokenizer(model_path=model_path, device="cuda")
+pil_image = smart_resize_images(["./assets/understanding_example.png"])[0]
+info = image_tokenizer.encode_with_info(pil_image)
+image_tokens = [x + model.config.image_token_offset for x in info["token_ids"]]
+_, h, w = info["grid_thw"]
+# Understand the image
+response = model.understand_image(
+    image_tokens, h, w,
+    question="Describe this image in detail.",
+    steps=32, gen_length=2048,
+)
+print(response)
+```
+### 🌟 Image Editing
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from encoder.image_tokenizer import ImageTokenizer
+from decoder.utils import generate_crop_size_list, var_center_crop
+from decoder import decode_vq_tokens
+from PIL import Image
+model_path = "inclusionAI/LLaDA2.0-Uni"
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path, device_map="cuda", torch_dtype="bfloat16", trust_remote_code=True
+).eval()
+model.tokenizer = tokenizer
+# Encode source image
+image_tokenizer = ImageTokenizer(model_path=model_path, device="cuda")
+crop_size_list = generate_crop_size_list((512 // 32) ** 2, 32)
+pil_image = var_center_crop(Image.open("./assets/edit_example.png").convert("RGB"), crop_size_list=crop_size_list)
+info = image_tokenizer.encode_with_info(pil_image)
+image_tokens = [x + model.config.image_token_offset for x in info["token_ids"]]
+_, h, w = info["grid_thw"]
+# Edit the image
+result = model.edit_image(
+    image_tokens, h, w,
+    instruction="Change the background to a beach.",
+    steps=8, cfg_text_scale=4.0,
+)
+# Decode to PIL image
+edited_image = decode_vq_tokens(result["token_ids"], result["h"], result["w"], model_path, "cuda", num_steps=8, decode_mode="decoder-turbo",)
+edited_image.save("edited.png")
+```
+### 🌟 SPRINT Acceleration
+SPRINT accelerates inference by combining **KV cache reuse**, **adaptive unmasking**, and **threshold-based batch acceptance**:
+- **KV Cache Reuse & Pruning**: The prefix KV cache is computed once during warmup steps, then optionally pruned by importance scores (blending KV attention importance with token confidence). Subsequent denoising steps reuse the cached prefix, significantly reducing computation. Per-modality keep ratios (`image_keep_ratio`, `text_keep_ratio`) allow fine-grained control — e.g., retaining all image/text tokens for quality while still benefiting from cache reuse.
+- **Adaptive Unmasking**: Instead of unmasking a fixed number of tokens per step, Sprint dynamically decides how many tokens to reveal based on model confidence. At each step, it computes confidence scores (via strategies like `low_confidence`, `top_k_margin`, or `neg_entropy`) and transfers the top-k most confident tokens, where k is adaptively set as `ceil(remaining_masked / steps_left)`. This allows easy positions to be resolved quickly while concentrating compute on harder tokens.
+- **Batch Acceptance**: On top of adaptive scheduling, all tokens whose probability exceeds `threshold` are accepted in batch, further reducing the number of denoising iterations needed.
+**Image Understanding** with Sprint:
+```python
+response = model.understand_image(
+    image_tokens, h, w,
+    question="Describe this image in detail.",
+    steps=32, gen_length=4096,
+    use_sprint=True,
+    threshold=0.93,
+    keep_ratio=0.5,
+    cache_warmup_steps=1,
+    image_keep_ratio=1.0,
+    text_keep_ratio=1.0,
+)
+```
+**Text-to-Image** with Sprint:
+```python
+result = model.generate_image(
+    "A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f/2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.",
+    image_h=1024, image_w=1024,
+    cfg_scale=2.0,
+    use_sprint=True,
+    block_length=32,
+    steps=8,
+    keep_ratio=0.5,
+    cache_warmup_steps=1,
+)
+```
+> [!Note]
+>  Sprint is supported for Simple CFG and no-CFG modes. When using Editing CFG (three-way guidance with `cfg_text_scale` / `cfg_image_scale`), Sprint automatically falls back to baseline.
+## Repository Structure
+```
+LLaDA2-Uni/
+├── config.json                          # Model configuration
+├── modeling_llada2uni_moe.py            # Model implementation (trust_remote_code)
+├── configuration_llada2uni_moe.py       # Config class
+├── tokenizer.json                       # Tokenizer
+├── model-00001-of-00013.safetensors     # MoE backbone weights (sharded, bf16)
+├── ...
+├── model-00013-of-00013.safetensors
+├── model.safetensors.index.json
+├── image_tokenizer/
+│   ├── config.json
+│   ├── image_tokenizer.safetensors      # SigLIP-VQ encoder
+│   ├── sigvq_embedding.pt               # SigVQ embedding + projector
+│   └── preprocessor_config.json
+├── decoder/
+│   ├── config.json
+│   └── decoder_model.safetensors        # Diffusion decoder (bf16, 12GB)
+├── decoder-turbo/
+│   ├── config.json
+│   └── decoder_model.safetensors        # Distilled few-step decoder (bf16, 12GB)
+└── vae/
+    ├── config.json
+    └── diffusion_pytorch_model.safetensors
+```
+## Hardware Requirements
+| Component | GPU Memory |
+|---|---|
+| MoE Backbone (bf16, 16B total) | ~32 GB |
+| Diffusion Decoder (bf16, 6.2B) | ~12 GB |
+| VAE + SigVQ + Tokenizer | ~3 GB |
+| **Total (generation/editing)** | **~47 GB** |
+| **Total (understanding only)** | **~35 GB** |
+> 💡 While only ~1B parameters are activated per token during inference, all 16B MoE parameters must be loaded into memory. The diffusion decoder is only needed for image generation/editing and is released afterwards.
+## 🚀 SGLang Support (Coming Soon)
+We are working on integrating [SGLang](https://github.com/sgl-project/sglang) for high-throughput serving and optimized inference. Stay tuned!
+## ⚠️ License
+This project is licensed under the terms of the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0).
+## 📖 BibTeX
+```bibtex
+@article{LLaDA2Uni,
+title = {LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model},
+author = {Inclusion AI},
+year = {2026}
+}
+```

assets/architecture.png ADDED Viewed

Git LFS Details

SHA256: 418b2990397db6f0bab58e01d331464e3d217e0353523d2927b7f226fba802df
Pointer size: 132 Bytes
Size of remote file: 1.13 MB

assets/edit_example.png ADDED Viewed

Git LFS Details

SHA256: b93b777561b5217b2545a85e02698d64c246e502c8fa6e15b86b1203da01138e
Pointer size: 132 Bytes
Size of remote file: 1.26 MB

assets/llada_logo.png ADDED Viewed

assets/performance.png ADDED Viewed

Git LFS Details

SHA256: 5f2fa37f61facb8b4280c6a97f64951be8c5840ea051c2792c7085a76e008019
Pointer size: 131 Bytes
Size of remote file: 802 kB

assets/understanding_example.png ADDED Viewed

config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "architectures": [
+    "LLaDA2MoeModelLM"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_llada2uni_moe.LLaDA2MoeConfig",
+    "AutoModel": "modeling_llada2uni_moe.LLaDA2MoeModelLM",
+    "AutoModelForCausalLM": "modeling_llada2uni_moe.LLaDA2MoeModelLM"
+  },
+  "model_type": "llada2_moe",
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.51.0",
+  "vocab_size": 173568,
+  "hidden_size": 2048,
+  "intermediate_size": 5120,
+  "num_hidden_layers": 20,
+  "num_attention_heads": 16,
+  "num_key_value_heads": 4,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "use_qkv_bias": false,
+  "use_qk_norm": true,
+  "use_bias": false,
+  "rms_norm_eps": 1e-06,
+  "attention_dropout": 0.0,
+  "initializer_range": 0.02,
+  "max_position_embeddings": 8192,
+  "rope_theta": 600000,
+  "rope_parameters": {
+    "rope_type": "default",
+    "rope_theta": 600000,
+    "partial_rotary_factor": 0.5
+  },
+  "partial_rotary_factor": 0.5,
+  "use_cache": false,
+  "sliding_window": null,
+  "pad_token_id": 156892,
+  "num_experts": 256,
+  "num_shared_experts": 1,
+  "num_experts_per_tok": 8,
+  "n_group": 8,
+  "topk_group": 4,
+  "routed_scaling_factor": 2.5,
+  "moe_intermediate_size": 512,
+  "first_k_dense_replace": 1,
+  "output_router_logits": false,
+  "image_token_offset": 157184
+}

configuration_llada2uni_moe.py ADDED Viewed

	@@ -0,0 +1,122 @@

+# Copyright 2025 Antgroup and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""LLaDA2 MoE model configuration."""
+from transformers.configuration_utils import PretrainedConfig
+class LLaDA2MoeConfig(PretrainedConfig):
+    r"""
+    Configuration class for the LLaDA2 MoE model (discrete-token multimodal LLM).
+    This config covers the LLM backbone only. Images are represented as discrete VQ tokens
+    in the extended vocabulary — no vision encoder config is needed.
+    ```python
+    >>> from configuration_llada2uni_moe import LLaDA2MoeConfig
+    >>> config = LLaDA2MoeConfig()
+    ```
+    """
+    model_type = "llada2_moe"
+    def __init__(
+        self,
+        vocab_size=30592,
+        hidden_size=1024,
+        intermediate_size=None,
+        num_hidden_layers=24,
+        num_attention_heads=16,
+        num_key_value_heads=0,
+        head_dim=None,
+        hidden_act="silu",
+        use_qkv_bias=False,
+        use_qk_norm=False,
+        use_bias=True,
+        rms_norm_eps=1e-05,
+        tie_word_embeddings=False,
+        attention_dropout=0.1,
+        initializer_range=0.02,
+        max_position_embeddings=16384,
+        rope_theta=10000.0,
+        rope_parameters=None,
+        partial_rotary_factor=0.5,
+        use_cache=True,
+        sliding_window=None,
+        pad_token_id=126081,
+        # Image
+        image_token_offset=157184,
+        # MoE
+        num_experts=16,
+        num_shared_experts=0,
+        num_experts_per_tok=2,
+        n_group=8,
+        topk_group=4,
+        routed_scaling_factor=2.5,
+        moe_intermediate_size=None,
+        first_k_dense_replace=0,
+        output_router_logits=False,
+        **kwargs,
+    ):
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.intermediate_size = intermediate_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.num_key_value_heads = num_key_value_heads
+        self.head_dim = head_dim or hidden_size // num_attention_heads
+        self.hidden_act = hidden_act
+        self.use_qkv_bias = use_qkv_bias
+        self.use_qk_norm = use_qk_norm
+        self.use_bias = use_bias
+        self.rms_norm_eps = rms_norm_eps
+        self.attention_dropout = attention_dropout
+        self.initializer_range = initializer_range
+        self.max_position_embeddings = max_position_embeddings
+        self.rope_theta = rope_theta
+        self.partial_rotary_factor = partial_rotary_factor
+        self.use_cache = use_cache
+        self.sliding_window = sliding_window
+        # Image token offset: VQ codebook indices are shifted by this amount in the vocabulary
+        self.image_token_offset = image_token_offset
+        # RoPE parameters dict — used by LLaDA2MoeRotaryEmbedding
+        if rope_parameters is None:
+            rope_parameters = {
+                "rope_type": "default",
+                "rope_theta": rope_theta,
+                "partial_rotary_factor": partial_rotary_factor,
+            }
+        self.rope_parameters = rope_parameters
+        # MoE
+        self.num_experts = num_experts
+        self.num_shared_experts = num_shared_experts
+        self.num_experts_per_tok = num_experts_per_tok
+        self.n_group = n_group
+        self.topk_group = topk_group
+        self.routed_scaling_factor = routed_scaling_factor
+        self.moe_intermediate_size = moe_intermediate_size
+        self.first_k_dense_replace = first_k_dense_replace
+        self.output_router_logits = output_router_logits
+        super().__init__(
+            pad_token_id=pad_token_id,
+            tie_word_embeddings=tie_word_embeddings,
+            **kwargs,
+        )
+__all__ = ["LLaDA2MoeConfig"]

decoder-turbo/config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "_class_name": "ZImageTransformer2DModel",
+  "_diffusers_version": "0.37.0.dev0",
+  "all_f_patch_size": [
+    1
+  ],
+  "all_patch_size": [
+    2
+  ],
+  "axes_dims": [
+    32,
+    48,
+    48
+  ],
+  "axes_lens": [
+    1536,
+    512,
+    512
+  ],
+  "cap_feat_dim": 2560,
+  "dim": 3840,
+  "in_channels": 16,
+  "n_heads": 30,
+  "n_kv_heads": 30,
+  "n_layers": 30,
+  "n_refiner_layers": 2,
+  "norm_eps": 1e-05,
+  "qk_norm": true,
+  "rope_theta": 256.0,
+  "siglip_feat_dim": null,
+  "t_scale": 1000.0
+}

decoder-turbo/decoder_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:76007e00703e5289b6a64c1c0f44dc26ed0616e0f99018c64b79ef201f8ff248
+size 12321673696

decoder/config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "_class_name": "ZImageTransformer2DModel",
+  "_diffusers_version": "0.37.0.dev0",
+  "all_f_patch_size": [
+    1
+  ],
+  "all_patch_size": [
+    2
+  ],
+  "axes_dims": [
+    32,
+    48,
+    48
+  ],
+  "axes_lens": [
+    1536,
+    512,
+    512
+  ],
+  "cap_feat_dim": 2560,
+  "dim": 3840,
+  "in_channels": 16,
+  "n_heads": 30,
+  "n_kv_heads": 30,
+  "n_layers": 30,
+  "n_refiner_layers": 2,
+  "norm_eps": 1e-05,
+  "qk_norm": true,
+  "rope_theta": 256.0,
+  "siglip_feat_dim": null,
+  "t_scale": 1000.0
+}

decoder/decoder_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b4538abc88dc41ecbdced5b032a5f0ac1f0780f96b36256d37bb7a105930ae8f
+size 12321673696

image_tokenizer/config.json ADDED Viewed

	@@ -0,0 +1,61 @@

+{
+  "architectures": [
+    "GlmImageForConditionalGeneration"
+  ],
+  "image_start_token_id": 16384,
+  "image_end_token_id": 16385,
+  "image_token_id": 167855,
+  "model_type": "glm_image",
+  "text_config": {
+    "attention_dropout": 0.0,
+    "eos_token_id": 16385,
+    "pad_token_id": 167841,
+    "hidden_act": "silu",
+    "hidden_size": 4096,
+    "initializer_range": 0.02,
+    "intermediate_size": 13696,
+    "max_position_embeddings": 131072,
+    "model_type": "glm_image_text",
+    "num_attention_heads": 32,
+    "num_hidden_layers": 40,
+    "num_key_value_heads": 2,
+    "rms_norm_eps": 1e-05,
+    "dtype": "bfloat16",
+    "rope_parameters": {
+      "rope_theta": 10000,
+      "rope_type": "default",
+      "mrope_section": [
+        8,
+        12,
+        12
+      ],
+      "partial_rotary_factor": 0.5
+    },
+    "use_cache": true,
+    "vision_vocab_size": 16512,
+    "vocab_size": 168064
+  },
+  "transformers_version": "5.0.0.dev0",
+  "vision_config": {
+    "attention_bias": true,
+    "attention_dropout": 0.0,
+    "depth": 40,
+    "hidden_act": "gelu",
+    "hidden_size": 1536,
+    "image_size": 2048,
+    "in_channels": 3,
+    "intermediate_size": 6144,
+    "layer_norm_eps": 1e-06,
+    "model_type": "glm_image_vision",
+    "num_heads": 16,
+    "patch_size": 16
+  },
+  "vq_config": {
+    "embed_dim": 2048,
+    "in_channels": 3,
+    "initializer_range": 0.02,
+    "latent_channels": 1536,
+    "model_type": "glm_image_vqmodel",
+    "num_embeddings": 16384
+  }
+}

image_tokenizer/image_tokenizer.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e0a11a82ad221ac1f3b917abfce31ffaaec3571200ae7ee5318a223ff2eedc49
+size 2398968416

image_tokenizer/preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,16 @@

+{
+    "min_pixels": 262144,
+    "max_pixels": 4194304,
+    "do_rescale": true,
+    "do_normalize": true,
+    "do_resize": false,
+    "patch_size": 16,
+    "temporal_patch_size": 1,
+    "merge_size": 1,
+    "image_mean": [0.5, 0.5, 0.5],
+    "image_std": [0.5, 0.5, 0.5],
+    "image_processor_type": "GlmImageImageProcessor",
+    "processor_class": "GlmImageProcessor",
+    "resample": 3,
+    "rescale_factor": 0.00392156862745098
+}

image_tokenizer/sigvq_embedding.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1f3689458a14ee04088def752dfaf8fe391910a1a61511decd9ce87fbdbe981b
+size 402688449

model-00001-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9de425701ab2118e3554f63a274ac98962c4efa4d95a05c8d81d409a45ffacd6
+size 5369025312

model-00002-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3f55a735671a05fcc5a8983181de16b0008e3bf8766273d10e490a9524a15df5
+size 5369025664

model-00003-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:26f578d62d76cecfd83d2ac28cb42618d6ea6ff37b6a28317ae54f1fb53ac284
+size 5369025576

model-00004-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d4fe3bacdc83f196430f374db78c5ca647fdf61e3ae8657fd8d1de7155429a97
+size 5369027896

model-00005-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:960858047af7d66f0c38c2d4fb76b4a80c65d5b4e04b8f55c5e3b7704ecda466
+size 5369027904

model-00006-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c16ab6dbd22b8273211d956cc0ead6aed6239a14457f9a071604d6ec25cf9c3f
+size 3821234896

model-00007-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cbcd0ef62f99aa2025fdc47d59ecb771e415162730628f74133dce4c533d1708
+size 63992360

model-00008-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:43e3be5de1c7b2b6f95f3cf31d842e9888115b2754775973979b33134738f2ee
+size 824220328

model-00009-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:455ef640a86513be99ee39d38164dc17908c51bae27cafe50601bfaf46f069fa
+size 124811128

model-00010-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b839b13466a6f65b1d8121ed21a80b02273b0f70cdf74b08de73a439c11c306e
+size 84965488

model-00011-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1a39c7db1ca265dcefd0c7167dc1782fe8e782aef204bc94643518df1579cce7
+size 79725952

model-00012-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d2f905aa76429a6ac3b995b0a32019bf1f18df64d6b36106dafe7f770e1f7849
+size 84967960

model-00013-of-00013.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:541420db214a20bb6d107256b48b1ae139cfa2376df12decaaf745d981360d3a
+size 718288984

model.safetensors.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff

modeling_llada2uni_moe.py ADDED Viewed

The diff for this file is too large to render. See raw diff

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "bos_token": {
+    "content": "<|startoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<|mask|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2197aeddaf09785316673451ca6fb86dcfcfdb108972a3145d106b8fa4c927e6
+size 15297062

tokenizer_config.json ADDED Viewed

The diff for this file is too large to render. See raw diff

vae/config.json ADDED Viewed

	@@ -0,0 +1,38 @@

+{
+  "_class_name": "AutoencoderKL",
+  "_diffusers_version": "0.30.0.dev0",
+  "_name_or_path": "../checkpoints/flux-dev",
+  "act_fn": "silu",
+  "block_out_channels": [
+    128,
+    256,
+    512,
+    512
+  ],
+  "down_block_types": [
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D"
+  ],
+  "force_upcast": true,
+  "in_channels": 3,
+  "latent_channels": 16,
+  "latents_mean": null,
+  "latents_std": null,
+  "layers_per_block": 2,
+  "mid_block_add_attention": true,
+  "norm_num_groups": 32,
+  "out_channels": 3,
+  "sample_size": 1024,
+  "scaling_factor": 0.3611,
+  "shift_factor": 0.1159,
+  "up_block_types": [
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D"
+  ],
+  "use_post_quant_conv": false,
+  "use_quant_conv": false
+}

vae/diffusion_pytorch_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5b59a26851551b67ae1fe58d32e76486e1e812def4696a4bea97f16604d40a3
+size 167666902