--- library_name: transformers license: apache-2.0 pipeline_tag: image-text-to-text tags: - multimodal - vision-language - chat --- # Rax 3.5 Chat Rax 3.5 Chat is a compact 2B parameter multimodal model for vision-language understanding and conversational AI. It supports text and image inputs with extended context up to 262K tokens. ## Model Details - **Parameters**: ~2B - **Context Length**: 262,144 tokens - **Input Modalities**: Text + Images - **Attention**: Hybrid linear + full attention (24 layers) - **Vision Encoder**: 24-layer transformer with 1024 hidden size - **Text Hidden Size**: 2048 - **Precision**: BFloat16 ## Key Features - **Multimodal Understanding**: Processes text and images in unified reasoning - **Long Context**: Supports up to 262K tokens for extended conversations - **Efficient Architecture**: Hybrid attention mechanism for optimal performance - **Production Ready**: Compatible with vLLM, SGLang, and Transformers ## Usage ### With Transformers ```python from transformers import AutoModelForVision2Seq, AutoProcessor from PIL import Image model = AutoModelForVision2Seq.from_pretrained("raxcore/Rax-3.5-Chat", trust_remote_code=True) processor = AutoProcessor.from_pretrained("raxcore/Rax-3.5-Chat", trust_remote_code=True) # Text-only conversation messages = [{"role": "user", "content": "What is the capital of France?"}] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=512) print(processor.decode(outputs[0], skip_special_tokens=True)) # With image image = Image.open("image.jpg") messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": "Describe this image."}]}] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=text, images=image, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=512) print(processor.decode(outputs[0], skip_special_tokens=True)) ``` ### With vLLM ```bash vllm serve raxcore/Rax-3.5-Chat --port 8000 --max-model-len 8192 ``` ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="token") response = client.chat.completions.create( model="raxcore/Rax-3.5-Chat", messages=[{"role": "user", "content": "Hello!"}], temperature=0.7, max_tokens=512 ) print(response.choices[0].message.content) ``` ## Architecture Highlights - **Hybrid Attention**: Alternates between linear attention and full attention layers for efficiency - **Vision Encoder**: 24-layer transformer with patch size 16 and spatial merge 2x2 - **Efficient KV Cache**: 2 key-value heads for reduced memory footprint - **Multi-resolution Position Embeddings**: Optimized for long-context understanding ## Best Practices - Use temperature 0.6–0.8 for factual tasks, 0.8–1.0 for creative tasks - For long context (>32K tokens), ensure sufficient GPU memory - Enable trust_remote_code when loading the model ## Limitations - 2B parameters may limit complex reasoning compared to larger models - Vision understanding optimized for natural images - Long context requires significant memory resources ## License Apache 2.0 ## Citation ```bibtex @misc{rax3.5chat, title={Rax 3.5 Chat: Efficient Multimodal Assistant Model}, author={Raxcore}, year={2026} } ```