# Using Local LLMs with StoryBox This guide explains how to run StoryBox with local LLMs like **Gemma 4**, **Llama 3.1**, **Mistral**, **Phi-4**, etc. ## Supported Local LLM Options ### Option 1: Ollama (Recommended) **Ollama** is the easiest way to run local LLMs. It supports Gemma, Llama, Mistral, and many others. #### Step 1: Install Ollama ```bash # macOS / Linux curl -fsSL https://ollama.com/install.sh | sh # Or download from https://ollama.com/download ``` #### Step 2: Pull Your Model ```bash # Gemma 4 (Google's latest model) ollama pull gemma4 # Gemma 4 with specific sizes ollama pull gemma4:4b # 4 billion parameters ollama pull gemma4:9b # 9 billion parameters ollama pull gemma4:27b # 27 billion parameters # Other popular models ollama pull llama3.1:8b ollama pull mistral ollama pull phi4 ollama pull qwen2.5 ollama pull deepseek-r1 ``` #### Step 3: Verify Ollama is Running ```bash # Check if Ollama server is running curl http://localhost:11434/api/tags # Test the model ollama run gemma4 "Hello, can you help me write a story?" ``` #### Step 4: Configure StoryBox Edit `reverie/config/config.py`: ```python # Change this line: llm_model_name = 'gpt-4o-mini' # To your local model: llm_model_name = 'gemma4' # or 'gemma4:9b', 'gemma4:27b' # llm_model_name = 'llama3.1:8b' # llm_model_name = 'mistral' # llm_model_name = 'phi4' # Ollama URL (default is localhost:11434) ollama_base_url = 'http://localhost:11434' ``` #### Step 5: Run StoryBox ```bash cd /app/storybox/reverie python run.py ``` --- ### Option 2: HuggingFace Transformers (Direct Loading) For models not available via Ollama, you can load them directly with HuggingFace. #### Step 1: Install Dependencies ```bash pip install transformers accelerate bitsandbytes ``` #### Step 2: Modify `reverie/common/llm.py` Add your model to the `get_chat_model()` function: ```python # Huggingface direct loading elif model_name in {'google/gemma-4-4b-it', 'google/gemma-4-9b-it', 'google/gemma-4-27b-it'}: from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype=torch.bfloat16, load_in_8bit=True # Use 8-bit quantization to save VRAM ) pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, max_new_tokens=4096, temperature=temperature ) llm = HuggingFacePipeline(pipeline=pipe) chat_model = ChatHuggingFace(llm=llm) ``` #### Step 3: Update Config ```python llm_model_name = 'google/gemma-4-9b-it' ``` --- ### Option 3: vLLM (High-Throughput Serving) For production use or multiple concurrent requests, **vLLM** offers much better throughput. #### Step 1: Install vLLM ```bash pip install vllm ``` #### Step 2: Start vLLM Server ```bash python -m vllm.entrypoints.openai.api_server \ --model google/gemma-4-9b-it \ --tensor-parallel-size 1 \ --max-model-len 8192 ``` #### Step 3: Configure StoryBox as OpenAI-Compatible ```python # In config.py llm_model_name = 'google/gemma-4-9b-it' base_url = 'http://localhost:8000/v1' # vLLM default port api_key = 'not-needed-for-local' # vLLM doesn't require auth by default ``` --- ### Option 4: LM Studio (GUI Alternative) **LM Studio** provides a user-friendly GUI for running local LLMs. 1. Download from https://lmstudio.ai 2. Download Gemma 4 (or any model) through the UI 3. Start the local server (default: `http://localhost:1234`) 4. Configure StoryBox: ```python llm_model_name = 'gemma4' ollama_base_url = 'http://localhost:1234/v1' ``` --- ## Hardware Requirements | Model | VRAM Required | RAM Fallback | Speed (tokens/sec) | |-------|--------------|--------------|-------------------| | gemma4:4b | ~8 GB | 16 GB + CPU | ~30-50 | | gemma4:9b | ~18 GB | 32 GB + CPU | ~15-25 | | gemma4:27b | ~54 GB | Not recommended | ~5-10 | | llama3.1:8b | ~16 GB | 32 GB + CPU | ~20-35 | | mistral:7b | ~14 GB | 28 GB + CPU | ~25-40 | | phi4 | ~14 GB | 28 GB + CPU | ~20-30 | **Tips for limited VRAM:** - Use quantization: `--quantization q4_k_m` (Ollama) or `load_in_8bit=True` (HF) - Use CPU offloading: `device_map="auto"` lets transformers split across GPU/CPU - Reduce `max_context_length` in config.py (e.g., 32000 instead of 102400) - Reduce `max_tokens` for generation (e.g., 4096 instead of 8000) --- ## Expected Runtime with Local LLMs With a 24GB GPU (RTX 3090/4090) and Gemma 4 9B: - **Simulation**: ~8-12 hours for 14 days (vs ~4 hours with GPT-4o-mini) - **Story generation**: ~2-3 hours - **Total**: ~10-15 hours The slowdown is because local models generate tokens sequentially and are much slower than API-based models. Consider: - Running simulation for fewer days (e.g., 7 days = 168 iterations) - Using a smaller model for planning, larger for story generation - Using vLLM for batching multiple requests --- ## Troubleshooting ### "Connection refused" to Ollama ```bash # Make sure Ollama is running ollama serve & # Or start as a service sudo systemctl start ollama ``` ### Out of Memory (OOM) ```python # In config.py, reduce context: max_context_length = 32000 max_tokens = 4096 # Use smaller model llm_model_name = 'gemma4:4b' ``` ### JSON Parsing Failures Local models sometimes produce malformed JSON. StoryBox has retry logic (`max_retries = 5`), but you can increase it: ```python max_retries = 10 ``` Or add a JSON-fixing post-processor in `reverie/common/utils.py`. ### Slow Generation - Use vLLM instead of Ollama for better throughput - Enable Flash Attention: `pip install flash-attn` - Use quantization (Q4_K_M or Q5_K_M) - Reduce simulation days: `max_iteration = 24 * 7` (7 days instead of 14) --- ## Quick Reference: Config Changes ```python # reverie/config/config.py # For Ollama llm_model_name = 'gemma4' ollama_base_url = 'http://localhost:11434' # For vLLM / LM Studio (OpenAI-compatible) llm_model_name = 'gemma4' base_url = 'http://localhost:8000/v1' # or 1234 for LM Studio api_key = 'not-needed' # Reduce resource usage max_context_length = 32000 max_tokens = 4096 max_iteration = 24 * 7 # 7 days instead of 14 ``` --- ## Model Recommendations | Use Case | Recommended Model | Why | |----------|-------------------|-----| | Best quality | gemma4:27b or llama3.3 | Largest, most capable | | Best speed/quality | gemma4:9b or llama3.1:8b | Good balance | | Limited VRAM | gemma4:4b or phi4 | Fits on 8-16GB | | Long context | qwen2.5:32b | Supports 128K context | | Coding/planning | deepseek-r1 | Strong reasoning |