| # Using Local LLMs with StoryBox |
|
|
| This guide explains how to run StoryBox with local LLMs like **Gemma 4**, **Llama 3.1**, **Mistral**, **Phi-4**, etc. |
|
|
| ## Supported Local LLM Options |
|
|
| ### Option 1: Ollama (Recommended) |
|
|
| **Ollama** is the easiest way to run local LLMs. It supports Gemma, Llama, Mistral, and many others. |
|
|
| #### Step 1: Install Ollama |
|
|
| ```bash |
| # macOS / Linux |
| curl -fsSL https://ollama.com/install.sh | sh |
| |
| # Or download from https://ollama.com/download |
| ``` |
|
|
| #### Step 2: Pull Your Model |
|
|
| ```bash |
| # Gemma 4 (Google's latest model) |
| ollama pull gemma4 |
| |
| # Gemma 4 with specific sizes |
| ollama pull gemma4:4b # 4 billion parameters |
| ollama pull gemma4:9b # 9 billion parameters |
| ollama pull gemma4:27b # 27 billion parameters |
| |
| # Other popular models |
| ollama pull llama3.1:8b |
| ollama pull mistral |
| ollama pull phi4 |
| ollama pull qwen2.5 |
| ollama pull deepseek-r1 |
| ``` |
|
|
| #### Step 3: Verify Ollama is Running |
|
|
| ```bash |
| # Check if Ollama server is running |
| curl http://localhost:11434/api/tags |
| |
| # Test the model |
| ollama run gemma4 "Hello, can you help me write a story?" |
| ``` |
|
|
| #### Step 4: Configure StoryBox |
|
|
| Edit `reverie/config/config.py`: |
|
|
| ```python |
| # Change this line: |
| llm_model_name = 'gpt-4o-mini' |
| |
| # To your local model: |
| llm_model_name = 'gemma4' # or 'gemma4:9b', 'gemma4:27b' |
| # llm_model_name = 'llama3.1:8b' |
| # llm_model_name = 'mistral' |
| # llm_model_name = 'phi4' |
| |
| # Ollama URL (default is localhost:11434) |
| ollama_base_url = 'http://localhost:11434' |
| ``` |
|
|
| #### Step 5: Run StoryBox |
|
|
| ```bash |
| cd /app/storybox/reverie |
| python run.py |
| ``` |
|
|
| --- |
|
|
| ### Option 2: HuggingFace Transformers (Direct Loading) |
|
|
| For models not available via Ollama, you can load them directly with HuggingFace. |
|
|
| #### Step 1: Install Dependencies |
|
|
| ```bash |
| pip install transformers accelerate bitsandbytes |
| ``` |
|
|
| #### Step 2: Modify `reverie/common/llm.py` |
|
|
| Add your model to the `get_chat_model()` function: |
|
|
| ```python |
| # Huggingface direct loading |
| elif model_name in {'google/gemma-4-4b-it', 'google/gemma-4-9b-it', 'google/gemma-4-27b-it'}: |
| from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_name) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| device_map="auto", |
| torch_dtype=torch.bfloat16, |
| load_in_8bit=True # Use 8-bit quantization to save VRAM |
| ) |
| |
| pipe = pipeline( |
| "text-generation", |
| model=model, |
| tokenizer=tokenizer, |
| max_new_tokens=4096, |
| temperature=temperature |
| ) |
| |
| llm = HuggingFacePipeline(pipeline=pipe) |
| chat_model = ChatHuggingFace(llm=llm) |
| ``` |
|
|
| #### Step 3: Update Config |
|
|
| ```python |
| llm_model_name = 'google/gemma-4-9b-it' |
| ``` |
|
|
| --- |
|
|
| ### Option 3: vLLM (High-Throughput Serving) |
|
|
| For production use or multiple concurrent requests, **vLLM** offers much better throughput. |
|
|
| #### Step 1: Install vLLM |
|
|
| ```bash |
| pip install vllm |
| ``` |
|
|
| #### Step 2: Start vLLM Server |
|
|
| ```bash |
| python -m vllm.entrypoints.openai.api_server \ |
| --model google/gemma-4-9b-it \ |
| --tensor-parallel-size 1 \ |
| --max-model-len 8192 |
| ``` |
|
|
| #### Step 3: Configure StoryBox as OpenAI-Compatible |
|
|
| ```python |
| # In config.py |
| llm_model_name = 'google/gemma-4-9b-it' |
| base_url = 'http://localhost:8000/v1' # vLLM default port |
| api_key = 'not-needed-for-local' # vLLM doesn't require auth by default |
| ``` |
|
|
| --- |
|
|
| ### Option 4: LM Studio (GUI Alternative) |
|
|
| **LM Studio** provides a user-friendly GUI for running local LLMs. |
|
|
| 1. Download from https://lmstudio.ai |
| 2. Download Gemma 4 (or any model) through the UI |
| 3. Start the local server (default: `http://localhost:1234`) |
| 4. Configure StoryBox: |
|
|
| ```python |
| llm_model_name = 'gemma4' |
| ollama_base_url = 'http://localhost:1234/v1' |
| ``` |
|
|
| --- |
|
|
| ## Hardware Requirements |
|
|
| | Model | VRAM Required | RAM Fallback | Speed (tokens/sec) | |
| |-------|--------------|--------------|-------------------| |
| | gemma4:4b | ~8 GB | 16 GB + CPU | ~30-50 | |
| | gemma4:9b | ~18 GB | 32 GB + CPU | ~15-25 | |
| | gemma4:27b | ~54 GB | Not recommended | ~5-10 | |
| | llama3.1:8b | ~16 GB | 32 GB + CPU | ~20-35 | |
| | mistral:7b | ~14 GB | 28 GB + CPU | ~25-40 | |
| | phi4 | ~14 GB | 28 GB + CPU | ~20-30 | |
|
|
| **Tips for limited VRAM:** |
| - Use quantization: `--quantization q4_k_m` (Ollama) or `load_in_8bit=True` (HF) |
| - Use CPU offloading: `device_map="auto"` lets transformers split across GPU/CPU |
| - Reduce `max_context_length` in config.py (e.g., 32000 instead of 102400) |
| - Reduce `max_tokens` for generation (e.g., 4096 instead of 8000) |
|
|
| --- |
|
|
| ## Expected Runtime with Local LLMs |
|
|
| With a 24GB GPU (RTX 3090/4090) and Gemma 4 9B: |
| - **Simulation**: ~8-12 hours for 14 days (vs ~4 hours with GPT-4o-mini) |
| - **Story generation**: ~2-3 hours |
| - **Total**: ~10-15 hours |
|
|
| The slowdown is because local models generate tokens sequentially and are much slower than API-based models. Consider: |
| - Running simulation for fewer days (e.g., 7 days = 168 iterations) |
| - Using a smaller model for planning, larger for story generation |
| - Using vLLM for batching multiple requests |
|
|
| --- |
|
|
| ## Troubleshooting |
|
|
| ### "Connection refused" to Ollama |
| ```bash |
| # Make sure Ollama is running |
| ollama serve & |
| |
| # Or start as a service |
| sudo systemctl start ollama |
| ``` |
|
|
| ### Out of Memory (OOM) |
| ```python |
| # In config.py, reduce context: |
| max_context_length = 32000 |
| max_tokens = 4096 |
| |
| # Use smaller model |
| llm_model_name = 'gemma4:4b' |
| ``` |
|
|
| ### JSON Parsing Failures |
| Local models sometimes produce malformed JSON. StoryBox has retry logic (`max_retries = 5`), but you can increase it: |
| ```python |
| max_retries = 10 |
| ``` |
|
|
| Or add a JSON-fixing post-processor in `reverie/common/utils.py`. |
|
|
| ### Slow Generation |
| - Use vLLM instead of Ollama for better throughput |
| - Enable Flash Attention: `pip install flash-attn` |
| - Use quantization (Q4_K_M or Q5_K_M) |
| - Reduce simulation days: `max_iteration = 24 * 7` (7 days instead of 14) |
|
|
| --- |
|
|
| ## Quick Reference: Config Changes |
|
|
| ```python |
| # reverie/config/config.py |
| |
| # For Ollama |
| llm_model_name = 'gemma4' |
| ollama_base_url = 'http://localhost:11434' |
| |
| # For vLLM / LM Studio (OpenAI-compatible) |
| llm_model_name = 'gemma4' |
| base_url = 'http://localhost:8000/v1' # or 1234 for LM Studio |
| api_key = 'not-needed' |
| |
| # Reduce resource usage |
| max_context_length = 32000 |
| max_tokens = 4096 |
| max_iteration = 24 * 7 # 7 days instead of 14 |
| ``` |
|
|
| --- |
|
|
| ## Model Recommendations |
|
|
| | Use Case | Recommended Model | Why | |
| |----------|-------------------|-----| |
| | Best quality | gemma4:27b or llama3.3 | Largest, most capable | |
| | Best speed/quality | gemma4:9b or llama3.1:8b | Good balance | |
| | Limited VRAM | gemma4:4b or phi4 | Fits on 8-16GB | |
| | Long context | qwen2.5:32b | Supports 128K context | |
| | Coding/planning | deepseek-r1 | Strong reasoning | |
|
|