File size: 6,678 Bytes
88346c6 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 | # Using Local LLMs with StoryBox
This guide explains how to run StoryBox with local LLMs like **Gemma 4**, **Llama 3.1**, **Mistral**, **Phi-4**, etc.
## Supported Local LLM Options
### Option 1: Ollama (Recommended)
**Ollama** is the easiest way to run local LLMs. It supports Gemma, Llama, Mistral, and many others.
#### Step 1: Install Ollama
```bash
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or download from https://ollama.com/download
```
#### Step 2: Pull Your Model
```bash
# Gemma 4 (Google's latest model)
ollama pull gemma4
# Gemma 4 with specific sizes
ollama pull gemma4:4b # 4 billion parameters
ollama pull gemma4:9b # 9 billion parameters
ollama pull gemma4:27b # 27 billion parameters
# Other popular models
ollama pull llama3.1:8b
ollama pull mistral
ollama pull phi4
ollama pull qwen2.5
ollama pull deepseek-r1
```
#### Step 3: Verify Ollama is Running
```bash
# Check if Ollama server is running
curl http://localhost:11434/api/tags
# Test the model
ollama run gemma4 "Hello, can you help me write a story?"
```
#### Step 4: Configure StoryBox
Edit `reverie/config/config.py`:
```python
# Change this line:
llm_model_name = 'gpt-4o-mini'
# To your local model:
llm_model_name = 'gemma4' # or 'gemma4:9b', 'gemma4:27b'
# llm_model_name = 'llama3.1:8b'
# llm_model_name = 'mistral'
# llm_model_name = 'phi4'
# Ollama URL (default is localhost:11434)
ollama_base_url = 'http://localhost:11434'
```
#### Step 5: Run StoryBox
```bash
cd /app/storybox/reverie
python run.py
```
---
### Option 2: HuggingFace Transformers (Direct Loading)
For models not available via Ollama, you can load them directly with HuggingFace.
#### Step 1: Install Dependencies
```bash
pip install transformers accelerate bitsandbytes
```
#### Step 2: Modify `reverie/common/llm.py`
Add your model to the `get_chat_model()` function:
```python
# Huggingface direct loading
elif model_name in {'google/gemma-4-4b-it', 'google/gemma-4-9b-it', 'google/gemma-4-27b-it'}:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16,
load_in_8bit=True # Use 8-bit quantization to save VRAM
)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=4096,
temperature=temperature
)
llm = HuggingFacePipeline(pipeline=pipe)
chat_model = ChatHuggingFace(llm=llm)
```
#### Step 3: Update Config
```python
llm_model_name = 'google/gemma-4-9b-it'
```
---
### Option 3: vLLM (High-Throughput Serving)
For production use or multiple concurrent requests, **vLLM** offers much better throughput.
#### Step 1: Install vLLM
```bash
pip install vllm
```
#### Step 2: Start vLLM Server
```bash
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-4-9b-it \
--tensor-parallel-size 1 \
--max-model-len 8192
```
#### Step 3: Configure StoryBox as OpenAI-Compatible
```python
# In config.py
llm_model_name = 'google/gemma-4-9b-it'
base_url = 'http://localhost:8000/v1' # vLLM default port
api_key = 'not-needed-for-local' # vLLM doesn't require auth by default
```
---
### Option 4: LM Studio (GUI Alternative)
**LM Studio** provides a user-friendly GUI for running local LLMs.
1. Download from https://lmstudio.ai
2. Download Gemma 4 (or any model) through the UI
3. Start the local server (default: `http://localhost:1234`)
4. Configure StoryBox:
```python
llm_model_name = 'gemma4'
ollama_base_url = 'http://localhost:1234/v1'
```
---
## Hardware Requirements
| Model | VRAM Required | RAM Fallback | Speed (tokens/sec) |
|-------|--------------|--------------|-------------------|
| gemma4:4b | ~8 GB | 16 GB + CPU | ~30-50 |
| gemma4:9b | ~18 GB | 32 GB + CPU | ~15-25 |
| gemma4:27b | ~54 GB | Not recommended | ~5-10 |
| llama3.1:8b | ~16 GB | 32 GB + CPU | ~20-35 |
| mistral:7b | ~14 GB | 28 GB + CPU | ~25-40 |
| phi4 | ~14 GB | 28 GB + CPU | ~20-30 |
**Tips for limited VRAM:**
- Use quantization: `--quantization q4_k_m` (Ollama) or `load_in_8bit=True` (HF)
- Use CPU offloading: `device_map="auto"` lets transformers split across GPU/CPU
- Reduce `max_context_length` in config.py (e.g., 32000 instead of 102400)
- Reduce `max_tokens` for generation (e.g., 4096 instead of 8000)
---
## Expected Runtime with Local LLMs
With a 24GB GPU (RTX 3090/4090) and Gemma 4 9B:
- **Simulation**: ~8-12 hours for 14 days (vs ~4 hours with GPT-4o-mini)
- **Story generation**: ~2-3 hours
- **Total**: ~10-15 hours
The slowdown is because local models generate tokens sequentially and are much slower than API-based models. Consider:
- Running simulation for fewer days (e.g., 7 days = 168 iterations)
- Using a smaller model for planning, larger for story generation
- Using vLLM for batching multiple requests
---
## Troubleshooting
### "Connection refused" to Ollama
```bash
# Make sure Ollama is running
ollama serve &
# Or start as a service
sudo systemctl start ollama
```
### Out of Memory (OOM)
```python
# In config.py, reduce context:
max_context_length = 32000
max_tokens = 4096
# Use smaller model
llm_model_name = 'gemma4:4b'
```
### JSON Parsing Failures
Local models sometimes produce malformed JSON. StoryBox has retry logic (`max_retries = 5`), but you can increase it:
```python
max_retries = 10
```
Or add a JSON-fixing post-processor in `reverie/common/utils.py`.
### Slow Generation
- Use vLLM instead of Ollama for better throughput
- Enable Flash Attention: `pip install flash-attn`
- Use quantization (Q4_K_M or Q5_K_M)
- Reduce simulation days: `max_iteration = 24 * 7` (7 days instead of 14)
---
## Quick Reference: Config Changes
```python
# reverie/config/config.py
# For Ollama
llm_model_name = 'gemma4'
ollama_base_url = 'http://localhost:11434'
# For vLLM / LM Studio (OpenAI-compatible)
llm_model_name = 'gemma4'
base_url = 'http://localhost:8000/v1' # or 1234 for LM Studio
api_key = 'not-needed'
# Reduce resource usage
max_context_length = 32000
max_tokens = 4096
max_iteration = 24 * 7 # 7 days instead of 14
```
---
## Model Recommendations
| Use Case | Recommended Model | Why |
|----------|-------------------|-----|
| Best quality | gemma4:27b or llama3.3 | Largest, most capable |
| Best speed/quality | gemma4:9b or llama3.1:8b | Good balance |
| Limited VRAM | gemma4:4b or phi4 | Fits on 8-16GB |
| Long context | qwen2.5:32b | Supports 128K context |
| Coding/planning | deepseek-r1 | Strong reasoning |
|