File size: 6,678 Bytes
88346c6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
# Using Local LLMs with StoryBox

This guide explains how to run StoryBox with local LLMs like **Gemma 4**, **Llama 3.1**, **Mistral**, **Phi-4**, etc.

## Supported Local LLM Options

### Option 1: Ollama (Recommended)

**Ollama** is the easiest way to run local LLMs. It supports Gemma, Llama, Mistral, and many others.

#### Step 1: Install Ollama

```bash
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download from https://ollama.com/download
```

#### Step 2: Pull Your Model

```bash
# Gemma 4 (Google's latest model)
ollama pull gemma4

# Gemma 4 with specific sizes
ollama pull gemma4:4b    # 4 billion parameters
ollama pull gemma4:9b    # 9 billion parameters
ollama pull gemma4:27b   # 27 billion parameters

# Other popular models
ollama pull llama3.1:8b
ollama pull mistral
ollama pull phi4
ollama pull qwen2.5
ollama pull deepseek-r1
```

#### Step 3: Verify Ollama is Running

```bash
# Check if Ollama server is running
curl http://localhost:11434/api/tags

# Test the model
ollama run gemma4 "Hello, can you help me write a story?"
```

#### Step 4: Configure StoryBox

Edit `reverie/config/config.py`:

```python
# Change this line:
llm_model_name = 'gpt-4o-mini'

# To your local model:
llm_model_name = 'gemma4'           # or 'gemma4:9b', 'gemma4:27b'
# llm_model_name = 'llama3.1:8b'
# llm_model_name = 'mistral'
# llm_model_name = 'phi4'

# Ollama URL (default is localhost:11434)
ollama_base_url = 'http://localhost:11434'
```

#### Step 5: Run StoryBox

```bash
cd /app/storybox/reverie
python run.py
```

---

### Option 2: HuggingFace Transformers (Direct Loading)

For models not available via Ollama, you can load them directly with HuggingFace.

#### Step 1: Install Dependencies

```bash
pip install transformers accelerate bitsandbytes
```

#### Step 2: Modify `reverie/common/llm.py`

Add your model to the `get_chat_model()` function:

```python
# Huggingface direct loading
elif model_name in {'google/gemma-4-4b-it', 'google/gemma-4-9b-it', 'google/gemma-4-27b-it'}:
    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        load_in_8bit=True  # Use 8-bit quantization to save VRAM
    )
    
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=4096,
        temperature=temperature
    )
    
    llm = HuggingFacePipeline(pipeline=pipe)
    chat_model = ChatHuggingFace(llm=llm)
```

#### Step 3: Update Config

```python
llm_model_name = 'google/gemma-4-9b-it'
```

---

### Option 3: vLLM (High-Throughput Serving)

For production use or multiple concurrent requests, **vLLM** offers much better throughput.

#### Step 1: Install vLLM

```bash
pip install vllm
```

#### Step 2: Start vLLM Server

```bash
python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-9b-it \
    --tensor-parallel-size 1 \
    --max-model-len 8192
```

#### Step 3: Configure StoryBox as OpenAI-Compatible

```python
# In config.py
llm_model_name = 'google/gemma-4-9b-it'
base_url = 'http://localhost:8000/v1'  # vLLM default port
api_key = 'not-needed-for-local'       # vLLM doesn't require auth by default
```

---

### Option 4: LM Studio (GUI Alternative)

**LM Studio** provides a user-friendly GUI for running local LLMs.

1. Download from https://lmstudio.ai
2. Download Gemma 4 (or any model) through the UI
3. Start the local server (default: `http://localhost:1234`)
4. Configure StoryBox:

```python
llm_model_name = 'gemma4'
ollama_base_url = 'http://localhost:1234/v1'
```

---

## Hardware Requirements

| Model | VRAM Required | RAM Fallback | Speed (tokens/sec) |
|-------|--------------|--------------|-------------------|
| gemma4:4b | ~8 GB | 16 GB + CPU | ~30-50 |
| gemma4:9b | ~18 GB | 32 GB + CPU | ~15-25 |
| gemma4:27b | ~54 GB | Not recommended | ~5-10 |
| llama3.1:8b | ~16 GB | 32 GB + CPU | ~20-35 |
| mistral:7b | ~14 GB | 28 GB + CPU | ~25-40 |
| phi4 | ~14 GB | 28 GB + CPU | ~20-30 |

**Tips for limited VRAM:**
- Use quantization: `--quantization q4_k_m` (Ollama) or `load_in_8bit=True` (HF)
- Use CPU offloading: `device_map="auto"` lets transformers split across GPU/CPU
- Reduce `max_context_length` in config.py (e.g., 32000 instead of 102400)
- Reduce `max_tokens` for generation (e.g., 4096 instead of 8000)

---

## Expected Runtime with Local LLMs

With a 24GB GPU (RTX 3090/4090) and Gemma 4 9B:
- **Simulation**: ~8-12 hours for 14 days (vs ~4 hours with GPT-4o-mini)
- **Story generation**: ~2-3 hours
- **Total**: ~10-15 hours

The slowdown is because local models generate tokens sequentially and are much slower than API-based models. Consider:
- Running simulation for fewer days (e.g., 7 days = 168 iterations)
- Using a smaller model for planning, larger for story generation
- Using vLLM for batching multiple requests

---

## Troubleshooting

### "Connection refused" to Ollama
```bash
# Make sure Ollama is running
ollama serve &

# Or start as a service
sudo systemctl start ollama
```

### Out of Memory (OOM)
```python
# In config.py, reduce context:
max_context_length = 32000
max_tokens = 4096

# Use smaller model
llm_model_name = 'gemma4:4b'
```

### JSON Parsing Failures
Local models sometimes produce malformed JSON. StoryBox has retry logic (`max_retries = 5`), but you can increase it:
```python
max_retries = 10
```

Or add a JSON-fixing post-processor in `reverie/common/utils.py`.

### Slow Generation
- Use vLLM instead of Ollama for better throughput
- Enable Flash Attention: `pip install flash-attn`
- Use quantization (Q4_K_M or Q5_K_M)
- Reduce simulation days: `max_iteration = 24 * 7` (7 days instead of 14)

---

## Quick Reference: Config Changes

```python
# reverie/config/config.py

# For Ollama
llm_model_name = 'gemma4'
ollama_base_url = 'http://localhost:11434'

# For vLLM / LM Studio (OpenAI-compatible)
llm_model_name = 'gemma4'
base_url = 'http://localhost:8000/v1'  # or 1234 for LM Studio
api_key = 'not-needed'

# Reduce resource usage
max_context_length = 32000
max_tokens = 4096
max_iteration = 24 * 7  # 7 days instead of 14
```

---

## Model Recommendations

| Use Case | Recommended Model | Why |
|----------|-------------------|-----|
| Best quality | gemma4:27b or llama3.3 | Largest, most capable |
| Best speed/quality | gemma4:9b or llama3.1:8b | Good balance |
| Limited VRAM | gemma4:4b or phi4 | Fits on 8-16GB |
| Long context | qwen2.5:32b | Supports 128K context |
| Coding/planning | deepseek-r1 | Strong reasoning |