storybox-reproduction / LOCAL_LLM_GUIDE.md
raazkumar's picture
Upload folder using huggingface_hub
88346c6 verified
# Using Local LLMs with StoryBox
This guide explains how to run StoryBox with local LLMs like **Gemma 4**, **Llama 3.1**, **Mistral**, **Phi-4**, etc.
## Supported Local LLM Options
### Option 1: Ollama (Recommended)
**Ollama** is the easiest way to run local LLMs. It supports Gemma, Llama, Mistral, and many others.
#### Step 1: Install Ollama
```bash
# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh
# Or download from https://ollama.com/download
```
#### Step 2: Pull Your Model
```bash
# Gemma 4 (Google's latest model)
ollama pull gemma4
# Gemma 4 with specific sizes
ollama pull gemma4:4b # 4 billion parameters
ollama pull gemma4:9b # 9 billion parameters
ollama pull gemma4:27b # 27 billion parameters
# Other popular models
ollama pull llama3.1:8b
ollama pull mistral
ollama pull phi4
ollama pull qwen2.5
ollama pull deepseek-r1
```
#### Step 3: Verify Ollama is Running
```bash
# Check if Ollama server is running
curl http://localhost:11434/api/tags
# Test the model
ollama run gemma4 "Hello, can you help me write a story?"
```
#### Step 4: Configure StoryBox
Edit `reverie/config/config.py`:
```python
# Change this line:
llm_model_name = 'gpt-4o-mini'
# To your local model:
llm_model_name = 'gemma4' # or 'gemma4:9b', 'gemma4:27b'
# llm_model_name = 'llama3.1:8b'
# llm_model_name = 'mistral'
# llm_model_name = 'phi4'
# Ollama URL (default is localhost:11434)
ollama_base_url = 'http://localhost:11434'
```
#### Step 5: Run StoryBox
```bash
cd /app/storybox/reverie
python run.py
```
---
### Option 2: HuggingFace Transformers (Direct Loading)
For models not available via Ollama, you can load them directly with HuggingFace.
#### Step 1: Install Dependencies
```bash
pip install transformers accelerate bitsandbytes
```
#### Step 2: Modify `reverie/common/llm.py`
Add your model to the `get_chat_model()` function:
```python
# Huggingface direct loading
elif model_name in {'google/gemma-4-4b-it', 'google/gemma-4-9b-it', 'google/gemma-4-27b-it'}:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16,
load_in_8bit=True # Use 8-bit quantization to save VRAM
)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=4096,
temperature=temperature
)
llm = HuggingFacePipeline(pipeline=pipe)
chat_model = ChatHuggingFace(llm=llm)
```
#### Step 3: Update Config
```python
llm_model_name = 'google/gemma-4-9b-it'
```
---
### Option 3: vLLM (High-Throughput Serving)
For production use or multiple concurrent requests, **vLLM** offers much better throughput.
#### Step 1: Install vLLM
```bash
pip install vllm
```
#### Step 2: Start vLLM Server
```bash
python -m vllm.entrypoints.openai.api_server \
--model google/gemma-4-9b-it \
--tensor-parallel-size 1 \
--max-model-len 8192
```
#### Step 3: Configure StoryBox as OpenAI-Compatible
```python
# In config.py
llm_model_name = 'google/gemma-4-9b-it'
base_url = 'http://localhost:8000/v1' # vLLM default port
api_key = 'not-needed-for-local' # vLLM doesn't require auth by default
```
---
### Option 4: LM Studio (GUI Alternative)
**LM Studio** provides a user-friendly GUI for running local LLMs.
1. Download from https://lmstudio.ai
2. Download Gemma 4 (or any model) through the UI
3. Start the local server (default: `http://localhost:1234`)
4. Configure StoryBox:
```python
llm_model_name = 'gemma4'
ollama_base_url = 'http://localhost:1234/v1'
```
---
## Hardware Requirements
| Model | VRAM Required | RAM Fallback | Speed (tokens/sec) |
|-------|--------------|--------------|-------------------|
| gemma4:4b | ~8 GB | 16 GB + CPU | ~30-50 |
| gemma4:9b | ~18 GB | 32 GB + CPU | ~15-25 |
| gemma4:27b | ~54 GB | Not recommended | ~5-10 |
| llama3.1:8b | ~16 GB | 32 GB + CPU | ~20-35 |
| mistral:7b | ~14 GB | 28 GB + CPU | ~25-40 |
| phi4 | ~14 GB | 28 GB + CPU | ~20-30 |
**Tips for limited VRAM:**
- Use quantization: `--quantization q4_k_m` (Ollama) or `load_in_8bit=True` (HF)
- Use CPU offloading: `device_map="auto"` lets transformers split across GPU/CPU
- Reduce `max_context_length` in config.py (e.g., 32000 instead of 102400)
- Reduce `max_tokens` for generation (e.g., 4096 instead of 8000)
---
## Expected Runtime with Local LLMs
With a 24GB GPU (RTX 3090/4090) and Gemma 4 9B:
- **Simulation**: ~8-12 hours for 14 days (vs ~4 hours with GPT-4o-mini)
- **Story generation**: ~2-3 hours
- **Total**: ~10-15 hours
The slowdown is because local models generate tokens sequentially and are much slower than API-based models. Consider:
- Running simulation for fewer days (e.g., 7 days = 168 iterations)
- Using a smaller model for planning, larger for story generation
- Using vLLM for batching multiple requests
---
## Troubleshooting
### "Connection refused" to Ollama
```bash
# Make sure Ollama is running
ollama serve &
# Or start as a service
sudo systemctl start ollama
```
### Out of Memory (OOM)
```python
# In config.py, reduce context:
max_context_length = 32000
max_tokens = 4096
# Use smaller model
llm_model_name = 'gemma4:4b'
```
### JSON Parsing Failures
Local models sometimes produce malformed JSON. StoryBox has retry logic (`max_retries = 5`), but you can increase it:
```python
max_retries = 10
```
Or add a JSON-fixing post-processor in `reverie/common/utils.py`.
### Slow Generation
- Use vLLM instead of Ollama for better throughput
- Enable Flash Attention: `pip install flash-attn`
- Use quantization (Q4_K_M or Q5_K_M)
- Reduce simulation days: `max_iteration = 24 * 7` (7 days instead of 14)
---
## Quick Reference: Config Changes
```python
# reverie/config/config.py
# For Ollama
llm_model_name = 'gemma4'
ollama_base_url = 'http://localhost:11434'
# For vLLM / LM Studio (OpenAI-compatible)
llm_model_name = 'gemma4'
base_url = 'http://localhost:8000/v1' # or 1234 for LM Studio
api_key = 'not-needed'
# Reduce resource usage
max_context_length = 32000
max_tokens = 4096
max_iteration = 24 * 7 # 7 days instead of 14
```
---
## Model Recommendations
| Use Case | Recommended Model | Why |
|----------|-------------------|-----|
| Best quality | gemma4:27b or llama3.3 | Largest, most capable |
| Best speed/quality | gemma4:9b or llama3.1:8b | Good balance |
| Limited VRAM | gemma4:4b or phi4 | Fits on 8-16GB |
| Long context | qwen2.5:32b | Supports 128K context |
| Coding/planning | deepseek-r1 | Strong reasoning |