storybox-reproduction / LOCAL_LLM_GUIDE.md

Upload folder using huggingface_hub

88346c6 verified 3 days ago

6.68 kB

	# Using Local LLMs with StoryBox

	This guide explains how to run StoryBox with local LLMs like Gemma 4, Llama 3.1, Mistral, Phi-4, etc.

	## Supported Local LLM Options

	### Option 1: Ollama (Recommended)

	Ollama is the easiest way to run local LLMs. It supports Gemma, Llama, Mistral, and many others.

	#### Step 1: Install Ollama

	```bash
	# macOS / Linux
	curl -fsSL https://ollama.com/install.sh \| sh

	# Or download from https://ollama.com/download
	```

	#### Step 2: Pull Your Model

	```bash
	# Gemma 4 (Google's latest model)
	ollama pull gemma4

	# Gemma 4 with specific sizes
	ollama pull gemma4:4b # 4 billion parameters
	ollama pull gemma4:9b # 9 billion parameters
	ollama pull gemma4:27b # 27 billion parameters

	# Other popular models
	ollama pull llama3.1:8b
	ollama pull mistral
	ollama pull phi4
	ollama pull qwen2.5
	ollama pull deepseek-r1
	```

	#### Step 3: Verify Ollama is Running

	```bash
	# Check if Ollama server is running
	curl http://localhost:11434/api/tags

	# Test the model
	ollama run gemma4 "Hello, can you help me write a story?"
	```

	#### Step 4: Configure StoryBox

	Edit `reverie/config/config.py`:

	```python
	# Change this line:
	llm_model_name = 'gpt-4o-mini'

	# To your local model:
	llm_model_name = 'gemma4' # or 'gemma4:9b', 'gemma4:27b'
	# llm_model_name = 'llama3.1:8b'
	# llm_model_name = 'mistral'
	# llm_model_name = 'phi4'

	# Ollama URL (default is localhost:11434)
	ollama_base_url = 'http://localhost:11434'
	```

	#### Step 5: Run StoryBox

	```bash
	cd /app/storybox/reverie
	python run.py
	```

	---

	### Option 2: HuggingFace Transformers (Direct Loading)

	For models not available via Ollama, you can load them directly with HuggingFace.

	#### Step 1: Install Dependencies

	```bash
	pip install transformers accelerate bitsandbytes
	```

	#### Step 2: Modify `reverie/common/llm.py`

	Add your model to the `get_chat_model()` function:

	```python
	# Huggingface direct loading
	elif model_name in {'google/gemma-4-4b-it', 'google/gemma-4-9b-it', 'google/gemma-4-27b-it'}:
	from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	load_in_8bit=True # Use 8-bit quantization to save VRAM
	)

	pipe = pipeline(
	"text-generation",
	model=model,
	tokenizer=tokenizer,
	max_new_tokens=4096,
	temperature=temperature
	)

	llm = HuggingFacePipeline(pipeline=pipe)
	chat_model = ChatHuggingFace(llm=llm)
	```

	#### Step 3: Update Config

	```python
	llm_model_name = 'google/gemma-4-9b-it'
	```

	---

	### Option 3: vLLM (High-Throughput Serving)

	For production use or multiple concurrent requests, vLLM offers much better throughput.

	#### Step 1: Install vLLM

	```bash
	pip install vllm
	```

	#### Step 2: Start vLLM Server

	```bash
	python -m vllm.entrypoints.openai.api_server \
	--model google/gemma-4-9b-it \
	--tensor-parallel-size 1 \
	--max-model-len 8192
	```

	#### Step 3: Configure StoryBox as OpenAI-Compatible

	```python
	# In config.py
	llm_model_name = 'google/gemma-4-9b-it'
	base_url = 'http://localhost:8000/v1' # vLLM default port
	api_key = 'not-needed-for-local' # vLLM doesn't require auth by default
	```

	---

	### Option 4: LM Studio (GUI Alternative)

	LM Studio provides a user-friendly GUI for running local LLMs.

	1. Download from https://lmstudio.ai
	2. Download Gemma 4 (or any model) through the UI
	3. Start the local server (default: `http://localhost:1234`)
	4. Configure StoryBox:

	```python
	llm_model_name = 'gemma4'
	ollama_base_url = 'http://localhost:1234/v1'
	```

	---

	## Hardware Requirements

	\| Model \| VRAM Required \| RAM Fallback \| Speed (tokens/sec) \|
	\|-------\|--------------\|--------------\|-------------------\|
	\| gemma4:4b \| ~8 GB \| 16 GB + CPU \| ~30-50 \|
	\| gemma4:9b \| ~18 GB \| 32 GB + CPU \| ~15-25 \|
	\| gemma4:27b \| ~54 GB \| Not recommended \| ~5-10 \|
	\| llama3.1:8b \| ~16 GB \| 32 GB + CPU \| ~20-35 \|
	\| mistral:7b \| ~14 GB \| 28 GB + CPU \| ~25-40 \|
	\| phi4 \| ~14 GB \| 28 GB + CPU \| ~20-30 \|

	Tips for limited VRAM:
	- Use quantization: `--quantization q4_k_m` (Ollama) or `load_in_8bit=True` (HF)
	- Use CPU offloading: `device_map="auto"` lets transformers split across GPU/CPU
	- Reduce `max_context_length` in config.py (e.g., 32000 instead of 102400)
	- Reduce `max_tokens` for generation (e.g., 4096 instead of 8000)

	---

	## Expected Runtime with Local LLMs

	With a 24GB GPU (RTX 3090/4090) and Gemma 4 9B:
	- Simulation: ~8-12 hours for 14 days (vs ~4 hours with GPT-4o-mini)
	- Story generation: ~2-3 hours
	- Total: ~10-15 hours

	The slowdown is because local models generate tokens sequentially and are much slower than API-based models. Consider:
	- Running simulation for fewer days (e.g., 7 days = 168 iterations)
	- Using a smaller model for planning, larger for story generation
	- Using vLLM for batching multiple requests

	---

	## Troubleshooting

	### "Connection refused" to Ollama
	```bash
	# Make sure Ollama is running
	ollama serve &

	# Or start as a service
	sudo systemctl start ollama
	```

	### Out of Memory (OOM)
	```python
	# In config.py, reduce context:
	max_context_length = 32000
	max_tokens = 4096

	# Use smaller model
	llm_model_name = 'gemma4:4b'
	```

	### JSON Parsing Failures
	Local models sometimes produce malformed JSON. StoryBox has retry logic (`max_retries = 5`), but you can increase it:
	```python
	max_retries = 10
	```

	Or add a JSON-fixing post-processor in `reverie/common/utils.py`.

	### Slow Generation
	- Use vLLM instead of Ollama for better throughput
	- Enable Flash Attention: `pip install flash-attn`
	- Use quantization (Q4_K_M or Q5_K_M)
	- Reduce simulation days: `max_iteration = 24 * 7` (7 days instead of 14)

	---

	## Quick Reference: Config Changes

	```python
	# reverie/config/config.py

	# For Ollama
	llm_model_name = 'gemma4'
	ollama_base_url = 'http://localhost:11434'

	# For vLLM / LM Studio (OpenAI-compatible)
	llm_model_name = 'gemma4'
	base_url = 'http://localhost:8000/v1' # or 1234 for LM Studio
	api_key = 'not-needed'

	# Reduce resource usage
	max_context_length = 32000
	max_tokens = 4096
	max_iteration = 24 * 7 # 7 days instead of 14
	```

	---

	## Model Recommendations

	\| Use Case \| Recommended Model \| Why \|
	\|----------\|-------------------\|-----\|
	\| Best quality \| gemma4:27b or llama3.3 \| Largest, most capable \|
	\| Best speed/quality \| gemma4:9b or llama3.1:8b \| Good balance \|
	\| Limited VRAM \| gemma4:4b or phi4 \| Fits on 8-16GB \|
	\| Long context \| qwen2.5:32b \| Supports 128K context \|
	\| Coding/planning \| deepseek-r1 \| Strong reasoning \|