storybox-reproduction / LOCAL_LLM_GUIDE.md

raazkumar

Upload folder using huggingface_hub

88346c6 verified 3 days ago

preview code

raw

history blame contribute delete

6.68 kB

Using Local LLMs with StoryBox

This guide explains how to run StoryBox with local LLMs like Gemma 4, Llama 3.1, Mistral, Phi-4, etc.

Supported Local LLM Options

Option 1: Ollama (Recommended)

Ollama is the easiest way to run local LLMs. It supports Gemma, Llama, Mistral, and many others.

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download from https://ollama.com/download

Step 2: Pull Your Model

# Gemma 4 (Google's latest model)
ollama pull gemma4

# Gemma 4 with specific sizes
ollama pull gemma4:4b    # 4 billion parameters
ollama pull gemma4:9b    # 9 billion parameters
ollama pull gemma4:27b   # 27 billion parameters

# Other popular models
ollama pull llama3.1:8b
ollama pull mistral
ollama pull phi4
ollama pull qwen2.5
ollama pull deepseek-r1

Step 3: Verify Ollama is Running

# Check if Ollama server is running
curl http://localhost:11434/api/tags

# Test the model
ollama run gemma4 "Hello, can you help me write a story?"

Step 4: Configure StoryBox

Edit reverie/config/config.py:

# Change this line:
llm_model_name = 'gpt-4o-mini'

# To your local model:
llm_model_name = 'gemma4'           # or 'gemma4:9b', 'gemma4:27b'
# llm_model_name = 'llama3.1:8b'
# llm_model_name = 'mistral'
# llm_model_name = 'phi4'

# Ollama URL (default is localhost:11434)
ollama_base_url = 'http://localhost:11434'

Step 5: Run StoryBox

cd /app/storybox/reverie
python run.py

Option 2: HuggingFace Transformers (Direct Loading)

For models not available via Ollama, you can load them directly with HuggingFace.

Step 1: Install Dependencies

pip install transformers accelerate bitsandbytes

Step 2: Modify `reverie/common/llm.py`

Add your model to the get_chat_model() function:

# Huggingface direct loading
elif model_name in {'google/gemma-4-4b-it', 'google/gemma-4-9b-it', 'google/gemma-4-27b-it'}:
    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        load_in_8bit=True  # Use 8-bit quantization to save VRAM
    )
    
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=4096,
        temperature=temperature
    )
    
    llm = HuggingFacePipeline(pipeline=pipe)
    chat_model = ChatHuggingFace(llm=llm)

Step 3: Update Config

llm_model_name = 'google/gemma-4-9b-it'

Option 3: vLLM (High-Throughput Serving)

For production use or multiple concurrent requests, vLLM offers much better throughput.

Step 1: Install vLLM

pip install vllm

Step 2: Start vLLM Server

python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-9b-it \
    --tensor-parallel-size 1 \
    --max-model-len 8192

Step 3: Configure StoryBox as OpenAI-Compatible

# In config.py
llm_model_name = 'google/gemma-4-9b-it'
base_url = 'http://localhost:8000/v1'  # vLLM default port
api_key = 'not-needed-for-local'       # vLLM doesn't require auth by default

Option 4: LM Studio (GUI Alternative)

LM Studio provides a user-friendly GUI for running local LLMs.

Download from https://lmstudio.ai
Download Gemma 4 (or any model) through the UI
Start the local server (default: http://localhost:1234)
Configure StoryBox:

llm_model_name = 'gemma4'
ollama_base_url = 'http://localhost:1234/v1'

Hardware Requirements

Model	VRAM Required	RAM Fallback	Speed (tokens/sec)
gemma4:4b	~8 GB	16 GB + CPU	~30-50
gemma4:9b	~18 GB	32 GB + CPU	~15-25
gemma4:27b	~54 GB	Not recommended	~5-10
llama3.1:8b	~16 GB	32 GB + CPU	~20-35
mistral:7b	~14 GB	28 GB + CPU	~25-40
phi4	~14 GB	28 GB + CPU	~20-30

Tips for limited VRAM:

Use quantization: --quantization q4_k_m (Ollama) or load_in_8bit=True (HF)
Use CPU offloading: device_map="auto" lets transformers split across GPU/CPU
Reduce max_context_length in config.py (e.g., 32000 instead of 102400)
Reduce max_tokens for generation (e.g., 4096 instead of 8000)

Expected Runtime with Local LLMs

With a 24GB GPU (RTX 3090/4090) and Gemma 4 9B:

Simulation: ~8-12 hours for 14 days (vs ~4 hours with GPT-4o-mini)
Story generation: ~2-3 hours
Total: ~10-15 hours

The slowdown is because local models generate tokens sequentially and are much slower than API-based models. Consider:

Running simulation for fewer days (e.g., 7 days = 168 iterations)
Using a smaller model for planning, larger for story generation
Using vLLM for batching multiple requests

Troubleshooting

"Connection refused" to Ollama

# Make sure Ollama is running
ollama serve &

# Or start as a service
sudo systemctl start ollama

Out of Memory (OOM)

# In config.py, reduce context:
max_context_length = 32000
max_tokens = 4096

# Use smaller model
llm_model_name = 'gemma4:4b'

JSON Parsing Failures

Local models sometimes produce malformed JSON. StoryBox has retry logic (max_retries = 5), but you can increase it:

max_retries = 10

Or add a JSON-fixing post-processor in reverie/common/utils.py.

Slow Generation

Use vLLM instead of Ollama for better throughput
Enable Flash Attention: pip install flash-attn
Use quantization (Q4_K_M or Q5_K_M)
Reduce simulation days: max_iteration = 24 * 7 (7 days instead of 14)

Quick Reference: Config Changes

# reverie/config/config.py

# For Ollama
llm_model_name = 'gemma4'
ollama_base_url = 'http://localhost:11434'

# For vLLM / LM Studio (OpenAI-compatible)
llm_model_name = 'gemma4'
base_url = 'http://localhost:8000/v1'  # or 1234 for LM Studio
api_key = 'not-needed'

# Reduce resource usage
max_context_length = 32000
max_tokens = 4096
max_iteration = 24 * 7  # 7 days instead of 14

Model Recommendations

Use Case	Recommended Model	Why
Best quality	gemma4:27b or llama3.3	Largest, most capable
Best speed/quality	gemma4:9b or llama3.1:8b	Good balance
Limited VRAM	gemma4:4b or phi4	Fits on 8-16GB
Long context	qwen2.5:32b	Supports 128K context
Coding/planning	deepseek-r1	Strong reasoning