storybox-reproduction / LOCAL_LLM_GUIDE.md
raazkumar's picture
Upload folder using huggingface_hub
88346c6 verified

Using Local LLMs with StoryBox

This guide explains how to run StoryBox with local LLMs like Gemma 4, Llama 3.1, Mistral, Phi-4, etc.

Supported Local LLM Options

Option 1: Ollama (Recommended)

Ollama is the easiest way to run local LLMs. It supports Gemma, Llama, Mistral, and many others.

Step 1: Install Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download from https://ollama.com/download

Step 2: Pull Your Model

# Gemma 4 (Google's latest model)
ollama pull gemma4

# Gemma 4 with specific sizes
ollama pull gemma4:4b    # 4 billion parameters
ollama pull gemma4:9b    # 9 billion parameters
ollama pull gemma4:27b   # 27 billion parameters

# Other popular models
ollama pull llama3.1:8b
ollama pull mistral
ollama pull phi4
ollama pull qwen2.5
ollama pull deepseek-r1

Step 3: Verify Ollama is Running

# Check if Ollama server is running
curl http://localhost:11434/api/tags

# Test the model
ollama run gemma4 "Hello, can you help me write a story?"

Step 4: Configure StoryBox

Edit reverie/config/config.py:

# Change this line:
llm_model_name = 'gpt-4o-mini'

# To your local model:
llm_model_name = 'gemma4'           # or 'gemma4:9b', 'gemma4:27b'
# llm_model_name = 'llama3.1:8b'
# llm_model_name = 'mistral'
# llm_model_name = 'phi4'

# Ollama URL (default is localhost:11434)
ollama_base_url = 'http://localhost:11434'

Step 5: Run StoryBox

cd /app/storybox/reverie
python run.py

Option 2: HuggingFace Transformers (Direct Loading)

For models not available via Ollama, you can load them directly with HuggingFace.

Step 1: Install Dependencies

pip install transformers accelerate bitsandbytes

Step 2: Modify reverie/common/llm.py

Add your model to the get_chat_model() function:

# Huggingface direct loading
elif model_name in {'google/gemma-4-4b-it', 'google/gemma-4-9b-it', 'google/gemma-4-27b-it'}:
    from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
    
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        torch_dtype=torch.bfloat16,
        load_in_8bit=True  # Use 8-bit quantization to save VRAM
    )
    
    pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=4096,
        temperature=temperature
    )
    
    llm = HuggingFacePipeline(pipeline=pipe)
    chat_model = ChatHuggingFace(llm=llm)

Step 3: Update Config

llm_model_name = 'google/gemma-4-9b-it'

Option 3: vLLM (High-Throughput Serving)

For production use or multiple concurrent requests, vLLM offers much better throughput.

Step 1: Install vLLM

pip install vllm

Step 2: Start vLLM Server

python -m vllm.entrypoints.openai.api_server \
    --model google/gemma-4-9b-it \
    --tensor-parallel-size 1 \
    --max-model-len 8192

Step 3: Configure StoryBox as OpenAI-Compatible

# In config.py
llm_model_name = 'google/gemma-4-9b-it'
base_url = 'http://localhost:8000/v1'  # vLLM default port
api_key = 'not-needed-for-local'       # vLLM doesn't require auth by default

Option 4: LM Studio (GUI Alternative)

LM Studio provides a user-friendly GUI for running local LLMs.

  1. Download from https://lmstudio.ai
  2. Download Gemma 4 (or any model) through the UI
  3. Start the local server (default: http://localhost:1234)
  4. Configure StoryBox:
llm_model_name = 'gemma4'
ollama_base_url = 'http://localhost:1234/v1'

Hardware Requirements

Model VRAM Required RAM Fallback Speed (tokens/sec)
gemma4:4b ~8 GB 16 GB + CPU ~30-50
gemma4:9b ~18 GB 32 GB + CPU ~15-25
gemma4:27b ~54 GB Not recommended ~5-10
llama3.1:8b ~16 GB 32 GB + CPU ~20-35
mistral:7b ~14 GB 28 GB + CPU ~25-40
phi4 ~14 GB 28 GB + CPU ~20-30

Tips for limited VRAM:

  • Use quantization: --quantization q4_k_m (Ollama) or load_in_8bit=True (HF)
  • Use CPU offloading: device_map="auto" lets transformers split across GPU/CPU
  • Reduce max_context_length in config.py (e.g., 32000 instead of 102400)
  • Reduce max_tokens for generation (e.g., 4096 instead of 8000)

Expected Runtime with Local LLMs

With a 24GB GPU (RTX 3090/4090) and Gemma 4 9B:

  • Simulation: ~8-12 hours for 14 days (vs ~4 hours with GPT-4o-mini)
  • Story generation: ~2-3 hours
  • Total: ~10-15 hours

The slowdown is because local models generate tokens sequentially and are much slower than API-based models. Consider:

  • Running simulation for fewer days (e.g., 7 days = 168 iterations)
  • Using a smaller model for planning, larger for story generation
  • Using vLLM for batching multiple requests

Troubleshooting

"Connection refused" to Ollama

# Make sure Ollama is running
ollama serve &

# Or start as a service
sudo systemctl start ollama

Out of Memory (OOM)

# In config.py, reduce context:
max_context_length = 32000
max_tokens = 4096

# Use smaller model
llm_model_name = 'gemma4:4b'

JSON Parsing Failures

Local models sometimes produce malformed JSON. StoryBox has retry logic (max_retries = 5), but you can increase it:

max_retries = 10

Or add a JSON-fixing post-processor in reverie/common/utils.py.

Slow Generation

  • Use vLLM instead of Ollama for better throughput
  • Enable Flash Attention: pip install flash-attn
  • Use quantization (Q4_K_M or Q5_K_M)
  • Reduce simulation days: max_iteration = 24 * 7 (7 days instead of 14)

Quick Reference: Config Changes

# reverie/config/config.py

# For Ollama
llm_model_name = 'gemma4'
ollama_base_url = 'http://localhost:11434'

# For vLLM / LM Studio (OpenAI-compatible)
llm_model_name = 'gemma4'
base_url = 'http://localhost:8000/v1'  # or 1234 for LM Studio
api_key = 'not-needed'

# Reduce resource usage
max_context_length = 32000
max_tokens = 4096
max_iteration = 24 * 7  # 7 days instead of 14

Model Recommendations

Use Case Recommended Model Why
Best quality gemma4:27b or llama3.3 Largest, most capable
Best speed/quality gemma4:9b or llama3.1:8b Good balance
Limited VRAM gemma4:4b or phi4 Fits on 8-16GB
Long context qwen2.5:32b Supports 128K context
Coding/planning deepseek-r1 Strong reasoning