storybox-reproduction / NVIDIA_NIM_GUIDE.md
raazkumar's picture
Upload folder using huggingface_hub
7d850d4 verified

Using NVIDIA NIM with StoryBox

NVIDIA NIM provides optimized inference for LLMs via an OpenAI-compatible API. This guide shows how to use NIM with StoryBox.

What is NVIDIA NIM?

NVIDIA NIM (NVIDIA Inference Microservices) is a set of easy-to-use microservices for deploying AI models. It exposes an OpenAI-compatible API, so it works seamlessly with StoryBox's existing ChatOpenAI integration.

Setup Options

Option 1: NVIDIA AI Enterprise (Cloud)

Use NVIDIA-hosted models via the NIM API.

Step 1: Get API Key

  1. Go to https://build.nvidia.com
  2. Sign in with your NVIDIA account
  3. Generate an API key

Step 2: Set Environment Variables

export NIM_API_KEY="nvapi-xxxxxxxxxxxxxxxxxxxxxxxx"
# Optional: override the default endpoint
export NIM_BASE_URL="https://integrate.api.nvidia.com/v1"

Step 3: Configure StoryBox

Edit reverie/config/config.py:

# Use NVIDIA NIM model
# Format: nvidia/<model-name>
# The "nvidia/" prefix tells StoryBox to route to NIM
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
# llm_model_name = 'nvidia/meta/llama-3.1-70b-instruct'
# llm_model_name = 'nvidia/mistralai/mistral-7b-instruct-v0.3'
# llm_model_name = 'nvidia/nvidia/nemotron-4-340b-instruct'
# llm_model_name = 'nvidia/google/gemma-2-9b-it'
# llm_model_name = 'nvidia/microsoft/phi-3-mini-128k-instruct'

# NIM settings (reads from env vars by default)
nim_base_url = os.getenv('NIM_BASE_URL', 'https://integrate.api.nvidia.com/v1')
nim_api_key = os.getenv('NIM_API_KEY', '<YOUR_NIM_API_KEY>')

Step 4: Run

cd /app/storybox/reverie
python run.py

Option 2: Self-Hosted NIM (Local/Docker)

Run NIM on your own GPU infrastructure.

Step 1: Prerequisites

  • NVIDIA GPU with at least 24GB VRAM (for 8B models)
  • Docker with NVIDIA Container Toolkit
  • NVIDIA driver 535+ and CUDA 12.2+

Step 2: Pull and Run NIM Container

# Login to NVIDIA Container Registry
docker login nvcr.io
# Username: $oauthtoken
# Password: <YOUR_NGC_API_KEY>

# Run Llama 3.1 8B NIM
docker run --gpus all --rm \
  -p 8000:8000 \
  -e NGC_API_KEY=<YOUR_NGC_API_KEY> \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

# Or run Mistral 7B
docker run --gpus all --rm \
  -p 8000:8000 \
  -e NGC_API_KEY=<YOUR_NGC_API_KEY> \
  nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest

Step 3: Configure StoryBox for Local NIM

# In reverie/config/config.py
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'

# Point to your local NIM instance
nim_base_url = 'http://localhost:8000/v1'
nim_api_key = 'not-needed-for-local'  # Local NIM doesn't require auth by default

Step 4: Run

cd /app/storybox/reverie
python run.py

Option 3: NIM on Kubernetes / Cloud

For production deployments, run NIM on Kubernetes or cloud GPU instances.

Example: AWS EC2 g5.xlarge (A10G GPU)

# SSH into your GPU instance
ssh -i key.pem ubuntu@<instance-ip>

# Install Docker and NVIDIA Container Toolkit
# ... (standard setup)

# Run NIM
docker run --gpus all --rm \
  -p 8000:8000 \
  -e NGC_API_KEY=$NGC_API_KEY \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

# From your local machine, configure StoryBox:
# nim_base_url = 'http://<instance-ip>:8000/v1'

Available NIM Models

Model NIM Name VRAM (self-hosted) Context
Llama 3.1 8B meta/llama-3.1-8b-instruct ~24 GB 128K
Llama 3.1 70B meta/llama-3.1-70b-instruct ~140 GB 128K
Mistral 7B mistralai/mistral-7b-instruct-v0.3 ~24 GB 32K
Mixtral 8x7B mistralai/mixtral-8x7b-instruct-v0.1 ~100 GB 32K
Nemotron-4 340B nvidia/nemotron-4-340b-instruct ~700 GB 4K
Gemma 2 9B google/gemma-2-9b-it ~24 GB 8K
Gemma 2 27B google/gemma-2-27b-it ~80 GB 8K
Phi-3 Mini microsoft/phi-3-mini-128k-instruct ~16 GB 128K
Phi-3 Medium microsoft/phi-3-medium-128k-instruct ~48 GB 128K
Qwen2.5 7B qwen/qwen2.5-7b-instruct ~24 GB 128K

Note: For cloud NIM, check https://build.nvidia.com for the latest available models.


Configuration Summary

# reverie/config/config.py

# NVIDIA NIM (cloud)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'https://integrate.api.nvidia.com/v1'
nim_api_key = os.getenv('NIM_API_KEY')

# NVIDIA NIM (self-hosted local)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'http://localhost:8000/v1'
nim_api_key = 'not-needed'

# NVIDIA NIM (self-hosted remote)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'http://your-server-ip:8000/v1'
nim_api_key = 'not-needed'

Environment Variables

Variable Description Default
NIM_API_KEY Your NVIDIA API key <YOUR_NIM_API_KEY>
NIM_BASE_URL NIM endpoint URL https://integrate.api.nvidia.com/v1

Troubleshooting

"Authentication failed"

"Model not found"

Connection timeout

  • For self-hosted: ensure the container is running and port is exposed
  • Check firewall rules for port 8000

Out of memory (self-hosted)

  • Use a smaller model (e.g., Phi-3 Mini instead of Llama 70B)
  • Enable quantization: add --env QUANTIZATION=int8 to docker run
  • Use tensor parallelism for large models: --gpus all with multiple GPUs

Performance Comparison

Setup Tokens/sec Latency Cost
OpenAI GPT-4o-mini ~150 Low $0.60/M tokens
NVIDIA NIM Cloud (8B) ~100 Low ~$0.10/M tokens
Self-hosted NIM (A100) ~80 Very Low Hardware cost only
Self-hosted NIM (A10G) ~40 Low Hardware cost only
Ollama (local) ~30 Very Low Free

Quick Reference

# 1. Set API key (for cloud NIM)
export NIM_API_KEY="nvapi-..."

# 2. Edit config
# llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'

# 3. Run
python run.py

For more details, visit: