Using NVIDIA NIM with StoryBox

NVIDIA NIM provides optimized inference for LLMs via an OpenAI-compatible API. This guide shows how to use NIM with StoryBox.

What is NVIDIA NIM?

NVIDIA NIM (NVIDIA Inference Microservices) is a set of easy-to-use microservices for deploying AI models. It exposes an OpenAI-compatible API, so it works seamlessly with StoryBox's existing ChatOpenAI integration.

Setup Options

Option 1: NVIDIA AI Enterprise (Cloud)

Use NVIDIA-hosted models via the NIM API.

Step 1: Get API Key

Go to https://build.nvidia.com
Sign in with your NVIDIA account
Generate an API key

Step 2: Set Environment Variables

export NIM_API_KEY="nvapi-xxxxxxxxxxxxxxxxxxxxxxxx"
# Optional: override the default endpoint
export NIM_BASE_URL="https://integrate.api.nvidia.com/v1"

Step 3: Configure StoryBox

Edit reverie/config/config.py:

# Use NVIDIA NIM model
# Format: nvidia/<model-name>
# The "nvidia/" prefix tells StoryBox to route to NIM
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
# llm_model_name = 'nvidia/meta/llama-3.1-70b-instruct'
# llm_model_name = 'nvidia/mistralai/mistral-7b-instruct-v0.3'
# llm_model_name = 'nvidia/nvidia/nemotron-4-340b-instruct'
# llm_model_name = 'nvidia/google/gemma-2-9b-it'
# llm_model_name = 'nvidia/microsoft/phi-3-mini-128k-instruct'

# NIM settings (reads from env vars by default)
nim_base_url = os.getenv('NIM_BASE_URL', 'https://integrate.api.nvidia.com/v1')
nim_api_key = os.getenv('NIM_API_KEY', '<YOUR_NIM_API_KEY>')

Step 4: Run

cd /app/storybox/reverie
python run.py

Option 2: Self-Hosted NIM (Local/Docker)

Run NIM on your own GPU infrastructure.

Step 1: Prerequisites

NVIDIA GPU with at least 24GB VRAM (for 8B models)
Docker with NVIDIA Container Toolkit
NVIDIA driver 535+ and CUDA 12.2+

Step 2: Pull and Run NIM Container

# Login to NVIDIA Container Registry
docker login nvcr.io
# Username: $oauthtoken
# Password: <YOUR_NGC_API_KEY>

# Run Llama 3.1 8B NIM
docker run --gpus all --rm \
  -p 8000:8000 \
  -e NGC_API_KEY=<YOUR_NGC_API_KEY> \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

# Or run Mistral 7B
docker run --gpus all --rm \
  -p 8000:8000 \
  -e NGC_API_KEY=<YOUR_NGC_API_KEY> \
  nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest

Step 3: Configure StoryBox for Local NIM

# In reverie/config/config.py
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'

# Point to your local NIM instance
nim_base_url = 'http://localhost:8000/v1'
nim_api_key = 'not-needed-for-local'  # Local NIM doesn't require auth by default

Step 4: Run

cd /app/storybox/reverie
python run.py

Option 3: NIM on Kubernetes / Cloud

For production deployments, run NIM on Kubernetes or cloud GPU instances.

Example: AWS EC2 g5.xlarge (A10G GPU)

# SSH into your GPU instance
ssh -i key.pem ubuntu@<instance-ip>

# Install Docker and NVIDIA Container Toolkit
# ... (standard setup)

# Run NIM
docker run --gpus all --rm \
  -p 8000:8000 \
  -e NGC_API_KEY=$NGC_API_KEY \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

# From your local machine, configure StoryBox:
# nim_base_url = 'http://<instance-ip>:8000/v1'

Available NIM Models

Model	NIM Name	VRAM (self-hosted)	Context
Llama 3.1 8B	`meta/llama-3.1-8b-instruct`	~24 GB	128K
Llama 3.1 70B	`meta/llama-3.1-70b-instruct`	~140 GB	128K
Mistral 7B	`mistralai/mistral-7b-instruct-v0.3`	~24 GB	32K
Mixtral 8x7B	`mistralai/mixtral-8x7b-instruct-v0.1`	~100 GB	32K
Nemotron-4 340B	`nvidia/nemotron-4-340b-instruct`	~700 GB	4K
Gemma 2 9B	`google/gemma-2-9b-it`	~24 GB	8K
Gemma 2 27B	`google/gemma-2-27b-it`	~80 GB	8K
Phi-3 Mini	`microsoft/phi-3-mini-128k-instruct`	~16 GB	128K
Phi-3 Medium	`microsoft/phi-3-medium-128k-instruct`	~48 GB	128K
Qwen2.5 7B	`qwen/qwen2.5-7b-instruct`	~24 GB	128K

Note: For cloud NIM, check https://build.nvidia.com for the latest available models.

Configuration Summary

# reverie/config/config.py

# NVIDIA NIM (cloud)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'https://integrate.api.nvidia.com/v1'
nim_api_key = os.getenv('NIM_API_KEY')

# NVIDIA NIM (self-hosted local)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'http://localhost:8000/v1'
nim_api_key = 'not-needed'

# NVIDIA NIM (self-hosted remote)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'http://your-server-ip:8000/v1'
nim_api_key = 'not-needed'

Environment Variables

Variable	Description	Default
`NIM_API_KEY`	Your NVIDIA API key	`<YOUR_NIM_API_KEY>`
`NIM_BASE_URL`	NIM endpoint URL	`https://integrate.api.nvidia.com/v1`

Troubleshooting

"Authentication failed"

Check your NIM_API_KEY is set correctly
For cloud NIM, ensure your key is active at https://build.nvidia.com

"Model not found"

Verify the model name format: nvidia/<org>/<model-name>
Check available models at https://build.nvidia.com

Connection timeout

For self-hosted: ensure the container is running and port is exposed
Check firewall rules for port 8000

Out of memory (self-hosted)

Use a smaller model (e.g., Phi-3 Mini instead of Llama 70B)
Enable quantization: add --env QUANTIZATION=int8 to docker run
Use tensor parallelism for large models: --gpus all with multiple GPUs

Performance Comparison

Setup	Tokens/sec	Latency	Cost
OpenAI GPT-4o-mini	~150	Low	$0.60/M tokens
NVIDIA NIM Cloud (8B)	~100	Low	~$0.10/M tokens
Self-hosted NIM (A100)	~80	Very Low	Hardware cost only
Self-hosted NIM (A10G)	~40	Low	Hardware cost only
Ollama (local)	~30	Very Low	Free

Quick Reference

# 1. Set API key (for cloud NIM)
export NIM_API_KEY="nvapi-..."

# 2. Edit config
# llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'

# 3. Run
python run.py

For more details, visit:

https://build.nvidia.com (Cloud NIM)
https://docs.nvidia.com/nim/ (Self-hosted NIM)