Using NVIDIA NIM with StoryBox
NVIDIA NIM provides optimized inference for LLMs via an OpenAI-compatible API. This guide shows how to use NIM with StoryBox.
What is NVIDIA NIM?
NVIDIA NIM (NVIDIA Inference Microservices) is a set of easy-to-use microservices for deploying AI models. It exposes an OpenAI-compatible API, so it works seamlessly with StoryBox's existing ChatOpenAI integration.
Setup Options
Option 1: NVIDIA AI Enterprise (Cloud)
Use NVIDIA-hosted models via the NIM API.
Step 1: Get API Key
- Go to https://build.nvidia.com
- Sign in with your NVIDIA account
- Generate an API key
Step 2: Set Environment Variables
export NIM_API_KEY="nvapi-xxxxxxxxxxxxxxxxxxxxxxxx"
# Optional: override the default endpoint
export NIM_BASE_URL="https://integrate.api.nvidia.com/v1"
Step 3: Configure StoryBox
Edit reverie/config/config.py:
# Use NVIDIA NIM model
# Format: nvidia/<model-name>
# The "nvidia/" prefix tells StoryBox to route to NIM
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
# llm_model_name = 'nvidia/meta/llama-3.1-70b-instruct'
# llm_model_name = 'nvidia/mistralai/mistral-7b-instruct-v0.3'
# llm_model_name = 'nvidia/nvidia/nemotron-4-340b-instruct'
# llm_model_name = 'nvidia/google/gemma-2-9b-it'
# llm_model_name = 'nvidia/microsoft/phi-3-mini-128k-instruct'
# NIM settings (reads from env vars by default)
nim_base_url = os.getenv('NIM_BASE_URL', 'https://integrate.api.nvidia.com/v1')
nim_api_key = os.getenv('NIM_API_KEY', '<YOUR_NIM_API_KEY>')
Step 4: Run
cd /app/storybox/reverie
python run.py
Option 2: Self-Hosted NIM (Local/Docker)
Run NIM on your own GPU infrastructure.
Step 1: Prerequisites
- NVIDIA GPU with at least 24GB VRAM (for 8B models)
- Docker with NVIDIA Container Toolkit
- NVIDIA driver 535+ and CUDA 12.2+
Step 2: Pull and Run NIM Container
# Login to NVIDIA Container Registry
docker login nvcr.io
# Username: $oauthtoken
# Password: <YOUR_NGC_API_KEY>
# Run Llama 3.1 8B NIM
docker run --gpus all --rm \
-p 8000:8000 \
-e NGC_API_KEY=<YOUR_NGC_API_KEY> \
nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
# Or run Mistral 7B
docker run --gpus all --rm \
-p 8000:8000 \
-e NGC_API_KEY=<YOUR_NGC_API_KEY> \
nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest
Step 3: Configure StoryBox for Local NIM
# In reverie/config/config.py
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
# Point to your local NIM instance
nim_base_url = 'http://localhost:8000/v1'
nim_api_key = 'not-needed-for-local' # Local NIM doesn't require auth by default
Step 4: Run
cd /app/storybox/reverie
python run.py
Option 3: NIM on Kubernetes / Cloud
For production deployments, run NIM on Kubernetes or cloud GPU instances.
Example: AWS EC2 g5.xlarge (A10G GPU)
# SSH into your GPU instance
ssh -i key.pem ubuntu@<instance-ip>
# Install Docker and NVIDIA Container Toolkit
# ... (standard setup)
# Run NIM
docker run --gpus all --rm \
-p 8000:8000 \
-e NGC_API_KEY=$NGC_API_KEY \
nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
# From your local machine, configure StoryBox:
# nim_base_url = 'http://<instance-ip>:8000/v1'
Available NIM Models
| Model | NIM Name | VRAM (self-hosted) | Context |
|---|---|---|---|
| Llama 3.1 8B | meta/llama-3.1-8b-instruct |
~24 GB | 128K |
| Llama 3.1 70B | meta/llama-3.1-70b-instruct |
~140 GB | 128K |
| Mistral 7B | mistralai/mistral-7b-instruct-v0.3 |
~24 GB | 32K |
| Mixtral 8x7B | mistralai/mixtral-8x7b-instruct-v0.1 |
~100 GB | 32K |
| Nemotron-4 340B | nvidia/nemotron-4-340b-instruct |
~700 GB | 4K |
| Gemma 2 9B | google/gemma-2-9b-it |
~24 GB | 8K |
| Gemma 2 27B | google/gemma-2-27b-it |
~80 GB | 8K |
| Phi-3 Mini | microsoft/phi-3-mini-128k-instruct |
~16 GB | 128K |
| Phi-3 Medium | microsoft/phi-3-medium-128k-instruct |
~48 GB | 128K |
| Qwen2.5 7B | qwen/qwen2.5-7b-instruct |
~24 GB | 128K |
Note: For cloud NIM, check https://build.nvidia.com for the latest available models.
Configuration Summary
# reverie/config/config.py
# NVIDIA NIM (cloud)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'https://integrate.api.nvidia.com/v1'
nim_api_key = os.getenv('NIM_API_KEY')
# NVIDIA NIM (self-hosted local)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'http://localhost:8000/v1'
nim_api_key = 'not-needed'
# NVIDIA NIM (self-hosted remote)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'http://your-server-ip:8000/v1'
nim_api_key = 'not-needed'
Environment Variables
| Variable | Description | Default |
|---|---|---|
NIM_API_KEY |
Your NVIDIA API key | <YOUR_NIM_API_KEY> |
NIM_BASE_URL |
NIM endpoint URL | https://integrate.api.nvidia.com/v1 |
Troubleshooting
"Authentication failed"
- Check your
NIM_API_KEYis set correctly - For cloud NIM, ensure your key is active at https://build.nvidia.com
"Model not found"
- Verify the model name format:
nvidia/<org>/<model-name> - Check available models at https://build.nvidia.com
Connection timeout
- For self-hosted: ensure the container is running and port is exposed
- Check firewall rules for port 8000
Out of memory (self-hosted)
- Use a smaller model (e.g., Phi-3 Mini instead of Llama 70B)
- Enable quantization: add
--env QUANTIZATION=int8to docker run - Use tensor parallelism for large models:
--gpus allwith multiple GPUs
Performance Comparison
| Setup | Tokens/sec | Latency | Cost |
|---|---|---|---|
| OpenAI GPT-4o-mini | ~150 | Low | $0.60/M tokens |
| NVIDIA NIM Cloud (8B) | ~100 | Low | ~$0.10/M tokens |
| Self-hosted NIM (A100) | ~80 | Very Low | Hardware cost only |
| Self-hosted NIM (A10G) | ~40 | Low | Hardware cost only |
| Ollama (local) | ~30 | Very Low | Free |
Quick Reference
# 1. Set API key (for cloud NIM)
export NIM_API_KEY="nvapi-..."
# 2. Edit config
# llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
# 3. Run
python run.py
For more details, visit:
- https://build.nvidia.com (Cloud NIM)
- https://docs.nvidia.com/nim/ (Self-hosted NIM)