storybox-reproduction / NVIDIA_NIM_GUIDE.md
raazkumar's picture
Upload folder using huggingface_hub
7d850d4 verified
# Using NVIDIA NIM with StoryBox
NVIDIA NIM provides optimized inference for LLMs via an OpenAI-compatible API. This guide shows how to use NIM with StoryBox.
## What is NVIDIA NIM?
NVIDIA NIM (NVIDIA Inference Microservices) is a set of easy-to-use microservices for deploying AI models. It exposes an OpenAI-compatible API, so it works seamlessly with StoryBox's existing `ChatOpenAI` integration.
## Setup Options
### Option 1: NVIDIA AI Enterprise (Cloud)
Use NVIDIA-hosted models via the NIM API.
#### Step 1: Get API Key
1. Go to https://build.nvidia.com
2. Sign in with your NVIDIA account
3. Generate an API key
#### Step 2: Set Environment Variables
```bash
export NIM_API_KEY="nvapi-xxxxxxxxxxxxxxxxxxxxxxxx"
# Optional: override the default endpoint
export NIM_BASE_URL="https://integrate.api.nvidia.com/v1"
```
#### Step 3: Configure StoryBox
Edit `reverie/config/config.py`:
```python
# Use NVIDIA NIM model
# Format: nvidia/<model-name>
# The "nvidia/" prefix tells StoryBox to route to NIM
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
# llm_model_name = 'nvidia/meta/llama-3.1-70b-instruct'
# llm_model_name = 'nvidia/mistralai/mistral-7b-instruct-v0.3'
# llm_model_name = 'nvidia/nvidia/nemotron-4-340b-instruct'
# llm_model_name = 'nvidia/google/gemma-2-9b-it'
# llm_model_name = 'nvidia/microsoft/phi-3-mini-128k-instruct'
# NIM settings (reads from env vars by default)
nim_base_url = os.getenv('NIM_BASE_URL', 'https://integrate.api.nvidia.com/v1')
nim_api_key = os.getenv('NIM_API_KEY', '<YOUR_NIM_API_KEY>')
```
#### Step 4: Run
```bash
cd /app/storybox/reverie
python run.py
```
---
### Option 2: Self-Hosted NIM (Local/Docker)
Run NIM on your own GPU infrastructure.
#### Step 1: Prerequisites
- NVIDIA GPU with at least 24GB VRAM (for 8B models)
- Docker with NVIDIA Container Toolkit
- NVIDIA driver 535+ and CUDA 12.2+
#### Step 2: Pull and Run NIM Container
```bash
# Login to NVIDIA Container Registry
docker login nvcr.io
# Username: $oauthtoken
# Password: <YOUR_NGC_API_KEY>
# Run Llama 3.1 8B NIM
docker run --gpus all --rm \
-p 8000:8000 \
-e NGC_API_KEY=<YOUR_NGC_API_KEY> \
nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
# Or run Mistral 7B
docker run --gpus all --rm \
-p 8000:8000 \
-e NGC_API_KEY=<YOUR_NGC_API_KEY> \
nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest
```
#### Step 3: Configure StoryBox for Local NIM
```python
# In reverie/config/config.py
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
# Point to your local NIM instance
nim_base_url = 'http://localhost:8000/v1'
nim_api_key = 'not-needed-for-local' # Local NIM doesn't require auth by default
```
#### Step 4: Run
```bash
cd /app/storybox/reverie
python run.py
```
---
### Option 3: NIM on Kubernetes / Cloud
For production deployments, run NIM on Kubernetes or cloud GPU instances.
#### Example: AWS EC2 g5.xlarge (A10G GPU)
```bash
# SSH into your GPU instance
ssh -i key.pem ubuntu@<instance-ip>
# Install Docker and NVIDIA Container Toolkit
# ... (standard setup)
# Run NIM
docker run --gpus all --rm \
-p 8000:8000 \
-e NGC_API_KEY=$NGC_API_KEY \
nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
# From your local machine, configure StoryBox:
# nim_base_url = 'http://<instance-ip>:8000/v1'
```
---
## Available NIM Models
| Model | NIM Name | VRAM (self-hosted) | Context |
|-------|----------|-------------------|---------|
| Llama 3.1 8B | `meta/llama-3.1-8b-instruct` | ~24 GB | 128K |
| Llama 3.1 70B | `meta/llama-3.1-70b-instruct` | ~140 GB | 128K |
| Mistral 7B | `mistralai/mistral-7b-instruct-v0.3` | ~24 GB | 32K |
| Mixtral 8x7B | `mistralai/mixtral-8x7b-instruct-v0.1` | ~100 GB | 32K |
| Nemotron-4 340B | `nvidia/nemotron-4-340b-instruct` | ~700 GB | 4K |
| Gemma 2 9B | `google/gemma-2-9b-it` | ~24 GB | 8K |
| Gemma 2 27B | `google/gemma-2-27b-it` | ~80 GB | 8K |
| Phi-3 Mini | `microsoft/phi-3-mini-128k-instruct` | ~16 GB | 128K |
| Phi-3 Medium | `microsoft/phi-3-medium-128k-instruct` | ~48 GB | 128K |
| Qwen2.5 7B | `qwen/qwen2.5-7b-instruct` | ~24 GB | 128K |
**Note:** For cloud NIM, check https://build.nvidia.com for the latest available models.
---
## Configuration Summary
```python
# reverie/config/config.py
# NVIDIA NIM (cloud)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'https://integrate.api.nvidia.com/v1'
nim_api_key = os.getenv('NIM_API_KEY')
# NVIDIA NIM (self-hosted local)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'http://localhost:8000/v1'
nim_api_key = 'not-needed'
# NVIDIA NIM (self-hosted remote)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'http://your-server-ip:8000/v1'
nim_api_key = 'not-needed'
```
---
## Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `NIM_API_KEY` | Your NVIDIA API key | `<YOUR_NIM_API_KEY>` |
| `NIM_BASE_URL` | NIM endpoint URL | `https://integrate.api.nvidia.com/v1` |
---
## Troubleshooting
### "Authentication failed"
- Check your `NIM_API_KEY` is set correctly
- For cloud NIM, ensure your key is active at https://build.nvidia.com
### "Model not found"
- Verify the model name format: `nvidia/<org>/<model-name>`
- Check available models at https://build.nvidia.com
### Connection timeout
- For self-hosted: ensure the container is running and port is exposed
- Check firewall rules for port 8000
### Out of memory (self-hosted)
- Use a smaller model (e.g., Phi-3 Mini instead of Llama 70B)
- Enable quantization: add `--env QUANTIZATION=int8` to docker run
- Use tensor parallelism for large models: `--gpus all` with multiple GPUs
---
## Performance Comparison
| Setup | Tokens/sec | Latency | Cost |
|-------|-----------|---------|------|
| OpenAI GPT-4o-mini | ~150 | Low | $0.60/M tokens |
| NVIDIA NIM Cloud (8B) | ~100 | Low | ~$0.10/M tokens |
| Self-hosted NIM (A100) | ~80 | Very Low | Hardware cost only |
| Self-hosted NIM (A10G) | ~40 | Low | Hardware cost only |
| Ollama (local) | ~30 | Very Low | Free |
---
## Quick Reference
```bash
# 1. Set API key (for cloud NIM)
export NIM_API_KEY="nvapi-..."
# 2. Edit config
# llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
# 3. Run
python run.py
```
For more details, visit:
- https://build.nvidia.com (Cloud NIM)
- https://docs.nvidia.com/nim/ (Self-hosted NIM)