# Using NVIDIA NIM with StoryBox

NVIDIA NIM provides optimized inference for LLMs via an OpenAI-compatible API. This guide shows how to use NIM with StoryBox.

## What is NVIDIA NIM?

NVIDIA NIM (NVIDIA Inference Microservices) is a set of easy-to-use microservices for deploying AI models. It exposes an OpenAI-compatible API, so it works seamlessly with StoryBox's existing `ChatOpenAI` integration.

## Setup Options

### Option 1: NVIDIA AI Enterprise (Cloud)

Use NVIDIA-hosted models via the NIM API.

#### Step 1: Get API Key

1. Go to https://build.nvidia.com
2. Sign in with your NVIDIA account
3. Generate an API key

#### Step 2: Set Environment Variables

```bash
export NIM_API_KEY="nvapi-xxxxxxxxxxxxxxxxxxxxxxxx"
# Optional: override the default endpoint
export NIM_BASE_URL="https://integrate.api.nvidia.com/v1"
```

#### Step 3: Configure StoryBox

Edit `reverie/config/config.py`:

```python
# Use NVIDIA NIM model
# Format: nvidia/<model-name>
# The "nvidia/" prefix tells StoryBox to route to NIM
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
# llm_model_name = 'nvidia/meta/llama-3.1-70b-instruct'
# llm_model_name = 'nvidia/mistralai/mistral-7b-instruct-v0.3'
# llm_model_name = 'nvidia/nvidia/nemotron-4-340b-instruct'
# llm_model_name = 'nvidia/google/gemma-2-9b-it'
# llm_model_name = 'nvidia/microsoft/phi-3-mini-128k-instruct'

# NIM settings (reads from env vars by default)
nim_base_url = os.getenv('NIM_BASE_URL', 'https://integrate.api.nvidia.com/v1')
nim_api_key = os.getenv('NIM_API_KEY', '<YOUR_NIM_API_KEY>')
```

#### Step 4: Run

```bash
cd /app/storybox/reverie
python run.py
```

---

### Option 2: Self-Hosted NIM (Local/Docker)

Run NIM on your own GPU infrastructure.

#### Step 1: Prerequisites

- NVIDIA GPU with at least 24GB VRAM (for 8B models)
- Docker with NVIDIA Container Toolkit
- NVIDIA driver 535+ and CUDA 12.2+

#### Step 2: Pull and Run NIM Container

```bash
# Login to NVIDIA Container Registry
docker login nvcr.io
# Username: $oauthtoken
# Password: <YOUR_NGC_API_KEY>

# Run Llama 3.1 8B NIM
docker run --gpus all --rm \
  -p 8000:8000 \
  -e NGC_API_KEY=<YOUR_NGC_API_KEY> \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

# Or run Mistral 7B
docker run --gpus all --rm \
  -p 8000:8000 \
  -e NGC_API_KEY=<YOUR_NGC_API_KEY> \
  nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest
```

#### Step 3: Configure StoryBox for Local NIM

```python
# In reverie/config/config.py
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'

# Point to your local NIM instance
nim_base_url = 'http://localhost:8000/v1'
nim_api_key = 'not-needed-for-local'  # Local NIM doesn't require auth by default
```

#### Step 4: Run

```bash
cd /app/storybox/reverie
python run.py
```

---

### Option 3: NIM on Kubernetes / Cloud

For production deployments, run NIM on Kubernetes or cloud GPU instances.

#### Example: AWS EC2 g5.xlarge (A10G GPU)

```bash
# SSH into your GPU instance
ssh -i key.pem ubuntu@<instance-ip>

# Install Docker and NVIDIA Container Toolkit
# ... (standard setup)

# Run NIM
docker run --gpus all --rm \
  -p 8000:8000 \
  -e NGC_API_KEY=$NGC_API_KEY \
  nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

# From your local machine, configure StoryBox:
# nim_base_url = 'http://<instance-ip>:8000/v1'
```

---

## Available NIM Models

| Model | NIM Name | VRAM (self-hosted) | Context |
|-------|----------|-------------------|---------|
| Llama 3.1 8B | `meta/llama-3.1-8b-instruct` | ~24 GB | 128K |
| Llama 3.1 70B | `meta/llama-3.1-70b-instruct` | ~140 GB | 128K |
| Mistral 7B | `mistralai/mistral-7b-instruct-v0.3` | ~24 GB | 32K |
| Mixtral 8x7B | `mistralai/mixtral-8x7b-instruct-v0.1` | ~100 GB | 32K |
| Nemotron-4 340B | `nvidia/nemotron-4-340b-instruct` | ~700 GB | 4K |
| Gemma 2 9B | `google/gemma-2-9b-it` | ~24 GB | 8K |
| Gemma 2 27B | `google/gemma-2-27b-it` | ~80 GB | 8K |
| Phi-3 Mini | `microsoft/phi-3-mini-128k-instruct` | ~16 GB | 128K |
| Phi-3 Medium | `microsoft/phi-3-medium-128k-instruct` | ~48 GB | 128K |
| Qwen2.5 7B | `qwen/qwen2.5-7b-instruct` | ~24 GB | 128K |

**Note:** For cloud NIM, check https://build.nvidia.com for the latest available models.

---

## Configuration Summary

```python
# reverie/config/config.py

# NVIDIA NIM (cloud)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'https://integrate.api.nvidia.com/v1'
nim_api_key = os.getenv('NIM_API_KEY')

# NVIDIA NIM (self-hosted local)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'http://localhost:8000/v1'
nim_api_key = 'not-needed'

# NVIDIA NIM (self-hosted remote)
llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
nim_base_url = 'http://your-server-ip:8000/v1'
nim_api_key = 'not-needed'
```

---

## Environment Variables

| Variable | Description | Default |
|----------|-------------|---------|
| `NIM_API_KEY` | Your NVIDIA API key | `<YOUR_NIM_API_KEY>` |
| `NIM_BASE_URL` | NIM endpoint URL | `https://integrate.api.nvidia.com/v1` |

---

## Troubleshooting

### "Authentication failed"
- Check your `NIM_API_KEY` is set correctly
- For cloud NIM, ensure your key is active at https://build.nvidia.com

### "Model not found"
- Verify the model name format: `nvidia/<org>/<model-name>`
- Check available models at https://build.nvidia.com

### Connection timeout
- For self-hosted: ensure the container is running and port is exposed
- Check firewall rules for port 8000

### Out of memory (self-hosted)
- Use a smaller model (e.g., Phi-3 Mini instead of Llama 70B)
- Enable quantization: add `--env QUANTIZATION=int8` to docker run
- Use tensor parallelism for large models: `--gpus all` with multiple GPUs

---

## Performance Comparison

| Setup | Tokens/sec | Latency | Cost |
|-------|-----------|---------|------|
| OpenAI GPT-4o-mini | ~150 | Low | $0.60/M tokens |
| NVIDIA NIM Cloud (8B) | ~100 | Low | ~$0.10/M tokens |
| Self-hosted NIM (A100) | ~80 | Very Low | Hardware cost only |
| Self-hosted NIM (A10G) | ~40 | Low | Hardware cost only |
| Ollama (local) | ~30 | Very Low | Free |

---

## Quick Reference

```bash
# 1. Set API key (for cloud NIM)
export NIM_API_KEY="nvapi-..."

# 2. Edit config
# llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'

# 3. Run
python run.py
```

For more details, visit:
- https://build.nvidia.com (Cloud NIM)
- https://docs.nvidia.com/nim/ (Self-hosted NIM)