| # Using NVIDIA NIM with StoryBox |
|
|
| NVIDIA NIM provides optimized inference for LLMs via an OpenAI-compatible API. This guide shows how to use NIM with StoryBox. |
|
|
| ## What is NVIDIA NIM? |
|
|
| NVIDIA NIM (NVIDIA Inference Microservices) is a set of easy-to-use microservices for deploying AI models. It exposes an OpenAI-compatible API, so it works seamlessly with StoryBox's existing `ChatOpenAI` integration. |
|
|
| ## Setup Options |
|
|
| ### Option 1: NVIDIA AI Enterprise (Cloud) |
|
|
| Use NVIDIA-hosted models via the NIM API. |
|
|
| #### Step 1: Get API Key |
|
|
| 1. Go to https://build.nvidia.com |
| 2. Sign in with your NVIDIA account |
| 3. Generate an API key |
|
|
| #### Step 2: Set Environment Variables |
|
|
| ```bash |
| export NIM_API_KEY="nvapi-xxxxxxxxxxxxxxxxxxxxxxxx" |
| # Optional: override the default endpoint |
| export NIM_BASE_URL="https://integrate.api.nvidia.com/v1" |
| ``` |
|
|
| #### Step 3: Configure StoryBox |
|
|
| Edit `reverie/config/config.py`: |
|
|
| ```python |
| # Use NVIDIA NIM model |
| # Format: nvidia/<model-name> |
| # The "nvidia/" prefix tells StoryBox to route to NIM |
| llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct' |
| # llm_model_name = 'nvidia/meta/llama-3.1-70b-instruct' |
| # llm_model_name = 'nvidia/mistralai/mistral-7b-instruct-v0.3' |
| # llm_model_name = 'nvidia/nvidia/nemotron-4-340b-instruct' |
| # llm_model_name = 'nvidia/google/gemma-2-9b-it' |
| # llm_model_name = 'nvidia/microsoft/phi-3-mini-128k-instruct' |
| |
| # NIM settings (reads from env vars by default) |
| nim_base_url = os.getenv('NIM_BASE_URL', 'https://integrate.api.nvidia.com/v1') |
| nim_api_key = os.getenv('NIM_API_KEY', '<YOUR_NIM_API_KEY>') |
| ``` |
|
|
| #### Step 4: Run |
|
|
| ```bash |
| cd /app/storybox/reverie |
| python run.py |
| ``` |
|
|
| --- |
|
|
| ### Option 2: Self-Hosted NIM (Local/Docker) |
|
|
| Run NIM on your own GPU infrastructure. |
|
|
| #### Step 1: Prerequisites |
|
|
| - NVIDIA GPU with at least 24GB VRAM (for 8B models) |
| - Docker with NVIDIA Container Toolkit |
| - NVIDIA driver 535+ and CUDA 12.2+ |
|
|
| #### Step 2: Pull and Run NIM Container |
|
|
| ```bash |
| # Login to NVIDIA Container Registry |
| docker login nvcr.io |
| # Username: $oauthtoken |
| # Password: <YOUR_NGC_API_KEY> |
| |
| # Run Llama 3.1 8B NIM |
| docker run --gpus all --rm \ |
| -p 8000:8000 \ |
| -e NGC_API_KEY=<YOUR_NGC_API_KEY> \ |
| nvcr.io/nim/meta/llama-3.1-8b-instruct:latest |
| |
| # Or run Mistral 7B |
| docker run --gpus all --rm \ |
| -p 8000:8000 \ |
| -e NGC_API_KEY=<YOUR_NGC_API_KEY> \ |
| nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest |
| ``` |
|
|
| #### Step 3: Configure StoryBox for Local NIM |
|
|
| ```python |
| # In reverie/config/config.py |
| llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct' |
| |
| # Point to your local NIM instance |
| nim_base_url = 'http://localhost:8000/v1' |
| nim_api_key = 'not-needed-for-local' # Local NIM doesn't require auth by default |
| ``` |
|
|
| #### Step 4: Run |
|
|
| ```bash |
| cd /app/storybox/reverie |
| python run.py |
| ``` |
|
|
| --- |
|
|
| ### Option 3: NIM on Kubernetes / Cloud |
|
|
| For production deployments, run NIM on Kubernetes or cloud GPU instances. |
|
|
| #### Example: AWS EC2 g5.xlarge (A10G GPU) |
|
|
| ```bash |
| # SSH into your GPU instance |
| ssh -i key.pem ubuntu@<instance-ip> |
| |
| # Install Docker and NVIDIA Container Toolkit |
| # ... (standard setup) |
| |
| # Run NIM |
| docker run --gpus all --rm \ |
| -p 8000:8000 \ |
| -e NGC_API_KEY=$NGC_API_KEY \ |
| nvcr.io/nim/meta/llama-3.1-8b-instruct:latest |
| |
| # From your local machine, configure StoryBox: |
| # nim_base_url = 'http://<instance-ip>:8000/v1' |
| ``` |
|
|
| --- |
|
|
| ## Available NIM Models |
|
|
| | Model | NIM Name | VRAM (self-hosted) | Context | |
| |-------|----------|-------------------|---------| |
| | Llama 3.1 8B | `meta/llama-3.1-8b-instruct` | ~24 GB | 128K | |
| | Llama 3.1 70B | `meta/llama-3.1-70b-instruct` | ~140 GB | 128K | |
| | Mistral 7B | `mistralai/mistral-7b-instruct-v0.3` | ~24 GB | 32K | |
| | Mixtral 8x7B | `mistralai/mixtral-8x7b-instruct-v0.1` | ~100 GB | 32K | |
| | Nemotron-4 340B | `nvidia/nemotron-4-340b-instruct` | ~700 GB | 4K | |
| | Gemma 2 9B | `google/gemma-2-9b-it` | ~24 GB | 8K | |
| | Gemma 2 27B | `google/gemma-2-27b-it` | ~80 GB | 8K | |
| | Phi-3 Mini | `microsoft/phi-3-mini-128k-instruct` | ~16 GB | 128K | |
| | Phi-3 Medium | `microsoft/phi-3-medium-128k-instruct` | ~48 GB | 128K | |
| | Qwen2.5 7B | `qwen/qwen2.5-7b-instruct` | ~24 GB | 128K | |
|
|
| **Note:** For cloud NIM, check https://build.nvidia.com for the latest available models. |
|
|
| --- |
|
|
| ## Configuration Summary |
|
|
| ```python |
| # reverie/config/config.py |
| |
| # NVIDIA NIM (cloud) |
| llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct' |
| nim_base_url = 'https://integrate.api.nvidia.com/v1' |
| nim_api_key = os.getenv('NIM_API_KEY') |
| |
| # NVIDIA NIM (self-hosted local) |
| llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct' |
| nim_base_url = 'http://localhost:8000/v1' |
| nim_api_key = 'not-needed' |
| |
| # NVIDIA NIM (self-hosted remote) |
| llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct' |
| nim_base_url = 'http://your-server-ip:8000/v1' |
| nim_api_key = 'not-needed' |
| ``` |
|
|
| --- |
|
|
| ## Environment Variables |
|
|
| | Variable | Description | Default | |
| |----------|-------------|---------| |
| | `NIM_API_KEY` | Your NVIDIA API key | `<YOUR_NIM_API_KEY>` | |
| | `NIM_BASE_URL` | NIM endpoint URL | `https://integrate.api.nvidia.com/v1` | |
|
|
| --- |
|
|
| ## Troubleshooting |
|
|
| ### "Authentication failed" |
| - Check your `NIM_API_KEY` is set correctly |
| - For cloud NIM, ensure your key is active at https://build.nvidia.com |
|
|
| ### "Model not found" |
| - Verify the model name format: `nvidia/<org>/<model-name>` |
| - Check available models at https://build.nvidia.com |
|
|
| ### Connection timeout |
| - For self-hosted: ensure the container is running and port is exposed |
| - Check firewall rules for port 8000 |
|
|
| ### Out of memory (self-hosted) |
| - Use a smaller model (e.g., Phi-3 Mini instead of Llama 70B) |
| - Enable quantization: add `--env QUANTIZATION=int8` to docker run |
| - Use tensor parallelism for large models: `--gpus all` with multiple GPUs |
|
|
| --- |
|
|
| ## Performance Comparison |
|
|
| | Setup | Tokens/sec | Latency | Cost | |
| |-------|-----------|---------|------| |
| | OpenAI GPT-4o-mini | ~150 | Low | $0.60/M tokens | |
| | NVIDIA NIM Cloud (8B) | ~100 | Low | ~$0.10/M tokens | |
| | Self-hosted NIM (A100) | ~80 | Very Low | Hardware cost only | |
| | Self-hosted NIM (A10G) | ~40 | Low | Hardware cost only | |
| | Ollama (local) | ~30 | Very Low | Free | |
|
|
| --- |
|
|
| ## Quick Reference |
|
|
| ```bash |
| # 1. Set API key (for cloud NIM) |
| export NIM_API_KEY="nvapi-..." |
| |
| # 2. Edit config |
| # llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct' |
| |
| # 3. Run |
| python run.py |
| ``` |
|
|
| For more details, visit: |
| - https://build.nvidia.com (Cloud NIM) |
| - https://docs.nvidia.com/nim/ (Self-hosted NIM) |
|
|