# Using NVIDIA NIM with StoryBox NVIDIA NIM provides optimized inference for LLMs via an OpenAI-compatible API. This guide shows how to use NIM with StoryBox. ## What is NVIDIA NIM? NVIDIA NIM (NVIDIA Inference Microservices) is a set of easy-to-use microservices for deploying AI models. It exposes an OpenAI-compatible API, so it works seamlessly with StoryBox's existing `ChatOpenAI` integration. ## Setup Options ### Option 1: NVIDIA AI Enterprise (Cloud) Use NVIDIA-hosted models via the NIM API. #### Step 1: Get API Key 1. Go to https://build.nvidia.com 2. Sign in with your NVIDIA account 3. Generate an API key #### Step 2: Set Environment Variables ```bash export NIM_API_KEY="nvapi-xxxxxxxxxxxxxxxxxxxxxxxx" # Optional: override the default endpoint export NIM_BASE_URL="https://integrate.api.nvidia.com/v1" ``` #### Step 3: Configure StoryBox Edit `reverie/config/config.py`: ```python # Use NVIDIA NIM model # Format: nvidia/ # The "nvidia/" prefix tells StoryBox to route to NIM llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct' # llm_model_name = 'nvidia/meta/llama-3.1-70b-instruct' # llm_model_name = 'nvidia/mistralai/mistral-7b-instruct-v0.3' # llm_model_name = 'nvidia/nvidia/nemotron-4-340b-instruct' # llm_model_name = 'nvidia/google/gemma-2-9b-it' # llm_model_name = 'nvidia/microsoft/phi-3-mini-128k-instruct' # NIM settings (reads from env vars by default) nim_base_url = os.getenv('NIM_BASE_URL', 'https://integrate.api.nvidia.com/v1') nim_api_key = os.getenv('NIM_API_KEY', '') ``` #### Step 4: Run ```bash cd /app/storybox/reverie python run.py ``` --- ### Option 2: Self-Hosted NIM (Local/Docker) Run NIM on your own GPU infrastructure. #### Step 1: Prerequisites - NVIDIA GPU with at least 24GB VRAM (for 8B models) - Docker with NVIDIA Container Toolkit - NVIDIA driver 535+ and CUDA 12.2+ #### Step 2: Pull and Run NIM Container ```bash # Login to NVIDIA Container Registry docker login nvcr.io # Username: $oauthtoken # Password: # Run Llama 3.1 8B NIM docker run --gpus all --rm \ -p 8000:8000 \ -e NGC_API_KEY= \ nvcr.io/nim/meta/llama-3.1-8b-instruct:latest # Or run Mistral 7B docker run --gpus all --rm \ -p 8000:8000 \ -e NGC_API_KEY= \ nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest ``` #### Step 3: Configure StoryBox for Local NIM ```python # In reverie/config/config.py llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct' # Point to your local NIM instance nim_base_url = 'http://localhost:8000/v1' nim_api_key = 'not-needed-for-local' # Local NIM doesn't require auth by default ``` #### Step 4: Run ```bash cd /app/storybox/reverie python run.py ``` --- ### Option 3: NIM on Kubernetes / Cloud For production deployments, run NIM on Kubernetes or cloud GPU instances. #### Example: AWS EC2 g5.xlarge (A10G GPU) ```bash # SSH into your GPU instance ssh -i key.pem ubuntu@ # Install Docker and NVIDIA Container Toolkit # ... (standard setup) # Run NIM docker run --gpus all --rm \ -p 8000:8000 \ -e NGC_API_KEY=$NGC_API_KEY \ nvcr.io/nim/meta/llama-3.1-8b-instruct:latest # From your local machine, configure StoryBox: # nim_base_url = 'http://:8000/v1' ``` --- ## Available NIM Models | Model | NIM Name | VRAM (self-hosted) | Context | |-------|----------|-------------------|---------| | Llama 3.1 8B | `meta/llama-3.1-8b-instruct` | ~24 GB | 128K | | Llama 3.1 70B | `meta/llama-3.1-70b-instruct` | ~140 GB | 128K | | Mistral 7B | `mistralai/mistral-7b-instruct-v0.3` | ~24 GB | 32K | | Mixtral 8x7B | `mistralai/mixtral-8x7b-instruct-v0.1` | ~100 GB | 32K | | Nemotron-4 340B | `nvidia/nemotron-4-340b-instruct` | ~700 GB | 4K | | Gemma 2 9B | `google/gemma-2-9b-it` | ~24 GB | 8K | | Gemma 2 27B | `google/gemma-2-27b-it` | ~80 GB | 8K | | Phi-3 Mini | `microsoft/phi-3-mini-128k-instruct` | ~16 GB | 128K | | Phi-3 Medium | `microsoft/phi-3-medium-128k-instruct` | ~48 GB | 128K | | Qwen2.5 7B | `qwen/qwen2.5-7b-instruct` | ~24 GB | 128K | **Note:** For cloud NIM, check https://build.nvidia.com for the latest available models. --- ## Configuration Summary ```python # reverie/config/config.py # NVIDIA NIM (cloud) llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct' nim_base_url = 'https://integrate.api.nvidia.com/v1' nim_api_key = os.getenv('NIM_API_KEY') # NVIDIA NIM (self-hosted local) llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct' nim_base_url = 'http://localhost:8000/v1' nim_api_key = 'not-needed' # NVIDIA NIM (self-hosted remote) llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct' nim_base_url = 'http://your-server-ip:8000/v1' nim_api_key = 'not-needed' ``` --- ## Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `NIM_API_KEY` | Your NVIDIA API key | `` | | `NIM_BASE_URL` | NIM endpoint URL | `https://integrate.api.nvidia.com/v1` | --- ## Troubleshooting ### "Authentication failed" - Check your `NIM_API_KEY` is set correctly - For cloud NIM, ensure your key is active at https://build.nvidia.com ### "Model not found" - Verify the model name format: `nvidia//` - Check available models at https://build.nvidia.com ### Connection timeout - For self-hosted: ensure the container is running and port is exposed - Check firewall rules for port 8000 ### Out of memory (self-hosted) - Use a smaller model (e.g., Phi-3 Mini instead of Llama 70B) - Enable quantization: add `--env QUANTIZATION=int8` to docker run - Use tensor parallelism for large models: `--gpus all` with multiple GPUs --- ## Performance Comparison | Setup | Tokens/sec | Latency | Cost | |-------|-----------|---------|------| | OpenAI GPT-4o-mini | ~150 | Low | $0.60/M tokens | | NVIDIA NIM Cloud (8B) | ~100 | Low | ~$0.10/M tokens | | Self-hosted NIM (A100) | ~80 | Very Low | Hardware cost only | | Self-hosted NIM (A10G) | ~40 | Low | Hardware cost only | | Ollama (local) | ~30 | Very Low | Free | --- ## Quick Reference ```bash # 1. Set API key (for cloud NIM) export NIM_API_KEY="nvapi-..." # 2. Edit config # llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct' # 3. Run python run.py ``` For more details, visit: - https://build.nvidia.com (Cloud NIM) - https://docs.nvidia.com/nim/ (Self-hosted NIM)