storybox-reproduction / NVIDIA_NIM_GUIDE.md

Upload folder using huggingface_hub

7d850d4 verified 5 days ago

6.35 kB

	# Using NVIDIA NIM with StoryBox

	NVIDIA NIM provides optimized inference for LLMs via an OpenAI-compatible API. This guide shows how to use NIM with StoryBox.

	## What is NVIDIA NIM?

	NVIDIA NIM (NVIDIA Inference Microservices) is a set of easy-to-use microservices for deploying AI models. It exposes an OpenAI-compatible API, so it works seamlessly with StoryBox's existing `ChatOpenAI` integration.

	## Setup Options

	### Option 1: NVIDIA AI Enterprise (Cloud)

	Use NVIDIA-hosted models via the NIM API.

	#### Step 1: Get API Key

	1. Go to https://build.nvidia.com
	2. Sign in with your NVIDIA account
	3. Generate an API key

	#### Step 2: Set Environment Variables

	```bash
	export NIM_API_KEY="nvapi-xxxxxxxxxxxxxxxxxxxxxxxx"
	# Optional: override the default endpoint
	export NIM_BASE_URL="https://integrate.api.nvidia.com/v1"
	```

	#### Step 3: Configure StoryBox

	Edit `reverie/config/config.py`:

	```python
	# Use NVIDIA NIM model
	# Format: nvidia/<model-name>
	# The "nvidia/" prefix tells StoryBox to route to NIM
	llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
	# llm_model_name = 'nvidia/meta/llama-3.1-70b-instruct'
	# llm_model_name = 'nvidia/mistralai/mistral-7b-instruct-v0.3'
	# llm_model_name = 'nvidia/nvidia/nemotron-4-340b-instruct'
	# llm_model_name = 'nvidia/google/gemma-2-9b-it'
	# llm_model_name = 'nvidia/microsoft/phi-3-mini-128k-instruct'

	# NIM settings (reads from env vars by default)
	nim_base_url = os.getenv('NIM_BASE_URL', 'https://integrate.api.nvidia.com/v1')
	nim_api_key = os.getenv('NIM_API_KEY', '<YOUR_NIM_API_KEY>')
	```

	#### Step 4: Run

	```bash
	cd /app/storybox/reverie
	python run.py
	```

	---

	### Option 2: Self-Hosted NIM (Local/Docker)

	Run NIM on your own GPU infrastructure.

	#### Step 1: Prerequisites

	- NVIDIA GPU with at least 24GB VRAM (for 8B models)
	- Docker with NVIDIA Container Toolkit
	- NVIDIA driver 535+ and CUDA 12.2+

	#### Step 2: Pull and Run NIM Container

	```bash
	# Login to NVIDIA Container Registry
	docker login nvcr.io
	# Username: $oauthtoken
	# Password: <YOUR_NGC_API_KEY>

	# Run Llama 3.1 8B NIM
	docker run --gpus all --rm \
	-p 8000:8000 \
	-e NGC_API_KEY=<YOUR_NGC_API_KEY> \
	nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

	# Or run Mistral 7B
	docker run --gpus all --rm \
	-p 8000:8000 \
	-e NGC_API_KEY=<YOUR_NGC_API_KEY> \
	nvcr.io/nim/mistralai/mistral-7b-instruct-v0.3:latest
	```

	#### Step 3: Configure StoryBox for Local NIM

	```python
	# In reverie/config/config.py
	llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'

	# Point to your local NIM instance
	nim_base_url = 'http://localhost:8000/v1'
	nim_api_key = 'not-needed-for-local' # Local NIM doesn't require auth by default
	```

	#### Step 4: Run

	```bash
	cd /app/storybox/reverie
	python run.py
	```

	---

	### Option 3: NIM on Kubernetes / Cloud

	For production deployments, run NIM on Kubernetes or cloud GPU instances.

	#### Example: AWS EC2 g5.xlarge (A10G GPU)

	```bash
	# SSH into your GPU instance
	ssh -i key.pem ubuntu@<instance-ip>

	# Install Docker and NVIDIA Container Toolkit
	# ... (standard setup)

	# Run NIM
	docker run --gpus all --rm \
	-p 8000:8000 \
	-e NGC_API_KEY=$NGC_API_KEY \
	nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

	# From your local machine, configure StoryBox:
	# nim_base_url = 'http://<instance-ip>:8000/v1'
	```

	---

	## Available NIM Models

	\| Model \| NIM Name \| VRAM (self-hosted) \| Context \|
	\|-------\|----------\|-------------------\|---------\|
	\| Llama 3.1 8B \| `meta/llama-3.1-8b-instruct` \| ~24 GB \| 128K \|
	\| Llama 3.1 70B \| `meta/llama-3.1-70b-instruct` \| ~140 GB \| 128K \|
	\| Mistral 7B \| `mistralai/mistral-7b-instruct-v0.3` \| ~24 GB \| 32K \|
	\| Mixtral 8x7B \| `mistralai/mixtral-8x7b-instruct-v0.1` \| ~100 GB \| 32K \|
	\| Nemotron-4 340B \| `nvidia/nemotron-4-340b-instruct` \| ~700 GB \| 4K \|
	\| Gemma 2 9B \| `google/gemma-2-9b-it` \| ~24 GB \| 8K \|
	\| Gemma 2 27B \| `google/gemma-2-27b-it` \| ~80 GB \| 8K \|
	\| Phi-3 Mini \| `microsoft/phi-3-mini-128k-instruct` \| ~16 GB \| 128K \|
	\| Phi-3 Medium \| `microsoft/phi-3-medium-128k-instruct` \| ~48 GB \| 128K \|
	\| Qwen2.5 7B \| `qwen/qwen2.5-7b-instruct` \| ~24 GB \| 128K \|

	Note: For cloud NIM, check https://build.nvidia.com for the latest available models.

	---

	## Configuration Summary

	```python
	# reverie/config/config.py

	# NVIDIA NIM (cloud)
	llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
	nim_base_url = 'https://integrate.api.nvidia.com/v1'
	nim_api_key = os.getenv('NIM_API_KEY')

	# NVIDIA NIM (self-hosted local)
	llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
	nim_base_url = 'http://localhost:8000/v1'
	nim_api_key = 'not-needed'

	# NVIDIA NIM (self-hosted remote)
	llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'
	nim_base_url = 'http://your-server-ip:8000/v1'
	nim_api_key = 'not-needed'
	```

	---

	## Environment Variables

	\| Variable \| Description \| Default \|
	\|----------\|-------------\|---------\|
	\| `NIM_API_KEY` \| Your NVIDIA API key \| `<YOUR_NIM_API_KEY>` \|
	\| `NIM_BASE_URL` \| NIM endpoint URL \| `https://integrate.api.nvidia.com/v1` \|

	---

	## Troubleshooting

	### "Authentication failed"
	- Check your `NIM_API_KEY` is set correctly
	- For cloud NIM, ensure your key is active at https://build.nvidia.com

	### "Model not found"
	- Verify the model name format: `nvidia/<org>/<model-name>`
	- Check available models at https://build.nvidia.com

	### Connection timeout
	- For self-hosted: ensure the container is running and port is exposed
	- Check firewall rules for port 8000

	### Out of memory (self-hosted)
	- Use a smaller model (e.g., Phi-3 Mini instead of Llama 70B)
	- Enable quantization: add `--env QUANTIZATION=int8` to docker run
	- Use tensor parallelism for large models: `--gpus all` with multiple GPUs

	---

	## Performance Comparison

	\| Setup \| Tokens/sec \| Latency \| Cost \|
	\|-------\|-----------\|---------\|------\|
	\| OpenAI GPT-4o-mini \| ~150 \| Low \| $0.60/M tokens \|
	\| NVIDIA NIM Cloud (8B) \| ~100 \| Low \| ~$0.10/M tokens \|
	\| Self-hosted NIM (A100) \| ~80 \| Very Low \| Hardware cost only \|
	\| Self-hosted NIM (A10G) \| ~40 \| Low \| Hardware cost only \|
	\| Ollama (local) \| ~30 \| Very Low \| Free \|

	---

	## Quick Reference

	```bash
	# 1. Set API key (for cloud NIM)
	export NIM_API_KEY="nvapi-..."

	# 2. Edit config
	# llm_model_name = 'nvidia/meta/llama-3.1-8b-instruct'

	# 3. Run
	python run.py
	```

	For more details, visit:
	- https://build.nvidia.com (Cloud NIM)
	- https://docs.nvidia.com/nim/ (Self-hosted NIM)