| ================================================================================ |
| LLAMA.CPP SETUP & USAGE GUIDE FOR NVIDIA A100 80GB (CUDA 13) |
| ================================================================================ |
|
|
| 1. COMPILATION & INSTALLATION (Optimized for Ampere A100) |
| -------------------------------------------------------------------------------- |
| # Step 1: Set Environment Variables |
| export PATH=/usr/local/cuda-13.0/bin:$PATH |
| export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH |
| export CUDACXX=/usr/local/cuda-13.0/bin/nvcc |
|
|
| # Step 2: Clean and Configure |
| # We force Architecture 80 (A100) and disable CUDA compression to avoid GCC errors. |
| cd ~/llama.cpp |
| rm -rf build && mkdir build && cd build |
|
|
| cmake .. -DGGML_CUDA=ON \ |
| -DCMAKE_CUDA_ARCHITECTURES=80 \ |
| -DGGML_CUDA_COMPRESS_OPTIM_SIZE=OFF \ |
| -DCMAKE_BUILD_TYPE=Release |
|
|
| # Step 3: Build |
| cmake --build . --config Release -j $(nproc) |
|
|
|
|
| 2. ENVIRONMENT PERMANENCE |
| -------------------------------------------------------------------------------- |
| # Add binaries and CUDA paths to your .bashrc to skip manual exports next time: |
| echo 'export PATH="/usr/local/cuda-13.0/bin:$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc |
| echo 'export LD_LIBRARY_PATH="/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH"' >> ~/.bashrc |
| source ~/.bashrc |
|
|
|
|
| 3. RUNNING INFERENCE (llama-cli) |
| -------------------------------------------------------------------------------- |
| # Key flags for A100 80GB: |
| # -ngl 99 : Offload all layers to GPU VRAM |
| # -fa : Enable Flash Attention (massive speedup for A100) |
| # -c 8192 : Set context size (you can go much higher on 80GB) |
|
|
| llama-cli -m /path/to/your/model.gguf \ |
| -ngl 99 \ |
| -fa \ |
| -c 8192 \ |
| -n 512 \ |
| -p "You are a helpful assistant. Explain the benefits of A100 GPUs." |
|
|
|
|
| 4. SERVING AN API (llama-server) |
| -------------------------------------------------------------------------------- |
| # Starts an OpenAI-compatible API server |
| llama-server -m /path/to/your/model.gguf \ |
| -ngl 99 \ |
| -fa \ |
| --port 8080 \ |
| --host 0.0.0.0 |
|
|
|
|
| 5. PERFORMANCE BENCHMARKING |
| -------------------------------------------------------------------------------- |
| # Test the tokens-per-second (t/s) capability of your hardware: |
| llama-bench -m /path/to/your/model.gguf -ngl 99 -fa |
|
|
|
|
| 6. OPTIMIZATION TIPS FOR A100 80GB |
| -------------------------------------------------------------------------------- |
| * QUANTIZATION: With 80GB VRAM, use Q8_0 or Q6_K quantizations for near-native |
| precision. Use Q4_K_M only if running massive 100B+ parameter models. |
| * FLASH ATTENTION: Always use the -fa flag. It is specifically optimized for |
| the A100's architecture. |
| * BATCHING: If running multiple requests, increase '-b' (physical batch size) |
| and '-ub' (logical batch size) to 2048 or higher to saturate the A100 cores. |
| ================================================================================ |
|
|