shahidul034
/

readctrl

Model card Files Files and versions

readctrl / llama_cpp_a100_guide.txt

shahidul034's picture

Add files using upload-large-folder tool

c29669c verified 2 months ago

history blame contribute delete

2.9 kB

	================================================================================
	LLAMA.CPP SETUP & USAGE GUIDE FOR NVIDIA A100 80GB (CUDA 13)
	================================================================================

	1. COMPILATION & INSTALLATION (Optimized for Ampere A100)
	--------------------------------------------------------------------------------
	# Step 1: Set Environment Variables
	export PATH=/usr/local/cuda-13.0/bin:$PATH
	export LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH
	export CUDACXX=/usr/local/cuda-13.0/bin/nvcc

	# Step 2: Clean and Configure
	# We force Architecture 80 (A100) and disable CUDA compression to avoid GCC errors.
	cd ~/llama.cpp
	rm -rf build && mkdir build && cd build

	cmake .. -DGGML_CUDA=ON \
	-DCMAKE_CUDA_ARCHITECTURES=80 \
	-DGGML_CUDA_COMPRESS_OPTIM_SIZE=OFF \
	-DCMAKE_BUILD_TYPE=Release

	# Step 3: Build
	cmake --build . --config Release -j $(nproc)


	2. ENVIRONMENT PERMANENCE
	--------------------------------------------------------------------------------
	# Add binaries and CUDA paths to your .bashrc to skip manual exports next time:
	echo 'export PATH="/usr/local/cuda-13.0/bin:$HOME/llama.cpp/build/bin:$PATH"' >> ~/.bashrc
	echo 'export LD_LIBRARY_PATH="/usr/local/cuda-13.0/lib64:$LD_LIBRARY_PATH"' >> ~/.bashrc
	source ~/.bashrc


	3. RUNNING INFERENCE (llama-cli)
	--------------------------------------------------------------------------------
	# Key flags for A100 80GB:
	# -ngl 99 : Offload all layers to GPU VRAM
	# -fa : Enable Flash Attention (massive speedup for A100)
	# -c 8192 : Set context size (you can go much higher on 80GB)

	llama-cli -m /path/to/your/model.gguf \
	-ngl 99 \
	-fa \
	-c 8192 \
	-n 512 \
	-p "You are a helpful assistant. Explain the benefits of A100 GPUs."


	4. SERVING AN API (llama-server)
	--------------------------------------------------------------------------------
	# Starts an OpenAI-compatible API server
	llama-server -m /path/to/your/model.gguf \
	-ngl 99 \
	-fa \
	--port 8080 \
	--host 0.0.0.0


	5. PERFORMANCE BENCHMARKING
	--------------------------------------------------------------------------------
	# Test the tokens-per-second (t/s) capability of your hardware:
	llama-bench -m /path/to/your/model.gguf -ngl 99 -fa


	6. OPTIMIZATION TIPS FOR A100 80GB
	--------------------------------------------------------------------------------
	* QUANTIZATION: With 80GB VRAM, use Q8_0 or Q6_K quantizations for near-native
	precision. Use Q4_K_M only if running massive 100B+ parameter models.
	* FLASH ATTENTION: Always use the -fa flag. It is specifically optimized for
	the A100's architecture.
	* BATCHING: If running multiple requests, increase '-b' (physical batch size)
	and '-ub' (logical batch size) to 2048 or higher to saturate the A100 cores.
	================================================================================