REAP-55 quant version
#7
by JaheimLee - opened
Hi, thanks for this great work. Could you release a REAP-55 quant version? It could run on 4 x 24G GPUs.
+1
I've tried running on 2*48Gb GPU 4090 and vllm fails with not enough vram, I tried many settings and none of them worked
CONTAINER_ID=$(docker run -d --restart unless-stopped --gpus all -p 8080:8000 -e NCCL_P2P_DISABLE=1 -e CUDA_VISIBLE_DEVICES=0,1 -e TORCH_CUDA_ARCH_LIST="8.9" -e VLLM_TOOL_JSON_ERROR_AUTOMATIC_RETRY=1 -e TZ=Europe/Berlin -e TIKTOKEN_CACHE_DIR=/root/.cache/tiktoken -e HF_HUB_OFFLINE=1 -e HF_HOME=/root/.cache/huggingface -v $MOUNT_DIR:/root/.cache/huggingface -v /root/.cache/tiktoken:/root/.cache/tiktoken -v /var/log/vllm:/var/log/vllm vllm/vllm-openai:nightly --model $MODEL_NAME --reasoning-parser glm45 --gpu-memory-utilization 0.75 --tool-call-parser glm47 --max-model-len 8192 --enable-auto-tool-choice --max-num-seqs 16 --tensor-parallel-size 2)