0xSero/GLM-4.7-REAP-50-W4A16 · REAP-55 quant version

REAP-55 quant version

by JaheimLee - opened Jan 7

Discussion

JaheimLee

Jan 7

•

edited Jan 7

Hi, thanks for this great work. Could you release a REAP-55 quant version? It could run on 4 x 24G GPUs.

isevendays

Jan 29

I've tried running on 2*48Gb GPU 4090 and vllm fails with not enough vram, I tried many settings and none of them worked

CONTAINER_ID=$(docker run -d       --restart unless-stopped       --gpus all       -p 8080:8000       -e NCCL_P2P_DISABLE=1       -e CUDA_VISIBLE_DEVICES=0,1       -e TORCH_CUDA_ARCH_LIST="8.9"       -e VLLM_TOOL_JSON_ERROR_AUTOMATIC_RETRY=1       -e TZ=Europe/Berlin       -e TIKTOKEN_CACHE_DIR=/root/.cache/tiktoken       -e HF_HUB_OFFLINE=1       -e HF_HOME=/root/.cache/huggingface       -v $MOUNT_DIR:/root/.cache/huggingface       -v /root/.cache/tiktoken:/root/.cache/tiktoken       -v /var/log/vllm:/var/log/vllm       vllm/vllm-openai:nightly       --model $MODEL_NAME       --reasoning-parser glm45 --gpu-memory-utilization 0.75 --tool-call-parser glm47 --max-model-len 8192 --enable-auto-tool-choice       --max-num-seqs 16       --tensor-parallel-size 2)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment