sglang inference

by owao - opened Feb 13

Discussion

owao

Feb 13

•

edited Feb 24

I'm super mega duper hype to experiment with your model as I already found the previous iteration to be unique!

Just sharing the sglang serve command I use to run the model in BF16:

  "Nanbeige4.1-3B_sglang_131K":
    cmd: |
      python -m sglang.launch_server
        --model-path Nanbeige/Nanbeige4.1-3B
        --host 0.0.0.0
        --trust-remote-code
        --enable-torch-compile
        --tp-size 1
        # --disable-cuda-graph <-- if you disable it, the model loads faster, but you'll pay it as a lower throughput, recommend to let it enabled
        --reasoning-parser qwen3
        --tool-call-parser qwen
        --context-length 131072
        --port ${PORT}

On a RTX 3090 (just a bit undervolted), this gives a very good throughput even deep in the context window, so that's perfect for model with very long CoTs like this :)
Starting at ~110t/s @50 tokens to only fall to ~100t/s @32000 tokens vs starting at ~81t/s through llama-cpp in BF16 as well. But the major benefit are the prompt processing speed!
Takes ~21.5 GB of VRAM

PS: as I always recommend, give a try to llama-swap to manage all your models, it handles any inference engine (llama-cpp, sglang, vllm, etc.). You just have 1 config file and that's all, and the servers restart automatically when editing this file. The snippet is directly from my config file, that's why it doesn't contain the eventual \ before each newlines the bash command would have if ran directly. Do yourself a favor and ditch out ollama...!
If anyone is interested, I'll be more than happy to share my config file and llama-swap cmd as a starting point example!
Just ask!

skhadloya

Feb 15

hey!
I assume this wont be supported by vllm ootb right? The base model is not a derivative of some common players right?

owao

Feb 16

Hey @skhadloya I didn't try but yes it should be supported by vllm as the architecture is the good old, well supported LlamaForCausalLM

You can find it in config.json ;)

  "architectures": [
    "LlamaForCausalLM"
  ],

skhadloya

Feb 17

Oh nice - Missed this, thanks!!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment