Openai Api Compatible Inference

#4
by meganoob1337 - opened

Hello,
first of all, this is a nice Model! Thank you!

Now to my Question:
Is there a Way to deploy this easily (without a NIM) for inference as an Openai Api Compatible Endpoint?
When trying to run it with vllm/vllm-openai it doesnt seem to work properly

Anyone Got it to work ?

NVIDIA org

Hi! Thank you for interest. Nemotron-Parse is now also supported in vllm ToT (besides our fork) - maybe that would work for you?

Hey!

Thank you for the information,
i will try it out with vllm Nightly and report back!
EDIT:
Might need to wait a bit as im using docker and the nightly image was not re-built since the merge

Im getting it to start with the new nightly build
although with tensor parallel it wont work

but now im facing the problem that i cannot really use the endpoint as the model doesnt have a chat template
do you maybe have a chat template you are using in the NIM container ?
or it there a specific endpoint i can use?
i tried /v1/chat/completitions (here i get a chat template missing error)
and /v1/completitions (here multi_modal_content doesnt work)

Any hints in how i could use it properly via API ?

NVIDIA org

Hi! Added an example with vllm serve/openai-compatible api and chat template. Let me know if this works for you

Hey @katerynaCh thank you for your help!
it seems to still not work though, it seems the image is not properly processed as im getting this:

<x_0.1641><y_0.1844><tbc>**_S_**- \(\mathbf{u}'=\mathbf{v}^*\mathbf{s}+\mathbf{0}\), **_s_**- \(\mathbf{u}''=\mathbf{s}'\mathbf{0}\), **_s_**- \(\mathbf{u}''=\mathbf{s}''\mathbf{0}\), **_s_**- \(\mathbf{u}''=\mathbf{s}''\mathbf{1}\), **_s_**- \(\mathbf{u}''=\mathbf{s}''\mathbf{2}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{3}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{4}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{5}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{6}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{7}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{8}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{9}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \

it feels like it only outputs data not based on the image.

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://my-domain-endpoint.com/v1",
    api_key="sk-not-needed",
)

# Read and base64-encode the image
with open("./page_1.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode("utf-8")

prompt_text = "</s><s><predict_bbox><predict_classes><output_markdown>"

resp = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Parse-v1.1",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt_text,
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{img_b64}",
                    },
                },
            ],
        }
    ],
    max_tokens=7000,
    temperature=0.0,
    extra_body={
        "repetition_penalty": 1.1,
        "top_k": 1,
        "skip_special_tokens": False,
    },
)

print(resp.choices[0].message.content)
        --model nvidia/NVIDIA-Nemotron-Parse-v1.1 \
        --uvicorn-log-level info \
        --gpu-memory-utilization 0.60 \
        --limit-mm-per-prompt '{"image": 1}' \
        --max-model-len 8000 \
        --chat-template /templates/nemotron.jinja \
        --trust-remote-code \
        --dtype auto \
        --port ${PORT} \
        --host 0.0.0.0

I cannot explain why =/
Did you make it work on your end the way you described it in the model card?

It does work for me following these steps:

  1. vllm/vllm-openai:nightly
  2. pip install albumentations
  3. vllm serve like in Readme
  4. openai example like in readme (setting max_tokens < 9000)

Are you using the same environment?

im using this Dockerfile to build vllm/vllm-openai:nemotron:
( vllm nightly pulled today )

FROM vllm/vllm-openai:nightly
RUN pip install open_clip_torch timm albumentations

Only installing albumentations doesn't cut it, open_clip is also needed or it won't start thats why i added that

(APIServer pid=1) Encountered exception while importing open_clip: No module named 'open_clip'
(APIServer pid=1) Traceback (most recent call last):

Im running the docker container like this :

    docker run --rm \
      --gpus 'all'  \
      --network llama-swap_llama-swap \
      --name vllm-${PORT} \
      --shm-size 15gb \
      -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
      -e NVIDIA_VISIBLE_DEVICES=all \
      -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
      -e VLLM_SLEEP_WHEN_IDLE=1 \
      -v /home/meganoob1337/.cache/huggingface/hub/:/root/.cache/huggingface/hub/ \
      -v /home/meganoob1337/projects/ollama/vllm_cache2:/root/.cache/vllm \
      vllm/vllm-openai:nemotron \
        --dtype bfloat16 \
        --max-num-seqs 8 \
        --limit-mm-per-prompt '{"image": 1}' \
        --model nvidia/NVIDIA-Nemotron-Parse-v1.1 \
        --uvicorn-log-level info \
        --gpu-memory-utilization 0.60 \
        --trust-remote-code \
        --port ${PORT} \
        --host 0.0.0.0

with this script:

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://my-domain-endpoint.com/v1",
    api_key="sk-not-needed",
)

# Read and base64-encode the image
with open("./page_1.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode("utf-8")

prompt_text = "</s><s><predict_bbox><predict_classes><output_markdown>"

resp = client.chat.completions.create(
    model="nemotron-parse-1-1",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt_text,
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{img_b64}",
                    },
                },
            ],
        }
    ],
    max_tokens=7000,
    temperature=0.0,
    extra_body={
        "repetition_penalty": 1.1,
        "top_k": 1,
        "skip_special_tokens": False,
    },
)

print(resp.choices[0].message.content)

and im still getting jibberish..
I dont really understand why this is happening.
Im using the chat template from the Repository (i tried with using it mounted from filesystem aswell but that didnt change anything)
i feel like im missing something but i cannot pinpoint it ...

NVIDIA org

Hi, we have found some issues when running vllm on non-H100 that were resolved by using 0.14.1 image + using export VLLM_ATTENTION_BACKEND=TRITON_ATTN when serving - possibly this would resolve your issues too?

@katerynaCh yes! That was it, thank you very much!!

meganoob1337 changed discussion status to closed
NVIDIA org

Great, happy it worked!

Sign up or log in to comment