Openai Api Compatible Inference

by meganoob1337 - opened Jan 5

Jan 5

Hello,
first of all, this is a nice Model! Thank you!

Now to my Question:
Is there a Way to deploy this easily (without a NIM) for inference as an Openai Api Compatible Endpoint?
When trying to run it with vllm/vllm-openai it doesnt seem to work properly

Anyone Got it to work ?

katerynaCh

NVIDIA org Jan 8

Hi! Thank you for interest. Nemotron-Parse is now also supported in vllm ToT (besides our fork) - maybe that would work for you?

meganoob1337

Jan 8

•

edited Jan 8

Hey!

Thank you for the information,
i will try it out with vllm Nightly and report back!
EDIT:
Might need to wait a bit as im using docker and the nightly image was not re-built since the merge

meganoob1337

Jan 9

Im getting it to start with the new nightly build
although with tensor parallel it wont work

but now im facing the problem that i cannot really use the endpoint as the model doesnt have a chat template
do you maybe have a chat template you are using in the NIM container ?
or it there a specific endpoint i can use?
i tried /v1/chat/completitions (here i get a chat template missing error)
and /v1/completitions (here multi_modal_content doesnt work)

Any hints in how i could use it properly via API ?

katerynaCh

NVIDIA org Jan 18

Hi! Added an example with vllm serve/openai-compatible api and chat template. Let me know if this works for you

meganoob1337

Jan 19

•

edited Jan 19

Hey @katerynaCh thank you for your help!
it seems to still not work though, it seems the image is not properly processed as im getting this:

<x_0.1641><y_0.1844><tbc>**_S_**- \(\mathbf{u}'=\mathbf{v}^*\mathbf{s}+\mathbf{0}\), **_s_**- \(\mathbf{u}''=\mathbf{s}'\mathbf{0}\), **_s_**- \(\mathbf{u}''=\mathbf{s}''\mathbf{0}\), **_s_**- \(\mathbf{u}''=\mathbf{s}''\mathbf{1}\), **_s_**- \(\mathbf{u}''=\mathbf{s}''\mathbf{2}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{3}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{4}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{5}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{6}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{7}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{8}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{9}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \

it feels like it only outputs data not based on the image.

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://my-domain-endpoint.com/v1",
    api_key="sk-not-needed",
)

# Read and base64-encode the image
with open("./page_1.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode("utf-8")

prompt_text = "</s><s><predict_bbox><predict_classes><output_markdown>"

resp = client.chat.completions.create(
    model="nvidia/NVIDIA-Nemotron-Parse-v1.1",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt_text,
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{img_b64}",
                    },
                },
            ],
        }
    ],
    max_tokens=7000,
    temperature=0.0,
    extra_body={
        "repetition_penalty": 1.1,
        "top_k": 1,
        "skip_special_tokens": False,
    },
)

print(resp.choices[0].message.content)

        --model nvidia/NVIDIA-Nemotron-Parse-v1.1 \
        --uvicorn-log-level info \
        --gpu-memory-utilization 0.60 \
        --limit-mm-per-prompt '{"image": 1}' \
        --max-model-len 8000 \
        --chat-template /templates/nemotron.jinja \
        --trust-remote-code \
        --dtype auto \
        --port ${PORT} \
        --host 0.0.0.0

I cannot explain why =/
Did you make it work on your end the way you described it in the model card?

katerynaCh

NVIDIA org Jan 20

•

edited Jan 20

It does work for me following these steps:

vllm/vllm-openai:nightly
pip install albumentations
vllm serve like in Readme
openai example like in readme (setting max_tokens < 9000)

Are you using the same environment?

meganoob1337

Jan 20

im using this Dockerfile to build vllm/vllm-openai:nemotron:
( vllm nightly pulled today )

FROM vllm/vllm-openai:nightly
RUN pip install open_clip_torch timm albumentations

Only installing albumentations doesn't cut it, open_clip is also needed or it won't start thats why i added that

[0;36m(APIServer pid=1)[0;0m Encountered exception while importing open_clip: No module named 'open_clip'
[0;36m(APIServer pid=1)[0;0m Traceback (most recent call last):

Im running the docker container like this :

    docker run --rm \
      --gpus 'all'  \
      --network llama-swap_llama-swap \
      --name vllm-${PORT} \
      --shm-size 15gb \
      -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
      -e NVIDIA_VISIBLE_DEVICES=all \
      -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
      -e VLLM_SLEEP_WHEN_IDLE=1 \
      -v /home/meganoob1337/.cache/huggingface/hub/:/root/.cache/huggingface/hub/ \
      -v /home/meganoob1337/projects/ollama/vllm_cache2:/root/.cache/vllm \
      vllm/vllm-openai:nemotron \
        --dtype bfloat16 \
        --max-num-seqs 8 \
        --limit-mm-per-prompt '{"image": 1}' \
        --model nvidia/NVIDIA-Nemotron-Parse-v1.1 \
        --uvicorn-log-level info \
        --gpu-memory-utilization 0.60 \
        --trust-remote-code \
        --port ${PORT} \
        --host 0.0.0.0

with this script:

import base64
from openai import OpenAI

client = OpenAI(
    base_url="https://my-domain-endpoint.com/v1",
    api_key="sk-not-needed",
)

# Read and base64-encode the image
with open("./page_1.png", "rb") as f:
    img_b64 = base64.b64encode(f.read()).decode("utf-8")

prompt_text = "</s><s><predict_bbox><predict_classes><output_markdown>"

resp = client.chat.completions.create(
    model="nemotron-parse-1-1",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": prompt_text,
                },
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{img_b64}",
                    },
                },
            ],
        }
    ],
    max_tokens=7000,
    temperature=0.0,
    extra_body={
        "repetition_penalty": 1.1,
        "top_k": 1,
        "skip_special_tokens": False,
    },
)

print(resp.choices[0].message.content)

and im still getting jibberish..
I dont really understand why this is happening.
Im using the chat template from the Repository (i tried with using it mounted from filesystem aswell but that didnt change anything)
i feel like im missing something but i cannot pinpoint it ...

katerynaCh

NVIDIA org Jan 27

Hi, we have found some issues when running vllm on non-H100 that were resolved by using 0.14.1 image + using export VLLM_ATTENTION_BACKEND=TRITON_ATTN when serving - possibly this would resolve your issues too?

meganoob1337

Jan 27

@katerynaCh yes! That was it, thank you very much!!

meganoob1337 changed discussion status to closed Jan 27

katerynaCh

NVIDIA org Jan 28

Great, happy it worked!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment