Openai Api Compatible Inference
Hello,
first of all, this is a nice Model! Thank you!
Now to my Question:
Is there a Way to deploy this easily (without a NIM) for inference as an Openai Api Compatible Endpoint?
When trying to run it with vllm/vllm-openai it doesnt seem to work properly
Anyone Got it to work ?
Hi! Thank you for interest. Nemotron-Parse is now also supported in vllm ToT (besides our fork) - maybe that would work for you?
Hey!
Thank you for the information,
i will try it out with vllm Nightly and report back!
EDIT:
Might need to wait a bit as im using docker and the nightly image was not re-built since the merge
Im getting it to start with the new nightly build
although with tensor parallel it wont work
but now im facing the problem that i cannot really use the endpoint as the model doesnt have a chat template
do you maybe have a chat template you are using in the NIM container ?
or it there a specific endpoint i can use?
i tried /v1/chat/completitions (here i get a chat template missing error)
and /v1/completitions (here multi_modal_content doesnt work)
Any hints in how i could use it properly via API ?
Hi! Added an example with vllm serve/openai-compatible api and chat template. Let me know if this works for you
Hey @katerynaCh thank you for your help!
it seems to still not work though, it seems the image is not properly processed as im getting this:
<x_0.1641><y_0.1844><tbc>**_S_**- \(\mathbf{u}'=\mathbf{v}^*\mathbf{s}+\mathbf{0}\), **_s_**- \(\mathbf{u}''=\mathbf{s}'\mathbf{0}\), **_s_**- \(\mathbf{u}''=\mathbf{s}''\mathbf{0}\), **_s_**- \(\mathbf{u}''=\mathbf{s}''\mathbf{1}\), **_s_**- \(\mathbf{u}''=\mathbf{s}''\mathbf{2}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{3}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{4}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{5}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{6}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{7}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{8}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{9}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \(\mathbf{s}''=\mathbf{s}''\mathbf{10}\), **_s_**- \
it feels like it only outputs data not based on the image.
import base64
from openai import OpenAI
client = OpenAI(
base_url="https://my-domain-endpoint.com/v1",
api_key="sk-not-needed",
)
# Read and base64-encode the image
with open("./page_1.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode("utf-8")
prompt_text = "</s><s><predict_bbox><predict_classes><output_markdown>"
resp = client.chat.completions.create(
model="nvidia/NVIDIA-Nemotron-Parse-v1.1",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": prompt_text,
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{img_b64}",
},
},
],
}
],
max_tokens=7000,
temperature=0.0,
extra_body={
"repetition_penalty": 1.1,
"top_k": 1,
"skip_special_tokens": False,
},
)
print(resp.choices[0].message.content)
--model nvidia/NVIDIA-Nemotron-Parse-v1.1 \
--uvicorn-log-level info \
--gpu-memory-utilization 0.60 \
--limit-mm-per-prompt '{"image": 1}' \
--max-model-len 8000 \
--chat-template /templates/nemotron.jinja \
--trust-remote-code \
--dtype auto \
--port ${PORT} \
--host 0.0.0.0
I cannot explain why =/
Did you make it work on your end the way you described it in the model card?
It does work for me following these steps:
- vllm/vllm-openai:nightly
- pip install albumentations
- vllm serve like in Readme
- openai example like in readme (setting max_tokens < 9000)
Are you using the same environment?
im using this Dockerfile to build vllm/vllm-openai:nemotron:
( vllm nightly pulled today )
FROM vllm/vllm-openai:nightly
RUN pip install open_clip_torch timm albumentations
Only installing albumentations doesn't cut it, open_clip is also needed or it won't start thats why i added that
[0;36m(APIServer pid=1)[0;0m Encountered exception while importing open_clip: No module named 'open_clip'
[0;36m(APIServer pid=1)[0;0m Traceback (most recent call last):
Im running the docker container like this :
docker run --rm \
--gpus 'all' \
--network llama-swap_llama-swap \
--name vllm-${PORT} \
--shm-size 15gb \
-e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
-e NVIDIA_VISIBLE_DEVICES=all \
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
-e VLLM_SLEEP_WHEN_IDLE=1 \
-v /home/meganoob1337/.cache/huggingface/hub/:/root/.cache/huggingface/hub/ \
-v /home/meganoob1337/projects/ollama/vllm_cache2:/root/.cache/vllm \
vllm/vllm-openai:nemotron \
--dtype bfloat16 \
--max-num-seqs 8 \
--limit-mm-per-prompt '{"image": 1}' \
--model nvidia/NVIDIA-Nemotron-Parse-v1.1 \
--uvicorn-log-level info \
--gpu-memory-utilization 0.60 \
--trust-remote-code \
--port ${PORT} \
--host 0.0.0.0
with this script:
import base64
from openai import OpenAI
client = OpenAI(
base_url="https://my-domain-endpoint.com/v1",
api_key="sk-not-needed",
)
# Read and base64-encode the image
with open("./page_1.png", "rb") as f:
img_b64 = base64.b64encode(f.read()).decode("utf-8")
prompt_text = "</s><s><predict_bbox><predict_classes><output_markdown>"
resp = client.chat.completions.create(
model="nemotron-parse-1-1",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": prompt_text,
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{img_b64}",
},
},
],
}
],
max_tokens=7000,
temperature=0.0,
extra_body={
"repetition_penalty": 1.1,
"top_k": 1,
"skip_special_tokens": False,
},
)
print(resp.choices[0].message.content)
and im still getting jibberish..
I dont really understand why this is happening.
Im using the chat template from the Repository (i tried with using it mounted from filesystem aswell but that didnt change anything)
i feel like im missing something but i cannot pinpoint it ...
Hi, we have found some issues when running vllm on non-H100 that were resolved by using 0.14.1 image + using export VLLM_ATTENTION_BACKEND=TRITON_ATTN when serving - possibly this would resolve your issues too?
Great, happy it worked!