Optimised implementation of this model

#2
by ptrdvn - opened

Is there any optimised implementation of this model?

I have got this model running with the provided transformers code but I was wondering if it is compatible with vllm, sglang etc in any way? I would like to increase throughput from 1 item at a time.

Thanks

Just dropping two alternative ways to use this model here.
Neither are faster than the one given on the model card, but they might be useful to someone.

With batched transformers code

This runs, but does not run significantly quicker than the original implementation when run at scale.
This also does not output identical responses, which I think may be affected by padding tokens or something (?)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def get_batch_rewards(prompts, responses):

    messages = [[
        {'role': "user", "content": prompt}, 
        {'role': "assistant", "content": response}
    ] for prompt, response in zip(prompts, responses)]

    formatted_prompts = [tokenizer.apply_chat_template(p, tokenize=False, add_generation_prompt=False) for p in messages]
    
    tokenized_batch = tokenizer(
        formatted_prompts, 
        padding=True, 
        return_tensors="pt"
    )
    tokenized_batch = {k: v.to(model.device) for k, v in tokenized_batch.items()}

    response_token_ids = model.generate(
        **tokenized_batch,
        max_new_tokens=1, 
        return_dict_in_generate=True, 
        output_scores=True
    )
    rewards = response_token_ids['scores'][0][:, 0].tolist()
    return rewards

model_name = "nvidia/Llama-3.3-Nemotron-70B-Reward-Multilingual"
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"

responses = ["1+1=2", "1+1=3"] * 10
prompts = ["What is 1+1?"] * len(responses)
batch_size = 8

rewards = []
for i in trange(0, len(prompts), batch_size):
    batch_prompts = prompts[i:i+batch_size]
    batch_responses = responses[i:i+batch_size]
    rewards.append(get_batch_rewards(batch_prompts, batch_responses))

With vLLM

We get the logprob of the 0th token, which seems correlated with the score given in the model.
However, this implementation surprisingly does not run faster than the single-batch transformers implementation when run at scale.
Would anyone have an intuition as to why this does not run faster? I usually find vLLM is much faster at batched processing, but I am guessing there is minimal gains when you are just generating 1 token per sequence.


from vllm import LLM, SamplingParams

llm = LLM(
    model="nvidia/Llama-3.3-Nemotron-70B-Reward-Multilingual", 
    tensor_parallel_size=8, 
    max_model_len=8_000,
    max_num_seqs=128
)

responses = ["1+1=2", "1+1=3"]
prompts = ["What is 1+1?"] * len(responses)

conversation = [[
    {
        "role": "user",
        "content": prompt
    },
    {
        "role": "assistant",
        "content": response
    }
] for prompt, response in zip(prompts, responses)]

sampling_params = SamplingParams(temperature=0.0, max_tokens=1, logprobs=20, allowed_token_ids=[0])
outputs = llm.chat(conversation, sampling_params)
rewards = [o.outputs[0].logprobs[0][0].logprob for o in outputs]

Sign up or log in to comment