Optimised implementation of this model
Is there any optimised implementation of this model?
I have got this model running with the provided transformers code but I was wondering if it is compatible with vllm, sglang etc in any way? I would like to increase throughput from 1 item at a time.
Thanks
Just dropping two alternative ways to use this model here.
Neither are faster than the one given on the model card, but they might be useful to someone.
With batched transformers code
This runs, but does not run significantly quicker than the original implementation when run at scale.
This also does not output identical responses, which I think may be affected by padding tokens or something (?)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def get_batch_rewards(prompts, responses):
messages = [[
{'role': "user", "content": prompt},
{'role': "assistant", "content": response}
] for prompt, response in zip(prompts, responses)]
formatted_prompts = [tokenizer.apply_chat_template(p, tokenize=False, add_generation_prompt=False) for p in messages]
tokenized_batch = tokenizer(
formatted_prompts,
padding=True,
return_tensors="pt"
)
tokenized_batch = {k: v.to(model.device) for k, v in tokenized_batch.items()}
response_token_ids = model.generate(
**tokenized_batch,
max_new_tokens=1,
return_dict_in_generate=True,
output_scores=True
)
rewards = response_token_ids['scores'][0][:, 0].tolist()
return rewards
model_name = "nvidia/Llama-3.3-Nemotron-70B-Reward-Multilingual"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
responses = ["1+1=2", "1+1=3"] * 10
prompts = ["What is 1+1?"] * len(responses)
batch_size = 8
rewards = []
for i in trange(0, len(prompts), batch_size):
batch_prompts = prompts[i:i+batch_size]
batch_responses = responses[i:i+batch_size]
rewards.append(get_batch_rewards(batch_prompts, batch_responses))
With vLLM
We get the logprob of the 0th token, which seems correlated with the score given in the model.
However, this implementation surprisingly does not run faster than the single-batch transformers implementation when run at scale.
Would anyone have an intuition as to why this does not run faster? I usually find vLLM is much faster at batched processing, but I am guessing there is minimal gains when you are just generating 1 token per sequence.
from vllm import LLM, SamplingParams
llm = LLM(
model="nvidia/Llama-3.3-Nemotron-70B-Reward-Multilingual",
tensor_parallel_size=8,
max_model_len=8_000,
max_num_seqs=128
)
responses = ["1+1=2", "1+1=3"]
prompts = ["What is 1+1?"] * len(responses)
conversation = [[
{
"role": "user",
"content": prompt
},
{
"role": "assistant",
"content": response
}
] for prompt, response in zip(prompts, responses)]
sampling_params = SamplingParams(temperature=0.0, max_tokens=1, logprobs=20, allowed_token_ids=[0])
outputs = llm.chat(conversation, sampling_params)
rewards = [o.outputs[0].logprobs[0][0].logprob for o in outputs]