ValueError: Transformers does not recognize the model type `llava_qwen`

by buioyc - opened Aug 2, 2025

Aug 2, 2025

this is my code:

from transformers import AutoProcessor, AutoModelForVision2Seq

model_path = "./ViDRiP_LLaVA_image"  # 根據你實際 clone 下來的資料夾調整

processor = AutoProcessor.from_pretrained(model_path)
model = AutoModelForVision2Seq.from_pretrained(model_path).to("cuda")

and the following is the wrong message:

ValueError: The checkpoint you are trying to load has model type `llava_qwen` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can get the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

thank you very much if any one could help!!!

trinhvg

Owner Aug 3, 2025

Hi, could you please follow the enviroment set up by LLaVA-NeXT to install llava:
https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/requirements.txt

The "llava_qwen" model is implemented under the class "LlavaQwenForCausalLM" in the "llava" package.
Below is a snippet demonstrating how to use the model for inference on both image and video inputs.
Please let us know if you need any further assistance.

++++++++++++++++++++++++++++++++++++
We use from "llava.model.builder import load_pretrained_model" to load the model
++++++++++++++++++++++++++++++++++++

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
import os

model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map, attn_implementation=None) # Add any other thing you want to pass in
model.eval()

++++++++++++++++++++++++++++++++++++
Load image
++++++++++++++++++++++++++++++++++++
img_path = "./ov_chat_images/img1.png"
image = Image.open(img_path)
image_tensor = process_images([image], image_processor, model.config)
image_tensor = [_image.to(dtype=torch.float16, device=device) for _image in image_tensor]
image_sizes = [image.size]

++++++++++++++++++++++++++++++++++++
Load video
++++++++++++++++++++++++++++++++++++

Function to extract frames from video

def load_video(video_path, max_frames_num):
if type(video_path) == str:
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
else:
vr = VideoReader(video_path[0], ctx=cpu(0))
total_frame_num = len(vr)
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, max_frames_num, dtype=int)
frame_idx = uniform_sampled_frames.tolist()

# frame_time = [i/vr.get_avg_fps() for i in frame_idx]

spare_frames = vr.get_batch(frame_idx).asnumpy()
return spare_frames  # (frames, height, width, channels)


video_path = f"./video.mp4"
video_frames = load_video(video_path, 32)
print(video_frames.shape) # (16, 1024, 576, 3)
image_tensors = []
frames = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"].half().cuda()
image_tensors.append(frames)
image_sizes = [frame.size for frame in video_frames]

++++++++++++++++++++++++++++++++++++
conv_template = "qwen_1_5" # Make sure you use correct chat template for different models
question = DEFAULT_IMAGE_TOKEN + "\nWhat is the best diagnosis for this colon tissue image?"
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()

input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)

cont = model.generate(
input_ids,
images=image_tensor,
image_sizes=image_sizes,
do_sample=False,
temperature=0,
max_new_tokens=4096,
# return_dict_in_generate=True,
# output_attentions=True,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)
print(text_outputs)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment