It doesn't process audio in video file

#12

by jluixjurado - opened 10 days ago

Hi there!

I don't know if this behaviour is normal, but the model E4B doesn't process the audio data inside the passed video file. It corretly describes the video but complaints about the audio: "Please provide the audio or a transcription of the video you would like me to process." (the video actually has audio inside)

Am I missing something?

Thanks in advance.

pannaga10

Google org 9 days ago

•

edited 9 days ago

Hi @jluixjurado
I attempted to reproduce the issue using the multimodal video inference example from the https://github.com/huggingface/huggingface-gemma-recipes/blob/main/notebooks/Gemma4_(E2B)-Multimodal.ipynb for E4B model but the code executed successfully on my end. Could you please share the specific code snippet you are using? Will try to reproduce it on my side ?

Thanks

jluixjurado

6 days ago

Thank you very much for your help.

It happens that I was missing to include the parameter load_audio_from_video=True in the processor.apply_chat_template call:

inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
load_audio_from_video=True,
).to(model.device)

Now it works.

Thanks again.

jluixjurado changed discussion status to closed 6 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment