It doesn't process audio in video file

#12
by jluixjurado - opened

Hi there!

I don't know if this behaviour is normal, but the model E4B doesn't process the audio data inside the passed video file. It corretly describes the video but complaints about the audio: "Please provide the audio or a transcription of the video you would like me to process." (the video actually has audio inside)

Am I missing something?

Thanks in advance.

Hi @jluixjurado
I attempted to reproduce the issue using the multimodal video inference example from the https://github.com/huggingface/huggingface-gemma-recipes/blob/main/notebooks/Gemma4_(E2B)-Multimodal.ipynb for E4B model but the code executed successfully on my end. Could you please share the specific code snippet you are using? Will try to reproduce it on my side ?

Thanks

Thank you very much for your help.

It happens that I was missing to include the parameter load_audio_from_video=True in the processor.apply_chat_template call:

inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
load_audio_from_video=True,
).to(model.device)

Now it works.

Thanks again.

jluixjurado changed discussion status to closed

Sign up or log in to comment