Prompt Enhancer LLM for Video Generation

Final project for MIT 6.7920: Reinforcement Learning

Model Info

This model translates simple video generation prompts into detailed, professional-grade prompts, eliminating the need for prompt engineering and bringing high-quality AI-generated videos to ordinary users.

Example:

Original prompt:
"A golden retriever playing in a park, chasing butterflies on a sunny afternoon"
Enhanced prompt:
"# SUNSHINE AND PLEASURE - [Scene: A golden retriever runs through a sunlit park, bounding after fluttering butterflies. Grass sways in gentle breeze.] - Camera: Tracking shot at ground level, capturing playful motion and joyful expression. Cut to wide view showing expanse of green. - Lighting: Bright sunlight with long shadows. Golden hour glow enhances fur texture. - Motion: Dog leaps gracefully, tail wagging wildly. Butterflies dance around nose. - Details: Ribbon on collar; muddy paws. Birds chirp in trees. - Atmosphere: Innocence, freedom, pure happiness. - Style: Whimsical nature documentary. Inspired by Up and Finding Nemo. - Color Grading: Warm tones with high saturation. Add soft focus and subtle vignette. - Sound Design: Puppy barks, laughter, and rustling leaves."

The model was specifically trained to improve Sora videos, but you can try using it with other video generation services like Veo, etc.

Usage

Requirements:

pip install torch==2.1.0 transformers==4.42.4 peft==0.11.1

Sample code to load and use the model:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model_id = "Qwen/Qwen2.5-14B-Instruct"
adapter_id = "dariakryvosheieva/video-prompt-enhancer"

tokenizer = AutoTokenizer.from_pretrained(adapter_id, use_fast=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

base = AutoModelForCausalLM.from_pretrained(
    base_model_id, device_map="auto", torch_dtype="auto"
)
model = PeftModel.from_pretrained(base, adapter_id).eval()


def format_query(simple_prompt: str) -> str:
    instruction_text = (
        "Convert the following video generation prompt into a professional-grade prompt that will produce a high quality, aesthetic, and impressive video."
        "If the original prompt includes a style specification (such as 'anime', 'pixel', or 'cartoon'), keep it in the converted prompt."
        "Output only the converted prompt."
    )
    return f"{instruction_text}\n\nInput:\n{simple_prompt.strip()}\n\nOutput:\n"


prompt = "a cat riding a skateboard in a park at sunset"
text = format_query(prompt)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.95,
        pad_token_id=tokenizer.pad_token_id,
    )

print(
    tokenizer.decode(out[0][inputs["input_ids"].shape[-1] :], skip_special_tokens=True)
)

Credits & Inspiration

Sora users @keigo_matsumaru and @kejia for prompt styles
Jina AI's PromptPerfect - an analogous tool for the text and image modalities

Training Procedure

See the GitHub repo.

Downloads last month: 5

Video Preview

Reinforcement Learning

Model tree for dariakryvosheieva/video-prompt-enhancer

Base model

Qwen/Qwen2.5-14B

Finetuned

Qwen/Qwen2.5-14B-Instruct

Finetuned

(386)

this model