Spaces:

moonlantern1
/

clipforge

Sleeping

App Files Files Community

clipforge / humeo-core /README.md

moonlantern1

Deploy ClipForge Docker Space

eda316b verified 11 days ago

preview code

raw

history blame contribute delete

8.3 kB

humeo-core

Reusable-rocket MCP server for long-video → 9:16 shorts.

First-principles design, from the HIVE paper + Bryan's rocket analogy: we don't build doors and windows (general subject-tracker UI, retraining models). We build the container (schemas), landing gear (deterministic local extraction), and five thrusters (the five 9:16 layouts this video format actually uses). Everything else is pluggable.

The rocket, in one picture

            ┌──────────────────────────────────────────┐
            │         Control panel  (MCP tools)       │   <- any MCP client
            └───────────────────┬──────────────────────┘
                                │ strict JSON
   ┌────────────────┬───────────┼────────────────┬─────────────────┐
   ▼                ▼           ▼                ▼                 ▼
 ingest       classify_scenes  select_clips   plan_layout       render_clip
(scenes +     (5-way layout   (clip picker,   (5 thrusters,    (ffmpeg compile,
 keyframes +   classifier)     heuristic +     pure filter      dry-run safe)
 transcript)                   LLM-ready)      math)
                                                 │
                                                 ▼
                                       ┌────────────────────┐
                                       │   LayoutKind       │
                                       │  ────────────────  │
                                       │  zoom_call_center  │
                                       │  sit_center        │
                                       │  split_chart_person│
                                       │  split_two_persons │
                                       │  split_two_charts  │
                                       └────────────────────┘

Only the classifier and clip-selector have optional LLM hooks; everything else is deterministic, local, and cheap.

Why five layouts? (the "max 2 items" rule)

The hard constraint for this format: a short shows at most two on-screen items — where an "item" is a person (a human speaker) or a chart (slide, graph, data visual, screenshare). That gives exactly five recipes:

zoom_call_center — 1 person, tight zoom-call / webcam framing.
sit_center — 1 person, interview / seated framing.
split_chart_person — 1 chart + 1 person, stacked vertically (default: even 50/50 top/bottom, chart on top).
split_two_persons — 2 speakers, stacked vertically.
split_two_charts — 2 charts, stacked vertically.

Because the geometry is bounded, we do NOT need a general subject-tracker ML model or a drag-to-highlight UI. We need five small, correct pieces of crop/compose math. That is exactly what src/humeo_core/primitives/layouts.py is.

See TERMINOLOGY.md for the full glossary of terms used across these docs (subject, crop, band, seam, bbox, layout, etc.).

Install

uv venv
uv sync

External requirements: ffmpeg and ffprobe on PATH.

scenedetect requires OpenCV. Install opencv-python-headless or opencv-python alongside scenedetect.

Use it as an MCP server

humeo-core         # stdio transport (primary console script)
# humeo-mcp        # same entrypoint — kept so existing MCP configs keep working

Example Cursor/Claude Desktop config:

{
  "mcpServers": {
    "humeo": { "command": "humeo-core" }
  }
}

Tools exposed:

Tool	Purpose
`list_layouts`	Enumerate the 5 supported layouts.
`ingest`	Scene detection + keyframe extraction (+ optional transcript).
`classify_scenes`	Pixel-heuristic per-scene layout classification.
`detect_scene_regions`	Return the bbox prompt + per-scene jobs (agent runs its own vision model).
`classify_scenes_with_vision`	Classify scenes from already-gathered `SceneRegions` bbox JSON + build layout instructions.
`select_clips`	Heuristic clip picker over a word-level transcript.
`plan_layout`	Return the exact `ffmpeg -filter_complex` for a layout.
`build_render_cmd`	Build the ffmpeg command (no execution) — review before spend.
`render_clip`	Build + run ffmpeg to produce a 9:16 MP4.

Resource: humeo://layouts (JSON listing of the 5 layouts).

Three interchangeable region detectors

All three emit the same SceneRegions schema, so the layout planner and renderer don't care which one you used:

classify.py   (pixel variance, no ML)
face_detect.py (MediaPipe, local)            ──► SceneRegions ──► SceneClassification ──► LayoutInstruction ──► ffmpeg
vision.py     (multimodal LLM + OCR bboxes)

JSON contracts (non-negotiable)

All tools take and return Pydantic-validated JSON. The contracts live in src/humeo_core/schemas.py:

Scene {scene_id, start_time, end_time, keyframe_path?}
TranscriptWord {word, start_time, end_time}
IngestResult {source_path, duration_sec, scenes[], transcript_words[], keyframes_dir?}
SceneClassification {scene_id, layout, confidence, reason}
BoundingBox {x1, y1, x2, y2, label, confidence} (all coords normalized)
SceneRegions {scene_id, person_bbox?, chart_bbox?, ocr_text, raw_reason}
Clip {clip_id, topic, start_time_sec, end_time_sec, viral_hook, virality_score, transcript, suggested_overlay_title, layout?}
ClipPlan {source_path, clips[]}
LayoutInstruction {clip_id, layout, zoom, person_x_norm, chart_x_norm, split_chart_region?, split_person_region?, split_second_chart_region?, split_second_person_region?, top_band_ratio, focus_stack_order}
RenderRequest / RenderResult

First-principles decisions (what we intentionally did NOT build)

No giant subject-tracker ML. The video format has 5 fixed layouts (with a hard "max 2 items" rule); pixel-level tracking is not needed.
No drag-and-highlight UI. An MCP tool is a better "UI" for an agent-first workflow. If a human wants to override, they pass a LayoutInstruction with their own person_x_norm / chart_x_norm / zoom.
No end-to-end video→video model. The HIVE paper's core insight is that decomposed orchestration beats monolithic generation. We reify that insight as six small composable tools.

Extending the pilot

Plug a real multimodal model into classify_scenes_with_llm(vision_fn).
Plug a real reasoning model into select_clips_with_llm(text_fn).
Plug a real vision-LLM into detect_regions_with_llm(scenes, vision_fn) to get per-scene bboxes + OCR text, then feed the results back through classify_scenes_with_vision. This is the scene-change → v3 images → LLM+OCR → bbox path; see ../docs/SOLUTIONS.md §4 for rationale.
All enforce strict JSON outputs, so bad model output can't corrupt downstream stages.

Testing

python -m pytest

See docs/ARCHITECTURE.md for deeper rationale.

License

MIT