clipforge / humeo-core /README.md
moonlantern1's picture
Deploy ClipForge Docker Space
eda316b verified

humeo-core

Reusable-rocket MCP server for long-video β†’ 9:16 shorts.

First-principles design, from the HIVE paper + Bryan's rocket analogy: we don't build doors and windows (general subject-tracker UI, retraining models). We build the container (schemas), landing gear (deterministic local extraction), and five thrusters (the five 9:16 layouts this video format actually uses). Everything else is pluggable.

The rocket, in one picture

            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚         Control panel  (MCP tools)       β”‚   <- any MCP client
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚ strict JSON
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β–Ό                β–Ό           β–Ό                β–Ό                 β–Ό
 ingest       classify_scenes  select_clips   plan_layout       render_clip
(scenes +     (5-way layout   (clip picker,   (5 thrusters,    (ffmpeg compile,
 keyframes +   classifier)     heuristic +     pure filter      dry-run safe)
 transcript)                   LLM-ready)      math)
                                                 β”‚
                                                 β–Ό
                                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                       β”‚   LayoutKind       β”‚
                                       β”‚  ────────────────  β”‚
                                       β”‚  zoom_call_center  β”‚
                                       β”‚  sit_center        β”‚
                                       β”‚  split_chart_personβ”‚
                                       β”‚  split_two_persons β”‚
                                       β”‚  split_two_charts  β”‚
                                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Only the classifier and clip-selector have optional LLM hooks; everything else is deterministic, local, and cheap.

Why five layouts? (the "max 2 items" rule)

The hard constraint for this format: a short shows at most two on-screen items β€” where an "item" is a person (a human speaker) or a chart (slide, graph, data visual, screenshare). That gives exactly five recipes:

  1. zoom_call_center β€” 1 person, tight zoom-call / webcam framing.
  2. sit_center β€” 1 person, interview / seated framing.
  3. split_chart_person β€” 1 chart + 1 person, stacked vertically (default: even 50/50 top/bottom, chart on top).
  4. split_two_persons β€” 2 speakers, stacked vertically.
  5. split_two_charts β€” 2 charts, stacked vertically.

Because the geometry is bounded, we do NOT need a general subject-tracker ML model or a drag-to-highlight UI. We need five small, correct pieces of crop/compose math. That is exactly what src/humeo_core/primitives/layouts.py is.

See TERMINOLOGY.md for the full glossary of terms used across these docs (subject, crop, band, seam, bbox, layout, etc.).

Install

uv venv
uv sync

External requirements: ffmpeg and ffprobe on PATH.

scenedetect requires OpenCV. Install opencv-python-headless or opencv-python alongside scenedetect.

Use it as an MCP server

humeo-core         # stdio transport (primary console script)
# humeo-mcp        # same entrypoint β€” kept so existing MCP configs keep working

Example Cursor/Claude Desktop config:

{
  "mcpServers": {
    "humeo": { "command": "humeo-core" }
  }
}

Tools exposed:

Tool Purpose
list_layouts Enumerate the 5 supported layouts.
ingest Scene detection + keyframe extraction (+ optional transcript).
classify_scenes Pixel-heuristic per-scene layout classification.
detect_scene_regions Return the bbox prompt + per-scene jobs (agent runs its own vision model).
classify_scenes_with_vision Classify scenes from already-gathered SceneRegions bbox JSON + build layout instructions.
select_clips Heuristic clip picker over a word-level transcript.
plan_layout Return the exact ffmpeg -filter_complex for a layout.
build_render_cmd Build the ffmpeg command (no execution) β€” review before spend.
render_clip Build + run ffmpeg to produce a 9:16 MP4.

Resource: humeo://layouts (JSON listing of the 5 layouts).

Three interchangeable region detectors

All three emit the same SceneRegions schema, so the layout planner and renderer don't care which one you used:

classify.py   (pixel variance, no ML)
face_detect.py (MediaPipe, local)            ──► SceneRegions ──► SceneClassification ──► LayoutInstruction ──► ffmpeg
vision.py     (multimodal LLM + OCR bboxes)

JSON contracts (non-negotiable)

All tools take and return Pydantic-validated JSON. The contracts live in src/humeo_core/schemas.py:

  • Scene {scene_id, start_time, end_time, keyframe_path?}
  • TranscriptWord {word, start_time, end_time}
  • IngestResult {source_path, duration_sec, scenes[], transcript_words[], keyframes_dir?}
  • SceneClassification {scene_id, layout, confidence, reason}
  • BoundingBox {x1, y1, x2, y2, label, confidence} (all coords normalized)
  • SceneRegions {scene_id, person_bbox?, chart_bbox?, ocr_text, raw_reason}
  • Clip {clip_id, topic, start_time_sec, end_time_sec, viral_hook, virality_score, transcript, suggested_overlay_title, layout?}
  • ClipPlan {source_path, clips[]}
  • LayoutInstruction {clip_id, layout, zoom, person_x_norm, chart_x_norm, split_chart_region?, split_person_region?, split_second_chart_region?, split_second_person_region?, top_band_ratio, focus_stack_order}
  • RenderRequest / RenderResult

First-principles decisions (what we intentionally did NOT build)

  • No giant subject-tracker ML. The video format has 5 fixed layouts (with a hard "max 2 items" rule); pixel-level tracking is not needed.
  • No drag-and-highlight UI. An MCP tool is a better "UI" for an agent-first workflow. If a human wants to override, they pass a LayoutInstruction with their own person_x_norm / chart_x_norm / zoom.
  • No end-to-end videoβ†’video model. The HIVE paper's core insight is that decomposed orchestration beats monolithic generation. We reify that insight as six small composable tools.

Extending the pilot

  • Plug a real multimodal model into classify_scenes_with_llm(vision_fn).
  • Plug a real reasoning model into select_clips_with_llm(text_fn).
  • Plug a real vision-LLM into detect_regions_with_llm(scenes, vision_fn) to get per-scene bboxes + OCR text, then feed the results back through classify_scenes_with_vision. This is the scene-change β†’ v3 images β†’ LLM+OCR β†’ bbox path; see ../docs/SOLUTIONS.md Β§4 for rationale.
  • All enforce strict JSON outputs, so bad model output can't corrupt downstream stages.

Testing

python -m pytest

See docs/ARCHITECTURE.md for deeper rationale.

License

MIT