Spaces:
Sleeping
humeo-core
Reusable-rocket MCP server for long-video β 9:16 shorts.
First-principles design, from the HIVE paper + Bryan's rocket analogy: we don't build doors and windows (general subject-tracker UI, retraining models). We build the container (schemas), landing gear (deterministic local extraction), and five thrusters (the five 9:16 layouts this video format actually uses). Everything else is pluggable.
The rocket, in one picture
ββββββββββββββββββββββββββββββββββββββββββββ
β Control panel (MCP tools) β <- any MCP client
βββββββββββββββββββββ¬βββββββββββββββββββββββ
β strict JSON
ββββββββββββββββββ¬ββββββββββββΌβββββββββββββββββ¬ββββββββββββββββββ
βΌ βΌ βΌ βΌ βΌ
ingest classify_scenes select_clips plan_layout render_clip
(scenes + (5-way layout (clip picker, (5 thrusters, (ffmpeg compile,
keyframes + classifier) heuristic + pure filter dry-run safe)
transcript) LLM-ready) math)
β
βΌ
ββββββββββββββββββββββ
β LayoutKind β
β ββββββββββββββββ β
β zoom_call_center β
β sit_center β
β split_chart_personβ
β split_two_persons β
β split_two_charts β
ββββββββββββββββββββββ
Only the classifier and clip-selector have optional LLM hooks; everything else is deterministic, local, and cheap.
Why five layouts? (the "max 2 items" rule)
The hard constraint for this format: a short shows at most two on-screen
items β where an "item" is a person (a human speaker) or a chart
(slide, graph, data visual, screenshare). That gives exactly five recipes:
zoom_call_centerβ 1 person, tight zoom-call / webcam framing.sit_centerβ 1 person, interview / seated framing.split_chart_personβ 1 chart + 1 person, stacked vertically (default: even 50/50 top/bottom, chart on top).split_two_personsβ 2 speakers, stacked vertically.split_two_chartsβ 2 charts, stacked vertically.
Because the geometry is bounded, we do NOT need a general subject-tracker
ML model or a drag-to-highlight UI. We need five small, correct pieces of
crop/compose math. That is exactly what src/humeo_core/primitives/layouts.py
is.
See TERMINOLOGY.md for the full glossary of terms
used across these docs (subject, crop, band, seam, bbox, layout, etc.).
Install
uv venv
uv sync
External requirements: ffmpeg and ffprobe on PATH.
scenedetect requires OpenCV. Install opencv-python-headless or
opencv-python alongside scenedetect.
Use it as an MCP server
humeo-core # stdio transport (primary console script)
# humeo-mcp # same entrypoint β kept so existing MCP configs keep working
Example Cursor/Claude Desktop config:
{
"mcpServers": {
"humeo": { "command": "humeo-core" }
}
}
Tools exposed:
| Tool | Purpose |
|---|---|
list_layouts |
Enumerate the 5 supported layouts. |
ingest |
Scene detection + keyframe extraction (+ optional transcript). |
classify_scenes |
Pixel-heuristic per-scene layout classification. |
detect_scene_regions |
Return the bbox prompt + per-scene jobs (agent runs its own vision model). |
classify_scenes_with_vision |
Classify scenes from already-gathered SceneRegions bbox JSON + build layout instructions. |
select_clips |
Heuristic clip picker over a word-level transcript. |
plan_layout |
Return the exact ffmpeg -filter_complex for a layout. |
build_render_cmd |
Build the ffmpeg command (no execution) β review before spend. |
render_clip |
Build + run ffmpeg to produce a 9:16 MP4. |
Resource: humeo://layouts (JSON listing of the 5 layouts).
Three interchangeable region detectors
All three emit the same SceneRegions schema, so the layout planner and renderer don't care which one you used:
classify.py (pixel variance, no ML)
face_detect.py (MediaPipe, local) βββΊ SceneRegions βββΊ SceneClassification βββΊ LayoutInstruction βββΊ ffmpeg
vision.py (multimodal LLM + OCR bboxes)
JSON contracts (non-negotiable)
All tools take and return Pydantic-validated JSON. The contracts live in
src/humeo_core/schemas.py:
Scene{scene_id, start_time, end_time, keyframe_path?}TranscriptWord{word, start_time, end_time}IngestResult{source_path, duration_sec, scenes[], transcript_words[], keyframes_dir?}SceneClassification{scene_id, layout, confidence, reason}BoundingBox{x1, y1, x2, y2, label, confidence}(all coords normalized)SceneRegions{scene_id, person_bbox?, chart_bbox?, ocr_text, raw_reason}Clip{clip_id, topic, start_time_sec, end_time_sec, viral_hook, virality_score, transcript, suggested_overlay_title, layout?}ClipPlan{source_path, clips[]}LayoutInstruction{clip_id, layout, zoom, person_x_norm, chart_x_norm, split_chart_region?, split_person_region?, split_second_chart_region?, split_second_person_region?, top_band_ratio, focus_stack_order}RenderRequest/RenderResult
First-principles decisions (what we intentionally did NOT build)
- No giant subject-tracker ML. The video format has 5 fixed layouts (with a hard "max 2 items" rule); pixel-level tracking is not needed.
- No drag-and-highlight UI. An MCP tool is a better "UI" for an
agent-first workflow. If a human wants to override, they pass a
LayoutInstructionwith their ownperson_x_norm/chart_x_norm/zoom. - No end-to-end videoβvideo model. The HIVE paper's core insight is that decomposed orchestration beats monolithic generation. We reify that insight as six small composable tools.
Extending the pilot
- Plug a real multimodal model into
classify_scenes_with_llm(vision_fn). - Plug a real reasoning model into
select_clips_with_llm(text_fn). - Plug a real vision-LLM into
detect_regions_with_llm(scenes, vision_fn)to get per-scene bboxes + OCR text, then feed the results back throughclassify_scenes_with_vision. This is the scene-change β v3 images β LLM+OCR β bbox path; see../docs/SOLUTIONS.md Β§4for rationale. - All enforce strict JSON outputs, so bad model output can't corrupt downstream stages.
Testing
python -m pytest
See docs/ARCHITECTURE.md for deeper rationale.
License
MIT