Spaces:
Sleeping
Architecture β Reusable Rocket
"We don't need to build the door or windows β just a container with landing gear and thrusters that move in different directions." β Bryan
That analogy maps exactly onto this MCP:
| Rocket part | Codebase | Purpose |
|---|---|---|
| Container | src/humeo_core/schemas.py |
Strict JSON contracts every stage reads/writes. |
| Landing gear | src/humeo_core/primitives/ingest.py |
Deterministic local extraction (scenes, keyframes, transcript). |
| Thrusters (Γ5) | src/humeo_core/primitives/layouts.py |
Five fixed 9:16 crop/compose recipes (max 2 on-screen items). |
| Pilot | primitives/classify.py + primitives/select_clips.py |
Heuristic + LLM-ready decision makers. |
| Compiler | src/humeo_core/primitives/compile.py |
Deterministic ffmpeg assembly. |
| Control panel | src/humeo_core/server.py |
MCP tools exposing every primitive. |
| Control surface | src/humeo_core/server.py |
MCP tool surface for agents and clients. |
First-principles reasoning
The HIVE paper's core insight is that good short-video editing requires staged reasoning with strict intermediate artifacts, not a single giant model call. Three consequences flow from that:
Extraction must be local and deterministic. No model call should ever touch raw video bytes.
ingest.pyruns ffprobe + PySceneDetect- ffmpeg + (optional) faster-whisper. Everything it emits is JSON or a file path.
Reasoning must be decomposed into narrow sub-tasks. Classifying a scene's layout is a completely different task from selecting a viral clip. Each has its own schema, its own prompt, its own validation. This is why
primitives/is five files instead of one.Every model call must emit schema-validated JSON. Free-form model output is not allowed to enter the pipeline.
classify_scenes_with_llmandselect_clips_with_llmbothmodel_validate(...)the raw output before returning; parse failures degrade gracefully toSIT_CENTER+ low confidence, not crashes.
Why only five layouts?
The hard rule for this format: a short shows at most two on-screen
items, where an "item" is a person or a chart. That gives exactly
five recipes β all implemented as pure functions from
LayoutInstruction to an ffmpeg filtergraph string in layouts.py:
| Layout | Items | Recipe |
|---|---|---|
zoom_call_center |
1 person | tight centered 9:16 crop (zoom β₯ 1.25). |
sit_center |
1 person | wider centered 9:16 crop. |
split_chart_person |
1 chart + person | source partitioned L/R by bboxes, stacked. |
split_two_persons |
2 persons | L/R speakers, stacked top/bottom. |
split_two_charts |
2 charts | L/R charts, stacked top/bottom. |
A general subject-tracker ML model is orders of magnitude more expensive
and less reliable than five hand-written crop recipes. If a new geometry
ever shows up in future source videos, adding a sixth thruster is
strictly additive: write a new plan_* function, add it to _DISPATCH,
add an enum variant. No existing code has to change.
9:16 layout math
Source is assumed 16:9 (1920Γ1080 by default, but probed per-clip). Target is 1080Γ1920. For each layout:
zoom_call_center and sit_center
Standard centered aspect-ratio crop to 9:16, then scale to 1080Γ1920:
crop=cw:ch:x:y,scale=1080:1920:flags=lanczos,setsar=1[vout]
cw, ch are the largest 9:16 window that fits in the source, divided
by zoom. x, y center the window on person_x_norm / 0.5.
Dimensions are rounded to even values so libx264 is happy. The window is
clamped inside the source so a high person_x_norm never crops outside.
Split layouts (split_chart_person, split_two_persons, split_two_charts)
All three splits share one recipe β only the items differ:
- Horizontal partition. The source is cut at a single vertical seam
so the two source strips are complementary (no overlap, no gap).
When both bboxes are set (Gemini vision), the seam is the midpoint
between
left.x2andright.x1. Otherwise the seam defaults to either an even 50/50 (two-of-a-kind splits) or a 2/3 | 1/3 split (legacysplit_chart_personfallback). - Vertical crop. Each strip's vertical extent comes from the corresponding bbox when provided, so each item fills its output band instead of being lost in full-height source context.
- Cover-scale to the band. Each strip is scaled with
force_original_aspect_ratio=increase+ center-cropped to the band dimensions. Bands are always fully painted; no letterbox bars. - Stack. Two branches produced by
split=2arevstack-ed into the final 1080Γ1920.
Band heights are controlled by LayoutInstruction.top_band_ratio,
which defaults to 0.5 (even 50/50 β the symmetric look Bryan asked
for after the uneven Cathy Wood shorts). Legacy 60/40 is still reachable
by setting top_band_ratio=0.6.
Stack order (for split_chart_person) is controlled by
focus_stack_order: chart-on-top (default) or person-on-top.
Extensibility story
- Smarter classifier: implement
LLMVisionFnwith any multimodal model and pass it toclassify_scenes_with_llm. The fallback heuristic stays available for offline runs and tests. - Smarter clip selector: same pattern,
LLMTextFnβselect_clips_with_llm. - New layout: add a
plan_*planner, register in_DISPATCH, add aLayoutKindvariant. Tests intest_layouts.pyautomatically iterate over allLayoutKinds, so the dispatch coverage test will catch a missing registration immediately.
What we intentionally did NOT build
- Drag-and-highlight subject-selector UI.
- A general ML subject-tracker.
- A monolithic video-in-video-out model.
- Any network calls in the core library. The MCP server is stdio-only; the CLI runs fully offline.
This keeps the rocket reusable: the same primitives power the MCP server, the CLI, a Python library, and (soon) a web UI if that's ever warranted.