Spaces:
Sleeping
Sleeping
| # Architecture β Reusable Rocket | |
| > *"We don't need to build the door or windows β just a container with landing | |
| > gear and thrusters that move in different directions."* | |
| > β Bryan | |
| That analogy maps exactly onto this MCP: | |
| | Rocket part | Codebase | Purpose | | |
| | --------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------- | | |
| | Container | `src/humeo_core/schemas.py` | Strict JSON contracts every stage reads/writes. | | |
| | Landing gear | `src/humeo_core/primitives/ingest.py` | Deterministic local extraction (scenes, keyframes, transcript). | | |
| | Thrusters (Γ5) | `src/humeo_core/primitives/layouts.py` | Five fixed 9:16 crop/compose recipes (max 2 on-screen items). | | |
| | Pilot | `primitives/classify.py` + `primitives/select_clips.py` | Heuristic + LLM-ready decision makers. | | |
| | Compiler | `src/humeo_core/primitives/compile.py` | Deterministic ffmpeg assembly. | | |
| | Control panel | `src/humeo_core/server.py` | MCP tools exposing every primitive. | | |
| | Control surface | `src/humeo_core/server.py` | MCP tool surface for agents and clients. | | |
| ## First-principles reasoning | |
| The HIVE paper's core insight is that good short-video editing requires | |
| **staged reasoning with strict intermediate artifacts**, not a single | |
| giant model call. Three consequences flow from that: | |
| 1. **Extraction must be local and deterministic.** No model call should | |
| ever touch raw video bytes. `ingest.py` runs ffprobe + PySceneDetect | |
| + ffmpeg + (optional) faster-whisper. Everything it emits is JSON or | |
| a file path. | |
| 2. **Reasoning must be decomposed into narrow sub-tasks.** Classifying a | |
| scene's layout is a completely different task from selecting a viral | |
| clip. Each has its own schema, its own prompt, its own validation. | |
| This is why `primitives/` is five files instead of one. | |
| 3. **Every model call must emit schema-validated JSON.** Free-form model | |
| output is not allowed to enter the pipeline. `classify_scenes_with_llm` | |
| and `select_clips_with_llm` both `model_validate(...)` the raw output | |
| before returning; parse failures degrade gracefully to `SIT_CENTER` + | |
| low confidence, not crashes. | |
| ## Why only five layouts? | |
| The hard rule for this format: **a short shows at most two on-screen | |
| items**, where an "item" is a `person` or a `chart`. That gives exactly | |
| five recipes β all implemented as pure functions from | |
| `LayoutInstruction` to an ffmpeg filtergraph string in `layouts.py`: | |
| | Layout | Items | Recipe | | |
| | ---------------------- | --------------- | --------------------------------------------- | | |
| | `zoom_call_center` | 1 person | tight centered 9:16 crop (zoom β₯ 1.25). | | |
| | `sit_center` | 1 person | wider centered 9:16 crop. | | |
| | `split_chart_person` | 1 chart + person| source partitioned L/R by bboxes, stacked. | | |
| | `split_two_persons` | 2 persons | L/R speakers, stacked top/bottom. | | |
| | `split_two_charts` | 2 charts | L/R charts, stacked top/bottom. | | |
| A general subject-tracker ML model is orders of magnitude more expensive | |
| and less reliable than five hand-written crop recipes. If a new geometry | |
| ever shows up in future source videos, adding a sixth thruster is | |
| strictly additive: write a new `plan_*` function, add it to `_DISPATCH`, | |
| add an enum variant. No existing code has to change. | |
| ## 9:16 layout math | |
| Source is assumed 16:9 (1920Γ1080 by default, but probed per-clip). | |
| Target is 1080Γ1920. For each layout: | |
| ### `zoom_call_center` and `sit_center` | |
| Standard centered aspect-ratio crop to 9:16, then scale to 1080Γ1920: | |
| ``` | |
| crop=cw:ch:x:y,scale=1080:1920:flags=lanczos,setsar=1[vout] | |
| ``` | |
| `cw`, `ch` are the largest 9:16 window that fits in the source, divided | |
| by `zoom`. `x`, `y` center the window on `person_x_norm` / 0.5. | |
| Dimensions are rounded to even values so libx264 is happy. The window is | |
| clamped inside the source so a high `person_x_norm` never crops outside. | |
| ### Split layouts (`split_chart_person`, `split_two_persons`, `split_two_charts`) | |
| All three splits share one recipe β only the items differ: | |
| 1. **Horizontal partition.** The source is cut at a single vertical seam | |
| so the two source strips are **complementary** (no overlap, no gap). | |
| When both bboxes are set (Gemini vision), the seam is the midpoint | |
| between `left.x2` and `right.x1`. Otherwise the seam defaults to | |
| either an even 50/50 (two-of-a-kind splits) or a 2/3 | 1/3 split | |
| (legacy `split_chart_person` fallback). | |
| 2. **Vertical crop.** Each strip's vertical extent comes from the | |
| corresponding bbox when provided, so each item **fills** its output | |
| band instead of being lost in full-height source context. | |
| 3. **Cover-scale to the band.** Each strip is scaled with | |
| `force_original_aspect_ratio=increase` + center-cropped to the band | |
| dimensions. Bands are always fully painted; no letterbox bars. | |
| 4. **Stack.** Two branches produced by `split=2` are `vstack`-ed into | |
| the final 1080Γ1920. | |
| **Band heights** are controlled by `LayoutInstruction.top_band_ratio`, | |
| which defaults to **0.5** (even 50/50 β the symmetric look Bryan asked | |
| for after the uneven Cathy Wood shorts). Legacy 60/40 is still reachable | |
| by setting `top_band_ratio=0.6`. | |
| **Stack order** (for `split_chart_person`) is controlled by | |
| `focus_stack_order`: chart-on-top (default) or person-on-top. | |
| ## Extensibility story | |
| - **Smarter classifier:** implement `LLMVisionFn` with any multimodal | |
| model and pass it to `classify_scenes_with_llm`. The fallback heuristic | |
| stays available for offline runs and tests. | |
| - **Smarter clip selector:** same pattern, `LLMTextFn` β `select_clips_with_llm`. | |
| - **New layout:** add a `plan_*` planner, register in `_DISPATCH`, add a | |
| `LayoutKind` variant. Tests in `test_layouts.py` automatically iterate | |
| over all `LayoutKind`s, so the dispatch coverage test will catch a | |
| missing registration immediately. | |
| ## What we intentionally did NOT build | |
| - Drag-and-highlight subject-selector UI. | |
| - A general ML subject-tracker. | |
| - A monolithic video-in-video-out model. | |
| - Any network calls in the core library. The MCP server is stdio-only; | |
| the CLI runs fully offline. | |
| This keeps the rocket **reusable**: the same primitives power the MCP | |
| server, the CLI, a Python library, and (soon) a web UI if that's ever | |
| warranted. | |