# Architecture — Reusable Rocket > *"We don't need to build the door or windows — just a container with landing > gear and thrusters that move in different directions."* > — Bryan That analogy maps exactly onto this MCP: | Rocket part | Codebase | Purpose | | --------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------- | | Container | `src/humeo_core/schemas.py` | Strict JSON contracts every stage reads/writes. | | Landing gear | `src/humeo_core/primitives/ingest.py` | Deterministic local extraction (scenes, keyframes, transcript). | | Thrusters (×5) | `src/humeo_core/primitives/layouts.py` | Five fixed 9:16 crop/compose recipes (max 2 on-screen items). | | Pilot | `primitives/classify.py` + `primitives/select_clips.py` | Heuristic + LLM-ready decision makers. | | Compiler | `src/humeo_core/primitives/compile.py` | Deterministic ffmpeg assembly. | | Control panel | `src/humeo_core/server.py` | MCP tools exposing every primitive. | | Control surface | `src/humeo_core/server.py` | MCP tool surface for agents and clients. | ## First-principles reasoning The HIVE paper's core insight is that good short-video editing requires **staged reasoning with strict intermediate artifacts**, not a single giant model call. Three consequences flow from that: 1. **Extraction must be local and deterministic.** No model call should ever touch raw video bytes. `ingest.py` runs ffprobe + PySceneDetect + ffmpeg + (optional) faster-whisper. Everything it emits is JSON or a file path. 2. **Reasoning must be decomposed into narrow sub-tasks.** Classifying a scene's layout is a completely different task from selecting a viral clip. Each has its own schema, its own prompt, its own validation. This is why `primitives/` is five files instead of one. 3. **Every model call must emit schema-validated JSON.** Free-form model output is not allowed to enter the pipeline. `classify_scenes_with_llm` and `select_clips_with_llm` both `model_validate(...)` the raw output before returning; parse failures degrade gracefully to `SIT_CENTER` + low confidence, not crashes. ## Why only five layouts? The hard rule for this format: **a short shows at most two on-screen items**, where an "item" is a `person` or a `chart`. That gives exactly five recipes — all implemented as pure functions from `LayoutInstruction` to an ffmpeg filtergraph string in `layouts.py`: | Layout | Items | Recipe | | ---------------------- | --------------- | --------------------------------------------- | | `zoom_call_center` | 1 person | tight centered 9:16 crop (zoom ≥ 1.25). | | `sit_center` | 1 person | wider centered 9:16 crop. | | `split_chart_person` | 1 chart + person| source partitioned L/R by bboxes, stacked. | | `split_two_persons` | 2 persons | L/R speakers, stacked top/bottom. | | `split_two_charts` | 2 charts | L/R charts, stacked top/bottom. | A general subject-tracker ML model is orders of magnitude more expensive and less reliable than five hand-written crop recipes. If a new geometry ever shows up in future source videos, adding a sixth thruster is strictly additive: write a new `plan_*` function, add it to `_DISPATCH`, add an enum variant. No existing code has to change. ## 9:16 layout math Source is assumed 16:9 (1920×1080 by default, but probed per-clip). Target is 1080×1920. For each layout: ### `zoom_call_center` and `sit_center` Standard centered aspect-ratio crop to 9:16, then scale to 1080×1920: ``` crop=cw:ch:x:y,scale=1080:1920:flags=lanczos,setsar=1[vout] ``` `cw`, `ch` are the largest 9:16 window that fits in the source, divided by `zoom`. `x`, `y` center the window on `person_x_norm` / 0.5. Dimensions are rounded to even values so libx264 is happy. The window is clamped inside the source so a high `person_x_norm` never crops outside. ### Split layouts (`split_chart_person`, `split_two_persons`, `split_two_charts`) All three splits share one recipe — only the items differ: 1. **Horizontal partition.** The source is cut at a single vertical seam so the two source strips are **complementary** (no overlap, no gap). When both bboxes are set (Gemini vision), the seam is the midpoint between `left.x2` and `right.x1`. Otherwise the seam defaults to either an even 50/50 (two-of-a-kind splits) or a 2/3 | 1/3 split (legacy `split_chart_person` fallback). 2. **Vertical crop.** Each strip's vertical extent comes from the corresponding bbox when provided, so each item **fills** its output band instead of being lost in full-height source context. 3. **Cover-scale to the band.** Each strip is scaled with `force_original_aspect_ratio=increase` + center-cropped to the band dimensions. Bands are always fully painted; no letterbox bars. 4. **Stack.** Two branches produced by `split=2` are `vstack`-ed into the final 1080×1920. **Band heights** are controlled by `LayoutInstruction.top_band_ratio`, which defaults to **0.5** (even 50/50 — the symmetric look Bryan asked for after the uneven Cathy Wood shorts). Legacy 60/40 is still reachable by setting `top_band_ratio=0.6`. **Stack order** (for `split_chart_person`) is controlled by `focus_stack_order`: chart-on-top (default) or person-on-top. ## Extensibility story - **Smarter classifier:** implement `LLMVisionFn` with any multimodal model and pass it to `classify_scenes_with_llm`. The fallback heuristic stays available for offline runs and tests. - **Smarter clip selector:** same pattern, `LLMTextFn` → `select_clips_with_llm`. - **New layout:** add a `plan_*` planner, register in `_DISPATCH`, add a `LayoutKind` variant. Tests in `test_layouts.py` automatically iterate over all `LayoutKind`s, so the dispatch coverage test will catch a missing registration immediately. ## What we intentionally did NOT build - Drag-and-highlight subject-selector UI. - A general ML subject-tracker. - A monolithic video-in-video-out model. - Any network calls in the core library. The MCP server is stdio-only; the CLI runs fully offline. This keeps the rocket **reusable**: the same primitives power the MCP server, the CLI, a Python library, and (soon) a web UI if that's ever warranted.