Spaces:
Sleeping
Sleeping
File size: 7,119 Bytes
eda316b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | # Architecture β Reusable Rocket
> *"We don't need to build the door or windows β just a container with landing
> gear and thrusters that move in different directions."*
> β Bryan
That analogy maps exactly onto this MCP:
| Rocket part | Codebase | Purpose |
| --------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------- |
| Container | `src/humeo_core/schemas.py` | Strict JSON contracts every stage reads/writes. |
| Landing gear | `src/humeo_core/primitives/ingest.py` | Deterministic local extraction (scenes, keyframes, transcript). |
| Thrusters (Γ5) | `src/humeo_core/primitives/layouts.py` | Five fixed 9:16 crop/compose recipes (max 2 on-screen items). |
| Pilot | `primitives/classify.py` + `primitives/select_clips.py` | Heuristic + LLM-ready decision makers. |
| Compiler | `src/humeo_core/primitives/compile.py` | Deterministic ffmpeg assembly. |
| Control panel | `src/humeo_core/server.py` | MCP tools exposing every primitive. |
| Control surface | `src/humeo_core/server.py` | MCP tool surface for agents and clients. |
## First-principles reasoning
The HIVE paper's core insight is that good short-video editing requires
**staged reasoning with strict intermediate artifacts**, not a single
giant model call. Three consequences flow from that:
1. **Extraction must be local and deterministic.** No model call should
ever touch raw video bytes. `ingest.py` runs ffprobe + PySceneDetect
+ ffmpeg + (optional) faster-whisper. Everything it emits is JSON or
a file path.
2. **Reasoning must be decomposed into narrow sub-tasks.** Classifying a
scene's layout is a completely different task from selecting a viral
clip. Each has its own schema, its own prompt, its own validation.
This is why `primitives/` is five files instead of one.
3. **Every model call must emit schema-validated JSON.** Free-form model
output is not allowed to enter the pipeline. `classify_scenes_with_llm`
and `select_clips_with_llm` both `model_validate(...)` the raw output
before returning; parse failures degrade gracefully to `SIT_CENTER` +
low confidence, not crashes.
## Why only five layouts?
The hard rule for this format: **a short shows at most two on-screen
items**, where an "item" is a `person` or a `chart`. That gives exactly
five recipes β all implemented as pure functions from
`LayoutInstruction` to an ffmpeg filtergraph string in `layouts.py`:
| Layout | Items | Recipe |
| ---------------------- | --------------- | --------------------------------------------- |
| `zoom_call_center` | 1 person | tight centered 9:16 crop (zoom β₯ 1.25). |
| `sit_center` | 1 person | wider centered 9:16 crop. |
| `split_chart_person` | 1 chart + person| source partitioned L/R by bboxes, stacked. |
| `split_two_persons` | 2 persons | L/R speakers, stacked top/bottom. |
| `split_two_charts` | 2 charts | L/R charts, stacked top/bottom. |
A general subject-tracker ML model is orders of magnitude more expensive
and less reliable than five hand-written crop recipes. If a new geometry
ever shows up in future source videos, adding a sixth thruster is
strictly additive: write a new `plan_*` function, add it to `_DISPATCH`,
add an enum variant. No existing code has to change.
## 9:16 layout math
Source is assumed 16:9 (1920Γ1080 by default, but probed per-clip).
Target is 1080Γ1920. For each layout:
### `zoom_call_center` and `sit_center`
Standard centered aspect-ratio crop to 9:16, then scale to 1080Γ1920:
```
crop=cw:ch:x:y,scale=1080:1920:flags=lanczos,setsar=1[vout]
```
`cw`, `ch` are the largest 9:16 window that fits in the source, divided
by `zoom`. `x`, `y` center the window on `person_x_norm` / 0.5.
Dimensions are rounded to even values so libx264 is happy. The window is
clamped inside the source so a high `person_x_norm` never crops outside.
### Split layouts (`split_chart_person`, `split_two_persons`, `split_two_charts`)
All three splits share one recipe β only the items differ:
1. **Horizontal partition.** The source is cut at a single vertical seam
so the two source strips are **complementary** (no overlap, no gap).
When both bboxes are set (Gemini vision), the seam is the midpoint
between `left.x2` and `right.x1`. Otherwise the seam defaults to
either an even 50/50 (two-of-a-kind splits) or a 2/3 | 1/3 split
(legacy `split_chart_person` fallback).
2. **Vertical crop.** Each strip's vertical extent comes from the
corresponding bbox when provided, so each item **fills** its output
band instead of being lost in full-height source context.
3. **Cover-scale to the band.** Each strip is scaled with
`force_original_aspect_ratio=increase` + center-cropped to the band
dimensions. Bands are always fully painted; no letterbox bars.
4. **Stack.** Two branches produced by `split=2` are `vstack`-ed into
the final 1080Γ1920.
**Band heights** are controlled by `LayoutInstruction.top_band_ratio`,
which defaults to **0.5** (even 50/50 β the symmetric look Bryan asked
for after the uneven Cathy Wood shorts). Legacy 60/40 is still reachable
by setting `top_band_ratio=0.6`.
**Stack order** (for `split_chart_person`) is controlled by
`focus_stack_order`: chart-on-top (default) or person-on-top.
## Extensibility story
- **Smarter classifier:** implement `LLMVisionFn` with any multimodal
model and pass it to `classify_scenes_with_llm`. The fallback heuristic
stays available for offline runs and tests.
- **Smarter clip selector:** same pattern, `LLMTextFn` β `select_clips_with_llm`.
- **New layout:** add a `plan_*` planner, register in `_DISPATCH`, add a
`LayoutKind` variant. Tests in `test_layouts.py` automatically iterate
over all `LayoutKind`s, so the dispatch coverage test will catch a
missing registration immediately.
## What we intentionally did NOT build
- Drag-and-highlight subject-selector UI.
- A general ML subject-tracker.
- A monolithic video-in-video-out model.
- Any network calls in the core library. The MCP server is stdio-only;
the CLI runs fully offline.
This keeps the rocket **reusable**: the same primitives power the MCP
server, the CLI, a Python library, and (soon) a web UI if that's ever
warranted.
|