Spaces:

moonlantern1
/

clipforge

Sleeping

App Files Files Community

clipforge / humeo-core /docs /ARCHITECTURE.md

moonlantern1

Deploy ClipForge Docker Space

eda316b verified 11 days ago

preview code

raw

history blame contribute delete

7.12 kB

	# Architecture — Reusable Rocket

	> *"We don't need to build the door or windows — just a container with landing
	> gear and thrusters that move in different directions."*
	> — Bryan

	That analogy maps exactly onto this MCP:

	\| Rocket part \| Codebase \| Purpose \|
	\| --------------- \| ---------------------------------------------------------------- \| ----------------------------------------------------------------------- \|
	\| Container \| `src/humeo_core/schemas.py` \| Strict JSON contracts every stage reads/writes. \|
	\| Landing gear \| `src/humeo_core/primitives/ingest.py` \| Deterministic local extraction (scenes, keyframes, transcript). \|
	\| Thrusters (×5) \| `src/humeo_core/primitives/layouts.py` \| Five fixed 9:16 crop/compose recipes (max 2 on-screen items). \|
	\| Pilot \| `primitives/classify.py` + `primitives/select_clips.py` \| Heuristic + LLM-ready decision makers. \|
	\| Compiler \| `src/humeo_core/primitives/compile.py` \| Deterministic ffmpeg assembly. \|
	\| Control panel \| `src/humeo_core/server.py` \| MCP tools exposing every primitive. \|
	\| Control surface \| `src/humeo_core/server.py` \| MCP tool surface for agents and clients. \|

	## First-principles reasoning

	The HIVE paper's core insight is that good short-video editing requires
	staged reasoning with strict intermediate artifacts, not a single
	giant model call. Three consequences flow from that:

	1. Extraction must be local and deterministic. No model call should
	ever touch raw video bytes. `ingest.py` runs ffprobe + PySceneDetect
	+ ffmpeg + (optional) faster-whisper. Everything it emits is JSON or
	a file path.

	2. Reasoning must be decomposed into narrow sub-tasks. Classifying a
	scene's layout is a completely different task from selecting a viral
	clip. Each has its own schema, its own prompt, its own validation.
	This is why `primitives/` is five files instead of one.

	3. Every model call must emit schema-validated JSON. Free-form model
	output is not allowed to enter the pipeline. `classify_scenes_with_llm`
	and `select_clips_with_llm` both `model_validate(...)` the raw output
	before returning; parse failures degrade gracefully to `SIT_CENTER` +
	low confidence, not crashes.

	## Why only five layouts?

	The hard rule for this format: **a short shows at most two on-screen
	items**, where an "item" is a `person` or a `chart`. That gives exactly
	five recipes — all implemented as pure functions from
	`LayoutInstruction` to an ffmpeg filtergraph string in `layouts.py`:

	\| Layout \| Items \| Recipe \|
	\| ---------------------- \| --------------- \| --------------------------------------------- \|
	\| `zoom_call_center` \| 1 person \| tight centered 9:16 crop (zoom ≥ 1.25). \|
	\| `sit_center` \| 1 person \| wider centered 9:16 crop. \|
	\| `split_chart_person` \| 1 chart + person\| source partitioned L/R by bboxes, stacked. \|
	\| `split_two_persons` \| 2 persons \| L/R speakers, stacked top/bottom. \|
	\| `split_two_charts` \| 2 charts \| L/R charts, stacked top/bottom. \|

	A general subject-tracker ML model is orders of magnitude more expensive
	and less reliable than five hand-written crop recipes. If a new geometry
	ever shows up in future source videos, adding a sixth thruster is
	strictly additive: write a new `plan_*` function, add it to `_DISPATCH`,
	add an enum variant. No existing code has to change.

	## 9:16 layout math

	Source is assumed 16:9 (1920×1080 by default, but probed per-clip).
	Target is 1080×1920. For each layout:

	### `zoom_call_center` and `sit_center`

	Standard centered aspect-ratio crop to 9:16, then scale to 1080×1920:

	```
	crop=cw:ch:x:y,scale=1080:1920:flags=lanczos,setsar=1[vout]
	```

	`cw`, `ch` are the largest 9:16 window that fits in the source, divided
	by `zoom`. `x`, `y` center the window on `person_x_norm` / 0.5.
	Dimensions are rounded to even values so libx264 is happy. The window is
	clamped inside the source so a high `person_x_norm` never crops outside.

	### Split layouts (`split_chart_person`, `split_two_persons`, `split_two_charts`)

	All three splits share one recipe — only the items differ:

	1. Horizontal partition. The source is cut at a single vertical seam
	so the two source strips are complementary (no overlap, no gap).
	When both bboxes are set (Gemini vision), the seam is the midpoint
	between `left.x2` and `right.x1`. Otherwise the seam defaults to
	either an even 50/50 (two-of-a-kind splits) or a 2/3 \| 1/3 split
	(legacy `split_chart_person` fallback).
	2. Vertical crop. Each strip's vertical extent comes from the
	corresponding bbox when provided, so each item fills its output
	band instead of being lost in full-height source context.
	3. Cover-scale to the band. Each strip is scaled with
	`force_original_aspect_ratio=increase` + center-cropped to the band
	dimensions. Bands are always fully painted; no letterbox bars.
	4. Stack. Two branches produced by `split=2` are `vstack`-ed into
	the final 1080×1920.

	Band heights are controlled by `LayoutInstruction.top_band_ratio`,
	which defaults to 0.5 (even 50/50 — the symmetric look Bryan asked
	for after the uneven Cathy Wood shorts). Legacy 60/40 is still reachable
	by setting `top_band_ratio=0.6`.

	Stack order (for `split_chart_person`) is controlled by
	`focus_stack_order`: chart-on-top (default) or person-on-top.

	## Extensibility story

	- Smarter classifier: implement `LLMVisionFn` with any multimodal
	model and pass it to `classify_scenes_with_llm`. The fallback heuristic
	stays available for offline runs and tests.
	- Smarter clip selector: same pattern, `LLMTextFn` → `select_clips_with_llm`.
	- New layout: add a `plan_*` planner, register in `_DISPATCH`, add a
	`LayoutKind` variant. Tests in `test_layouts.py` automatically iterate
	over all `LayoutKind`s, so the dispatch coverage test will catch a
	missing registration immediately.

	## What we intentionally did NOT build

	- Drag-and-highlight subject-selector UI.
	- A general ML subject-tracker.
	- A monolithic video-in-video-out model.
	- Any network calls in the core library. The MCP server is stdio-only;
	the CLI runs fully offline.

	This keeps the rocket reusable: the same primitives power the MCP
	server, the CLI, a Python library, and (soon) a web UI if that's ever
	warranted.