clipforge / humeo-core /README.md
moonlantern1's picture
Deploy ClipForge Docker Space
eda316b verified
# humeo-core
**Reusable-rocket MCP server for long-video β†’ 9:16 shorts.**
First-principles design, from the HIVE paper + Bryan's rocket analogy:
we don't build doors and windows (general subject-tracker UI, retraining
models). We build the **container** (schemas), **landing gear** (deterministic
local extraction), and **five thrusters** (the five 9:16 layouts this video
format actually uses). Everything else is pluggable.
## The rocket, in one picture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Control panel (MCP tools) β”‚ <- any MCP client
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ strict JSON
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β–Ό β–Ό β–Ό β–Ό β–Ό
ingest classify_scenes select_clips plan_layout render_clip
(scenes + (5-way layout (clip picker, (5 thrusters, (ffmpeg compile,
keyframes + classifier) heuristic + pure filter dry-run safe)
transcript) LLM-ready) math)
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ LayoutKind β”‚
β”‚ ──────────────── β”‚
β”‚ zoom_call_center β”‚
β”‚ sit_center β”‚
β”‚ split_chart_personβ”‚
β”‚ split_two_persons β”‚
β”‚ split_two_charts β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
Only the classifier and clip-selector have optional LLM hooks; everything
else is deterministic, local, and cheap.
## Why five layouts? (the "max 2 items" rule)
The hard constraint for this format: **a short shows at most two on-screen
items** β€” where an "item" is a `person` (a human speaker) or a `chart`
(slide, graph, data visual, screenshare). That gives exactly five recipes:
1. **`zoom_call_center`** β€” 1 person, tight zoom-call / webcam framing.
2. **`sit_center`** β€” 1 person, interview / seated framing.
3. **`split_chart_person`** β€” 1 chart + 1 person, stacked vertically
(default: **even 50/50** top/bottom, chart on top).
4. **`split_two_persons`** β€” 2 speakers, stacked vertically.
5. **`split_two_charts`** β€” 2 charts, stacked vertically.
Because the geometry is bounded, we do NOT need a general subject-tracker
ML model or a drag-to-highlight UI. We need five small, correct pieces of
crop/compose math. That is exactly what `src/humeo_core/primitives/layouts.py`
is.
See [`TERMINOLOGY.md`](../TERMINOLOGY.md) for the full glossary of terms
used across these docs (subject, crop, band, seam, bbox, layout, etc.).
## Install
```bash
uv venv
uv sync
```
External requirements: `ffmpeg` and `ffprobe` on PATH.
`scenedetect` requires OpenCV. Install `opencv-python-headless` or
`opencv-python` alongside `scenedetect`.
## Use it as an MCP server
```bash
humeo-core # stdio transport (primary console script)
# humeo-mcp # same entrypoint β€” kept so existing MCP configs keep working
```
Example Cursor/Claude Desktop config:
```json
{
"mcpServers": {
"humeo": { "command": "humeo-core" }
}
}
```
Tools exposed:
| Tool | Purpose |
| --------------------------------- | --------------------------------------------------------------------------- |
| `list_layouts` | Enumerate the 5 supported layouts. |
| `ingest` | Scene detection + keyframe extraction (+ optional transcript). |
| `classify_scenes` | Pixel-heuristic per-scene layout classification. |
| `detect_scene_regions` | Return the bbox prompt + per-scene jobs (agent runs its own vision model). |
| `classify_scenes_with_vision` | Classify scenes from already-gathered `SceneRegions` bbox JSON + build layout instructions. |
| `select_clips` | Heuristic clip picker over a word-level transcript. |
| `plan_layout` | Return the exact `ffmpeg -filter_complex` for a layout. |
| `build_render_cmd` | Build the ffmpeg command (no execution) β€” review before spend. |
| `render_clip` | Build + run ffmpeg to produce a 9:16 MP4. |
Resource: `humeo://layouts` (JSON listing of the 5 layouts).
### Three interchangeable region detectors
All three emit the same `SceneRegions` schema, so the layout planner and renderer don't care which one you used:
```
classify.py (pixel variance, no ML)
face_detect.py (MediaPipe, local) ──► SceneRegions ──► SceneClassification ──► LayoutInstruction ──► ffmpeg
vision.py (multimodal LLM + OCR bboxes)
```
## JSON contracts (non-negotiable)
All tools take and return Pydantic-validated JSON. The contracts live in
[`src/humeo_core/schemas.py`](src/humeo_core/schemas.py):
- `Scene` `{scene_id, start_time, end_time, keyframe_path?}`
- `TranscriptWord` `{word, start_time, end_time}`
- `IngestResult` `{source_path, duration_sec, scenes[], transcript_words[], keyframes_dir?}`
- `SceneClassification` `{scene_id, layout, confidence, reason}`
- `BoundingBox` `{x1, y1, x2, y2, label, confidence}` (all coords normalized)
- `SceneRegions` `{scene_id, person_bbox?, chart_bbox?, ocr_text, raw_reason}`
- `Clip` `{clip_id, topic, start_time_sec, end_time_sec, viral_hook, virality_score, transcript, suggested_overlay_title, layout?}`
- `ClipPlan` `{source_path, clips[]}`
- `LayoutInstruction` `{clip_id, layout, zoom, person_x_norm, chart_x_norm, split_chart_region?, split_person_region?, split_second_chart_region?, split_second_person_region?, top_band_ratio, focus_stack_order}`
- `RenderRequest` / `RenderResult`
## First-principles decisions (what we intentionally did NOT build)
- **No giant subject-tracker ML.** The video format has 5 fixed layouts
(with a hard "max 2 items" rule); pixel-level tracking is not needed.
- **No drag-and-highlight UI.** An MCP tool is a better "UI" for an
agent-first workflow. If a human wants to override, they pass a
`LayoutInstruction` with their own `person_x_norm` / `chart_x_norm` /
`zoom`.
- **No end-to-end video→video model.** The HIVE paper's core insight is
that decomposed orchestration beats monolithic generation. We reify
that insight as six small composable tools.
## Extending the pilot
- Plug a real multimodal model into `classify_scenes_with_llm(vision_fn)`.
- Plug a real reasoning model into `select_clips_with_llm(text_fn)`.
- Plug a real vision-LLM into `detect_regions_with_llm(scenes, vision_fn)`
to get per-scene bboxes + OCR text, then feed the results back through
`classify_scenes_with_vision`. This is the scene-change β†’ v3 images β†’
LLM+OCR β†’ bbox path; see `../docs/SOLUTIONS.md Β§4` for rationale.
- All enforce strict JSON outputs, so bad model output can't corrupt
downstream stages.
## Testing
```bash
python -m pytest
```
See [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) for deeper rationale.
## License
MIT