# humeo-core

**Reusable-rocket MCP server for long-video → 9:16 shorts.**

First-principles design, from the HIVE paper + Bryan's rocket analogy:
we don't build doors and windows (general subject-tracker UI, retraining
models). We build the **container** (schemas), **landing gear** (deterministic
local extraction), and **five thrusters** (the five 9:16 layouts this video
format actually uses). Everything else is pluggable.

## The rocket, in one picture

```
            ┌──────────────────────────────────────────┐
            │         Control panel  (MCP tools)       │   <- any MCP client
            └───────────────────┬──────────────────────┘
                                │ strict JSON
   ┌────────────────┬───────────┼────────────────┬─────────────────┐
   ▼                ▼           ▼                ▼                 ▼
 ingest       classify_scenes  select_clips   plan_layout       render_clip
(scenes +     (5-way layout   (clip picker,   (5 thrusters,    (ffmpeg compile,
 keyframes +   classifier)     heuristic +     pure filter      dry-run safe)
 transcript)                   LLM-ready)      math)
                                                 │
                                                 ▼
                                       ┌────────────────────┐
                                       │   LayoutKind       │
                                       │  ────────────────  │
                                       │  zoom_call_center  │
                                       │  sit_center        │
                                       │  split_chart_person│
                                       │  split_two_persons │
                                       │  split_two_charts  │
                                       └────────────────────┘
```

Only the classifier and clip-selector have optional LLM hooks; everything
else is deterministic, local, and cheap.

## Why five layouts? (the "max 2 items" rule)

The hard constraint for this format: **a short shows at most two on-screen
items** — where an "item" is a `person` (a human speaker) or a `chart`
(slide, graph, data visual, screenshare). That gives exactly five recipes:

1. **`zoom_call_center`** — 1 person, tight zoom-call / webcam framing.
2. **`sit_center`** — 1 person, interview / seated framing.
3. **`split_chart_person`** — 1 chart + 1 person, stacked vertically
   (default: **even 50/50** top/bottom, chart on top).
4. **`split_two_persons`** — 2 speakers, stacked vertically.
5. **`split_two_charts`** — 2 charts, stacked vertically.

Because the geometry is bounded, we do NOT need a general subject-tracker
ML model or a drag-to-highlight UI. We need five small, correct pieces of
crop/compose math. That is exactly what `src/humeo_core/primitives/layouts.py`
is.

See [`TERMINOLOGY.md`](../TERMINOLOGY.md) for the full glossary of terms
used across these docs (subject, crop, band, seam, bbox, layout, etc.).

## Install

```bash
uv venv
uv sync
```

External requirements: `ffmpeg` and `ffprobe` on PATH.

`scenedetect` requires OpenCV. Install `opencv-python-headless` or
`opencv-python` alongside `scenedetect`.

## Use it as an MCP server

```bash
humeo-core         # stdio transport (primary console script)
# humeo-mcp        # same entrypoint — kept so existing MCP configs keep working
```

Example Cursor/Claude Desktop config:

```json
{
  "mcpServers": {
    "humeo": { "command": "humeo-core" }
  }
}
```

Tools exposed:

| Tool                              | Purpose                                                                     |
| --------------------------------- | --------------------------------------------------------------------------- |
| `list_layouts`                    | Enumerate the 5 supported layouts.                                          |
| `ingest`                          | Scene detection + keyframe extraction (+ optional transcript).              |
| `classify_scenes`                 | Pixel-heuristic per-scene layout classification.                            |
| `detect_scene_regions`            | Return the bbox prompt + per-scene jobs (agent runs its own vision model).  |
| `classify_scenes_with_vision`     | Classify scenes from already-gathered `SceneRegions` bbox JSON + build layout instructions. |
| `select_clips`                    | Heuristic clip picker over a word-level transcript.                         |
| `plan_layout`                     | Return the exact `ffmpeg -filter_complex` for a layout.                     |
| `build_render_cmd`                | Build the ffmpeg command (no execution) — review before spend.              |
| `render_clip`                     | Build + run ffmpeg to produce a 9:16 MP4.                                   |

Resource: `humeo://layouts` (JSON listing of the 5 layouts).

### Three interchangeable region detectors

All three emit the same `SceneRegions` schema, so the layout planner and renderer don't care which one you used:

```
classify.py   (pixel variance, no ML)
face_detect.py (MediaPipe, local)            ──► SceneRegions ──► SceneClassification ──► LayoutInstruction ──► ffmpeg
vision.py     (multimodal LLM + OCR bboxes)
```

## JSON contracts (non-negotiable)

All tools take and return Pydantic-validated JSON. The contracts live in
[`src/humeo_core/schemas.py`](src/humeo_core/schemas.py):

- `Scene`                     `{scene_id, start_time, end_time, keyframe_path?}`
- `TranscriptWord`            `{word, start_time, end_time}`
- `IngestResult`              `{source_path, duration_sec, scenes[], transcript_words[], keyframes_dir?}`
- `SceneClassification`       `{scene_id, layout, confidence, reason}`
- `BoundingBox`               `{x1, y1, x2, y2, label, confidence}`  (all coords normalized)
- `SceneRegions`              `{scene_id, person_bbox?, chart_bbox?, ocr_text, raw_reason}`
- `Clip`                      `{clip_id, topic, start_time_sec, end_time_sec, viral_hook, virality_score, transcript, suggested_overlay_title, layout?}`
- `ClipPlan`                  `{source_path, clips[]}`
- `LayoutInstruction`         `{clip_id, layout, zoom, person_x_norm, chart_x_norm, split_chart_region?, split_person_region?, split_second_chart_region?, split_second_person_region?, top_band_ratio, focus_stack_order}`
- `RenderRequest` / `RenderResult`

## First-principles decisions (what we intentionally did NOT build)

- **No giant subject-tracker ML.** The video format has 5 fixed layouts
  (with a hard "max 2 items" rule); pixel-level tracking is not needed.
- **No drag-and-highlight UI.** An MCP tool is a better "UI" for an
  agent-first workflow. If a human wants to override, they pass a
  `LayoutInstruction` with their own `person_x_norm` / `chart_x_norm` /
  `zoom`.
- **No end-to-end video→video model.** The HIVE paper's core insight is
  that decomposed orchestration beats monolithic generation. We reify
  that insight as six small composable tools.

## Extending the pilot

- Plug a real multimodal model into `classify_scenes_with_llm(vision_fn)`.
- Plug a real reasoning model into `select_clips_with_llm(text_fn)`.
- Plug a real vision-LLM into `detect_regions_with_llm(scenes, vision_fn)`
  to get per-scene bboxes + OCR text, then feed the results back through
  `classify_scenes_with_vision`. This is the scene-change → v3 images →
  LLM+OCR → bbox path; see `../docs/SOLUTIONS.md §4` for rationale.
- All enforce strict JSON outputs, so bad model output can't corrupt
  downstream stages.

## Testing

```bash
python -m pytest
```

See [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) for deeper rationale.

## License

MIT