---
title: YouTubeIntel
license: mit
tags:
- youtube
- analytics
- nlp
- llm
- fastapi
- react
---

<div align="center">

# YouTubeIntel

### From raw YouTube comments to **themes, audience positions, editorial briefings, and moderation-ready creator signals**

[![Python 3.11+](https://img.shields.io/badge/python-3.11+-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org)
[![FastAPI](https://img.shields.io/badge/FastAPI-009688?style=for-the-badge&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com)
[![React 18](https://img.shields.io/badge/React_18-61DAFB?style=for-the-badge&logo=react&logoColor=black)](https://react.dev)
[![PostgreSQL](https://img.shields.io/badge/PostgreSQL-4169E1?style=for-the-badge&logo=postgresql&logoColor=white)](https://postgresql.org)
[![Redis](https://img.shields.io/badge/Redis-DC382D?style=for-the-badge&logo=redis&logoColor=white)](https://redis.io)
[![OpenAI](https://img.shields.io/badge/OpenAI-412991?style=for-the-badge&logo=openai&logoColor=white)](https://openai.com)

<br />

**YouTubeIntel** is a portfolio-grade, production-style analytics platform that treats YouTube comments as operational data — not just sentiment fluff.

It combines a full-stack product surface with two substantial pipelines:
- a **Topic Intelligence Pipeline** for clustering, labeling, briefing, and audience positions;
- an **Appeal Analytics Pipeline** for criticism, questions, appeals, toxicity, and moderation support.

[🚀 Quick Start](#-quick-start) · [🏗 Architecture](#-architecture) · [🔬 Topic Pipeline](#-topic-intelligence-pipeline) · [🎯 Appeal Pipeline](#-appeal-analytics-pipeline) · [📡 API](#-api-reference)

> ### 🚀 Desktop Version Available
>
> **[👉 huggingface.co/herman3996/YouTubeIntelDesktop](https://huggingface.co/herman3996/YouTubeIntelDesktop)**


> If this project makes you think **“this should have more stars”** — you’re probably right ⭐

</div>

---

## The pitch

Most YouTube analytics tools stop at vanity metrics.

They tell you:
- views,
- engagement,
- rough sentiment,
- maybe some keyword clouds.

They usually **do not** tell you:
- what the audience is *actually arguing about*;
- which positions exist inside each topic;
- what comments are actionable for the next episode;
- what criticism is genuinely constructive;
- which toxic messages should be reviewed or escalated.

**YouTubeIntel** is built to solve exactly that.

<table>
<tr>
<td width="50%">

### 🔬 Semantic topic intelligence
Reveal real audience themes, cluster structure, representative quotes, and position-level disagreement.

</td>
<td width="50%">

### 🎯 Creator-facing signal extraction
Separate criticism, questions, appeals, and toxicity instead of mixing everything into one noisy bucket.

</td>
</tr>
<tr>
<td width="50%">

### 🛡 Moderation-aware workflow
Support toxic review queues, target detection, and manual/automatic moderation actions.

</td>
<td width="50%">

### 📦 Exportable decision artifacts
Persist reports and analytics blocks in PostgreSQL and export Markdown/HTML reporting artifacts.

</td>
</tr>
</table>

---

## Why this repo is worth starring

<table>
<tr>
<td width="33%">

### 🧠 Real pipeline depth
Not a toy CRUD app. This repo includes clustering, embeddings, LLM classification, refinement passes, moderation logic, diagnostics, and background execution.

</td>
<td width="33%">

### 🖥 Full product surface
Backend, frontend, API docs, Docker deployment, Celery workers, runtime settings, operator dashboard, and a desktop packaging companion.

</td>
<td width="33%">

### 🧰 Portfolio + practical utility
Designed to look strong on GitHub **and** to behave like a credible internal analytics product.

</td>
</tr>
</table>

---

## ✨ What the product does today

### Core capabilities
- **Topic Intelligence Pipeline** for semantic clustering, topic labeling, audience-position extraction, and daily briefing generation
- **Appeal Analytics Pipeline** for author-directed criticism, questions, appeals, and toxicity routing
- **Hybrid moderation** with rule-based filtering, optional LLM borderline moderation, toxic target detection, and review queues
- **Operator UI** for runs, reports, runtime settings, budget visibility, and appeal review
- **Asynchronous processing** via Celery + Redis
- **Docker-first environment** with PostgreSQL, Redis, backend, worker, and beat
- **Desktop companion** in [`desktop/`](./desktop) for local packaged delivery

### Operator-facing routes
- `/ui` — dashboard shell
- `/ui/videos` — recent videos + status monitor
- `/ui/budget` — budget + runtime settings
- `/ui/reports/:videoId` — full topic report detail
- `/ui/appeal/:videoId` — appeal analytics + toxic moderation workflow
- `/docs` — Swagger UI

---

## 🏗 Architecture

```text
YouTube Data API -> Comment fetch -> Preprocess + moderation
                                  -> Topic Intelligence Pipeline -> Markdown/HTML reports
                                  -> Topic Intelligence Pipeline -> PostgreSQL
                                  -> Appeal Analytics Pipeline    -> PostgreSQL

React + Vite SPA -> FastAPI -> PostgreSQL
Celery worker -> FastAPI
Celery beat -> Celery worker
Redis -> Celery worker
Redis -> Celery beat
```

### Stack overview
| Layer | Technology |
|---|---|
| API layer | FastAPI |
| Persistence | SQLAlchemy + PostgreSQL |
| Background execution | Celery + Redis |
| ML / NLP | SentenceTransformers, HDBSCAN, scikit-learn |
| LLM layer | OpenAI-compatible chat + embeddings |
| Frontend | React 18, TypeScript, Vite |
| Desktop packaging companion | [`desktop/`](./desktop) |

---

## 🔬 Topic Intelligence Pipeline

> **Entry point:** `app/services/pipeline/runner.py` → `DailyRunService`

This is the “what is the audience discussing, how are they split, and what should the creator do next?” pipeline.

```text
1) Context
 -> 2) Comments fetch
 -> 3) Preprocess + moderation
 -> 4) Persist comments
 -> 5) Embeddings
 -> 6) Clustering
 -> 7) Episode match (compatibility stage, skipped)
 -> 8) Labeling + audience positions
 -> 9) Persist clusters
 -> 10) Briefing build
 -> 11) Report export
```

### What happens in each stage
| # | Stage | What it does |
|:-:|---|---|
| 1 | Context | Loads prior report context for continuity |
| 2 | Comments fetch | Pulls comments via the YouTube Data API |
| 3 | Preprocess + moderation | Filters low-signal items, normalizes text, applies rule-based moderation, optionally runs borderline LLM moderation |
| 4 | Persist comments | Stores processed comments in PostgreSQL |
| 5 | Embeddings | Builds vectors with local embeddings or OpenAI |
| 6 | Clustering | Groups comments into themes with fallback logic for edge cases |
| 7 | Episode match | Preserved as a compatibility stage and **explicitly skipped** in the active runtime |
| 8 | Labeling + audience positions | Produces titles, summaries, and intra-topic positions |
| 9 | Persist clusters | Saves clusters and membership mappings |
| 10 | Briefing build | Generates executive summary, actions, risks, and topic-level findings |
| 11 | Report export | Writes Markdown/HTML reports and stores structured JSON |

### Main outputs
- top themes by discussion weight
- representative quotes and question comments
- audience positions inside each theme
- editorial briefing for the next content cycle, including actions, misunderstandings, audience requests, risks, and trend deltas
- moderation and degradation diagnostics

---

## 🎯 Appeal Analytics Pipeline

> **Entry point:** `app/services/appeal_analytics/runner.py` → `AppealAnalyticsService`

This is the “what is being said *to the creator*?” pipeline.

```text
Load/fetch video comments
 -> Unified LLM classification
 -> Question refiner
 -> Political criticism filter
 -> Toxic target classification
 -> Confidence + target routing:
      - Auto-ban block
      - Manual review block
      - Ignore third-party insults
 -> Persist appeal blocks
```

### Persisted blocks
| Block | Meaning |
|---|---|
| `constructive_question` | creator-directed questions |
| `constructive_criticism` | criticism backed by an actual political argument |
| `author_appeal` | direct requests / appeals to the creator |
| `toxic_auto_banned` | toxic comments that passed final verification and were auto-moderated |
| `toxic_manual_review` | toxic comments queued for admin review |

`skip` is still used internally as a classification outcome, but it is **not persisted as a block**.

### Pipeline behaviors that matter
- question candidates get a **second-pass refiner**;
- criticism with question signal can be promoted into the question block;
- low-value `attack_ragebait` / `meme_one_liner` question candidates are demoted out of `constructive_question`;
- toxic comments are classified by **target** (`author`, `guest`, `content`, `undefined`, `third_party`);
- routing splits comments into **auto-ban**, **manual review**, or **ignore**;
- comments with toxicity confidence **>= 0.80** enter the auto-ban path, but a final strict verification pass can still downgrade them into manual review;
- auto-banned authors can be **unbanned from the UI** if the operator spots a false positive;
- per-video guest names can improve targeting accuracy.

---

## 📡 API Reference

Interactive docs live at **`/docs`** when the backend is running.

### Core runs and reports
| Method | Endpoint | Purpose |
|:---:|---|---|
| `GET` | `/health` | health and OpenAI endpoint metadata |
| `POST` | `/run/latest` | run Topic Intelligence for the latest playlist video |
| `POST` | `/run/video` | run Topic Intelligence for a specific video URL |
| `GET` | `/videos` | recent videos |
| `GET` | `/videos/statuses` | progress/status dashboard payload |
| `GET` | `/videos/{video_id}` | single-video metadata |
| `GET` | `/reports/latest` | latest report |
| `GET` | `/reports/{video_id}` | latest report for one video |
| `GET` | `/reports/{video_id}/detail` | enriched report with comments and positions |

### Appeal analytics and moderation
| Method | Endpoint | Purpose |
|:---:|---|---|
| `POST` | `/appeal/run` | run Appeal Analytics |
| `GET` | `/appeal/{video_id}` | latest appeal analytics result |
| `GET` | `/appeal/{video_id}/author/{author_name}` | all comments by one author |
| `GET` | `/appeal/{video_id}/toxic-review` | manual toxic-review queue |
| `POST` | `/appeal/ban-user` | manual ban action |
| `POST` | `/appeal/unban-user` | restore a previously banned commenter |

### Runtime and settings
| Method | Endpoint | Purpose |
|:---:|---|---|
| `GET` | `/settings/video-guests/{video_id}` | load guest names |
| `PUT` | `/settings/video-guests/{video_id}` | update guest names |
| `GET` | `/budget` | OpenAI usage snapshot |
| `GET` | `/settings/runtime` | current mutable runtime settings |
| `PUT` | `/settings/runtime` | update mutable runtime settings |
| `GET` | `/app/setup/status` | desktop-only first-run setup status |
| `POST` | `/app/setup` | desktop-only save desktop bootstrap secrets |
| `PUT` | `/app/setup` | desktop-only rotate desktop bootstrap secrets / OAuth values |

For request examples, see [`docs/requests.md`](./docs/requests.md).

---

## 🚀 Quick Start

### Docker Compose
```bash
cp .env-docker.example .env-docker
# Same template as `.env.example`, but DATABASE_URL / Celery URLs use Compose hostnames (`db`, `redis`).
# Fill in YOUTUBE_API_KEY, YOUTUBE_PLAYLIST_ID, OPENAI_API_KEY (and OAuth if you use moderation actions).

docker compose up --build
```

Services started:
- PostgreSQL
- Redis
- FastAPI backend
- Celery worker
- Celery beat

App URLs:
- UI: `http://localhost:8000/ui`
- Swagger: `http://localhost:8000/docs`
- Health: `http://localhost:8000/health`

### Local development
```bash
cp .env.example .env
python -m venv .venv
source .venv/bin/activate        # Linux/macOS
# .\.venv\Scripts\Activate.ps1   # Windows PowerShell

pip install -U pip
pip install -r requirements-dev.txt

docker compose up -d db redis
alembic upgrade head

uvicorn app.main:app --reload
```

Frontend in a separate terminal:
```bash
npm --prefix frontend ci
npm --prefix frontend run dev
```

Workers in separate terminals:
```bash
celery -A app.workers.celery_app:celery_app worker --loglevel=INFO
celery -A app.workers.celery_app:celery_app beat --loglevel=INFO
```

---

## ⚙️ Configuration

Primary configuration surfaces:
- [`.env.example`](./.env.example)
- [`.env-docker.example`](./.env-docker.example)
- [`app/core/config.py`](./app/core/config.py)

### Most important variables
| Variable | Why it matters |
|---|---|
| `YOUTUBE_API_KEY` | YouTube Data API access |
| `YOUTUBE_PLAYLIST_ID` | latest-video runs |
| `OPENAI_API_KEY` | classification, labeling, moderation |
| `OPENAI_MAX_USD_PER_RUN` / `OPENAI_HARD_BUDGET_ENFORCED` | optional per-run spend guardrails |
| `YOUTUBE_OAUTH_CLIENT_ID` | optional YouTube moderation / restore actions |
| `YOUTUBE_OAUTH_CLIENT_SECRET` | optional YouTube moderation / restore actions |
| `YOUTUBE_OAUTH_REFRESH_TOKEN` | optional YouTube moderation / restore actions |
| `DATABASE_URL` | PostgreSQL persistence |
| `CELERY_BROKER_URL` | Redis broker |
| `CELERY_RESULT_BACKEND` | Redis result storage |
| `EMBEDDING_MODE` | `local` or `openai` |
| `LOCAL_EMBEDDING_MODEL` | recommended local topic-clustering model |
| `AUTO_BAN_THRESHOLD` | appeal/toxic: auto-hide when **UI confidence** ≥ this (default `0.80`) |
| `TOXIC_AUTOBAN_PRECISION_REVIEW_THRESHOLD` | should match `AUTO_BAN_THRESHOLD` so first-pass score is trusted; stricter second LLM only when score is below this |

### Runtime notes
- `episode_match` / transcription fields still exist for compatibility, but that stage is **skipped in the active runtime**.
- budget usage is tracked and visible via UI/API;
- guest-name configuration improves appeal/toxic targeting quality.
- for historical A/B testing of local embedding models, run `PYTHONPATH=. python scripts/benchmark_topic_models.py`.

---

## 🛠 Quality Gates

Recommended checks:

```bash
ruff check .
black --check .
pytest -q
( cd desktop && pytest -q )
npm --prefix frontend run build
```

CI in `.github/workflows/ci.yml` covers:
- Python lint / formatting (`ruff`, `black`) on the full tree (including `desktop/`)
- root `pytest` and `desktop/` `pytest`
- frontend production build

---

## 📂 Project Structure

```text
app/                 FastAPI app, schemas, services, workers
frontend/            React SPA
alembic/             database migrations
tests/               pytest suite
scripts/             startup scripts for api/worker/beat
desktop/             desktop packaging companion
docs/PIPELINE.md     pipeline-level notes
docs/requests.md     endpoint request reference
```

### Helpful companion docs
- [`docs/PIPELINE.md`](./docs/PIPELINE.md)
- [`docs/requests.md`](./docs/requests.md)
- [`docs/README.md`](./docs/README.md)
- [`app/README.md`](./app/README.md)
- [`frontend/README.md`](./frontend/README.md)
- [`tests/README.md`](./tests/README.md)
- [`desktop/README.md`](./desktop/README.md)

---

## Final note

This repo is deliberately built to feel like **a serious internal analytics product**, not just a demo.

If you like projects that combine:
- real product thinking,
- non-trivial data/LLM pipelines,
- backend + frontend + infra,
- and a strong GitHub presentation,

**YouTubeIntel** was made for that exact intersection.

---

## License

See [`LICENSE`](./LICENSE).
### 🔬 Semantic topic intelligence Reveal real audience themes, cluster structure, representative quotes, and position-level disagreement.	### 🎯 Creator-facing signal extraction Separate criticism, questions, appeals, and toxicity instead of mixing everything into one noisy bucket.
### 🛡 Moderation-aware workflow Support toxic review queues, target detection, and manual/automatic moderation actions.	### 📦 Exportable decision artifacts Persist reports and analytics blocks in PostgreSQL and export Markdown/HTML reporting artifacts.