--- title: YouTubeIntel license: mit tags: - youtube - analytics - nlp - llm - fastapi - react ---
# YouTubeIntel ### From raw YouTube comments to **themes, audience positions, editorial briefings, and moderation-ready creator signals** [![Python 3.11+](https://img.shields.io/badge/python-3.11+-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://python.org) [![FastAPI](https://img.shields.io/badge/FastAPI-009688?style=for-the-badge&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com) [![React 18](https://img.shields.io/badge/React_18-61DAFB?style=for-the-badge&logo=react&logoColor=black)](https://react.dev) [![PostgreSQL](https://img.shields.io/badge/PostgreSQL-4169E1?style=for-the-badge&logo=postgresql&logoColor=white)](https://postgresql.org) [![Redis](https://img.shields.io/badge/Redis-DC382D?style=for-the-badge&logo=redis&logoColor=white)](https://redis.io) [![OpenAI](https://img.shields.io/badge/OpenAI-412991?style=for-the-badge&logo=openai&logoColor=white)](https://openai.com)
**YouTubeIntel** is a portfolio-grade, production-style analytics platform that treats YouTube comments as operational data β€” not just sentiment fluff. It combines a full-stack product surface with two substantial pipelines: - a **Topic Intelligence Pipeline** for clustering, labeling, briefing, and audience positions; - an **Appeal Analytics Pipeline** for criticism, questions, appeals, toxicity, and moderation support. [πŸš€ Quick Start](#-quick-start) Β· [πŸ— Architecture](#-architecture) Β· [πŸ”¬ Topic Pipeline](#-topic-intelligence-pipeline) Β· [🎯 Appeal Pipeline](#-appeal-analytics-pipeline) Β· [πŸ“‘ API](#-api-reference) > ### πŸš€ Desktop Version Available > > **[πŸ‘‰ huggingface.co/herman3996/YouTubeIntelDesktop](https://huggingface.co/herman3996/YouTubeIntelDesktop)** > If this project makes you think **β€œthis should have more stars”** β€” you’re probably right ⭐
--- ## The pitch Most YouTube analytics tools stop at vanity metrics. They tell you: - views, - engagement, - rough sentiment, - maybe some keyword clouds. They usually **do not** tell you: - what the audience is *actually arguing about*; - which positions exist inside each topic; - what comments are actionable for the next episode; - what criticism is genuinely constructive; - which toxic messages should be reviewed or escalated. **YouTubeIntel** is built to solve exactly that.
### πŸ”¬ Semantic topic intelligence Reveal real audience themes, cluster structure, representative quotes, and position-level disagreement. ### 🎯 Creator-facing signal extraction Separate criticism, questions, appeals, and toxicity instead of mixing everything into one noisy bucket.
### πŸ›‘ Moderation-aware workflow Support toxic review queues, target detection, and manual/automatic moderation actions. ### πŸ“¦ Exportable decision artifacts Persist reports and analytics blocks in PostgreSQL and export Markdown/HTML reporting artifacts.
--- ## Why this repo is worth starring
### 🧠 Real pipeline depth Not a toy CRUD app. This repo includes clustering, embeddings, LLM classification, refinement passes, moderation logic, diagnostics, and background execution. ### πŸ–₯ Full product surface Backend, frontend, API docs, Docker deployment, Celery workers, runtime settings, operator dashboard, and a desktop packaging companion. ### 🧰 Portfolio + practical utility Designed to look strong on GitHub **and** to behave like a credible internal analytics product.
--- ## ✨ What the product does today ### Core capabilities - **Topic Intelligence Pipeline** for semantic clustering, topic labeling, audience-position extraction, and daily briefing generation - **Appeal Analytics Pipeline** for author-directed criticism, questions, appeals, and toxicity routing - **Hybrid moderation** with rule-based filtering, optional LLM borderline moderation, toxic target detection, and review queues - **Operator UI** for runs, reports, runtime settings, budget visibility, and appeal review - **Asynchronous processing** via Celery + Redis - **Docker-first environment** with PostgreSQL, Redis, backend, worker, and beat - **Desktop companion** in [`desktop/`](./desktop) for local packaged delivery ### Operator-facing routes - `/ui` β€” dashboard shell - `/ui/videos` β€” recent videos + status monitor - `/ui/budget` β€” budget + runtime settings - `/ui/reports/:videoId` β€” full topic report detail - `/ui/appeal/:videoId` β€” appeal analytics + toxic moderation workflow - `/docs` β€” Swagger UI --- ## πŸ— Architecture ```text YouTube Data API -> Comment fetch -> Preprocess + moderation -> Topic Intelligence Pipeline -> Markdown/HTML reports -> Topic Intelligence Pipeline -> PostgreSQL -> Appeal Analytics Pipeline -> PostgreSQL React + Vite SPA -> FastAPI -> PostgreSQL Celery worker -> FastAPI Celery beat -> Celery worker Redis -> Celery worker Redis -> Celery beat ``` ### Stack overview | Layer | Technology | |---|---| | API layer | FastAPI | | Persistence | SQLAlchemy + PostgreSQL | | Background execution | Celery + Redis | | ML / NLP | SentenceTransformers, HDBSCAN, scikit-learn | | LLM layer | OpenAI-compatible chat + embeddings | | Frontend | React 18, TypeScript, Vite | | Desktop packaging companion | [`desktop/`](./desktop) | --- ## πŸ”¬ Topic Intelligence Pipeline > **Entry point:** `app/services/pipeline/runner.py` β†’ `DailyRunService` This is the β€œwhat is the audience discussing, how are they split, and what should the creator do next?” pipeline. ```text 1) Context -> 2) Comments fetch -> 3) Preprocess + moderation -> 4) Persist comments -> 5) Embeddings -> 6) Clustering -> 7) Episode match (compatibility stage, skipped) -> 8) Labeling + audience positions -> 9) Persist clusters -> 10) Briefing build -> 11) Report export ``` ### What happens in each stage | # | Stage | What it does | |:-:|---|---| | 1 | Context | Loads prior report context for continuity | | 2 | Comments fetch | Pulls comments via the YouTube Data API | | 3 | Preprocess + moderation | Filters low-signal items, normalizes text, applies rule-based moderation, optionally runs borderline LLM moderation | | 4 | Persist comments | Stores processed comments in PostgreSQL | | 5 | Embeddings | Builds vectors with local embeddings or OpenAI | | 6 | Clustering | Groups comments into themes with fallback logic for edge cases | | 7 | Episode match | Preserved as a compatibility stage and **explicitly skipped** in the active runtime | | 8 | Labeling + audience positions | Produces titles, summaries, and intra-topic positions | | 9 | Persist clusters | Saves clusters and membership mappings | | 10 | Briefing build | Generates executive summary, actions, risks, and topic-level findings | | 11 | Report export | Writes Markdown/HTML reports and stores structured JSON | ### Main outputs - top themes by discussion weight - representative quotes and question comments - audience positions inside each theme - editorial briefing for the next content cycle, including actions, misunderstandings, audience requests, risks, and trend deltas - moderation and degradation diagnostics --- ## 🎯 Appeal Analytics Pipeline > **Entry point:** `app/services/appeal_analytics/runner.py` β†’ `AppealAnalyticsService` This is the β€œwhat is being said *to the creator*?” pipeline. ```text Load/fetch video comments -> Unified LLM classification -> Question refiner -> Political criticism filter -> Toxic target classification -> Confidence + target routing: - Auto-ban block - Manual review block - Ignore third-party insults -> Persist appeal blocks ``` ### Persisted blocks | Block | Meaning | |---|---| | `constructive_question` | creator-directed questions | | `constructive_criticism` | criticism backed by an actual political argument | | `author_appeal` | direct requests / appeals to the creator | | `toxic_auto_banned` | toxic comments that passed final verification and were auto-moderated | | `toxic_manual_review` | toxic comments queued for admin review | `skip` is still used internally as a classification outcome, but it is **not persisted as a block**. ### Pipeline behaviors that matter - question candidates get a **second-pass refiner**; - criticism with question signal can be promoted into the question block; - low-value `attack_ragebait` / `meme_one_liner` question candidates are demoted out of `constructive_question`; - toxic comments are classified by **target** (`author`, `guest`, `content`, `undefined`, `third_party`); - routing splits comments into **auto-ban**, **manual review**, or **ignore**; - comments with toxicity confidence **>= 0.80** enter the auto-ban path, but a final strict verification pass can still downgrade them into manual review; - auto-banned authors can be **unbanned from the UI** if the operator spots a false positive; - per-video guest names can improve targeting accuracy. --- ## πŸ“‘ API Reference Interactive docs live at **`/docs`** when the backend is running. ### Core runs and reports | Method | Endpoint | Purpose | |:---:|---|---| | `GET` | `/health` | health and OpenAI endpoint metadata | | `POST` | `/run/latest` | run Topic Intelligence for the latest playlist video | | `POST` | `/run/video` | run Topic Intelligence for a specific video URL | | `GET` | `/videos` | recent videos | | `GET` | `/videos/statuses` | progress/status dashboard payload | | `GET` | `/videos/{video_id}` | single-video metadata | | `GET` | `/reports/latest` | latest report | | `GET` | `/reports/{video_id}` | latest report for one video | | `GET` | `/reports/{video_id}/detail` | enriched report with comments and positions | ### Appeal analytics and moderation | Method | Endpoint | Purpose | |:---:|---|---| | `POST` | `/appeal/run` | run Appeal Analytics | | `GET` | `/appeal/{video_id}` | latest appeal analytics result | | `GET` | `/appeal/{video_id}/author/{author_name}` | all comments by one author | | `GET` | `/appeal/{video_id}/toxic-review` | manual toxic-review queue | | `POST` | `/appeal/ban-user` | manual ban action | | `POST` | `/appeal/unban-user` | restore a previously banned commenter | ### Runtime and settings | Method | Endpoint | Purpose | |:---:|---|---| | `GET` | `/settings/video-guests/{video_id}` | load guest names | | `PUT` | `/settings/video-guests/{video_id}` | update guest names | | `GET` | `/budget` | OpenAI usage snapshot | | `GET` | `/settings/runtime` | current mutable runtime settings | | `PUT` | `/settings/runtime` | update mutable runtime settings | | `GET` | `/app/setup/status` | desktop-only first-run setup status | | `POST` | `/app/setup` | desktop-only save desktop bootstrap secrets | | `PUT` | `/app/setup` | desktop-only rotate desktop bootstrap secrets / OAuth values | For request examples, see [`docs/requests.md`](./docs/requests.md). --- ## πŸš€ Quick Start ### Docker Compose ```bash cp .env-docker.example .env-docker # Same template as `.env.example`, but DATABASE_URL / Celery URLs use Compose hostnames (`db`, `redis`). # Fill in YOUTUBE_API_KEY, YOUTUBE_PLAYLIST_ID, OPENAI_API_KEY (and OAuth if you use moderation actions). docker compose up --build ``` Services started: - PostgreSQL - Redis - FastAPI backend - Celery worker - Celery beat App URLs: - UI: `http://localhost:8000/ui` - Swagger: `http://localhost:8000/docs` - Health: `http://localhost:8000/health` ### Local development ```bash cp .env.example .env python -m venv .venv source .venv/bin/activate # Linux/macOS # .\.venv\Scripts\Activate.ps1 # Windows PowerShell pip install -U pip pip install -r requirements-dev.txt docker compose up -d db redis alembic upgrade head uvicorn app.main:app --reload ``` Frontend in a separate terminal: ```bash npm --prefix frontend ci npm --prefix frontend run dev ``` Workers in separate terminals: ```bash celery -A app.workers.celery_app:celery_app worker --loglevel=INFO celery -A app.workers.celery_app:celery_app beat --loglevel=INFO ``` --- ## βš™οΈ Configuration Primary configuration surfaces: - [`.env.example`](./.env.example) - [`.env-docker.example`](./.env-docker.example) - [`app/core/config.py`](./app/core/config.py) ### Most important variables | Variable | Why it matters | |---|---| | `YOUTUBE_API_KEY` | YouTube Data API access | | `YOUTUBE_PLAYLIST_ID` | latest-video runs | | `OPENAI_API_KEY` | classification, labeling, moderation | | `OPENAI_MAX_USD_PER_RUN` / `OPENAI_HARD_BUDGET_ENFORCED` | optional per-run spend guardrails | | `YOUTUBE_OAUTH_CLIENT_ID` | optional YouTube moderation / restore actions | | `YOUTUBE_OAUTH_CLIENT_SECRET` | optional YouTube moderation / restore actions | | `YOUTUBE_OAUTH_REFRESH_TOKEN` | optional YouTube moderation / restore actions | | `DATABASE_URL` | PostgreSQL persistence | | `CELERY_BROKER_URL` | Redis broker | | `CELERY_RESULT_BACKEND` | Redis result storage | | `EMBEDDING_MODE` | `local` or `openai` | | `LOCAL_EMBEDDING_MODEL` | recommended local topic-clustering model | | `AUTO_BAN_THRESHOLD` | appeal/toxic: auto-hide when **UI confidence** β‰₯ this (default `0.80`) | | `TOXIC_AUTOBAN_PRECISION_REVIEW_THRESHOLD` | should match `AUTO_BAN_THRESHOLD` so first-pass score is trusted; stricter second LLM only when score is below this | ### Runtime notes - `episode_match` / transcription fields still exist for compatibility, but that stage is **skipped in the active runtime**. - budget usage is tracked and visible via UI/API; - guest-name configuration improves appeal/toxic targeting quality. - for historical A/B testing of local embedding models, run `PYTHONPATH=. python scripts/benchmark_topic_models.py`. --- ## πŸ›  Quality Gates Recommended checks: ```bash ruff check . black --check . pytest -q ( cd desktop && pytest -q ) npm --prefix frontend run build ``` CI in `.github/workflows/ci.yml` covers: - Python lint / formatting (`ruff`, `black`) on the full tree (including `desktop/`) - root `pytest` and `desktop/` `pytest` - frontend production build --- ## πŸ“‚ Project Structure ```text app/ FastAPI app, schemas, services, workers frontend/ React SPA alembic/ database migrations tests/ pytest suite scripts/ startup scripts for api/worker/beat desktop/ desktop packaging companion docs/PIPELINE.md pipeline-level notes docs/requests.md endpoint request reference ``` ### Helpful companion docs - [`docs/PIPELINE.md`](./docs/PIPELINE.md) - [`docs/requests.md`](./docs/requests.md) - [`docs/README.md`](./docs/README.md) - [`app/README.md`](./app/README.md) - [`frontend/README.md`](./frontend/README.md) - [`tests/README.md`](./tests/README.md) - [`desktop/README.md`](./desktop/README.md) --- ## Final note This repo is deliberately built to feel like **a serious internal analytics product**, not just a demo. If you like projects that combine: - real product thinking, - non-trivial data/LLM pipelines, - backend + frontend + infra, - and a strong GitHub presentation, **YouTubeIntel** was made for that exact intersection. --- ## License See [`LICENSE`](./LICENSE).