YouTubeIntel / README.md
herman3996's picture
Update README.md
3afad51 verified
metadata
title: YouTubeIntel
license: mit
tags:
  - youtube
  - analytics
  - nlp
  - llm
  - fastapi
  - react

YouTubeIntel

From raw YouTube comments to themes, audience positions, editorial briefings, and moderation-ready creator signals

Python 3.11+ FastAPI React 18 PostgreSQL Redis OpenAI


YouTubeIntel is a portfolio-grade, production-style analytics platform that treats YouTube comments as operational data β€” not just sentiment fluff.

It combines a full-stack product surface with two substantial pipelines:

  • a Topic Intelligence Pipeline for clustering, labeling, briefing, and audience positions;
  • an Appeal Analytics Pipeline for criticism, questions, appeals, toxicity, and moderation support.

πŸš€ Quick Start Β· πŸ— Architecture Β· πŸ”¬ Topic Pipeline Β· 🎯 Appeal Pipeline Β· πŸ“‘ API

πŸš€ Desktop Version Available

πŸ‘‰ huggingface.co/herman3996/YouTubeIntelDesktop

If this project makes you think β€œthis should have more stars” β€” you’re probably right ⭐


The pitch

Most YouTube analytics tools stop at vanity metrics.

They tell you:

  • views,
  • engagement,
  • rough sentiment,
  • maybe some keyword clouds.

They usually do not tell you:

  • what the audience is actually arguing about;
  • which positions exist inside each topic;
  • what comments are actionable for the next episode;
  • what criticism is genuinely constructive;
  • which toxic messages should be reviewed or escalated.

YouTubeIntel is built to solve exactly that.

πŸ”¬ Semantic topic intelligence

Reveal real audience themes, cluster structure, representative quotes, and position-level disagreement.

🎯 Creator-facing signal extraction

Separate criticism, questions, appeals, and toxicity instead of mixing everything into one noisy bucket.

πŸ›‘ Moderation-aware workflow

Support toxic review queues, target detection, and manual/automatic moderation actions.

πŸ“¦ Exportable decision artifacts

Persist reports and analytics blocks in PostgreSQL and export Markdown/HTML reporting artifacts.


Why this repo is worth starring

🧠 Real pipeline depth

Not a toy CRUD app. This repo includes clustering, embeddings, LLM classification, refinement passes, moderation logic, diagnostics, and background execution.

πŸ–₯ Full product surface

Backend, frontend, API docs, Docker deployment, Celery workers, runtime settings, operator dashboard, and a desktop packaging companion.

🧰 Portfolio + practical utility

Designed to look strong on GitHub and to behave like a credible internal analytics product.


✨ What the product does today

Core capabilities

  • Topic Intelligence Pipeline for semantic clustering, topic labeling, audience-position extraction, and daily briefing generation
  • Appeal Analytics Pipeline for author-directed criticism, questions, appeals, and toxicity routing
  • Hybrid moderation with rule-based filtering, optional LLM borderline moderation, toxic target detection, and review queues
  • Operator UI for runs, reports, runtime settings, budget visibility, and appeal review
  • Asynchronous processing via Celery + Redis
  • Docker-first environment with PostgreSQL, Redis, backend, worker, and beat
  • Desktop companion in desktop/ for local packaged delivery

Operator-facing routes

  • /ui β€” dashboard shell
  • /ui/videos β€” recent videos + status monitor
  • /ui/budget β€” budget + runtime settings
  • /ui/reports/:videoId β€” full topic report detail
  • /ui/appeal/:videoId β€” appeal analytics + toxic moderation workflow
  • /docs β€” Swagger UI

πŸ— Architecture

YouTube Data API -> Comment fetch -> Preprocess + moderation
                                  -> Topic Intelligence Pipeline -> Markdown/HTML reports
                                  -> Topic Intelligence Pipeline -> PostgreSQL
                                  -> Appeal Analytics Pipeline    -> PostgreSQL

React + Vite SPA -> FastAPI -> PostgreSQL
Celery worker -> FastAPI
Celery beat -> Celery worker
Redis -> Celery worker
Redis -> Celery beat

Stack overview

Layer Technology
API layer FastAPI
Persistence SQLAlchemy + PostgreSQL
Background execution Celery + Redis
ML / NLP SentenceTransformers, HDBSCAN, scikit-learn
LLM layer OpenAI-compatible chat + embeddings
Frontend React 18, TypeScript, Vite
Desktop packaging companion desktop/

πŸ”¬ Topic Intelligence Pipeline

Entry point: app/services/pipeline/runner.py β†’ DailyRunService

This is the β€œwhat is the audience discussing, how are they split, and what should the creator do next?” pipeline.

1) Context
 -> 2) Comments fetch
 -> 3) Preprocess + moderation
 -> 4) Persist comments
 -> 5) Embeddings
 -> 6) Clustering
 -> 7) Episode match (compatibility stage, skipped)
 -> 8) Labeling + audience positions
 -> 9) Persist clusters
 -> 10) Briefing build
 -> 11) Report export

What happens in each stage

# Stage What it does
1 Context Loads prior report context for continuity
2 Comments fetch Pulls comments via the YouTube Data API
3 Preprocess + moderation Filters low-signal items, normalizes text, applies rule-based moderation, optionally runs borderline LLM moderation
4 Persist comments Stores processed comments in PostgreSQL
5 Embeddings Builds vectors with local embeddings or OpenAI
6 Clustering Groups comments into themes with fallback logic for edge cases
7 Episode match Preserved as a compatibility stage and explicitly skipped in the active runtime
8 Labeling + audience positions Produces titles, summaries, and intra-topic positions
9 Persist clusters Saves clusters and membership mappings
10 Briefing build Generates executive summary, actions, risks, and topic-level findings
11 Report export Writes Markdown/HTML reports and stores structured JSON

Main outputs

  • top themes by discussion weight
  • representative quotes and question comments
  • audience positions inside each theme
  • editorial briefing for the next content cycle, including actions, misunderstandings, audience requests, risks, and trend deltas
  • moderation and degradation diagnostics

🎯 Appeal Analytics Pipeline

Entry point: app/services/appeal_analytics/runner.py β†’ AppealAnalyticsService

This is the β€œwhat is being said to the creator?” pipeline.

Load/fetch video comments
 -> Unified LLM classification
 -> Question refiner
 -> Political criticism filter
 -> Toxic target classification
 -> Confidence + target routing:
      - Auto-ban block
      - Manual review block
      - Ignore third-party insults
 -> Persist appeal blocks

Persisted blocks

Block Meaning
constructive_question creator-directed questions
constructive_criticism criticism backed by an actual political argument
author_appeal direct requests / appeals to the creator
toxic_auto_banned toxic comments that passed final verification and were auto-moderated
toxic_manual_review toxic comments queued for admin review

skip is still used internally as a classification outcome, but it is not persisted as a block.

Pipeline behaviors that matter

  • question candidates get a second-pass refiner;
  • criticism with question signal can be promoted into the question block;
  • low-value attack_ragebait / meme_one_liner question candidates are demoted out of constructive_question;
  • toxic comments are classified by target (author, guest, content, undefined, third_party);
  • routing splits comments into auto-ban, manual review, or ignore;
  • comments with toxicity confidence >= 0.80 enter the auto-ban path, but a final strict verification pass can still downgrade them into manual review;
  • auto-banned authors can be unbanned from the UI if the operator spots a false positive;
  • per-video guest names can improve targeting accuracy.

πŸ“‘ API Reference

Interactive docs live at /docs when the backend is running.

Core runs and reports

Method Endpoint Purpose
GET /health health and OpenAI endpoint metadata
POST /run/latest run Topic Intelligence for the latest playlist video
POST /run/video run Topic Intelligence for a specific video URL
GET /videos recent videos
GET /videos/statuses progress/status dashboard payload
GET /videos/{video_id} single-video metadata
GET /reports/latest latest report
GET /reports/{video_id} latest report for one video
GET /reports/{video_id}/detail enriched report with comments and positions

Appeal analytics and moderation

Method Endpoint Purpose
POST /appeal/run run Appeal Analytics
GET /appeal/{video_id} latest appeal analytics result
GET /appeal/{video_id}/author/{author_name} all comments by one author
GET /appeal/{video_id}/toxic-review manual toxic-review queue
POST /appeal/ban-user manual ban action
POST /appeal/unban-user restore a previously banned commenter

Runtime and settings

Method Endpoint Purpose
GET /settings/video-guests/{video_id} load guest names
PUT /settings/video-guests/{video_id} update guest names
GET /budget OpenAI usage snapshot
GET /settings/runtime current mutable runtime settings
PUT /settings/runtime update mutable runtime settings
GET /app/setup/status desktop-only first-run setup status
POST /app/setup desktop-only save desktop bootstrap secrets
PUT /app/setup desktop-only rotate desktop bootstrap secrets / OAuth values

For request examples, see docs/requests.md.


πŸš€ Quick Start

Docker Compose

cp .env-docker.example .env-docker
# Same template as `.env.example`, but DATABASE_URL / Celery URLs use Compose hostnames (`db`, `redis`).
# Fill in YOUTUBE_API_KEY, YOUTUBE_PLAYLIST_ID, OPENAI_API_KEY (and OAuth if you use moderation actions).

docker compose up --build

Services started:

  • PostgreSQL
  • Redis
  • FastAPI backend
  • Celery worker
  • Celery beat

App URLs:

  • UI: http://localhost:8000/ui
  • Swagger: http://localhost:8000/docs
  • Health: http://localhost:8000/health

Local development

cp .env.example .env
python -m venv .venv
source .venv/bin/activate        # Linux/macOS
# .\.venv\Scripts\Activate.ps1   # Windows PowerShell

pip install -U pip
pip install -r requirements-dev.txt

docker compose up -d db redis
alembic upgrade head

uvicorn app.main:app --reload

Frontend in a separate terminal:

npm --prefix frontend ci
npm --prefix frontend run dev

Workers in separate terminals:

celery -A app.workers.celery_app:celery_app worker --loglevel=INFO
celery -A app.workers.celery_app:celery_app beat --loglevel=INFO

βš™οΈ Configuration

Primary configuration surfaces:

Most important variables

Variable Why it matters
YOUTUBE_API_KEY YouTube Data API access
YOUTUBE_PLAYLIST_ID latest-video runs
OPENAI_API_KEY classification, labeling, moderation
OPENAI_MAX_USD_PER_RUN / OPENAI_HARD_BUDGET_ENFORCED optional per-run spend guardrails
YOUTUBE_OAUTH_CLIENT_ID optional YouTube moderation / restore actions
YOUTUBE_OAUTH_CLIENT_SECRET optional YouTube moderation / restore actions
YOUTUBE_OAUTH_REFRESH_TOKEN optional YouTube moderation / restore actions
DATABASE_URL PostgreSQL persistence
CELERY_BROKER_URL Redis broker
CELERY_RESULT_BACKEND Redis result storage
EMBEDDING_MODE local or openai
LOCAL_EMBEDDING_MODEL recommended local topic-clustering model
AUTO_BAN_THRESHOLD appeal/toxic: auto-hide when UI confidence β‰₯ this (default 0.80)
TOXIC_AUTOBAN_PRECISION_REVIEW_THRESHOLD should match AUTO_BAN_THRESHOLD so first-pass score is trusted; stricter second LLM only when score is below this

Runtime notes

  • episode_match / transcription fields still exist for compatibility, but that stage is skipped in the active runtime.
  • budget usage is tracked and visible via UI/API;
  • guest-name configuration improves appeal/toxic targeting quality.
  • for historical A/B testing of local embedding models, run PYTHONPATH=. python scripts/benchmark_topic_models.py.

πŸ›  Quality Gates

Recommended checks:

ruff check .
black --check .
pytest -q
( cd desktop && pytest -q )
npm --prefix frontend run build

CI in .github/workflows/ci.yml covers:

  • Python lint / formatting (ruff, black) on the full tree (including desktop/)
  • root pytest and desktop/ pytest
  • frontend production build

πŸ“‚ Project Structure

app/                 FastAPI app, schemas, services, workers
frontend/            React SPA
alembic/             database migrations
tests/               pytest suite
scripts/             startup scripts for api/worker/beat
desktop/             desktop packaging companion
docs/PIPELINE.md     pipeline-level notes
docs/requests.md     endpoint request reference

Helpful companion docs


Final note

This repo is deliberately built to feel like a serious internal analytics product, not just a demo.

If you like projects that combine:

  • real product thinking,
  • non-trivial data/LLM pipelines,
  • backend + frontend + infra,
  • and a strong GitHub presentation,

YouTubeIntel was made for that exact intersection.


License

See LICENSE.