YouTubeIntel / README.md

herman3996

Update README.md

3afad51 verified 9 days ago

preview code

raw

history blame contribute delete

15.5 kB

metadata

title: YouTubeIntel
license: mit
tags:
  - youtube
  - analytics
  - nlp
  - llm
  - fastapi
  - react

YouTubeIntel

From raw YouTube comments to themes, audience positions, editorial briefings, and moderation-ready creator signals

YouTubeIntel is a portfolio-grade, production-style analytics platform that treats YouTube comments as operational data — not just sentiment fluff.

It combines a full-stack product surface with two substantial pipelines:

a Topic Intelligence Pipeline for clustering, labeling, briefing, and audience positions;
an Appeal Analytics Pipeline for criticism, questions, appeals, toxicity, and moderation support.

🚀 Quick Start · 🏗 Architecture · 🔬 Topic Pipeline · 🎯 Appeal Pipeline · 📡 API

🚀 Desktop Version Available

👉 huggingface.co/herman3996/YouTubeIntelDesktop

If this project makes you think “this should have more stars” — you’re probably right ⭐

The pitch

Most YouTube analytics tools stop at vanity metrics.

They tell you:

views,
engagement,
rough sentiment,
maybe some keyword clouds.

They usually do not tell you:

what the audience is actually arguing about;
which positions exist inside each topic;
what comments are actionable for the next episode;
what criticism is genuinely constructive;
which toxic messages should be reviewed or escalated.

YouTubeIntel is built to solve exactly that.

🔬 Semantic topic intelligence

Reveal real audience themes, cluster structure, representative quotes, and position-level disagreement.

🎯 Creator-facing signal extraction

Separate criticism, questions, appeals, and toxicity instead of mixing everything into one noisy bucket.

🛡 Moderation-aware workflow

Support toxic review queues, target detection, and manual/automatic moderation actions.

📦 Exportable decision artifacts

Persist reports and analytics blocks in PostgreSQL and export Markdown/HTML reporting artifacts.

Why this repo is worth starring

🧠 Real pipeline depth

Not a toy CRUD app. This repo includes clustering, embeddings, LLM classification, refinement passes, moderation logic, diagnostics, and background execution.

🖥 Full product surface

Backend, frontend, API docs, Docker deployment, Celery workers, runtime settings, operator dashboard, and a desktop packaging companion.

🧰 Portfolio + practical utility

Designed to look strong on GitHub and to behave like a credible internal analytics product.

✨ What the product does today

Core capabilities

Topic Intelligence Pipeline for semantic clustering, topic labeling, audience-position extraction, and daily briefing generation
Appeal Analytics Pipeline for author-directed criticism, questions, appeals, and toxicity routing
Hybrid moderation with rule-based filtering, optional LLM borderline moderation, toxic target detection, and review queues
Operator UI for runs, reports, runtime settings, budget visibility, and appeal review
Asynchronous processing via Celery + Redis
Docker-first environment with PostgreSQL, Redis, backend, worker, and beat
Desktop companion in desktop/ for local packaged delivery

Operator-facing routes

/ui — dashboard shell
/ui/videos — recent videos + status monitor
/ui/budget — budget + runtime settings
/ui/reports/:videoId — full topic report detail
/ui/appeal/:videoId — appeal analytics + toxic moderation workflow
/docs — Swagger UI

🏗 Architecture

YouTube Data API -> Comment fetch -> Preprocess + moderation
                                  -> Topic Intelligence Pipeline -> Markdown/HTML reports
                                  -> Topic Intelligence Pipeline -> PostgreSQL
                                  -> Appeal Analytics Pipeline    -> PostgreSQL

React + Vite SPA -> FastAPI -> PostgreSQL
Celery worker -> FastAPI
Celery beat -> Celery worker
Redis -> Celery worker
Redis -> Celery beat

Stack overview

Layer	Technology
API layer	FastAPI
Persistence	SQLAlchemy + PostgreSQL
Background execution	Celery + Redis
ML / NLP	SentenceTransformers, HDBSCAN, scikit-learn
LLM layer	OpenAI-compatible chat + embeddings
Frontend	React 18, TypeScript, Vite
Desktop packaging companion	`desktop/`

🔬 Topic Intelligence Pipeline

Entry point: app/services/pipeline/runner.py → DailyRunService

This is the “what is the audience discussing, how are they split, and what should the creator do next?” pipeline.

1) Context
 -> 2) Comments fetch
 -> 3) Preprocess + moderation
 -> 4) Persist comments
 -> 5) Embeddings
 -> 6) Clustering
 -> 7) Episode match (compatibility stage, skipped)
 -> 8) Labeling + audience positions
 -> 9) Persist clusters
 -> 10) Briefing build
 -> 11) Report export

What happens in each stage

#	Stage	What it does
1	Context	Loads prior report context for continuity
2	Comments fetch	Pulls comments via the YouTube Data API
3	Preprocess + moderation	Filters low-signal items, normalizes text, applies rule-based moderation, optionally runs borderline LLM moderation
4	Persist comments	Stores processed comments in PostgreSQL
5	Embeddings	Builds vectors with local embeddings or OpenAI
6	Clustering	Groups comments into themes with fallback logic for edge cases
7	Episode match	Preserved as a compatibility stage and explicitly skipped in the active runtime
8	Labeling + audience positions	Produces titles, summaries, and intra-topic positions
9	Persist clusters	Saves clusters and membership mappings
10	Briefing build	Generates executive summary, actions, risks, and topic-level findings
11	Report export	Writes Markdown/HTML reports and stores structured JSON

Main outputs

top themes by discussion weight
representative quotes and question comments
audience positions inside each theme
editorial briefing for the next content cycle, including actions, misunderstandings, audience requests, risks, and trend deltas
moderation and degradation diagnostics

🎯 Appeal Analytics Pipeline

Entry point: app/services/appeal_analytics/runner.py → AppealAnalyticsService

This is the “what is being said to the creator?” pipeline.

Load/fetch video comments
 -> Unified LLM classification
 -> Question refiner
 -> Political criticism filter
 -> Toxic target classification
 -> Confidence + target routing:
      - Auto-ban block
      - Manual review block
      - Ignore third-party insults
 -> Persist appeal blocks

Persisted blocks

Block	Meaning
`constructive_question`	creator-directed questions
`constructive_criticism`	criticism backed by an actual political argument
`author_appeal`	direct requests / appeals to the creator
`toxic_auto_banned`	toxic comments that passed final verification and were auto-moderated
`toxic_manual_review`	toxic comments queued for admin review

skip is still used internally as a classification outcome, but it is not persisted as a block.

Pipeline behaviors that matter

question candidates get a second-pass refiner;
criticism with question signal can be promoted into the question block;
low-value attack_ragebait / meme_one_liner question candidates are demoted out of constructive_question;
toxic comments are classified by target (author, guest, content, undefined, third_party);
routing splits comments into auto-ban, manual review, or ignore;
comments with toxicity confidence >= 0.80 enter the auto-ban path, but a final strict verification pass can still downgrade them into manual review;
auto-banned authors can be unbanned from the UI if the operator spots a false positive;
per-video guest names can improve targeting accuracy.

📡 API Reference

Interactive docs live at /docs when the backend is running.

Core runs and reports

Method	Endpoint	Purpose
`GET`	`/health`	health and OpenAI endpoint metadata
`POST`	`/run/latest`	run Topic Intelligence for the latest playlist video
`POST`	`/run/video`	run Topic Intelligence for a specific video URL
`GET`	`/videos`	recent videos
`GET`	`/videos/statuses`	progress/status dashboard payload
`GET`	`/videos/{video_id}`	single-video metadata
`GET`	`/reports/latest`	latest report
`GET`	`/reports/{video_id}`	latest report for one video
`GET`	`/reports/{video_id}/detail`	enriched report with comments and positions

Appeal analytics and moderation

Method	Endpoint	Purpose
`POST`	`/appeal/run`	run Appeal Analytics
`GET`	`/appeal/{video_id}`	latest appeal analytics result
`GET`	`/appeal/{video_id}/author/{author_name}`	all comments by one author
`GET`	`/appeal/{video_id}/toxic-review`	manual toxic-review queue
`POST`	`/appeal/ban-user`	manual ban action
`POST`	`/appeal/unban-user`	restore a previously banned commenter

Runtime and settings

Method	Endpoint	Purpose
`GET`	`/settings/video-guests/{video_id}`	load guest names
`PUT`	`/settings/video-guests/{video_id}`	update guest names
`GET`	`/budget`	OpenAI usage snapshot
`GET`	`/settings/runtime`	current mutable runtime settings
`PUT`	`/settings/runtime`	update mutable runtime settings
`GET`	`/app/setup/status`	desktop-only first-run setup status
`POST`	`/app/setup`	desktop-only save desktop bootstrap secrets
`PUT`	`/app/setup`	desktop-only rotate desktop bootstrap secrets / OAuth values

For request examples, see docs/requests.md.

🚀 Quick Start

Docker Compose

cp .env-docker.example .env-docker
# Same template as `.env.example`, but DATABASE_URL / Celery URLs use Compose hostnames (`db`, `redis`).
# Fill in YOUTUBE_API_KEY, YOUTUBE_PLAYLIST_ID, OPENAI_API_KEY (and OAuth if you use moderation actions).

docker compose up --build

Services started:

PostgreSQL
Redis
FastAPI backend
Celery worker
Celery beat

App URLs:

UI: http://localhost:8000/ui
Swagger: http://localhost:8000/docs
Health: http://localhost:8000/health

Local development

cp .env.example .env
python -m venv .venv
source .venv/bin/activate        # Linux/macOS
# .\.venv\Scripts\Activate.ps1   # Windows PowerShell

pip install -U pip
pip install -r requirements-dev.txt

docker compose up -d db redis
alembic upgrade head

uvicorn app.main:app --reload

Frontend in a separate terminal:

npm --prefix frontend ci
npm --prefix frontend run dev

Workers in separate terminals:

celery -A app.workers.celery_app:celery_app worker --loglevel=INFO
celery -A app.workers.celery_app:celery_app beat --loglevel=INFO

⚙️ Configuration

Primary configuration surfaces:

Most important variables

Variable	Why it matters
`YOUTUBE_API_KEY`	YouTube Data API access
`YOUTUBE_PLAYLIST_ID`	latest-video runs
`OPENAI_API_KEY`	classification, labeling, moderation
`OPENAI_MAX_USD_PER_RUN` / `OPENAI_HARD_BUDGET_ENFORCED`	optional per-run spend guardrails
`YOUTUBE_OAUTH_CLIENT_ID`	optional YouTube moderation / restore actions
`YOUTUBE_OAUTH_CLIENT_SECRET`	optional YouTube moderation / restore actions
`YOUTUBE_OAUTH_REFRESH_TOKEN`	optional YouTube moderation / restore actions
`DATABASE_URL`	PostgreSQL persistence
`CELERY_BROKER_URL`	Redis broker
`CELERY_RESULT_BACKEND`	Redis result storage
`EMBEDDING_MODE`	`local` or `openai`
`LOCAL_EMBEDDING_MODEL`	recommended local topic-clustering model
`AUTO_BAN_THRESHOLD`	appeal/toxic: auto-hide when UI confidence ≥ this (default `0.80`)
`TOXIC_AUTOBAN_PRECISION_REVIEW_THRESHOLD`	should match `AUTO_BAN_THRESHOLD` so first-pass score is trusted; stricter second LLM only when score is below this

Runtime notes

episode_match / transcription fields still exist for compatibility, but that stage is skipped in the active runtime.
budget usage is tracked and visible via UI/API;
guest-name configuration improves appeal/toxic targeting quality.
for historical A/B testing of local embedding models, run PYTHONPATH=. python scripts/benchmark_topic_models.py.

🛠 Quality Gates

Recommended checks:

ruff check .
black --check .
pytest -q
( cd desktop && pytest -q )
npm --prefix frontend run build

CI in .github/workflows/ci.yml covers:

Python lint / formatting (ruff, black) on the full tree (including desktop/)
root pytest and desktop/ pytest
frontend production build

📂 Project Structure

app/                 FastAPI app, schemas, services, workers
frontend/            React SPA
alembic/             database migrations
tests/               pytest suite
scripts/             startup scripts for api/worker/beat
desktop/             desktop packaging companion
docs/PIPELINE.md     pipeline-level notes
docs/requests.md     endpoint request reference

Helpful companion docs

Final note

This repo is deliberately built to feel like a serious internal analytics product, not just a demo.

If you like projects that combine:

real product thinking,
non-trivial data/LLM pipelines,
backend + frontend + infra,
and a strong GitHub presentation,

YouTubeIntel was made for that exact intersection.

License

See LICENSE.