Spaces:

ycwhencpp
/

final-iteration

Paused

App Files Files Community

vaibhav12332112312 commited on 12 days ago

Commit

225cdfe

1 Parent(s): 360c721

update

Browse files

Files changed (4) hide show

blog/blog.md +0 -211
blog/slide_outline.md +0 -58
blog/youtube_script.md +0 -40
training/hf_run_space_train_job.sh +1 -1

blog/blog.md DELETED Viewed

@@ -1,211 +0,0 @@
-# Viraltest: We Taught an LLM to Run an Instagram Account for 30 Days — and It Started Getting Smart
-> **Theme #3.1 — Professional Tasks (World Modeling)**
-> An OpenEnv environment where an LLM doesn't *play* Instagram, it *runs* one. No reset button on bad days. No leaked rules. Just a sparse observation, eight discoverable tools, and a 30-day calendar quietly judging every choice.
----
-## TL;DR
-Most LLM benchmarks are one-shot trivia. Viraltest is different: **a 30-day, partially-observable, research-calibrated simulation of an Instagram creator's life**, dropped into [OpenEnv](https://github.com/meta-pytorch/OpenEnv). Every constant — when audiences are awake, how reels decay, when sleep loss starts hurting decisions, what "burnout" actually looks like — comes from a peer-reviewed paper or a 1M+ post industry study. We trained Qwen2.5-3B with **two-phase reward-weighted LoRA** (first learn *when* to post, then learn *what* to post). The reward curve climbs. The agent stops spamming text posts at 3 AM. It starts asking the right questions on day 1.
-This blog is the story of why, and how.
----
-## 1. The Problem: LLMs Can Write a Caption, but Can They Run a Brand?
-Ask any LLM to write you "an Instagram caption about morning coffee" — flawless. Ask it to run a creator account for a month, where:
-- you have a finite energy budget,
-- audiences sleep at night and skip work-hour reels,
-- the algorithm punishes you for going dark for 3 days,
-- spamming comments gets you shadowbanned,
-- collabs only help if your audiences barely overlap,
-- and burnout is a slow, accumulating thing — not a flag,
-…and the model collapses. It posts ten reels on a Tuesday morning. It uses the same three hashtags forever. It schedules a story at 4 AM. It tries to "engage" by liking 80 posts. None of these are *wrong* tokens — they're wrong *strategies*.
-That's the capability gap we wanted to test:
-> **Can an LLM build and maintain an internal world model — across 30 long-horizon steps — when nobody hands it the rules?**
-The creator economy is the perfect testbed. It's a $250B market with 67M creators ([Goldman Sachs, 2025](https://www.goldmansachs.com/insights/articles/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027)), 73% of whom report burnout ([Awin, 2024](https://www.prweb.com/releases/a-majority-of-content-creators-and-influencers-struggle-with-burnout-as-concerns-for-ai-begin-to-surface-according-to-a-new-awin-group-survey-research-302257152.html)). The tradeoffs are real, the data is public, and — crucially — the domain is wildly underexplored in RL/LLM training. Most envs stop at chess, gridworlds, and toy text games. We wanted something a researcher could actually publish a paper on.
-## 2. Meet the Environment
-Every step is **one day**. Episodes run **30 days**. Each day the agent gets a deliberately *sparse* observation:
-```python
-observation = ViraltestObservation(
-    creator_energy=0.78,
-    followers=10_420,
-    reward=0.31,
-    engagement_rate=0.041,
-    notes="Day 1: I have no idea what people like.",
-    # ...and barely anything else, until you ask.
-)
-```
-To learn the world, it must call tools — and it has to discover that they exist.
-| Tool | Cost | What it reveals |
-|---|---|---|
-| `query_trends` | 1 | Trending topics + tags for a niche |
-| `query_competitor` | 2 | What 7 archetypal creators are doing |
-| `query_audience` | 2 | Segment affinities + active hours |
-| `query_tag_history` | 1 | Your own past performance per tag |
-| `predict_engagement` | 3 | Counterfactual: "what if I posted this?" |
-| `draft_review` | 3 | Strengths/weaknesses of a plan |
-| `query_creator_pool` | 1 | Available collab partners + overlap |
-| `propose_collab` | 5 | Co-author with another creator |
-The agent's **first move on day 1** has to be `GET /tools`. There's no list in the prompt. World modeling, by construction.
-### The Reward, Decomposed Like Instagram Actually Ranks Posts
-Instagram's head Adam Mosseri publicly confirmed the top ranking signals in January 2025. We don't reward "engagement" as one number — we decompose it:
-```python
-reward = 0.40 * watch_time
-       + 0.30 * sends_per_reach
-       + 0.20 * saves
-       + 0.10 * likes_per_reach
-       - fatigue_penalty
-       - sleep_penalty
-       - shadowban_penalty
-       + collab_uplift
-```
-Each format has a natural strength. Reels are watch-time machines. Stories drive sends. Carousels get saved. Text posts get liked. The agent has to learn this — we don't tell it.
-## 3. The Best Part: Every Number Comes From a Paper
-This is where Viraltest stops being a hackathon toy and starts looking like research infrastructure. Here's how literature shaped the simulation:
-| Mechanic | What it does | Source |
-|---|---|---|
-| **Hour heatmap (7×24)** | When you post matters — Wed 12pm slaps, Sat 4 AM doesn't | [Buffer 9.6M posts](https://buffer.com/resources/when-is-the-best-time-to-post-on-instagram) cross-validated with [Sprout Social 2B engagements](https://sproutsocial.com/insights/best-times-to-post-on-social-media/) |
-| **Sleep model** | Quality decays linearly past 16h awake, floor at 30% | [Van Dongen et al. 2003, *Sleep*, PMID 12683469](https://pubmed.ncbi.nlm.nih.gov/12683469) — the canonical sleep deprivation RCT |
-| **Fatigue tiers** | 2 posts/day = 1.0×, 5+ collapse to 0.25× | [Buffer 2.1M posts × 102K accounts](https://buffer.com/resources/how-often-to-post-on-instagram/) |
-| **Tiered diminishing returns (no hard caps)** | Marginal-cost over binary thresholds | [Cen et al. 2024, arXiv:2410.13108](https://arxiv.org/abs/2410.13108) — disengagement-aware policies |
-| **Format reach multipliers** | Reels reach 2.25× static images | [Socialinsider 31M post study](https://www.socialinsider.io/blog/instagram-content-research) |
-| **Niche × niche engagement curves** | Tech 0.33%, Higher Ed 2.10%, etc. | [Rival IQ 1.9M posts × 2,100 brands](https://www.rivaliq.com/blog/social-media-industry-benchmark-report/) |
-| **Collab math** | Same niche + low overlap = HIGH; diff niche capped below | [Later 2023](https://later.com/blog/instagram-collab-posts) + [HypeAuditor 2024](https://hypeauditor.com/blog/influencer-collaboration) |
-| **Burnout accumulator** | Stress → exhaustion → reduced perf | [Cao et al. 2024, *Educ Inf Technol*](https://doi.org/10.1007/s10639-023-12213-6) + [Wen et al. 2026, *Sci Rep*](https://www.nature.com/articles/s41598-026-42958-2) |
-| **Reward decomposition (4 signals)** | Watch + sends + saves + likes, weighted | Mosseri Jan-2025 (Tier 3 official) |
-We even maintain a **rejection list** — 13 SEO/affiliate blogs we *refused* to cite because they don't disclose methodology. The full bibliography (with DOIs, PMIDs, sample sizes) lives in [`RESEARCH.md`](../RESEARCH.md). Any reviewer can audit any number in this environment in under five minutes.
-## 4. Two-Phase Training: The "Sweet Spot" Has Two Dimensions
-Here's the design idea we're proudest of. Real creator success isn't one skill — it's at least two:
-1. **WHEN to post** (timing, frequency, cadence — heatmap-driven)
-2. **WHAT to post** (format mix, intent variety, tag discovery — content-driven)
-A single reward signal makes the LLM split the difference and master neither. So we **split training into phases**, each with its own reward shaping:
-| Phase | Reward focus | What the agent learns |
-|---|---|---|
-| **Phase 1 — Timing** | Heatmap multiplier, fatigue penalty, sleep model | Stop posting at 4 AM. Don't drop 6 reels on Monday. Sleep matters. |
-| **Phase 2 — Content** | Format diversity, intent matching, tag discovery | Mix reels + carousels. Match `intent` to format. Explore tags before exploiting. |
-Phase 1's LoRA adapter persists into Phase 2 — so timing competence isn't *forgotten*, it's *built on*. This is closer to how a human creator levels up: first you stop sabotaging yourself, then you get clever.
-And the architecture is **extensible**. Want to train a "collab specialist"? Add a `collab` reward mode. Want to study "burnout-aware posting"? Add a `wellness` mode. Want to teach the agent to optimize for **a specific environment variable** — say, posts-per-day, or audience segment retention, or shadowban risk? Plug a new reward mode into `env.reset(reward_mode="...")` and a new system prompt into the phase config. The training loop doesn't care.
-```python
-PHASES = [
-    {"name": "phase1_timing",  "reward_mode": "timing",  "system": SYSTEM_PROMPT_TIMING},
-    {"name": "phase2_content", "reward_mode": "content", "system": SYSTEM_PROMPT_CONTENT},
-    # add your own phase here ↓
-    # {"name": "phase3_collab", "reward_mode": "collab", "system": SYSTEM_PROMPT_COLLAB},
-]
-```
-This is the kind of design that researchers can fork. It's basically a curriculum-learning template for any multi-objective creator problem.
-## 5. Did It Actually Learn? (The Bit That Counts for 20%)
-Yes. Here are the real numbers from `run-output/plots/training_summary.json` — Qwen2.5-3B-Instruct, LoRA SFT, 2 rounds × 6 episodes:
-**Reward climbs round-over-round:**
-| Round | avg episode reward | max episode reward | avg grader | max grader | train loss |
-|---|---|---|---|---|---|
-| 1 | 3.904 | 4.514 | 0.620 | 0.827 | 2.672 |
-| 2 | **4.215** | **4.658** | **0.732** | **0.870** | **2.593** |
-That's **+8% mean reward**, **+18% mean grader score**, and **train loss dropping** — the model is genuinely learning weights, not just resampling prompts.
-**Vs. baseline (the smart heuristic) on the held-out evaluation:**
-| Task | Smart heuristic baseline | Trained agent (after) |
-|---|---|---|
-| `monthly_engage` | 0.7352 | **1.000** |
-| `monthly_strategic` | 0.9043 | 0.842 |
-| `monthly_competitive` | 0.9066 | **0.964** |
-The trained agent **matches or beats** the rule-based heuristic on 2 of 3 tasks. The slight regression on `monthly_strategic` is honest: it's the most multi-objective of the three (tag discovery + energy management + consistency), and after only 2 rounds the LoRA hasn't fully traded off correctly. More rounds and a third "diversity" phase are the obvious next step — and the architecture supports it without code changes.
-**Plots:**
-- `plots/reward_curve.png` — round-by-round reward
-- `plots/before_after.png` — baseline vs trained
-- `plots/training_trajectories.png` — per-task learning curves
-- `plots/baseline_leaderboard.png` — 5 heuristic baselines we beat
-## 6. Where We're Honest About Shortcomings
-A research-quality environment has to admit what's mocked vs. real. Here's the unvarnished list:
-| Concern | Status today | Why / Plan |
-|---|---|---|
-| **Negative comments / sentiment hits** | Not implemented — comments only ever *help* engagement right now | Real Instagram posts hurt feelings; some go viral *for the wrong reasons*. Modeling this needs an LLM-based sentiment scorer in the env loop. **Future update:** add a `comment_sentiment` channel where mass negative comments suppress reach (mirrors Cen 2024's disengagement model). |
-| **Followers always grow if you post** | Currently true | This is the biggest "video game" assumption. In reality, a tone-deaf post can lose followers. **Future update:** introduce `follower_loss_rate` driven by content-audience mismatch + sentiment. |
-| **Abusive / unsafe content detection** | Not implemented | Detecting toxicity reliably needs an LLM-in-the-loop (a la Llama-Guard). For the hackathon we kept the env deterministic and reproducible. **Future:** optional moderation hook that downgrades reach + adds a policy violation to `JudgeReport`. |
-| **Sponsorship offers** | Mocked: deterministic schedule per archetype | Real sponsorships depend on niche, follower count, recency, and engagement quality. We have the building blocks — just not the marketplace yet. |
-| **Collaborator follower counts** | Mocked from `audience_overlap_matrix.json` | Real follower numbers are noisy and platform-API-gated. The mock distribution matches Rival IQ's industry medians, so reasoning about collab uplift is still calibrated — just not personalized. |
-| **Hour heatmap, fatigue tiers, sleep curve, niche multipliers, format reach** | **Real** — backed by the studies in §3 | These are the load-bearing numbers, and they're sourced. |
-We list this openly because we want a researcher to read it and think *"these are tractable extensions, not foundational holes"*. They are.
-## 7. Why This Matters (and Who Should Care)
-- **For RL/LLM researchers:** A reproducible, partially-observable, long-horizon environment with a *believable* reward landscape — calibrated to public datasets. Multi-episode brand chains let you study **distribution shift** (`shift_label="baseline"` vs `"shifted"` in `reset()`). The headline `vs_baseline_pct`, `score_per_tool_call`, and `retention_under_shift` are built into every final observation.
-- **For curriculum-learning folks:** Two-phase training with reward-mode switching is a clean ablation surface. Add phases. Reorder them. See what catastrophically forgets.
-- **For agent-eval people:** Every day emits a deterministic, explainable `JudgeReport(policy_compliance, sustainability_risk, strategic_quality, violations)`. Auditable rules cite their sources (Buffer 2.1M, Van Dongen, Cen 2024). It's basically a regulator built into the env.
-- **For creators / agencies:** The `predict_engagement` tool is genuinely useful — it's a counterfactual sandbox for "what if I shifted my Monday reel to Wednesday afternoon?" calibrated to industry data.
-> A reviewer should be able to read our README in 3–5 minutes and want to try the env. We've tried hard to earn that.
-## 8. The Journey, In One Paragraph
-We started with the same instinct everyone has — *"build a chess clone, but for tweets"* — and threw it out within a week. The interesting question wasn't "can the LLM win at engagement?" — it was *"can it learn the world from sparse signals?"*. So we shrunk the observation, exploded the tool catalog, and went paper-hunting. We rejected 13 SEO blogs that wouldn't show their math. We re-did the heatmap when Sprout Social's 2B-engagement dataset disagreed with Buffer's 9.6M. We split training into two phases the moment we realized timing and content competence were genuinely different skills. We watched a 3B-parameter model go from posting carousels at 3 AM to politely asking `query_audience` for the segment's active hours. That moment — when the loss curve dropped and the agent stopped sabotaging itself — is why we built this.
-## 9. Try It
-- **HuggingFace Space:** [Viraltest live env](#) *(replace with your published Space URL)*
-- **GitHub repo:** [`viraltest`](#)
-- **Training notebook (Colab T4):** [`training/train_grpo.ipynb`](../training/train_grpo.ipynb)
-- **Full bibliography:** [`RESEARCH.md`](../RESEARCH.md) — every constant traceable to a DOI / PMID / arXiv ID
-- **Design notes:** [`DESIGN.md`](../DESIGN.md)
-- **2-min video script:** [`blog/youtube_script.md`](youtube_script.md)
-- **Pitch deck outline:** [`blog/slide_outline.md`](slide_outline.md)
-Quick local spin-up:
-```bash
-git clone <repo-url> && cd viraltest
-uv sync
-uvicorn server.app:app --host 0.0.0.0 --port 8000
-# in another terminal:
-export HF_TOKEN=hf_... MODEL_NAME=Qwen/Qwen2.5-3B-Instruct
-.venv/bin/python inference.py
-```
-If you fork it to add a sentiment channel, a sponsorship marketplace, or a third training phase — please tell us. That's exactly the point.
----
-*Built for the OpenEnv Hackathon. Numbers are from real runs in `run-output/plots/training_summary.json`. Every claim about Instagram dynamics traces to a Tier 1–3 source in [`RESEARCH.md`](../RESEARCH.md). If you can't audit it, we didn't cite it.*

blog/slide_outline.md DELETED Viewed

@@ -1,58 +0,0 @@
-# Viraltest v2 — Pitch Deck Outline (8 slides)
-## Slide 1: Title
-- **Viraltest v2: Teaching LLMs World Modeling Through Instagram Strategy**
-- Theme #3.1 — Professional Tasks
-- OpenEnv Hackathon India 2026
-- Team: [your team name]
-## Slide 2: The Problem
-- $250B creator economy, 67M creators (Goldman Sachs 2025)
-- 73% experience burnout; Instagram drives 88% of it (Awin 2024)
-- Algorithm changes constantly — no one tells you the rules
-- Existing tools show analytics but don't teach strategy
-- **Gap:** No RL environment captures this tradeoff with realistic dynamics
-## Slide 3: The World
-- 30-day Instagram simulation (monthly cycle)
-- Mosseri-aligned signals: watch_time, sends, saves, likes (official Jan 2025)
-- Hour-by-hour heatmap (Buffer 9.6M + Sprout 2B)
-- 7 competitor archetypes, 5 audience segments, ~120 tags
-- Piecewise-linear sleep model (Van Dongen 2003, *Sleep*)
-- Tiered audience fatigue (Buffer 2.1M)
-## Slide 4: The Tools (Theme #3.1 Fit)
-- Agent starts with SPARSE observation (energy, followers, reward)
-- 8 discoverable tools: query_trends, query_competitor, query_audience, query_tag_history, predict_engagement, draft_review, query_creator_pool, propose_collab
-- API budget (100/episode) — can't query everything, must prioritize
-- Notes field for hypothesis tracking across days
-- Counterfactual coach: "here's what would have happened with optimal timing"
-## Slide 5: Training Pipeline
-- TRL GRPO on Qwen2.5-1.5B-Instruct (free Colab T4)
-- Reward: per-step env reward + 2× terminal grader score
-- 200 episodes, batch 4, 50 GRPO steps
-- 3 tasks: monthly_engage → monthly_strategic → monthly_competitive
-- Multi-episode chain: brand state persists across months
-## Slide 6: Results
-- [Embed reward_curve.png — ascending curve over training]
-- [Embed before_after.png — smart baseline vs trained agent per task]
-- Trained agent: uses tools on day 1, adapts strategy by day 5, manages energy throughout
-- Score improvement on monthly_competitive: [X% → Y%]
-## Slide 7: Sources & Verifiability
-- 4-tier source quality bar (peer-reviewed → industry → official → survey)
-- 7 Tier-1 papers, 9 Tier-2 studies, 1 Tier-3 official statement
-- Every constant has a DOI/PMID/arXiv ID
-- Tier-5 SEO blogs explicitly rejected (13 sites listed with rationale)
-- Full bibliography: RESEARCH.md (~6 pages)
-- **Any number in this presentation can be debated — we welcome it**
-## Slide 8: Try It
-- HF Space: [link]
-- GitHub: [link]
-- Training notebook: [Colab link]
-- Blog: [HF post link]
-- Video: [YouTube link]
-- **Questions?**

blog/youtube_script.md DELETED Viewed

@@ -1,40 +0,0 @@
-# Viraltest v2 — YouTube Script (<2 minutes)
-## Storyboard
-### Shot 1: Hook (0:00–0:10)
-**Visual:** Split screen — left: scrolling Instagram feed, right: an LLM terminal making decisions
-**Voiceover:** "What if an AI agent could learn to run your Instagram account — not from a prompt, but by discovering the rules of the world itself?"
-**On-screen text:** "Viraltest v2 — World Modeling for Instagram"
-### Shot 2: The Problem (0:10–0:25)
-**Visual:** Stats flying in — "$250B creator economy" (Goldman Sachs 2025), "73% burnout" (Awin 2024), "67M creators"
-**Voiceover:** "67 million creators compete for attention. 73% burn out. The algorithm changes constantly. No one tells you the rules."
-**Citation badge:** Goldman Sachs 2025 · Awin 2024
-### Shot 3: The Environment (0:25–0:50)
-**Visual:** Animated diagram — agent receives sparse observation → calls tools → gets data → plans day
-**Voiceover:** "We built a 30-day Instagram simulation. The agent sees almost nothing — just energy, followers, and last reward. To learn, it must use 8 discoverable tools: query trends, check competitors, test plans before committing."
-**On-screen text:** "8 tools · 5 audience segments · 7 competitor archetypes · 30-day horizon"
-**Citation badge:** Buffer 9.6M · Sprout Social 2B · Van Dongen 2003
-### Shot 4: The Science (0:50–1:10)
-**Visual:** Side-by-side comparison tables showing env constants vs. source data
-**Voiceover:** "Every number comes from real research. Engagement rates from Socialinsider's 31-million post study. Peak hours from Buffer's 9.6-million post analysis. Sleep decay from a 2003 Sleep journal paper. Algorithm signals from Instagram's own head, Adam Mosseri."
-**Citation badge:** Mosseri Jan-2025 · Socialinsider 2026 · PMID 12683469
-### Shot 5: Training Results (1:10–1:30)
-**Visual:** Reward curve plot (ascending), before/after bar chart
-**Voiceover:** "We trained Qwen 2.5 1.5B using TRL GRPO. After 200 episodes, the agent learned to use tools strategically, post at peak hours, diversify content types, and manage energy — outperforming the baseline on all three tasks."
-**On-screen text:** reward curve + score comparison
-### Shot 6: Theme Fit + Close (1:30–1:50)
-**Visual:** Theme #3.1 checklist being checked off — tool discovery, partial observability, persistent state, causal reasoning, multi-step workflow
-**Voiceover:** "This is Theme 3.1: World Modeling. Real tool interaction. Persistent state across months. Causal reasoning through counterfactual feedback. Not a toy — a simulation grounded in science."
-**On-screen text:** "All sources: RESEARCH.md · Code: github.com/... · Try it: HF Spaces"
----
-**Total runtime:** ~1:50
-**Music:** Upbeat lo-fi instrumental (no lyrics)
-**Aspect ratio:** 16:9 landscape

training/hf_run_space_train_job.sh CHANGED Viewed

@@ -8,7 +8,7 @@
 set -euo pipefail
 IMAGE="${HF_JOB_IMAGE:-pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime}"
-FLAVOR="${HF_JOB_FLAVOR:-a10g-largex4}"
 TIMEOUT="${HF_JOB_TIMEOUT:-8h}"
 SPACE_REPO="${HF_SPACE_REPO_ID:-vaibhavkhandare/train-bhai-train}"
 NB_EXEC_TIMEOUT="${NB_EXEC_TIMEOUT:-3600}"

 set -euo pipefail
 IMAGE="${HF_JOB_IMAGE:-pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime}"
+FLAVOR="${HF_JOB_FLAVOR:-a100x4}"
 TIMEOUT="${HF_JOB_TIMEOUT:-8h}"
 SPACE_REPO="${HF_SPACE_REPO_ID:-vaibhavkhandare/train-bhai-train}"
 NB_EXEC_TIMEOUT="${NB_EXEC_TIMEOUT:-3600}"