Spaces:

ycwhencpp
/

final-iteration

Paused

App Files Files Community

anuragredbus commited on 12 days ago

Commit

034a807

1 Parent(s): 7a5c462

Update hf_mini_blog.md

Browse files

Files changed (1) hide show

blog/hf_mini_blog.md +84 -24

blog/hf_mini_blog.md CHANGED Viewed

@@ -1,39 +1,99 @@
-# Viraltest v2: Teaching LLMs to Be Instagram Strategists Through World Modeling
-**TL;DR:** We built an OpenEnv environment where an LLM agent manages an Instagram creator account for 30 simulated days. The agent receives sparse observations and must discover the world — trending topics, competitor behavior, audience segments, posting heatmaps — through a catalog of 8 tools. Every constant is calibrated against peer-reviewed research and large-N industry studies.
-## The Problem
-The $250B creator economy (Goldman Sachs, 2025) has 67 million creators, but 73% experience burnout (Awin, 2024). The core tension: post enough to stay visible in the algorithm, but not so much that quality drops and audiences fatigue. No existing RL environment captures this tradeoff with realistic dynamics.
-## The Environment
-**Viraltest v2** simulates a 30-day Instagram creator lifecycle grounded in 10+ verified data sources:
-- **Engagement signals** decomposed into watch_time, sends_per_reach, saves, and likes_per_reach — matching Adam Mosseri's Jan-2025 official ranking signal confirmation
-- **Hour-by-hour heatmap** from Buffer's 9.6M-post study cross-validated with Sprout Social's 2B-engagement analysis
-- **Sleep/cognitive model** based on Van Dongen et al. (2003, *Sleep*, PMID 12683469) — performance lapses are linear above 16 hours awake
-- **Tiered audience fatigue** from Buffer's 2.1M-post frequency study — not a cliff but a gradual decay
-- **7 competitor archetypes** with realistic posting cadences (3–5/week, not per-day)
-## Theme #3.1: Why This Is World Modeling
-The agent starts each day with almost no information — just energy, followers, and last reward. To plan effectively, it must:
-1. **Discover tools** (`GET /tools`) on day 1
-2. **Query the world** — trending topics, competitor activity, audience preferences
-3. **Form hypotheses** and persist them in a scratchpad (`notes` field)
-4. **Test plans** via `predict_engagement` before committing
-5. **Learn from counterfactual feedback** — the environment shadow-runs the optimal heatmap plan and shows the delta
-This isn't prompt engineering. The agent must build and maintain an internal world model across 30 steps.
-## Training
-We trained Qwen2.5-1.5B-Instruct using TRL's GRPO trainer. Reward = per-step environment reward + 2× terminal grader score. After 200 episodes, the trained agent outperforms the untrained baseline on all three tasks (monthly_engage, monthly_strategic, monthly_competitive).
-## Every Number Is Verifiable
-We classify our sources into 4 tiers (peer-reviewed → industry → official → survey) and explicitly reject SEO/affiliate blogs. Full bibliography with DOIs, PMIDs, arXiv IDs, methodology extracts, and sample sizes lives in [RESEARCH.md](../RESEARCH.md).
-[Environment on HF Spaces](#) | [GitHub repo](#) | [Training notebook](#)

+# We Trained an LLM to Survive Instagram
+### Why we built Creator Copilot, an OpenEnv where the agent learns by living a creator's life — not by reading about it.
+---
+## The scene we couldn't shake
+A creator wakes up at 7:42 AM. Yesterday's reel did 12% of what last week's did. Nobody at the platform will tell her why. There is a heatmap somewhere, a ranking change last Tuesday, an audience segment that quietly shifted, a "trending" tag that peaked six hours ago. She doesn't have access to any of it. So she does the only thing she can do: she posts more. Eventually 73% of creators in her cohort report burnout ([Awin, 2024](https://www.prweb.com/releases/a-majority-of-content-creators-and-influencers-struggle-with-burnout-as-concerns-for-ai-begin-to-surface-according-to-a-new-awin-group-survey-research-302257152.html)).
+The creator economy is a $250B industry running on guesswork ([Goldman Sachs, 2025](https://www.goldmansachs.com/insights/articles/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027)). 67 million people are running businesses inside a black box, against an algorithm that nobody outside Meta fully understands, while their own bodies push back at 16 hours of wakefulness ([Van Dongen et al., 2003, *Sleep*, PMID 12683469](https://pubmed.ncbi.nlm.nih.gov/12683469)).
+That is a *real* world model problem. And we couldn't find a single RL environment that took it seriously.
+## Creator Copilot in one sentence
+**An OpenEnv environment where an LLM agent runs an Instagram creator account for 7 simulated days, gets almost nothing for free, and has to discover the rules of the world through 8 tool calls and a notebook.**
+It is the smallest version we could build of "operate a real account in a real economy."
+## The bet: discovery, not instruction
+Most agent environments hand the model a verbose observation and ask it to pick from 4 actions. Creator Copilot does the opposite. The default observation is *deliberately sparse* — just `energy`, `followers`, `last reward`. Everything interesting (trending topics, competitor cadence, audience segments, hour-by-hour engagement, your own past tag performance) is hidden behind tools the agent has to *discover* by hitting `GET /tools`.
+This is the move that makes the environment a world-modeling environment instead of a recommendation problem:
+- The agent has to **plan inquiry**: queries are the only way to reduce uncertainty, so it has to choose which questions are worth asking.
+- The agent has to **carry beliefs forward**: a `notes` scratchpad persists across all 7 days. If the agent doesn't write down "Tuesdays at 12pm worked," it has no memory.
+- The agent has to **test before committing**: `predict_engagement` lets it simulate a plan; `coach_feedback` shows the *counterfactual delta* between its plan and a heatmap-optimal plan. That second signal is the secret sauce — it teaches causality, not just outcomes.
+- The agent has to **stay alive**: `creator_energy` decays with posting and recovers with rest, calibrated to a real sleep-deprivation paper. Burn out and the episode ends early.
+The model doesn't get a tutorial. It gets a phone, a calendar, a sleep cycle, and a question: *can you grow this account without breaking the human?*
+## The moat: every number is auditable
+We were tired of RL environments where the rewards are vibes. So we drew a hard line: **every constant in Creator Copilot is backed by a Tier 1–3 source.** We even wrote a source-quality rubric and explicitly *rejected* 13 SEO/affiliate blogs that didn't meet it.
+| What it controls | What it's based on |
+|---|---|
+| Engagement decomposition (watch_time, sends, saves, likes) | [Adam Mosseri, Head of Instagram, Jan 2025 statement](https://about.fb.com/news/) |
+| 7×24 hour-of-day heatmap | [Buffer 9.6M post study](https://buffer.com/resources/when-is-the-best-time-to-post-on-instagram) cross-validated with [Sprout Social 2B engagements](https://sproutsocial.com/insights/best-times-to-post-on-social-media/) |
+| Sleep-driven cognitive decay | [Van Dongen et al., 2003, *Sleep*, PMID 12683469](https://pubmed.ncbi.nlm.nih.gov/12683469) |
+| Tiered audience fatigue from over-posting | [Buffer 2.1M post frequency study](https://buffer.com/resources/how-often-to-post-on-instagram/) |
+| Algorithmic disengagement model | [Cen et al., 2024 — arXiv:2410.13108](https://arxiv.org/abs/2410.13108) |
+| Engagement vs. utility split | [Aouali et al., 2024 — arXiv:2406.01611](https://arxiv.org/abs/2406.01611) |
+If a judge wants to challenge a single number, they can open `RESEARCH.md`, find the DOI/PMID/arXiv ID, and read the methodology. We *want* that fight.
+That auditability is also why we believe a researcher could write a paper on top of this environment — not "an LLM played a game," but "an LLM learned a strategy that survives a known sleep-deprivation curve."
+## What the agent gets graded on
+We didn't want a single-number reward we could game. So the environment ships a **JudgeReport every day** — a deterministic, source-cited audit of three things:
+- `policy_compliance` — did the agent break sourced sustainability rules? (e.g. >5 posts/day from Buffer 2.1M, weekly collab cap from Cen 2024, >22h awake from Van Dongen 2003)
+- `sustainability_risk` — energy floor, sleep debt, and low-energy ratio over the day
+- `strategic_quality` — engagement-per-post × intent diversity × format diversity
+Plus three task graders calibrated to a *smart heuristic* baseline (`weekly_engage`, `weekly_strategic`, `weekly_competitive`). The agent isn't competing against zero — it's competing against a known-good rule-based player.
+This composability is the OpenEnv Rubric idea taken seriously: separable, auditable signals that a researcher can swap in and out, not a monolithic black-box reward.
+## Did the agent actually learn?
+Yes — and we're being honest about where.
+We trained Qwen2.5-3B-Instruct (Q4 quantized, running on a local M4 Mac via Ollama, no T4 needed) over 4 rounds, 6 episodes each, with temperature annealing from 1.4 → 0.7. Reward = per-step environment reward + 2× terminal grader score.
+| Task | Untrained | Trained | Δ |
+|---|---|---|---|
+| `weekly_engage` | 0.355 | **0.409** | **+5.4%** |
+| `weekly_competitive` | 0.374 | **0.510** | **+13.6%** |
+| `weekly_strategic` | 0.680 | 0.627 | −5.2% |
+The wins are largest on the *hardest* task — `weekly_competitive` — which is where the world model bites: the agent has to query competitors, differentiate its content, and time its posts. Exactly where we'd expect tool discovery to matter.
+The strategic task regression is real and we're not hiding it: the model started doing too much exploration on a task where exploitation matters more, and our 4-round budget wasn't long enough to anneal that out. Honest result on a small training run.
+What we *can* show qualitatively: the trained agent calls `GET /tools` on day 1, queries trends and competitors before posting, drops `predict_engagement` calls on the days it has a clear plan, and keeps `creator_energy` above 0.5 through the week. The untrained baseline posts blindly for the first few days and burns out.
+Plots and the full per-episode log live in `plots/training_summary.json` and `plots/training_log.csv`.
+## Why this is the submission to remember
+There were going to be a lot of grid-worlds at this hackathon. A lot of toy puzzles. A lot of "we trained on a math benchmark."
+Creator Copilot is something different. It's the smallest possible environment that tests whether an LLM can be an **operator** — discover an unknown world, plan inquiry under a budget, hold beliefs across time, weigh strategy against the operator's own physical constraints, and beat a smart human-style baseline on it.
+That's not just an Instagram problem. That's the shape of every interesting LLM deployment in the next two years: customer success agents, ad ops, account managers, founders' assistants, ops engineers. Operator-class agents will live or die on whether they can do this loop. We don't have a benchmark for it yet. So we built one.
+If you train an LLM on Creator Copilot and it gets better, you've taught it something it could not previously do — and you can prove it, line by line, against the literature.
+That's the bet. That's why we built it.
+---
+**Try it:** the environment is on Hugging Face Spaces, the training notebook is in `training/`, the per-day audit logs are in `server/simulation_history.json`, and every numeric constant has a citation in [`RESEARCH.md`](../RESEARCH.md).
+We don't think we've solved the creator economy. We think we've built the first environment honest enough to fail against it. Come argue with our numbers.