anuragredbus commited on
Commit
034a807
·
1 Parent(s): 7a5c462

Update hf_mini_blog.md

Browse files
Files changed (1) hide show
  1. blog/hf_mini_blog.md +84 -24
blog/hf_mini_blog.md CHANGED
@@ -1,39 +1,99 @@
1
- # Viraltest v2: Teaching LLMs to Be Instagram Strategists Through World Modeling
2
 
3
- **TL;DR:** We built an OpenEnv environment where an LLM agent manages an Instagram creator account for 30 simulated days. The agent receives sparse observations and must discover the world trending topics, competitor behavior, audience segments, posting heatmaps — through a catalog of 8 tools. Every constant is calibrated against peer-reviewed research and large-N industry studies.
4
 
5
- ## The Problem
6
 
7
- The $250B creator economy (Goldman Sachs, 2025) has 67 million creators, but 73% experience burnout (Awin, 2024). The core tension: post enough to stay visible in the algorithm, but not so much that quality drops and audiences fatigue. No existing RL environment captures this tradeoff with realistic dynamics.
8
 
9
- ## The Environment
10
 
11
- **Viraltest v2** simulates a 30-day Instagram creator lifecycle grounded in 10+ verified data sources:
12
 
13
- - **Engagement signals** decomposed into watch_time, sends_per_reach, saves, and likes_per_reach matching Adam Mosseri's Jan-2025 official ranking signal confirmation
14
- - **Hour-by-hour heatmap** from Buffer's 9.6M-post study cross-validated with Sprout Social's 2B-engagement analysis
15
- - **Sleep/cognitive model** based on Van Dongen et al. (2003, *Sleep*, PMID 12683469) — performance lapses are linear above 16 hours awake
16
- - **Tiered audience fatigue** from Buffer's 2.1M-post frequency study — not a cliff but a gradual decay
17
- - **7 competitor archetypes** with realistic posting cadences (3–5/week, not per-day)
18
 
19
- ## Theme #3.1: Why This Is World Modeling
20
 
21
- The agent starts each day with almost no information just energy, followers, and last reward. To plan effectively, it must:
22
 
23
- 1. **Discover tools** (`GET /tools`) on day 1
24
- 2. **Query the world** — trending topics, competitor activity, audience preferences
25
- 3. **Form hypotheses** and persist them in a scratchpad (`notes` field)
26
- 4. **Test plans** via `predict_engagement` before committing
27
- 5. **Learn from counterfactual feedback** — the environment shadow-runs the optimal heatmap plan and shows the delta
28
 
29
- This isn't prompt engineering. The agent must build and maintain an internal world model across 30 steps.
30
 
31
- ## Training
32
 
33
- We trained Qwen2.5-1.5B-Instruct using TRL's GRPO trainer. Reward = per-step environment reward + terminal grader score. After 200 episodes, the trained agent outperforms the untrained baseline on all three tasks (monthly_engage, monthly_strategic, monthly_competitive).
34
 
35
- ## Every Number Is Verifiable
 
 
 
36
 
37
- We classify our sources into 4 tiers (peer-reviewed industry official survey) and explicitly reject SEO/affiliate blogs. Full bibliography with DOIs, PMIDs, arXiv IDs, methodology extracts, and sample sizes lives in [RESEARCH.md](../RESEARCH.md).
38
 
39
- [Environment on HF Spaces](#) | [GitHub repo](#) | [Training notebook](#)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # We Trained an LLM to Survive Instagram
2
 
3
+ ### Why we built Creator Copilot, an OpenEnv where the agent learns by living a creator's lifenot by reading about it.
4
 
5
+ ---
6
 
7
+ ## The scene we couldn't shake
8
 
9
+ A creator wakes up at 7:42 AM. Yesterday's reel did 12% of what last week's did. Nobody at the platform will tell her why. There is a heatmap somewhere, a ranking change last Tuesday, an audience segment that quietly shifted, a "trending" tag that peaked six hours ago. She doesn't have access to any of it. So she does the only thing she can do: she posts more. Eventually 73% of creators in her cohort report burnout ([Awin, 2024](https://www.prweb.com/releases/a-majority-of-content-creators-and-influencers-struggle-with-burnout-as-concerns-for-ai-begin-to-surface-according-to-a-new-awin-group-survey-research-302257152.html)).
10
 
11
+ The creator economy is a $250B industry running on guesswork ([Goldman Sachs, 2025](https://www.goldmansachs.com/insights/articles/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027)). 67 million people are running businesses inside a black box, against an algorithm that nobody outside Meta fully understands, while their own bodies push back at 16 hours of wakefulness ([Van Dongen et al., 2003, *Sleep*, PMID 12683469](https://pubmed.ncbi.nlm.nih.gov/12683469)).
12
 
13
+ That is a *real* world model problem. And we couldn't find a single RL environment that took it seriously.
 
 
 
 
14
 
15
+ ## Creator Copilot in one sentence
16
 
17
+ **An OpenEnv environment where an LLM agent runs an Instagram creator account for 7 simulated days, gets almost nothing for free, and has to discover the rules of the world through 8 tool calls and a notebook.**
18
 
19
+ It is the smallest version we could build of "operate a real account in a real economy."
 
 
 
 
20
 
21
+ ## The bet: discovery, not instruction
22
 
23
+ Most agent environments hand the model a verbose observation and ask it to pick from 4 actions. Creator Copilot does the opposite. The default observation is *deliberately sparse* — just `energy`, `followers`, `last reward`. Everything interesting (trending topics, competitor cadence, audience segments, hour-by-hour engagement, your own past tag performance) is hidden behind tools the agent has to *discover* by hitting `GET /tools`.
24
 
25
+ This is the move that makes the environment a world-modeling environment instead of a recommendation problem:
26
 
27
+ - The agent has to **plan inquiry**: queries are the only way to reduce uncertainty, so it has to choose which questions are worth asking.
28
+ - The agent has to **carry beliefs forward**: a `notes` scratchpad persists across all 7 days. If the agent doesn't write down "Tuesdays at 12pm worked," it has no memory.
29
+ - The agent has to **test before committing**: `predict_engagement` lets it simulate a plan; `coach_feedback` shows the *counterfactual delta* between its plan and a heatmap-optimal plan. That second signal is the secret sauce — it teaches causality, not just outcomes.
30
+ - The agent has to **stay alive**: `creator_energy` decays with posting and recovers with rest, calibrated to a real sleep-deprivation paper. Burn out and the episode ends early.
31
 
32
+ The model doesn't get a tutorial. It gets a phone, a calendar, a sleep cycle, and a question: *can you grow this account without breaking the human?*
33
 
34
+ ## The moat: every number is auditable
35
+
36
+ We were tired of RL environments where the rewards are vibes. So we drew a hard line: **every constant in Creator Copilot is backed by a Tier 1–3 source.** We even wrote a source-quality rubric and explicitly *rejected* 13 SEO/affiliate blogs that didn't meet it.
37
+
38
+ | What it controls | What it's based on |
39
+ |---|---|
40
+ | Engagement decomposition (watch_time, sends, saves, likes) | [Adam Mosseri, Head of Instagram, Jan 2025 statement](https://about.fb.com/news/) |
41
+ | 7×24 hour-of-day heatmap | [Buffer 9.6M post study](https://buffer.com/resources/when-is-the-best-time-to-post-on-instagram) cross-validated with [Sprout Social 2B engagements](https://sproutsocial.com/insights/best-times-to-post-on-social-media/) |
42
+ | Sleep-driven cognitive decay | [Van Dongen et al., 2003, *Sleep*, PMID 12683469](https://pubmed.ncbi.nlm.nih.gov/12683469) |
43
+ | Tiered audience fatigue from over-posting | [Buffer 2.1M post frequency study](https://buffer.com/resources/how-often-to-post-on-instagram/) |
44
+ | Algorithmic disengagement model | [Cen et al., 2024 — arXiv:2410.13108](https://arxiv.org/abs/2410.13108) |
45
+ | Engagement vs. utility split | [Aouali et al., 2024 — arXiv:2406.01611](https://arxiv.org/abs/2406.01611) |
46
+
47
+ If a judge wants to challenge a single number, they can open `RESEARCH.md`, find the DOI/PMID/arXiv ID, and read the methodology. We *want* that fight.
48
+
49
+ That auditability is also why we believe a researcher could write a paper on top of this environment — not "an LLM played a game," but "an LLM learned a strategy that survives a known sleep-deprivation curve."
50
+
51
+ ## What the agent gets graded on
52
+
53
+ We didn't want a single-number reward we could game. So the environment ships a **JudgeReport every day** — a deterministic, source-cited audit of three things:
54
+
55
+ - `policy_compliance` — did the agent break sourced sustainability rules? (e.g. >5 posts/day from Buffer 2.1M, weekly collab cap from Cen 2024, >22h awake from Van Dongen 2003)
56
+ - `sustainability_risk` — energy floor, sleep debt, and low-energy ratio over the day
57
+ - `strategic_quality` — engagement-per-post × intent diversity × format diversity
58
+
59
+ Plus three task graders calibrated to a *smart heuristic* baseline (`weekly_engage`, `weekly_strategic`, `weekly_competitive`). The agent isn't competing against zero — it's competing against a known-good rule-based player.
60
+
61
+ This composability is the OpenEnv Rubric idea taken seriously: separable, auditable signals that a researcher can swap in and out, not a monolithic black-box reward.
62
+
63
+ ## Did the agent actually learn?
64
+
65
+ Yes — and we're being honest about where.
66
+
67
+ We trained Qwen2.5-3B-Instruct (Q4 quantized, running on a local M4 Mac via Ollama, no T4 needed) over 4 rounds, 6 episodes each, with temperature annealing from 1.4 → 0.7. Reward = per-step environment reward + 2× terminal grader score.
68
+
69
+ | Task | Untrained | Trained | Δ |
70
+ |---|---|---|---|
71
+ | `weekly_engage` | 0.355 | **0.409** | **+5.4%** |
72
+ | `weekly_competitive` | 0.374 | **0.510** | **+13.6%** |
73
+ | `weekly_strategic` | 0.680 | 0.627 | −5.2% |
74
+
75
+ The wins are largest on the *hardest* task — `weekly_competitive` — which is where the world model bites: the agent has to query competitors, differentiate its content, and time its posts. Exactly where we'd expect tool discovery to matter.
76
+
77
+ The strategic task regression is real and we're not hiding it: the model started doing too much exploration on a task where exploitation matters more, and our 4-round budget wasn't long enough to anneal that out. Honest result on a small training run.
78
+
79
+ What we *can* show qualitatively: the trained agent calls `GET /tools` on day 1, queries trends and competitors before posting, drops `predict_engagement` calls on the days it has a clear plan, and keeps `creator_energy` above 0.5 through the week. The untrained baseline posts blindly for the first few days and burns out.
80
+
81
+ Plots and the full per-episode log live in `plots/training_summary.json` and `plots/training_log.csv`.
82
+
83
+ ## Why this is the submission to remember
84
+
85
+ There were going to be a lot of grid-worlds at this hackathon. A lot of toy puzzles. A lot of "we trained on a math benchmark."
86
+
87
+ Creator Copilot is something different. It's the smallest possible environment that tests whether an LLM can be an **operator** — discover an unknown world, plan inquiry under a budget, hold beliefs across time, weigh strategy against the operator's own physical constraints, and beat a smart human-style baseline on it.
88
+
89
+ That's not just an Instagram problem. That's the shape of every interesting LLM deployment in the next two years: customer success agents, ad ops, account managers, founders' assistants, ops engineers. Operator-class agents will live or die on whether they can do this loop. We don't have a benchmark for it yet. So we built one.
90
+
91
+ If you train an LLM on Creator Copilot and it gets better, you've taught it something it could not previously do — and you can prove it, line by line, against the literature.
92
+
93
+ That's the bet. That's why we built it.
94
+
95
+ ---
96
+
97
+ **Try it:** the environment is on Hugging Face Spaces, the training notebook is in `training/`, the per-day audit logs are in `server/simulation_history.json`, and every numeric constant has a citation in [`RESEARCH.md`](../RESEARCH.md).
98
+
99
+ We don't think we've solved the creator economy. We think we've built the first environment honest enough to fail against it. Come argue with our numbers.