# Viraltest v2: Teaching LLMs to Be Instagram Strategists Through World Modeling **TL;DR:** We built an OpenEnv environment where an LLM agent manages an Instagram creator account for 30 simulated days. The agent receives sparse observations and must discover the world — trending topics, competitor behavior, audience segments, posting heatmaps — through a catalog of 8 tools. Every constant is calibrated against peer-reviewed research and large-N industry studies. ## The Problem The $250B creator economy (Goldman Sachs, 2025) has 67 million creators, but 73% experience burnout (Awin, 2024). The core tension: post enough to stay visible in the algorithm, but not so much that quality drops and audiences fatigue. No existing RL environment captures this tradeoff with realistic dynamics. ## The Environment **Viraltest v2** simulates a 30-day Instagram creator lifecycle grounded in 10+ verified data sources: - **Engagement signals** decomposed into watch_time, sends_per_reach, saves, and likes_per_reach — matching Adam Mosseri's Jan-2025 official ranking signal confirmation - **Hour-by-hour heatmap** from Buffer's 9.6M-post study cross-validated with Sprout Social's 2B-engagement analysis - **Sleep/cognitive model** based on Van Dongen et al. (2003, *Sleep*, PMID 12683469) — performance lapses are linear above 16 hours awake - **Tiered audience fatigue** from Buffer's 2.1M-post frequency study — not a cliff but a gradual decay - **7 competitor archetypes** with realistic posting cadences (3–5/week, not per-day) ## Theme #3.1: Why This Is World Modeling The agent starts each day with almost no information — just energy, followers, and last reward. To plan effectively, it must: 1. **Discover tools** (`GET /tools`) on day 1 2. **Query the world** — trending topics, competitor activity, audience preferences 3. **Form hypotheses** and persist them in a scratchpad (`notes` field) 4. **Test plans** via `predict_engagement` before committing 5. **Learn from counterfactual feedback** — the environment shadow-runs the optimal heatmap plan and shows the delta This isn't prompt engineering. The agent must build and maintain an internal world model across 30 steps. ## Training We trained Qwen2.5-1.5B-Instruct using TRL's GRPO trainer. Reward = per-step environment reward + 2× terminal grader score. After 200 episodes, the trained agent outperforms the untrained baseline on all three tasks (monthly_engage, monthly_strategic, monthly_competitive). ## Every Number Is Verifiable We classify our sources into 4 tiers (peer-reviewed → industry → official → survey) and explicitly reject SEO/affiliate blogs. Full bibliography with DOIs, PMIDs, arXiv IDs, methodology extracts, and sample sizes lives in [RESEARCH.md](../RESEARCH.md). [Environment on HF Spaces](#) | [GitHub repo](#) | [Training notebook](#)