Viraltest v2: Teaching LLMs to Be Instagram Strategists Through World Modeling
TL;DR: We built an OpenEnv environment where an LLM agent manages an Instagram creator account for 30 simulated days. The agent receives sparse observations and must discover the world — trending topics, competitor behavior, audience segments, posting heatmaps — through a catalog of 8 tools. Every constant is calibrated against peer-reviewed research and large-N industry studies.
The Problem
The $250B creator economy (Goldman Sachs, 2025) has 67 million creators, but 73% experience burnout (Awin, 2024). The core tension: post enough to stay visible in the algorithm, but not so much that quality drops and audiences fatigue. No existing RL environment captures this tradeoff with realistic dynamics.
The Environment
Viraltest v2 simulates a 30-day Instagram creator lifecycle grounded in 10+ verified data sources:
- Engagement signals decomposed into watch_time, sends_per_reach, saves, and likes_per_reach — matching Adam Mosseri's Jan-2025 official ranking signal confirmation
- Hour-by-hour heatmap from Buffer's 9.6M-post study cross-validated with Sprout Social's 2B-engagement analysis
- Sleep/cognitive model based on Van Dongen et al. (2003, Sleep, PMID 12683469) — performance lapses are linear above 16 hours awake
- Tiered audience fatigue from Buffer's 2.1M-post frequency study — not a cliff but a gradual decay
- 7 competitor archetypes with realistic posting cadences (3–5/week, not per-day)
Theme #3.1: Why This Is World Modeling
The agent starts each day with almost no information — just energy, followers, and last reward. To plan effectively, it must:
- Discover tools (
GET /tools) on day 1 - Query the world — trending topics, competitor activity, audience preferences
- Form hypotheses and persist them in a scratchpad (
notesfield) - Test plans via
predict_engagementbefore committing - Learn from counterfactual feedback — the environment shadow-runs the optimal heatmap plan and shows the delta
This isn't prompt engineering. The agent must build and maintain an internal world model across 30 steps.
Training
We trained Qwen2.5-1.5B-Instruct using TRL's GRPO trainer. Reward = per-step environment reward + 2× terminal grader score. After 200 episodes, the trained agent outperforms the untrained baseline on all three tasks (monthly_engage, monthly_strategic, monthly_competitive).
Every Number Is Verifiable
We classify our sources into 4 tiers (peer-reviewed → industry → official → survey) and explicitly reject SEO/affiliate blogs. Full bibliography with DOIs, PMIDs, arXiv IDs, methodology extracts, and sample sizes lives in RESEARCH.md.