| --- |
| title: Viraltest — Creator Optimization Agent |
| emoji: 📊 |
| colorFrom: yellow |
| colorTo: indigo |
| sdk: docker |
| pinned: false |
| app_port: 8000 |
| base_path: /web |
| tags: |
| - openenv |
| --- |
| |
| # Viraltest v2 — World-Modeling RL Environment for Instagram Strategy |
|
|
| > **Theme #3.1 — Professional Tasks (World Modeling)** |
| > An [OpenEnv](https://github.com/meta-pytorch/OpenEnv) environment where an LLM agent manages an Instagram creator account over 30 simulated days, discovering the world through tools rather than being told the rules. |
|
|
| ## What this teaches the LLM |
|
|
| | Capability | How the environment tests it | |
| |---|---| |
| | **Tool discovery & orchestration** | 8 discoverable tools (`query_trends`, `query_competitor`, `predict_engagement`...). Agent must call `GET /tools` to learn what's available. | |
| | **Persistent world model** | 30-day horizon. Multi-episode brand chain carries state across months. | |
| | **Belief tracking** | `notes` field persists hypotheses day-to-day. Agent must update beliefs from tool results. | |
| | **Causal reasoning** | `coach_feedback` returns counterfactual delta (your plan vs. heatmap-optimal). `predict_engagement` lets agent test hypotheses before committing. | |
| | **Partial observability** | Default observation is sparse: energy, followers, reward. Rich data (trends, competitors, tags) only via tools. | |
| | **Multi-step workflow** | Per day: discover → query → draft → predict → commit → reply → learn from feedback. | |
|
|
| ## Why this matters |
|
|
| The $250B creator economy ([Goldman Sachs, 2025](https://www.goldmansachs.com/insights/articles/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027)) has 67M creators, but 73% experience burnout ([Awin, 2024](https://www.prweb.com/releases/a-majority-of-content-creators-and-influencers-struggle-with-burnout-as-concerns-for-ai-begin-to-surface-according-to-a-new-awin-group-survey-research-302257152.html)). This environment turns the posting-vs-burnout tradeoff into a reproducible simulation calibrated against 10+ verifiable sources. |
|
|
| ## Quick Start |
|
|
| ```python |
| import asyncio |
| from viraltest import ViraltestAction, ViraltestEnv |
| from viraltest.models import ToolCall |
| |
| async def main(): |
| env = ViraltestEnv(base_url="http://localhost:8000") |
| try: |
| result = await env.reset(task="monthly_strategic") |
| action = ViraltestAction( |
| tool_calls=[ |
| ToolCall(name="query_trends", arguments={"niche": "tech"}), |
| ], |
| scheduled_actions=[ |
| {"hour": 12, "action_type": "post", "content_type": "reel", |
| "topic": "AI tools", "tags": ["ai", "coding"], "intent": "watch_bait"}, |
| ], |
| notes="Day 1: querying trends to establish baseline.", |
| ) |
| result = await env.step(action) |
| print(result.observation.engagement_signals) |
| finally: |
| await env.close() |
| |
| asyncio.run(main()) |
| ``` |
|
|
| ## Simulation mechanics |
|
|
| ### Engagement signals (Mosseri Jan-2025) |
|
|
| Instagram's head confirmed the top-3 ranking signals. Our reward decomposes engagement accordingly: |
|
|
| | Signal | Weight | Best format | Source | |
| |--------|--------|-------------|--------| |
| | Watch time | 0.40 | Reels | Mosseri Jan-2025 | |
| | Sends per reach | 0.30 | Stories | Mosseri Jan-2025 | |
| | Saves | 0.20 | Carousels | Mosseri Jan-2025 | |
| | Likes per reach | 0.10 | Text posts | Mosseri Jan-2025 | |
|
|
| ### Hour heatmap |
|
|
| 7×24 multiplier grid from [Buffer 9.6M posts](https://buffer.com/resources/when-is-the-best-time-to-post-on-instagram) cross-validated with [Sprout Social 2B engagements](https://sproutsocial.com/insights/best-times-to-post-on-social-media/). |
|
|
| ### Sleep model |
|
|
| Piecewise-linear from [Van Dongen et al. 2003](https://pubmed.ncbi.nlm.nih.gov/12683469) (*Sleep*, PMID 12683469): no quality loss below 16h awake, then 6.25% per hour, floor at 30%. |
|
|
| ### Audience fatigue |
|
|
| Tiered from [Buffer 2.1M study](https://buffer.com/resources/how-often-to-post-on-instagram/): 2 posts/day=1.0×, 3=0.75×, 4=0.50×, 5+=0.25×. Weekly cap at 7 posts → 0.75×. |
|
|
| ## Tasks and graders (30 steps each) |
|
|
| | Task | Difficulty | Grader focus | |
| |------|-----------|--------------| |
| | `monthly_engage` | Easier | Total engagement vs theoretical max; burnout penalty | |
| | `monthly_strategic` | Medium | + tag discovery/exploitation + energy + consistency | |
| | `monthly_competitive` | Hard | + growth vs competitors + differentiation + content diversity | |
|
|
| ## Regulator/Judge Mode (per-day audit) |
|
|
| Every day the env emits a deterministic, explainable `JudgeReport` on the observation: |
|
|
| ```python |
| JudgeReport( |
| policy_compliance=1.00, # 1.0 - sum(weighted_violations); see _compute_judge_report |
| sustainability_risk=0.10, # 0.4*(1-energy_min) + 0.3*sleep_debt + 0.3*low_energy_ratio |
| strategic_quality=0.96, # 0.4*engagement_per_post + 0.3*intent_diversity + 0.3*format_diversity |
| explanation="compliance=1.00 risk=0.10 strategy=0.96 | no policy violations", |
| violations=[], # human-readable rule breaks (Buffer 2.1M, Van Dongen, Cen 2024) |
| ) |
| ``` |
|
|
| Auditable rules (all sourced): >5 posts/day → fatigue cliff (Buffer 2.1M); >7 posts/week → weekly cap; ≥4 collabs/month → diminishing returns (Cen 2024); >22h awake → sleep debt (Van Dongen 2003). |
|
|
| ## Headline metrics (final-step audit) |
|
|
| The final observation carries `HeadlineMetrics` with the three numbers judges remember: |
|
|
| | Metric | What it measures | Source of truth | |
| |---|---|---| |
| | `vs_baseline_pct` | (agent_score − heuristic_baseline) / heuristic_baseline | Empirical baseline loaded from `plots/training_summary.json["smart_heuristic"]` (0.43 / 0.77 / 0.81) | |
| | `score_per_tool_call` | grader_score / total_tool_calls | Efficiency: did the agent learn to call tools sparingly? | |
| | `score_per_1k_chars` | grader_score per 1k action JSON chars | Token-proxy efficiency | |
| | `retention_under_shift` | shifted_score / baseline_score | Pass `episode_chain_id` + `shift_label="baseline"` then `="shifted"` to a second `reset` to populate. None until both runs complete. | |
|
|
| ## Tool catalog |
|
|
| | Tool | Cost | Returns | |
| |------|------|---------| |
| | `query_trends` | 1 | Trending topics, tags, niche saturation | |
| | `query_competitor` | 2 | Recent posts, avg engagement, strategy | |
| | `query_tag_history` | 1 | Your historical signals per tag | |
| | `query_audience` | 2 | Segment affinities, active hours | |
| | `predict_engagement` | 3 | Simulated signals without committing | |
| | `draft_review` | 3 | Strengths/weaknesses of a plan | |
| | `query_creator_pool` | 1 | Available collab partners + overlap | |
| | `propose_collab` | 5 | Propose collaboration (max 2/month) | |
|
|
| API budget starts at 100 per episode. |
|
|
| ## Sources & verifiability |
|
|
| Every constant is backed by a Tier 1–3 source. Full bibliography with DOIs, PMIDs, and methodology extracts: **[RESEARCH.md](RESEARCH.md)**. |
|
|
| | Tier | Count | Example | |
| |------|-------|---------| |
| | T1 (Peer-reviewed) | 7 papers | Van Dongen 2003, arxiv:2410.13108 | |
| | T2 (Industry, large-N) | 9 studies | Buffer 9.6M, Sprout 2B, Rival IQ 1.9M | |
| | T3 (Official) | 1 statement | Mosseri Jan-2025 | |
| | T4 (Survey) | 2 surveys | Awin 2024 (n=300+) | |
| | T5 (Rejected) | 13 sites | No methodology disclosed | |
|
|
| ## Storytelling assets |
|
|
| - [HuggingFace blog](blog/hf_mini_blog.md) |
| - [YouTube script (<2 min)](blog/youtube_script.md) |
| - [Slide deck outline](blog/slide_outline.md) |
|
|
| ## Local development |
|
|
| ```bash |
| git clone <repo-url> && cd viraltest |
| uv sync |
| |
| # Terminal 1 — API server |
| uvicorn viraltest.server.app:app --host 0.0.0.0 --port 8000 |
| |
| # Terminal 2 — inference |
| export HF_TOKEN=hf_... |
| export API_BASE_URL=https://router.huggingface.co/v1 |
| export MODEL_NAME=Qwen/Qwen2.5-7B-Instruct |
| .venv/bin/python inference.py |
| ``` |
|
|
| ## Docker |
|
|
| ```bash |
| docker build -t viraltest-env:latest . |
| docker run --rm -p 8000:8000 viraltest-env:latest |
| curl -s -X POST -H "Content-Type: application/json" -d '{}' http://localhost:8000/reset |
| ``` |
|
|
| ## Project structure |
|
|
| ``` |
| . |
| ├── inference.py # Tool-discovery agent (no hint keys) |
| ├── openenv.yaml # OpenEnv manifest |
| ├── models.py # Action/Observation + ToolCall, EngagementSignals |
| ├── client.py # ViraltestEnv client (async) |
| ├── Dockerfile |
| ├── RESEARCH.md # Full sourced bibliography (6+ pages) |
| ├── DESIGN.md # Deep design notes |
| ├── blog/ |
| │ ├── hf_mini_blog.md |
| │ ├── youtube_script.md |
| │ └── slide_outline.md |
| ├── server/ |
| │ ├── app.py # FastAPI + /tools endpoints |
| │ ├── viraltest_environment.py |
| │ ├── dashboard.html |
| │ └── data/ |
| │ ├── tags.json # ~120 tags, 4 tiers |
| │ ├── topics.json # Niche multipliers + seasonal calendar |
| │ ├── competitors.json # 7 archetypes |
| │ ├── hour_heatmap.json # 7×24 from Buffer+Sprout |
| │ ├── audience_segments.json |
| │ └── audience_overlap_matrix.json |
| ├── training/ |
| │ └── train_grpo.ipynb # TRL GRPO on Qwen2.5-1.5B-Instruct |
| └── plots/ |
| ├── reward_curve.png |
| └── before_after.png |
| ``` |
|
|
| ## License |
|
|
| See `LICENSE` in the repository root (BSD-style per upstream OpenEnv examples). |
|
|