Spaces:
Paused
Paused
File size: 9,355 Bytes
28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 97ee7e7 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d a402a82 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 fc3950d 28dd5a4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 | ---
title: Viraltest — Creator Optimization Agent
emoji: 📊
colorFrom: yellow
colorTo: indigo
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
---
# Viraltest v2 — World-Modeling RL Environment for Instagram Strategy
> **Theme #3.1 — Professional Tasks (World Modeling)**
> An [OpenEnv](https://github.com/meta-pytorch/OpenEnv) environment where an LLM agent manages an Instagram creator account over 30 simulated days, discovering the world through tools rather than being told the rules.
## What this teaches the LLM
| Capability | How the environment tests it |
|---|---|
| **Tool discovery & orchestration** | 8 discoverable tools (`query_trends`, `query_competitor`, `predict_engagement`...). Agent must call `GET /tools` to learn what's available. |
| **Persistent world model** | 30-day horizon. Multi-episode brand chain carries state across months. |
| **Belief tracking** | `notes` field persists hypotheses day-to-day. Agent must update beliefs from tool results. |
| **Causal reasoning** | `coach_feedback` returns counterfactual delta (your plan vs. heatmap-optimal). `predict_engagement` lets agent test hypotheses before committing. |
| **Partial observability** | Default observation is sparse: energy, followers, reward. Rich data (trends, competitors, tags) only via tools. |
| **Multi-step workflow** | Per day: discover → query → draft → predict → commit → reply → learn from feedback. |
## Why this matters
The $250B creator economy ([Goldman Sachs, 2025](https://www.goldmansachs.com/insights/articles/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027)) has 67M creators, but 73% experience burnout ([Awin, 2024](https://www.prweb.com/releases/a-majority-of-content-creators-and-influencers-struggle-with-burnout-as-concerns-for-ai-begin-to-surface-according-to-a-new-awin-group-survey-research-302257152.html)). This environment turns the posting-vs-burnout tradeoff into a reproducible simulation calibrated against 10+ verifiable sources.
## Quick Start
```python
import asyncio
from viraltest import ViraltestAction, ViraltestEnv
from viraltest.models import ToolCall
async def main():
env = ViraltestEnv(base_url="http://localhost:8000")
try:
result = await env.reset(task="monthly_strategic")
action = ViraltestAction(
tool_calls=[
ToolCall(name="query_trends", arguments={"niche": "tech"}),
],
scheduled_actions=[
{"hour": 12, "action_type": "post", "content_type": "reel",
"topic": "AI tools", "tags": ["ai", "coding"], "intent": "watch_bait"},
],
notes="Day 1: querying trends to establish baseline.",
)
result = await env.step(action)
print(result.observation.engagement_signals)
finally:
await env.close()
asyncio.run(main())
```
## Simulation mechanics
### Engagement signals (Mosseri Jan-2025)
Instagram's head confirmed the top-3 ranking signals. Our reward decomposes engagement accordingly:
| Signal | Weight | Best format | Source |
|--------|--------|-------------|--------|
| Watch time | 0.40 | Reels | Mosseri Jan-2025 |
| Sends per reach | 0.30 | Stories | Mosseri Jan-2025 |
| Saves | 0.20 | Carousels | Mosseri Jan-2025 |
| Likes per reach | 0.10 | Text posts | Mosseri Jan-2025 |
### Hour heatmap
7×24 multiplier grid from [Buffer 9.6M posts](https://buffer.com/resources/when-is-the-best-time-to-post-on-instagram) cross-validated with [Sprout Social 2B engagements](https://sproutsocial.com/insights/best-times-to-post-on-social-media/).
### Sleep model
Piecewise-linear from [Van Dongen et al. 2003](https://pubmed.ncbi.nlm.nih.gov/12683469) (*Sleep*, PMID 12683469): no quality loss below 16h awake, then 6.25% per hour, floor at 30%.
### Audience fatigue
Tiered from [Buffer 2.1M study](https://buffer.com/resources/how-often-to-post-on-instagram/): 2 posts/day=1.0×, 3=0.75×, 4=0.50×, 5+=0.25×. Weekly cap at 7 posts → 0.75×.
## Tasks and graders (30 steps each)
| Task | Difficulty | Grader focus |
|------|-----------|--------------|
| `monthly_engage` | Easier | Total engagement vs theoretical max; burnout penalty |
| `monthly_strategic` | Medium | + tag discovery/exploitation + energy + consistency |
| `monthly_competitive` | Hard | + growth vs competitors + differentiation + content diversity |
## Regulator/Judge Mode (per-day audit)
Every day the env emits a deterministic, explainable `JudgeReport` on the observation:
```python
JudgeReport(
policy_compliance=1.00, # 1.0 - sum(weighted_violations); see _compute_judge_report
sustainability_risk=0.10, # 0.4*(1-energy_min) + 0.3*sleep_debt + 0.3*low_energy_ratio
strategic_quality=0.96, # 0.4*engagement_per_post + 0.3*intent_diversity + 0.3*format_diversity
explanation="compliance=1.00 risk=0.10 strategy=0.96 | no policy violations",
violations=[], # human-readable rule breaks (Buffer 2.1M, Van Dongen, Cen 2024)
)
```
Auditable rules (all sourced): >5 posts/day → fatigue cliff (Buffer 2.1M); >7 posts/week → weekly cap; ≥4 collabs/month → diminishing returns (Cen 2024); >22h awake → sleep debt (Van Dongen 2003).
## Headline metrics (final-step audit)
The final observation carries `HeadlineMetrics` with the three numbers judges remember:
| Metric | What it measures | Source of truth |
|---|---|---|
| `vs_baseline_pct` | (agent_score − heuristic_baseline) / heuristic_baseline | Empirical baseline loaded from `plots/training_summary.json["smart_heuristic"]` (0.43 / 0.77 / 0.81) |
| `score_per_tool_call` | grader_score / total_tool_calls | Efficiency: did the agent learn to call tools sparingly? |
| `score_per_1k_chars` | grader_score per 1k action JSON chars | Token-proxy efficiency |
| `retention_under_shift` | shifted_score / baseline_score | Pass `episode_chain_id` + `shift_label="baseline"` then `="shifted"` to a second `reset` to populate. None until both runs complete. |
## Tool catalog
| Tool | Cost | Returns |
|------|------|---------|
| `query_trends` | 1 | Trending topics, tags, niche saturation |
| `query_competitor` | 2 | Recent posts, avg engagement, strategy |
| `query_tag_history` | 1 | Your historical signals per tag |
| `query_audience` | 2 | Segment affinities, active hours |
| `predict_engagement` | 3 | Simulated signals without committing |
| `draft_review` | 3 | Strengths/weaknesses of a plan |
| `query_creator_pool` | 1 | Available collab partners + overlap |
| `propose_collab` | 5 | Propose collaboration (max 2/month) |
API budget starts at 100 per episode.
## Sources & verifiability
Every constant is backed by a Tier 1–3 source. Full bibliography with DOIs, PMIDs, and methodology extracts: **[RESEARCH.md](RESEARCH.md)**.
| Tier | Count | Example |
|------|-------|---------|
| T1 (Peer-reviewed) | 7 papers | Van Dongen 2003, arxiv:2410.13108 |
| T2 (Industry, large-N) | 9 studies | Buffer 9.6M, Sprout 2B, Rival IQ 1.9M |
| T3 (Official) | 1 statement | Mosseri Jan-2025 |
| T4 (Survey) | 2 surveys | Awin 2024 (n=300+) |
| T5 (Rejected) | 13 sites | No methodology disclosed |
## Storytelling assets
- [Full blog — story, science, results](blog/blog.md)
- [HuggingFace mini-blog](blog/hf_mini_blog.md)
- [YouTube script (<2 min)](blog/youtube_script.md)
- [Slide deck outline](blog/slide_outline.md)
## Local development
```bash
git clone <repo-url> && cd viraltest
uv sync
# Terminal 1 — API server
uvicorn viraltest.server.app:app --host 0.0.0.0 --port 8000
# Terminal 2 — inference
export HF_TOKEN=hf_...
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-7B-Instruct
.venv/bin/python inference.py
```
## Docker
```bash
docker build -t viraltest-env:latest .
docker run --rm -p 8000:8000 viraltest-env:latest
curl -s -X POST -H "Content-Type: application/json" -d '{}' http://localhost:8000/reset
```
## Project structure
```
.
├── inference.py # Tool-discovery agent (no hint keys)
├── openenv.yaml # OpenEnv manifest
├── models.py # Action/Observation + ToolCall, EngagementSignals
├── client.py # ViraltestEnv client (async)
├── Dockerfile
├── RESEARCH.md # Full sourced bibliography (6+ pages)
├── DESIGN.md # Deep design notes
├── blog/
│ ├── hf_mini_blog.md
│ ├── youtube_script.md
│ └── slide_outline.md
├── server/
│ ├── app.py # FastAPI + /tools endpoints
│ ├── viraltest_environment.py
│ ├── dashboard.html
│ └── data/
│ ├── tags.json # ~120 tags, 4 tiers
│ ├── topics.json # Niche multipliers + seasonal calendar
│ ├── competitors.json # 7 archetypes
│ ├── hour_heatmap.json # 7×24 from Buffer+Sprout
│ ├── audience_segments.json
│ └── audience_overlap_matrix.json
├── training/
│ └── train_grpo.ipynb # TRL GRPO on Qwen2.5-1.5B-Instruct
└── plots/
├── reward_curve.png
└── before_after.png
```
## License
See `LICENSE` in the repository root (BSD-style per upstream OpenEnv examples).
|