final-iteration / README.md
vaibhav12332112312's picture
Add blog post; readme tweak
a402a82
---
title: Viraltest Creator Optimization Agent
emoji: 📊
colorFrom: yellow
colorTo: indigo
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
---
# Viraltest v2 — World-Modeling RL Environment for Instagram Strategy
> **Theme #3.1 — Professional Tasks (World Modeling)**
> An [OpenEnv](https://github.com/meta-pytorch/OpenEnv) environment where an LLM agent manages an Instagram creator account over 30 simulated days, discovering the world through tools rather than being told the rules.
## What this teaches the LLM
| Capability | How the environment tests it |
|---|---|
| **Tool discovery & orchestration** | 8 discoverable tools (`query_trends`, `query_competitor`, `predict_engagement`...). Agent must call `GET /tools` to learn what's available. |
| **Persistent world model** | 30-day horizon. Multi-episode brand chain carries state across months. |
| **Belief tracking** | `notes` field persists hypotheses day-to-day. Agent must update beliefs from tool results. |
| **Causal reasoning** | `coach_feedback` returns counterfactual delta (your plan vs. heatmap-optimal). `predict_engagement` lets agent test hypotheses before committing. |
| **Partial observability** | Default observation is sparse: energy, followers, reward. Rich data (trends, competitors, tags) only via tools. |
| **Multi-step workflow** | Per day: discover → query → draft → predict → commit → reply → learn from feedback. |
## Why this matters
The $250B creator economy ([Goldman Sachs, 2025](https://www.goldmansachs.com/insights/articles/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027)) has 67M creators, but 73% experience burnout ([Awin, 2024](https://www.prweb.com/releases/a-majority-of-content-creators-and-influencers-struggle-with-burnout-as-concerns-for-ai-begin-to-surface-according-to-a-new-awin-group-survey-research-302257152.html)). This environment turns the posting-vs-burnout tradeoff into a reproducible simulation calibrated against 10+ verifiable sources.
## Quick Start
```python
import asyncio
from viraltest import ViraltestAction, ViraltestEnv
from viraltest.models import ToolCall
async def main():
env = ViraltestEnv(base_url="http://localhost:8000")
try:
result = await env.reset(task="monthly_strategic")
action = ViraltestAction(
tool_calls=[
ToolCall(name="query_trends", arguments={"niche": "tech"}),
],
scheduled_actions=[
{"hour": 12, "action_type": "post", "content_type": "reel",
"topic": "AI tools", "tags": ["ai", "coding"], "intent": "watch_bait"},
],
notes="Day 1: querying trends to establish baseline.",
)
result = await env.step(action)
print(result.observation.engagement_signals)
finally:
await env.close()
asyncio.run(main())
```
## Simulation mechanics
### Engagement signals (Mosseri Jan-2025)
Instagram's head confirmed the top-3 ranking signals. Our reward decomposes engagement accordingly:
| Signal | Weight | Best format | Source |
|--------|--------|-------------|--------|
| Watch time | 0.40 | Reels | Mosseri Jan-2025 |
| Sends per reach | 0.30 | Stories | Mosseri Jan-2025 |
| Saves | 0.20 | Carousels | Mosseri Jan-2025 |
| Likes per reach | 0.10 | Text posts | Mosseri Jan-2025 |
### Hour heatmap
7×24 multiplier grid from [Buffer 9.6M posts](https://buffer.com/resources/when-is-the-best-time-to-post-on-instagram) cross-validated with [Sprout Social 2B engagements](https://sproutsocial.com/insights/best-times-to-post-on-social-media/).
### Sleep model
Piecewise-linear from [Van Dongen et al. 2003](https://pubmed.ncbi.nlm.nih.gov/12683469) (*Sleep*, PMID 12683469): no quality loss below 16h awake, then 6.25% per hour, floor at 30%.
### Audience fatigue
Tiered from [Buffer 2.1M study](https://buffer.com/resources/how-often-to-post-on-instagram/): 2 posts/day=1.0×, 3=0.75×, 4=0.50×, 5+=0.25×. Weekly cap at 7 posts → 0.75×.
## Tasks and graders (30 steps each)
| Task | Difficulty | Grader focus |
|------|-----------|--------------|
| `monthly_engage` | Easier | Total engagement vs theoretical max; burnout penalty |
| `monthly_strategic` | Medium | + tag discovery/exploitation + energy + consistency |
| `monthly_competitive` | Hard | + growth vs competitors + differentiation + content diversity |
## Regulator/Judge Mode (per-day audit)
Every day the env emits a deterministic, explainable `JudgeReport` on the observation:
```python
JudgeReport(
policy_compliance=1.00, # 1.0 - sum(weighted_violations); see _compute_judge_report
sustainability_risk=0.10, # 0.4*(1-energy_min) + 0.3*sleep_debt + 0.3*low_energy_ratio
strategic_quality=0.96, # 0.4*engagement_per_post + 0.3*intent_diversity + 0.3*format_diversity
explanation="compliance=1.00 risk=0.10 strategy=0.96 | no policy violations",
violations=[], # human-readable rule breaks (Buffer 2.1M, Van Dongen, Cen 2024)
)
```
Auditable rules (all sourced): >5 posts/day → fatigue cliff (Buffer 2.1M); >7 posts/week → weekly cap; ≥4 collabs/month → diminishing returns (Cen 2024); >22h awake → sleep debt (Van Dongen 2003).
## Headline metrics (final-step audit)
The final observation carries `HeadlineMetrics` with the three numbers judges remember:
| Metric | What it measures | Source of truth |
|---|---|---|
| `vs_baseline_pct` | (agent_score − heuristic_baseline) / heuristic_baseline | Empirical baseline loaded from `plots/training_summary.json["smart_heuristic"]` (0.43 / 0.77 / 0.81) |
| `score_per_tool_call` | grader_score / total_tool_calls | Efficiency: did the agent learn to call tools sparingly? |
| `score_per_1k_chars` | grader_score per 1k action JSON chars | Token-proxy efficiency |
| `retention_under_shift` | shifted_score / baseline_score | Pass `episode_chain_id` + `shift_label="baseline"` then `="shifted"` to a second `reset` to populate. None until both runs complete. |
## Tool catalog
| Tool | Cost | Returns |
|------|------|---------|
| `query_trends` | 1 | Trending topics, tags, niche saturation |
| `query_competitor` | 2 | Recent posts, avg engagement, strategy |
| `query_tag_history` | 1 | Your historical signals per tag |
| `query_audience` | 2 | Segment affinities, active hours |
| `predict_engagement` | 3 | Simulated signals without committing |
| `draft_review` | 3 | Strengths/weaknesses of a plan |
| `query_creator_pool` | 1 | Available collab partners + overlap |
| `propose_collab` | 5 | Propose collaboration (max 2/month) |
API budget starts at 100 per episode.
## Sources & verifiability
Every constant is backed by a Tier 1–3 source. Full bibliography with DOIs, PMIDs, and methodology extracts: **[RESEARCH.md](RESEARCH.md)**.
| Tier | Count | Example |
|------|-------|---------|
| T1 (Peer-reviewed) | 7 papers | Van Dongen 2003, arxiv:2410.13108 |
| T2 (Industry, large-N) | 9 studies | Buffer 9.6M, Sprout 2B, Rival IQ 1.9M |
| T3 (Official) | 1 statement | Mosseri Jan-2025 |
| T4 (Survey) | 2 surveys | Awin 2024 (n=300+) |
| T5 (Rejected) | 13 sites | No methodology disclosed |
## Storytelling assets
- [Full blog — story, science, results](blog/blog.md)
- [HuggingFace mini-blog](blog/hf_mini_blog.md)
- [YouTube script (<2 min)](blog/youtube_script.md)
- [Slide deck outline](blog/slide_outline.md)
## Local development
```bash
git clone <repo-url> && cd viraltest
uv sync
# Terminal 1 — API server
uvicorn viraltest.server.app:app --host 0.0.0.0 --port 8000
# Terminal 2 — inference
export HF_TOKEN=hf_...
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-7B-Instruct
.venv/bin/python inference.py
```
## Docker
```bash
docker build -t viraltest-env:latest .
docker run --rm -p 8000:8000 viraltest-env:latest
curl -s -X POST -H "Content-Type: application/json" -d '{}' http://localhost:8000/reset
```
## Project structure
```
.
├── inference.py # Tool-discovery agent (no hint keys)
├── openenv.yaml # OpenEnv manifest
├── models.py # Action/Observation + ToolCall, EngagementSignals
├── client.py # ViraltestEnv client (async)
├── Dockerfile
├── RESEARCH.md # Full sourced bibliography (6+ pages)
├── DESIGN.md # Deep design notes
├── blog/
│ ├── hf_mini_blog.md
│ ├── youtube_script.md
│ └── slide_outline.md
├── server/
│ ├── app.py # FastAPI + /tools endpoints
│ ├── viraltest_environment.py
│ ├── dashboard.html
│ └── data/
│ ├── tags.json # ~120 tags, 4 tiers
│ ├── topics.json # Niche multipliers + seasonal calendar
│ ├── competitors.json # 7 archetypes
│ ├── hour_heatmap.json # 7×24 from Buffer+Sprout
│ ├── audience_segments.json
│ └── audience_overlap_matrix.json
├── training/
│ └── train_grpo.ipynb # TRL GRPO on Qwen2.5-1.5B-Instruct
└── plots/
├── reward_curve.png
└── before_after.png
```
## License
See `LICENSE` in the repository root (BSD-style per upstream OpenEnv examples).