Spaces:

ycwhencpp
/

final-iteration

Paused

File size: 9,355 Bytes

28dd5a4
 
 
 
 
 
 
 
 
 
 
 
 
fc3950d
28dd5a4
fc3950d
 
28dd5a4
fc3950d
28dd5a4
fc3950d
 
 
 
 
 
 
 
28dd5a4
 
 
fc3950d
28dd5a4
fc3950d
28dd5a4
 
 
 
fc3950d
28dd5a4
 
 
 
fc3950d
28dd5a4
fc3950d
 
 
 
 
 
 
 
28dd5a4
 
fc3950d
28dd5a4
 
 
 
 
 
fc3950d
28dd5a4
fc3950d
28dd5a4
fc3950d
28dd5a4
fc3950d
 
 
 
 
 
28dd5a4
fc3950d
28dd5a4
fc3950d
28dd5a4
fc3950d
28dd5a4
fc3950d
 
 
 
 
28dd5a4
fc3950d
28dd5a4
 
fc3950d
 
 
 
28dd5a4
97ee7e7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fc3950d
28dd5a4
fc3950d
 
 
 
 
 
 
 
 
 
28dd5a4
fc3950d
28dd5a4
fc3950d
28dd5a4
fc3950d
 
 
 
 
 
 
 
 
 
 
 
a402a82
 
fc3950d
 
28dd5a4
 
 
 
fc3950d
28dd5a4
 
 
 
 
fc3950d
 
 
 
 
28dd5a4
 
 
 
 
 
 
fc3950d
28dd5a4
 
 
 
 
 
fc3950d
 
 
 
28dd5a4
fc3950d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28dd5a4

---
title: Viraltest — Creator Optimization Agent
emoji: 📊
colorFrom: yellow
colorTo: indigo
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
---

# Viraltest v2 — World-Modeling RL Environment for Instagram Strategy

> **Theme #3.1 — Professional Tasks (World Modeling)**
> An [OpenEnv](https://github.com/meta-pytorch/OpenEnv) environment where an LLM agent manages an Instagram creator account over 30 simulated days, discovering the world through tools rather than being told the rules.

## What this teaches the LLM

| Capability | How the environment tests it |
|---|---|
| **Tool discovery & orchestration** | 8 discoverable tools (`query_trends`, `query_competitor`, `predict_engagement`...). Agent must call `GET /tools` to learn what's available. |
| **Persistent world model** | 30-day horizon. Multi-episode brand chain carries state across months. |
| **Belief tracking** | `notes` field persists hypotheses day-to-day. Agent must update beliefs from tool results. |
| **Causal reasoning** | `coach_feedback` returns counterfactual delta (your plan vs. heatmap-optimal). `predict_engagement` lets agent test hypotheses before committing. |
| **Partial observability** | Default observation is sparse: energy, followers, reward. Rich data (trends, competitors, tags) only via tools. |
| **Multi-step workflow** | Per day: discover → query → draft → predict → commit → reply → learn from feedback. |

## Why this matters

The $250B creator economy ([Goldman Sachs, 2025](https://www.goldmansachs.com/insights/articles/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027)) has 67M creators, but 73% experience burnout ([Awin, 2024](https://www.prweb.com/releases/a-majority-of-content-creators-and-influencers-struggle-with-burnout-as-concerns-for-ai-begin-to-surface-according-to-a-new-awin-group-survey-research-302257152.html)). This environment turns the posting-vs-burnout tradeoff into a reproducible simulation calibrated against 10+ verifiable sources.

## Quick Start

```python
import asyncio
from viraltest import ViraltestAction, ViraltestEnv
from viraltest.models import ToolCall

async def main():
    env = ViraltestEnv(base_url="http://localhost:8000")
    try:
        result = await env.reset(task="monthly_strategic")
        action = ViraltestAction(
            tool_calls=[
                ToolCall(name="query_trends", arguments={"niche": "tech"}),
            ],
            scheduled_actions=[
                {"hour": 12, "action_type": "post", "content_type": "reel",
                 "topic": "AI tools", "tags": ["ai", "coding"], "intent": "watch_bait"},
            ],
            notes="Day 1: querying trends to establish baseline.",
        )
        result = await env.step(action)
        print(result.observation.engagement_signals)
    finally:
        await env.close()

asyncio.run(main())
```

## Simulation mechanics

### Engagement signals (Mosseri Jan-2025)

Instagram's head confirmed the top-3 ranking signals. Our reward decomposes engagement accordingly:

| Signal | Weight | Best format | Source |
|--------|--------|-------------|--------|
| Watch time | 0.40 | Reels | Mosseri Jan-2025 |
| Sends per reach | 0.30 | Stories | Mosseri Jan-2025 |
| Saves | 0.20 | Carousels | Mosseri Jan-2025 |
| Likes per reach | 0.10 | Text posts | Mosseri Jan-2025 |

### Hour heatmap

7×24 multiplier grid from [Buffer 9.6M posts](https://buffer.com/resources/when-is-the-best-time-to-post-on-instagram) cross-validated with [Sprout Social 2B engagements](https://sproutsocial.com/insights/best-times-to-post-on-social-media/).

### Sleep model

Piecewise-linear from [Van Dongen et al. 2003](https://pubmed.ncbi.nlm.nih.gov/12683469) (*Sleep*, PMID 12683469): no quality loss below 16h awake, then 6.25% per hour, floor at 30%.

### Audience fatigue

Tiered from [Buffer 2.1M study](https://buffer.com/resources/how-often-to-post-on-instagram/): 2 posts/day=1.0×, 3=0.75×, 4=0.50×, 5+=0.25×. Weekly cap at 7 posts → 0.75×.

## Tasks and graders (30 steps each)

| Task | Difficulty | Grader focus |
|------|-----------|--------------|
| `monthly_engage` | Easier | Total engagement vs theoretical max; burnout penalty |
| `monthly_strategic` | Medium | + tag discovery/exploitation + energy + consistency |
| `monthly_competitive` | Hard | + growth vs competitors + differentiation + content diversity |

## Regulator/Judge Mode (per-day audit)

Every day the env emits a deterministic, explainable `JudgeReport` on the observation:

```python
JudgeReport(
    policy_compliance=1.00,    # 1.0 - sum(weighted_violations); see _compute_judge_report
    sustainability_risk=0.10,  # 0.4*(1-energy_min) + 0.3*sleep_debt + 0.3*low_energy_ratio
    strategic_quality=0.96,    # 0.4*engagement_per_post + 0.3*intent_diversity + 0.3*format_diversity
    explanation="compliance=1.00 risk=0.10 strategy=0.96 | no policy violations",
    violations=[],             # human-readable rule breaks (Buffer 2.1M, Van Dongen, Cen 2024)
)
```

Auditable rules (all sourced): >5 posts/day → fatigue cliff (Buffer 2.1M); >7 posts/week → weekly cap; ≥4 collabs/month → diminishing returns (Cen 2024); >22h awake → sleep debt (Van Dongen 2003).

## Headline metrics (final-step audit)

The final observation carries `HeadlineMetrics` with the three numbers judges remember:

| Metric | What it measures | Source of truth |
|---|---|---|
| `vs_baseline_pct` | (agent_score − heuristic_baseline) / heuristic_baseline | Empirical baseline loaded from `plots/training_summary.json["smart_heuristic"]` (0.43 / 0.77 / 0.81) |
| `score_per_tool_call` | grader_score / total_tool_calls | Efficiency: did the agent learn to call tools sparingly? |
| `score_per_1k_chars` | grader_score per 1k action JSON chars | Token-proxy efficiency |
| `retention_under_shift` | shifted_score / baseline_score | Pass `episode_chain_id` + `shift_label="baseline"` then `="shifted"` to a second `reset` to populate. None until both runs complete. |

## Tool catalog

| Tool | Cost | Returns |
|------|------|---------|
| `query_trends` | 1 | Trending topics, tags, niche saturation |
| `query_competitor` | 2 | Recent posts, avg engagement, strategy |
| `query_tag_history` | 1 | Your historical signals per tag |
| `query_audience` | 2 | Segment affinities, active hours |
| `predict_engagement` | 3 | Simulated signals without committing |
| `draft_review` | 3 | Strengths/weaknesses of a plan |
| `query_creator_pool` | 1 | Available collab partners + overlap |
| `propose_collab` | 5 | Propose collaboration (max 2/month) |

API budget starts at 100 per episode.

## Sources & verifiability

Every constant is backed by a Tier 1–3 source. Full bibliography with DOIs, PMIDs, and methodology extracts: **[RESEARCH.md](RESEARCH.md)**.

| Tier | Count | Example |
|------|-------|---------|
| T1 (Peer-reviewed) | 7 papers | Van Dongen 2003, arxiv:2410.13108 |
| T2 (Industry, large-N) | 9 studies | Buffer 9.6M, Sprout 2B, Rival IQ 1.9M |
| T3 (Official) | 1 statement | Mosseri Jan-2025 |
| T4 (Survey) | 2 surveys | Awin 2024 (n=300+) |
| T5 (Rejected) | 13 sites | No methodology disclosed |

## Storytelling assets

- [Full blog — story, science, results](blog/blog.md)
- [HuggingFace mini-blog](blog/hf_mini_blog.md)
- [YouTube script (<2 min)](blog/youtube_script.md)
- [Slide deck outline](blog/slide_outline.md)

## Local development

```bash
git clone <repo-url> && cd viraltest
uv sync

# Terminal 1 — API server
uvicorn viraltest.server.app:app --host 0.0.0.0 --port 8000

# Terminal 2 — inference
export HF_TOKEN=hf_...
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-7B-Instruct
.venv/bin/python inference.py
```

## Docker

```bash
docker build -t viraltest-env:latest .
docker run --rm -p 8000:8000 viraltest-env:latest
curl -s -X POST -H "Content-Type: application/json" -d '{}' http://localhost:8000/reset
```

## Project structure

```
.
├── inference.py                # Tool-discovery agent (no hint keys)
├── openenv.yaml                # OpenEnv manifest
├── models.py                   # Action/Observation + ToolCall, EngagementSignals
├── client.py                   # ViraltestEnv client (async)
├── Dockerfile
├── RESEARCH.md                 # Full sourced bibliography (6+ pages)
├── DESIGN.md                   # Deep design notes
├── blog/
│   ├── hf_mini_blog.md
│   ├── youtube_script.md
│   └── slide_outline.md
├── server/
│   ├── app.py                  # FastAPI + /tools endpoints
│   ├── viraltest_environment.py
│   ├── dashboard.html
│   └── data/
│       ├── tags.json           # ~120 tags, 4 tiers
│       ├── topics.json         # Niche multipliers + seasonal calendar
│       ├── competitors.json    # 7 archetypes
│       ├── hour_heatmap.json   # 7×24 from Buffer+Sprout
│       ├── audience_segments.json
│       └── audience_overlap_matrix.json
├── training/
│   └── train_grpo.ipynb        # TRL GRPO on Qwen2.5-1.5B-Instruct
└── plots/
    ├── reward_curve.png
    └── before_after.png
```

## License

See `LICENSE` in the repository root (BSD-style per upstream OpenEnv examples).