File size: 8,749 Bytes
034a807
fc3950d
034a807
fc3950d
034a807
fc3950d
034a807
fc3950d
034a807
fc3950d
034a807
fc3950d
034a807
fc3950d
034a807
fc3950d
034a807
fc3950d
034a807
fc3950d
034a807
fc3950d
034a807
fc3950d
034a807
fc3950d
034a807
 
 
 
fc3950d
034a807
fc3950d
034a807
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
# We Trained an LLM to Survive Instagram

### Why we built Creator Copilot, an OpenEnv where the agent learns by living a creator's life — not by reading about it.

---

## The scene we couldn't shake

A creator wakes up at 7:42 AM. Yesterday's reel did 12% of what last week's did. Nobody at the platform will tell her why. There is a heatmap somewhere, a ranking change last Tuesday, an audience segment that quietly shifted, a "trending" tag that peaked six hours ago. She doesn't have access to any of it. So she does the only thing she can do: she posts more. Eventually 73% of creators in her cohort report burnout ([Awin, 2024](https://www.prweb.com/releases/a-majority-of-content-creators-and-influencers-struggle-with-burnout-as-concerns-for-ai-begin-to-surface-according-to-a-new-awin-group-survey-research-302257152.html)).

The creator economy is a $250B industry running on guesswork ([Goldman Sachs, 2025](https://www.goldmansachs.com/insights/articles/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027)). 67 million people are running businesses inside a black box, against an algorithm that nobody outside Meta fully understands, while their own bodies push back at 16 hours of wakefulness ([Van Dongen et al., 2003, *Sleep*, PMID 12683469](https://pubmed.ncbi.nlm.nih.gov/12683469)).

That is a *real* world model problem. And we couldn't find a single RL environment that took it seriously.

## Creator Copilot in one sentence

**An OpenEnv environment where an LLM agent runs an Instagram creator account for 7 simulated days, gets almost nothing for free, and has to discover the rules of the world through 8 tool calls and a notebook.**

It is the smallest version we could build of "operate a real account in a real economy."

## The bet: discovery, not instruction

Most agent environments hand the model a verbose observation and ask it to pick from 4 actions. Creator Copilot does the opposite. The default observation is *deliberately sparse* — just `energy`, `followers`, `last reward`. Everything interesting (trending topics, competitor cadence, audience segments, hour-by-hour engagement, your own past tag performance) is hidden behind tools the agent has to *discover* by hitting `GET /tools`.

This is the move that makes the environment a world-modeling environment instead of a recommendation problem:

- The agent has to **plan inquiry**: queries are the only way to reduce uncertainty, so it has to choose which questions are worth asking.
- The agent has to **carry beliefs forward**: a `notes` scratchpad persists across all 7 days. If the agent doesn't write down "Tuesdays at 12pm worked," it has no memory.
- The agent has to **test before committing**: `predict_engagement` lets it simulate a plan; `coach_feedback` shows the *counterfactual delta* between its plan and a heatmap-optimal plan. That second signal is the secret sauce — it teaches causality, not just outcomes.
- The agent has to **stay alive**: `creator_energy` decays with posting and recovers with rest, calibrated to a real sleep-deprivation paper. Burn out and the episode ends early.

The model doesn't get a tutorial. It gets a phone, a calendar, a sleep cycle, and a question: *can you grow this account without breaking the human?*

## The moat: every number is auditable

We were tired of RL environments where the rewards are vibes. So we drew a hard line: **every constant in Creator Copilot is backed by a Tier 1–3 source.** We even wrote a source-quality rubric and explicitly *rejected* 13 SEO/affiliate blogs that didn't meet it.

| What it controls | What it's based on |
|---|---|
| Engagement decomposition (watch_time, sends, saves, likes) | [Adam Mosseri, Head of Instagram, Jan 2025 statement](https://about.fb.com/news/) |
| 7×24 hour-of-day heatmap | [Buffer 9.6M post study](https://buffer.com/resources/when-is-the-best-time-to-post-on-instagram) cross-validated with [Sprout Social 2B engagements](https://sproutsocial.com/insights/best-times-to-post-on-social-media/) |
| Sleep-driven cognitive decay | [Van Dongen et al., 2003, *Sleep*, PMID 12683469](https://pubmed.ncbi.nlm.nih.gov/12683469) |
| Tiered audience fatigue from over-posting | [Buffer 2.1M post frequency study](https://buffer.com/resources/how-often-to-post-on-instagram/) |
| Algorithmic disengagement model | [Cen et al., 2024 — arXiv:2410.13108](https://arxiv.org/abs/2410.13108) |
| Engagement vs. utility split | [Aouali et al., 2024 — arXiv:2406.01611](https://arxiv.org/abs/2406.01611) |

If a judge wants to challenge a single number, they can open `RESEARCH.md`, find the DOI/PMID/arXiv ID, and read the methodology. We *want* that fight.

That auditability is also why we believe a researcher could write a paper on top of this environment — not "an LLM played a game," but "an LLM learned a strategy that survives a known sleep-deprivation curve."

## What the agent gets graded on

We didn't want a single-number reward we could game. So the environment ships a **JudgeReport every day** — a deterministic, source-cited audit of three things:

- `policy_compliance` — did the agent break sourced sustainability rules? (e.g. >5 posts/day from Buffer 2.1M, weekly collab cap from Cen 2024, >22h awake from Van Dongen 2003)
- `sustainability_risk` — energy floor, sleep debt, and low-energy ratio over the day
- `strategic_quality` — engagement-per-post × intent diversity × format diversity

Plus three task graders calibrated to a *smart heuristic* baseline (`weekly_engage`, `weekly_strategic`, `weekly_competitive`). The agent isn't competing against zero — it's competing against a known-good rule-based player.

This composability is the OpenEnv Rubric idea taken seriously: separable, auditable signals that a researcher can swap in and out, not a monolithic black-box reward.

## Did the agent actually learn?

Yes — and we're being honest about where.

We trained Qwen2.5-3B-Instruct (Q4 quantized, running on a local M4 Mac via Ollama, no T4 needed) over 4 rounds, 6 episodes each, with temperature annealing from 1.4 → 0.7. Reward = per-step environment reward + 2× terminal grader score.

| Task | Untrained | Trained | Δ |
|---|---|---|---|
| `weekly_engage` | 0.355 | **0.409** | **+5.4%** |
| `weekly_competitive` | 0.374 | **0.510** | **+13.6%** |
| `weekly_strategic` | 0.680 | 0.627 | −5.2% |

The wins are largest on the *hardest* task — `weekly_competitive` — which is where the world model bites: the agent has to query competitors, differentiate its content, and time its posts. Exactly where we'd expect tool discovery to matter.

The strategic task regression is real and we're not hiding it: the model started doing too much exploration on a task where exploitation matters more, and our 4-round budget wasn't long enough to anneal that out. Honest result on a small training run.

What we *can* show qualitatively: the trained agent calls `GET /tools` on day 1, queries trends and competitors before posting, drops `predict_engagement` calls on the days it has a clear plan, and keeps `creator_energy` above 0.5 through the week. The untrained baseline posts blindly for the first few days and burns out.

Plots and the full per-episode log live in `plots/training_summary.json` and `plots/training_log.csv`.

## Why this is the submission to remember

There were going to be a lot of grid-worlds at this hackathon. A lot of toy puzzles. A lot of "we trained on a math benchmark."

Creator Copilot is something different. It's the smallest possible environment that tests whether an LLM can be an **operator** — discover an unknown world, plan inquiry under a budget, hold beliefs across time, weigh strategy against the operator's own physical constraints, and beat a smart human-style baseline on it.

That's not just an Instagram problem. That's the shape of every interesting LLM deployment in the next two years: customer success agents, ad ops, account managers, founders' assistants, ops engineers. Operator-class agents will live or die on whether they can do this loop. We don't have a benchmark for it yet. So we built one.

If you train an LLM on Creator Copilot and it gets better, you've taught it something it could not previously do — and you can prove it, line by line, against the literature.

That's the bet. That's why we built it.

---

**Try it:** the environment is on Hugging Face Spaces, the training notebook is in `training/`, the per-day audit logs are in `server/simulation_history.json`, and every numeric constant has a citation in [`RESEARCH.md`](../RESEARCH.md).

We don't think we've solved the creator economy. We think we've built the first environment honest enough to fail against it. Come argue with our numbers.