Spaces:

ycwhencpp
/

final-iteration

Paused

App Files Files Community

final-iteration / blog /hf_mini_blog.md

anuragredbus

Update hf_mini_blog.md

034a807 12 days ago

preview code

raw

history blame contribute delete

8.75 kB

We Trained an LLM to Survive Instagram

Why we built Creator Copilot, an OpenEnv where the agent learns by living a creator's life — not by reading about it.

The scene we couldn't shake

A creator wakes up at 7:42 AM. Yesterday's reel did 12% of what last week's did. Nobody at the platform will tell her why. There is a heatmap somewhere, a ranking change last Tuesday, an audience segment that quietly shifted, a "trending" tag that peaked six hours ago. She doesn't have access to any of it. So she does the only thing she can do: she posts more. Eventually 73% of creators in her cohort report burnout (Awin, 2024).

The creator economy is a $250B industry running on guesswork (Goldman Sachs, 2025). 67 million people are running businesses inside a black box, against an algorithm that nobody outside Meta fully understands, while their own bodies push back at 16 hours of wakefulness (Van Dongen et al., 2003, Sleep, PMID 12683469).

That is a real world model problem. And we couldn't find a single RL environment that took it seriously.

Creator Copilot in one sentence

An OpenEnv environment where an LLM agent runs an Instagram creator account for 7 simulated days, gets almost nothing for free, and has to discover the rules of the world through 8 tool calls and a notebook.

It is the smallest version we could build of "operate a real account in a real economy."

The bet: discovery, not instruction

Most agent environments hand the model a verbose observation and ask it to pick from 4 actions. Creator Copilot does the opposite. The default observation is deliberately sparse — just energy, followers, last reward. Everything interesting (trending topics, competitor cadence, audience segments, hour-by-hour engagement, your own past tag performance) is hidden behind tools the agent has to discover by hitting GET /tools.

This is the move that makes the environment a world-modeling environment instead of a recommendation problem:

The agent has to plan inquiry: queries are the only way to reduce uncertainty, so it has to choose which questions are worth asking.
The agent has to carry beliefs forward: a notes scratchpad persists across all 7 days. If the agent doesn't write down "Tuesdays at 12pm worked," it has no memory.
The agent has to test before committing: predict_engagement lets it simulate a plan; coach_feedback shows the counterfactual delta between its plan and a heatmap-optimal plan. That second signal is the secret sauce — it teaches causality, not just outcomes.
The agent has to stay alive: creator_energy decays with posting and recovers with rest, calibrated to a real sleep-deprivation paper. Burn out and the episode ends early.

The model doesn't get a tutorial. It gets a phone, a calendar, a sleep cycle, and a question: can you grow this account without breaking the human?

The moat: every number is auditable

We were tired of RL environments where the rewards are vibes. So we drew a hard line: every constant in Creator Copilot is backed by a Tier 1–3 source. We even wrote a source-quality rubric and explicitly rejected 13 SEO/affiliate blogs that didn't meet it.

What it controls	What it's based on
Engagement decomposition (watch_time, sends, saves, likes)	Adam Mosseri, Head of Instagram, Jan 2025 statement
7×24 hour-of-day heatmap	Buffer 9.6M post study cross-validated with Sprout Social 2B engagements
Sleep-driven cognitive decay	Van Dongen et al., 2003, Sleep, PMID 12683469
Tiered audience fatigue from over-posting	Buffer 2.1M post frequency study
Algorithmic disengagement model	Cen et al., 2024 — arXiv:2410.13108
Engagement vs. utility split	Aouali et al., 2024 — arXiv:2406.01611

If a judge wants to challenge a single number, they can open RESEARCH.md, find the DOI/PMID/arXiv ID, and read the methodology. We want that fight.

That auditability is also why we believe a researcher could write a paper on top of this environment — not "an LLM played a game," but "an LLM learned a strategy that survives a known sleep-deprivation curve."

What the agent gets graded on

We didn't want a single-number reward we could game. So the environment ships a JudgeReport every day — a deterministic, source-cited audit of three things:

policy_compliance — did the agent break sourced sustainability rules? (e.g. >5 posts/day from Buffer 2.1M, weekly collab cap from Cen 2024, >22h awake from Van Dongen 2003)
sustainability_risk — energy floor, sleep debt, and low-energy ratio over the day
strategic_quality — engagement-per-post × intent diversity × format diversity

Plus three task graders calibrated to a smart heuristic baseline (weekly_engage, weekly_strategic, weekly_competitive). The agent isn't competing against zero — it's competing against a known-good rule-based player.

This composability is the OpenEnv Rubric idea taken seriously: separable, auditable signals that a researcher can swap in and out, not a monolithic black-box reward.

Did the agent actually learn?

Yes — and we're being honest about where.

We trained Qwen2.5-3B-Instruct (Q4 quantized, running on a local M4 Mac via Ollama, no T4 needed) over 4 rounds, 6 episodes each, with temperature annealing from 1.4 → 0.7. Reward = per-step environment reward + 2× terminal grader score.

Task	Untrained	Trained	Δ
`weekly_engage`	0.355	0.409	+5.4%
`weekly_competitive`	0.374	0.510	+13.6%
`weekly_strategic`	0.680	0.627	−5.2%

The wins are largest on the hardest task — weekly_competitive — which is where the world model bites: the agent has to query competitors, differentiate its content, and time its posts. Exactly where we'd expect tool discovery to matter.

The strategic task regression is real and we're not hiding it: the model started doing too much exploration on a task where exploitation matters more, and our 4-round budget wasn't long enough to anneal that out. Honest result on a small training run.

What we can show qualitatively: the trained agent calls GET /tools on day 1, queries trends and competitors before posting, drops predict_engagement calls on the days it has a clear plan, and keeps creator_energy above 0.5 through the week. The untrained baseline posts blindly for the first few days and burns out.

Plots and the full per-episode log live in plots/training_summary.json and plots/training_log.csv.

Why this is the submission to remember

There were going to be a lot of grid-worlds at this hackathon. A lot of toy puzzles. A lot of "we trained on a math benchmark."

Creator Copilot is something different. It's the smallest possible environment that tests whether an LLM can be an operator — discover an unknown world, plan inquiry under a budget, hold beliefs across time, weigh strategy against the operator's own physical constraints, and beat a smart human-style baseline on it.

That's not just an Instagram problem. That's the shape of every interesting LLM deployment in the next two years: customer success agents, ad ops, account managers, founders' assistants, ops engineers. Operator-class agents will live or die on whether they can do this loop. We don't have a benchmark for it yet. So we built one.

If you train an LLM on Creator Copilot and it gets better, you've taught it something it could not previously do — and you can prove it, line by line, against the literature.

That's the bet. That's why we built it.

Try it: the environment is on Hugging Face Spaces, the training notebook is in training/, the per-day audit logs are in server/simulation_history.json, and every numeric constant has a citation in RESEARCH.md.

We don't think we've solved the creator economy. We think we've built the first environment honest enough to fail against it. Come argue with our numbers.