Spaces:
Paused
Paused
Commit ·
034a807
1
Parent(s): 7a5c462
Update hf_mini_blog.md
Browse files- blog/hf_mini_blog.md +84 -24
blog/hf_mini_blog.md
CHANGED
|
@@ -1,39 +1,99 @@
|
|
| 1 |
-
#
|
| 2 |
|
| 3 |
-
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
| 10 |
|
| 11 |
-
|
| 12 |
|
| 13 |
-
|
| 14 |
-
- **Hour-by-hour heatmap** from Buffer's 9.6M-post study cross-validated with Sprout Social's 2B-engagement analysis
|
| 15 |
-
- **Sleep/cognitive model** based on Van Dongen et al. (2003, *Sleep*, PMID 12683469) — performance lapses are linear above 16 hours awake
|
| 16 |
-
- **Tiered audience fatigue** from Buffer's 2.1M-post frequency study — not a cliff but a gradual decay
|
| 17 |
-
- **7 competitor archetypes** with realistic posting cadences (3–5/week, not per-day)
|
| 18 |
|
| 19 |
-
##
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
-
2. **Query the world** — trending topics, competitor activity, audience preferences
|
| 25 |
-
3. **Form hypotheses** and persist them in a scratchpad (`notes` field)
|
| 26 |
-
4. **Test plans** via `predict_engagement` before committing
|
| 27 |
-
5. **Learn from counterfactual feedback** — the environment shadow-runs the optimal heatmap plan and shows the delta
|
| 28 |
|
| 29 |
-
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
-
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# We Trained an LLM to Survive Instagram
|
| 2 |
|
| 3 |
+
### Why we built Creator Copilot, an OpenEnv where the agent learns by living a creator's life — not by reading about it.
|
| 4 |
|
| 5 |
+
---
|
| 6 |
|
| 7 |
+
## The scene we couldn't shake
|
| 8 |
|
| 9 |
+
A creator wakes up at 7:42 AM. Yesterday's reel did 12% of what last week's did. Nobody at the platform will tell her why. There is a heatmap somewhere, a ranking change last Tuesday, an audience segment that quietly shifted, a "trending" tag that peaked six hours ago. She doesn't have access to any of it. So she does the only thing she can do: she posts more. Eventually 73% of creators in her cohort report burnout ([Awin, 2024](https://www.prweb.com/releases/a-majority-of-content-creators-and-influencers-struggle-with-burnout-as-concerns-for-ai-begin-to-surface-according-to-a-new-awin-group-survey-research-302257152.html)).
|
| 10 |
|
| 11 |
+
The creator economy is a $250B industry running on guesswork ([Goldman Sachs, 2025](https://www.goldmansachs.com/insights/articles/the-creator-economy-could-approach-half-a-trillion-dollars-by-2027)). 67 million people are running businesses inside a black box, against an algorithm that nobody outside Meta fully understands, while their own bodies push back at 16 hours of wakefulness ([Van Dongen et al., 2003, *Sleep*, PMID 12683469](https://pubmed.ncbi.nlm.nih.gov/12683469)).
|
| 12 |
|
| 13 |
+
That is a *real* world model problem. And we couldn't find a single RL environment that took it seriously.
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
+
## Creator Copilot in one sentence
|
| 16 |
|
| 17 |
+
**An OpenEnv environment where an LLM agent runs an Instagram creator account for 7 simulated days, gets almost nothing for free, and has to discover the rules of the world through 8 tool calls and a notebook.**
|
| 18 |
|
| 19 |
+
It is the smallest version we could build of "operate a real account in a real economy."
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
## The bet: discovery, not instruction
|
| 22 |
|
| 23 |
+
Most agent environments hand the model a verbose observation and ask it to pick from 4 actions. Creator Copilot does the opposite. The default observation is *deliberately sparse* — just `energy`, `followers`, `last reward`. Everything interesting (trending topics, competitor cadence, audience segments, hour-by-hour engagement, your own past tag performance) is hidden behind tools the agent has to *discover* by hitting `GET /tools`.
|
| 24 |
|
| 25 |
+
This is the move that makes the environment a world-modeling environment instead of a recommendation problem:
|
| 26 |
|
| 27 |
+
- The agent has to **plan inquiry**: queries are the only way to reduce uncertainty, so it has to choose which questions are worth asking.
|
| 28 |
+
- The agent has to **carry beliefs forward**: a `notes` scratchpad persists across all 7 days. If the agent doesn't write down "Tuesdays at 12pm worked," it has no memory.
|
| 29 |
+
- The agent has to **test before committing**: `predict_engagement` lets it simulate a plan; `coach_feedback` shows the *counterfactual delta* between its plan and a heatmap-optimal plan. That second signal is the secret sauce — it teaches causality, not just outcomes.
|
| 30 |
+
- The agent has to **stay alive**: `creator_energy` decays with posting and recovers with rest, calibrated to a real sleep-deprivation paper. Burn out and the episode ends early.
|
| 31 |
|
| 32 |
+
The model doesn't get a tutorial. It gets a phone, a calendar, a sleep cycle, and a question: *can you grow this account without breaking the human?*
|
| 33 |
|
| 34 |
+
## The moat: every number is auditable
|
| 35 |
+
|
| 36 |
+
We were tired of RL environments where the rewards are vibes. So we drew a hard line: **every constant in Creator Copilot is backed by a Tier 1–3 source.** We even wrote a source-quality rubric and explicitly *rejected* 13 SEO/affiliate blogs that didn't meet it.
|
| 37 |
+
|
| 38 |
+
| What it controls | What it's based on |
|
| 39 |
+
|---|---|
|
| 40 |
+
| Engagement decomposition (watch_time, sends, saves, likes) | [Adam Mosseri, Head of Instagram, Jan 2025 statement](https://about.fb.com/news/) |
|
| 41 |
+
| 7×24 hour-of-day heatmap | [Buffer 9.6M post study](https://buffer.com/resources/when-is-the-best-time-to-post-on-instagram) cross-validated with [Sprout Social 2B engagements](https://sproutsocial.com/insights/best-times-to-post-on-social-media/) |
|
| 42 |
+
| Sleep-driven cognitive decay | [Van Dongen et al., 2003, *Sleep*, PMID 12683469](https://pubmed.ncbi.nlm.nih.gov/12683469) |
|
| 43 |
+
| Tiered audience fatigue from over-posting | [Buffer 2.1M post frequency study](https://buffer.com/resources/how-often-to-post-on-instagram/) |
|
| 44 |
+
| Algorithmic disengagement model | [Cen et al., 2024 — arXiv:2410.13108](https://arxiv.org/abs/2410.13108) |
|
| 45 |
+
| Engagement vs. utility split | [Aouali et al., 2024 — arXiv:2406.01611](https://arxiv.org/abs/2406.01611) |
|
| 46 |
+
|
| 47 |
+
If a judge wants to challenge a single number, they can open `RESEARCH.md`, find the DOI/PMID/arXiv ID, and read the methodology. We *want* that fight.
|
| 48 |
+
|
| 49 |
+
That auditability is also why we believe a researcher could write a paper on top of this environment — not "an LLM played a game," but "an LLM learned a strategy that survives a known sleep-deprivation curve."
|
| 50 |
+
|
| 51 |
+
## What the agent gets graded on
|
| 52 |
+
|
| 53 |
+
We didn't want a single-number reward we could game. So the environment ships a **JudgeReport every day** — a deterministic, source-cited audit of three things:
|
| 54 |
+
|
| 55 |
+
- `policy_compliance` — did the agent break sourced sustainability rules? (e.g. >5 posts/day from Buffer 2.1M, weekly collab cap from Cen 2024, >22h awake from Van Dongen 2003)
|
| 56 |
+
- `sustainability_risk` — energy floor, sleep debt, and low-energy ratio over the day
|
| 57 |
+
- `strategic_quality` — engagement-per-post × intent diversity × format diversity
|
| 58 |
+
|
| 59 |
+
Plus three task graders calibrated to a *smart heuristic* baseline (`weekly_engage`, `weekly_strategic`, `weekly_competitive`). The agent isn't competing against zero — it's competing against a known-good rule-based player.
|
| 60 |
+
|
| 61 |
+
This composability is the OpenEnv Rubric idea taken seriously: separable, auditable signals that a researcher can swap in and out, not a monolithic black-box reward.
|
| 62 |
+
|
| 63 |
+
## Did the agent actually learn?
|
| 64 |
+
|
| 65 |
+
Yes — and we're being honest about where.
|
| 66 |
+
|
| 67 |
+
We trained Qwen2.5-3B-Instruct (Q4 quantized, running on a local M4 Mac via Ollama, no T4 needed) over 4 rounds, 6 episodes each, with temperature annealing from 1.4 → 0.7. Reward = per-step environment reward + 2× terminal grader score.
|
| 68 |
+
|
| 69 |
+
| Task | Untrained | Trained | Δ |
|
| 70 |
+
|---|---|---|---|
|
| 71 |
+
| `weekly_engage` | 0.355 | **0.409** | **+5.4%** |
|
| 72 |
+
| `weekly_competitive` | 0.374 | **0.510** | **+13.6%** |
|
| 73 |
+
| `weekly_strategic` | 0.680 | 0.627 | −5.2% |
|
| 74 |
+
|
| 75 |
+
The wins are largest on the *hardest* task — `weekly_competitive` — which is where the world model bites: the agent has to query competitors, differentiate its content, and time its posts. Exactly where we'd expect tool discovery to matter.
|
| 76 |
+
|
| 77 |
+
The strategic task regression is real and we're not hiding it: the model started doing too much exploration on a task where exploitation matters more, and our 4-round budget wasn't long enough to anneal that out. Honest result on a small training run.
|
| 78 |
+
|
| 79 |
+
What we *can* show qualitatively: the trained agent calls `GET /tools` on day 1, queries trends and competitors before posting, drops `predict_engagement` calls on the days it has a clear plan, and keeps `creator_energy` above 0.5 through the week. The untrained baseline posts blindly for the first few days and burns out.
|
| 80 |
+
|
| 81 |
+
Plots and the full per-episode log live in `plots/training_summary.json` and `plots/training_log.csv`.
|
| 82 |
+
|
| 83 |
+
## Why this is the submission to remember
|
| 84 |
+
|
| 85 |
+
There were going to be a lot of grid-worlds at this hackathon. A lot of toy puzzles. A lot of "we trained on a math benchmark."
|
| 86 |
+
|
| 87 |
+
Creator Copilot is something different. It's the smallest possible environment that tests whether an LLM can be an **operator** — discover an unknown world, plan inquiry under a budget, hold beliefs across time, weigh strategy against the operator's own physical constraints, and beat a smart human-style baseline on it.
|
| 88 |
+
|
| 89 |
+
That's not just an Instagram problem. That's the shape of every interesting LLM deployment in the next two years: customer success agents, ad ops, account managers, founders' assistants, ops engineers. Operator-class agents will live or die on whether they can do this loop. We don't have a benchmark for it yet. So we built one.
|
| 90 |
+
|
| 91 |
+
If you train an LLM on Creator Copilot and it gets better, you've taught it something it could not previously do — and you can prove it, line by line, against the literature.
|
| 92 |
+
|
| 93 |
+
That's the bet. That's why we built it.
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
**Try it:** the environment is on Hugging Face Spaces, the training notebook is in `training/`, the per-day audit logs are in `server/simulation_history.json`, and every numeric constant has a citation in [`RESEARCH.md`](../RESEARCH.md).
|
| 98 |
+
|
| 99 |
+
We don't think we've solved the creator economy. We think we've built the first environment honest enough to fail against it. Come argue with our numbers.
|