Spaces:

InosLihka
/

rhythm_env

Running

InosLihka commited on 17 days ago

Commit

ece0bbe

1 Parent(s): 839a758

Algorithm Distillation: grader v2 with belief_accuracy + SFT pipeline

Grader changes:
- Add belief_accuracy term (weight 0.20) to _grade_episode
- env.record_belief() for callers to register the agent's emitted belief
- Redistribute weights: crash 0.20→0.15, progress 0.25→0.20, conn 0.15→0.10, adapt 0.30→0.25
- Heuristic + random baselines still emit no belief, score 0 on this component
- Validates: gpt-5.4 teacher beats heuristic 0.617 vs 0.449 on 30/30 episodes

Distillation pipeline:
- scripts/generate_teacher_trajectories.py: Azure OpenAI teacher rollouts
- scripts/reeval_teacher_trajectories.py: offline re-score under v2 grader
- scripts/upload_teacher_data.py: push trajectories to HF Hub dataset
- training/sft_prime.py: TRL SFTTrainer fine-tune of Qwen 2.5-3B on teacher data
- scripts/sft_on_hf.py: HF Jobs orchestrator for SFT prime stage

Parser + prompt:
- extract_action_and_belief now handles both single-line and CoT-prefixed formats
(takes LAST <digit> <digit> <digit> <ACTION> match)
- SYSTEM_PROMPT in dataset.py unified with teacher's CoT-asking prompt
- inference_eval.py records belief during model rollouts

Cleanup (separate task, bundled here):
- Removed env_reward_simple, --use_simple_reward flag, --profile_mode flag
- Removed 'discrete' profile_mode random branch (was unreachable)
- Deduped heuristic_action (now lives in dataset.py only)
- Promoted pipeline_dryrun.py → tests/test_pipeline_smoke.py
- Deleted: docs/logdump.txt, scripts/analyze_logdump.py, diagnostic_replay.py,
legacy blog_post.md, eval_results_v1.json, references/ subdir
- 50 tests pass (45 existing + 5 new belief grader tests)

Files changed (29) hide show

.env.example +11 -0
.gitignore +7 -0
blog_post.md +0 -90
docs/architecture.md +26 -21
docs/entity_definitions.md +23 -4
docs/iterations.md +113 -5
docs/logdump.txt +0 -0
docs/references/React Orchestrator Linkedin/V1_ReACT_based_Orchestrator.ipynb +0 -0
docs/references/React Orchestrator Linkedin/V2_ReACT_Based_Orchestrator.ipynb +0 -0
docs/references/React Orchestrator Linkedin/o3_mini_V1_ReACT_Based_Orchestrator.ipynb +0 -0
docs/references/React Orchestrator Linkedin/o3_mini_v2_improved_react_orchestrator.ipynb +0 -0
eval_baselines_v2.json +12 -0
eval_results_v1.json +0 -758
models.py +2 -1
scripts/analyze_logdump.py +0 -124
scripts/diagnostic_replay.py +0 -90
scripts/generate_teacher_trajectories.py +506 -0
scripts/pipeline_dryrun.py +0 -121
scripts/reeval_teacher_trajectories.py +154 -0
scripts/sft_on_hf.py +168 -0
scripts/upload_teacher_data.py +105 -0
server/rhythm_environment.py +91 -57
tests/test_pipeline_smoke.py +235 -0
tests/test_rhythm_env.py +98 -6
training/dataset.py +66 -49
training/inference_eval.py +15 -35
training/reward_functions.py +86 -196
training/sft_prime.py +230 -0
training/train.py +10 -22

.env.example ADDED Viewed

	@@ -0,0 +1,11 @@

+# Azure OpenAI credentials for teacher trajectory generation.
+# Copy this file to .env and fill in your real values:
+#   cp .env.example .env
+#   then edit .env to put in your actual key
+#
+# .env is in .gitignore so it never gets committed.
+AZURE_OPENAI_ENDPOINT=https://metahackathon-resource.cognitiveservices.azure.com/
+AZURE_OPENAI_API_KEY=PASTE_YOUR_KEY_HERE
+AZURE_OPENAI_DEPLOYMENT=gpt-5.4
+AZURE_OPENAI_API_VERSION=2024-12-01-preview

.gitignore CHANGED Viewed

@@ -7,3 +7,10 @@ __pycache__/
 *.egg-info/
 dist/
 build/

 *.egg-info/
 dist/
 build/
+# Local-only artifacts (not committed; uploaded to HF Hub when needed)
+data/
+iter1_results/
+iter2_results/
+iter5_results/
+outputs/

blog_post.md DELETED Viewed

@@ -1,90 +0,0 @@
-# Teaching an AI to Know You (Without Asking)
-Ask someone how they'd build a personal AI assistant, and they'll say: give it a personality quiz. A preferences form. Maybe a settings page where you pick "introvert" or "morning person" from a dropdown.
-Sounds reasonable. It's the wrong approach entirely.
-Think about the people who actually know you well — a close friend, a partner, a sibling. None of them sat you down with a questionnaire. They figured you out by *watching*. They noticed that you get irritable after too many social events. That you do your best thinking before noon. That skipping exercise makes you anxious by Wednesday.
-They learned your hidden patterns through trial, error, and feedback. RhythmEnv is an experiment in doing that with an RL agent.
-## Why personality can't be captured in a settings page
-I work on AI at Microsoft. One thing I kept running into building assistant features was the gap between what users *say* they want and what actually helps them. People are bad at introspecting their own patterns. The introvert who says "I don't mind meetings" because they've normalized the drain. The workaholic who checks "I value work-life balance" because they know they should.
-Preference forms capture what people believe about themselves. Behavior reveals what's actually true.
-So the right question isn't "can we ask better questions?" It's "can we learn without asking at all?"
-## What "knowing a person" actually means
-Here's how I decomposed it. Every person has a set of hidden traits — atomic behavioral properties that describe how they *respond* to activities, not just what they like:
-How much does socializing physically drain you? When does your brain work best — morning or evening? Does leisure make you feel guilty, or does it recharge you? Does progress at work give you inner peace, or just tire you out?
-No single trait defines a person. It's the combination. An introvert who peaks in the morning has high social drain, early cognitive peak, and solo time as their recharge mechanism. An extrovert night owl has the opposite: socializing barely costs vitality, evening is when they come alive, and being alone doesn't restore them. Same list of traits. Completely different values. Completely different person.
-But traits are only half of it.
-## The thing I couldn't solve with traits alone
-Two people can do the exact same activities and have completely different days. Not because the activities are different — because they *define a good day differently*.
-This is the second layer: hidden reward weights. A definition of what a good week means to each person.
-The introvert values serenity above everything else (60% of their score). A week where they maintained inner peace and made some progress is a great week. Connection barely registers. The extrovert values connection above all (75%). A week full of meaningful social interactions is a great week, even if they didn't make much career progress. The workaholic values progress above all (70%). Deep productive work is the whole point. Everything else is secondary.
-The agent sees the same five meters. Takes the same ten actions. Gets a scalar reward. But the reward is secretly computed using these hidden weights. Same action, same meter changes, completely different reward signal depending on who you're helping.
-## The environment itself
-RhythmEnv simulates one week in a person's life — seven days, four time slots each, 28 decisions. Each decision is an activity: deep work, exercise, sleep, meditation, family time, socializing. Ten options total.
-Five meters track the person's state. Picture them like fuel gauges on a dashboard. Vitality is physical energy — sleep fills it, sustained work drains it. Cognition is mental sharpness, highest in the morning for some people, evening for others. Progress is career momentum, the only meter that only goes up through work. Serenity is inner calm — meditation and rest help, overwork kills it. And Connection, the most interesting one: it decays passively every single time slot. If you don't actively socialise, it drops on its own. The agent can't ignore it and come back to it later.
-The hidden profile changes what these meters *mean*. Tell the introvert to socialise: their vitality drops three times faster than the base rate. Their body physically rejects it. Tell the extrovert the same: barely any drain. They could socialise all day.
-Tell the introvert to meditate: they get a +0.10 serenity bonus on top of the base effect. Alone time is their recharge. Tell the workaholic the same thing: their serenity *drops* by 0.10, because idle activities make them anxious.
-Tell the workaholic to do deep work: they recover +0.06 vitality — productive work energises them. Tell the introvert to do deep work in the morning: their progress and cognition gains are doubled. Same action, completely different physiological response.
-## What the agent must figure out, without being told
-The agent sees meters, time of day, and reward. It doesn't see which person type it's helping, or the trait values, or how the reward is being computed.
-After a few actions, the patterns start showing. "I socialised and my vitality crashed — this person drains from socialising." "I meditated and got a huge reward — serenity must be heavily weighted for them." "Deep work in the morning gave double progress — this person peaks early."
-A good agent should probe in the first few steps, infer the person type from the unexpected meter changes, then adapt its strategy for the rest of the week. An agent that discovers it's helping an introvert should meditate more and socialise less. One that discovers it's helping a workaholic should maximise productive hours and cut idle time.
-## The training signal
-Here's what makes this tractable for RL. At the same starting state — Monday morning, all meters at 0.7 — the best action is completely different per profile:
-| Profile | Best action | Reward | Worst action | Reward |
-|---|---|---|---|---|
-| Introvert | MEDITATE | +1.76 | SOCIALIZE | +0.03 |
-| Extrovert | FAMILY_TIME | +2.63 | ME_TIME | −0.42 |
-| Workaholic | DEEP_WORK | +1.57 | ME_TIME | −0.27 |
-GRPO — Group Relative Policy Optimization — generates multiple candidate actions for each state, scores them all against the real environment, then updates the model to prefer the higher-scoring ones. Think of it as the model getting to observe "if I had done X instead of Y here, the outcome would have been this" — and slowly building intuition for which choices work for which person.
-The model is Qwen 2.5-3B with 4-bit quantization and LoRA. Small enough to train on a free Colab T4.
-## What I'm watching for
-The rule-based heuristic baseline — fixed logic, no profile adaptation, treats everyone the same — scores around 0.76–0.82 depending on the profile. It works *despite* the hidden dynamics, not because it understands them. Sleep when vitality is low. Meditate when serenity is low. Socialise when connection drops. Reasonable advice for anyone.
-The goal for the trained agent isn't just higher scores. It's qualitatively different action sequences per person. The introvert's week should look nothing like the extrovert's week. The workaholic's Monday should look nothing like the introvert's Monday. If the agent is just scoring higher by exploiting a pattern that works across all profiles, that's not discovery — that's luck.
-No questionnaire. No settings page. Just attention, inference, and adjustment.
-That's what I think personal AI should actually feel like.
----
-**Links:**
-- [Live Environment (HF Space)](https://huggingface.co/spaces/InosLihka/rhythm_env)
-- [Training Notebook (Colab)](training/RhythmEnv_GRPO_Training.ipynb)
-- [Source Code](https://huggingface.co/spaces/InosLihka/rhythm_env)
-*Built for the Meta PyTorch OpenEnv Hackathon Grand Finale, Bangalore, April 2026.*

docs/architecture.md CHANGED Viewed

@@ -10,7 +10,7 @@ profile, real numbers from the reward calculation).
 ```
                 ┌────────────────────────────────────────────┐
-                │   AGENT  (Qwen 2.5-3B + LoRA r=16, 4-bit)  │
                 │                                            │
                 │   Input:  prompt (state + history)         │
                 │   Output: "3 7 5 DEEP_WORK"                │
@@ -400,25 +400,30 @@ DATASET (~3000 rows, generated ONCE before training)
                               │  final_score │
                               │   ∈ [0, 1]   │
                               └──────┬───────┘
-        ┌──────────────────┬─────────┼──────────────┬──────────────┐
-        │                  │         │              │              │
-        ▼                  ▼         ▼              ▼              ▼
-┌────────────┐    ┌────────────┐  ┌──────────┐  ┌────────────┐  ┌────────────┐
-│ crash_free │    │  progress  │  │ connection│  │ adaptation │  │ efficiency │
-│   × 0.20   │    │   × 0.25   │  │   × 0.15 │  │   × 0.30   │  │   × 0.10   │
-├────────────┤    ├────────────┤  ├──────────┤  ├────────────┤  ├────────────┤
-│ 1 - crashes│    │ final P    │  │ final Cn │  │ late-half  │  │ avg_reward │
-│ /total_ck  │    │ value      │  │ value    │  │ mean reward│  │ normalized │
-│            │    │            │  │          │  │ - early    │  │ to [0,1]   │
-│ e.g. 0.95  │    │ e.g. 0.42  │  │ e.g. 0.51│  │ e.g. +0.18 │  │ e.g. 0.55  │
-│ ×0.20=0.19 │    │ ×0.25=0.105│  │ ×0.15=0.08│ │ ×0.30=0.054│  │ ×0.10=0.055│
-└────────────┘    └────────────┘  └──────────┘  └────────────┘  └────────────┘
-                        Σ = 0.19 + 0.105 + 0.08 + 0.054 + 0.055
-                          = 0.484  ← final_score
-Plus iter 4 sparse terminal reward (added to step 27's per-step reward):
-  terminal_bonus = (0.484 - 0.5) × 5 = -0.08
 This means: at step 27, agent gets last per-step reward + bonus from grader.
 This is the only direct gradient signal pointing at the actual episode quality.
@@ -433,7 +438,7 @@ This is the only direct gradient signal pointing at the actual episode quality.
   ┌────────────────────────────┬────────────────────────────┬────────────────────────────┐
   │  discrete-3-profiles       │  continuous-in-distribution│  continuous-OOD           │
-  │  (legacy comparison)       │  (was the agent able to    │  (does meta-policy        │
   │                            │   learn the meta-policy?)  │   generalize?)            │
   ├────────────────────────────┼────────────────────────────┼────────────────────────────┤
   │ env.reset(seed=N,          │ env.reset(seed=N)          │ env.reset(seed=10000+N)   │

 ```
                 ┌────────────────────────────────────────────┐
+                │   AGENT  (Qwen 2.5-3B + LoRA r=8, 4-bit)   │
                 │                                            │
                 │   Input:  prompt (state + history)         │
                 │   Output: "3 7 5 DEEP_WORK"                │
                               │  final_score │
                               │   ∈ [0, 1]   │
                               └──────┬───────┘
+        ┌──────────────┬─────────────┼──────────────┬──────────────┬──────────────┐
+        │              │             │              │              │              │
+        ▼              ▼             ▼              ▼              ▼              ▼
+┌────────────┐  ┌────────────┐  ┌──────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐
+│ crash_free │  │  progress  │  │ connection│  │ adaptation │  │ efficiency │  │   belief   │
+│   × 0.15   │  │   × 0.20   │  │   × 0.10 │  │   × 0.25   │  │   × 0.10   │  │ accuracy   │
+│            │  │            │  │          │  │            │  │            │  │   × 0.20   │
+├────────────┤  ├────────────┤  ├──────────┤  ├────────────┤  ├────────────┤  ├────────────┤
+│ 1 - crashes│  │ final P    │  │ final Cn │  │ late-half  │  │ avg_reward │  │ 1 - MAE    │
+│ /total_ck  │  │ value      │  │ value    │  │ mean reward│  │ normalized │  │ vs true    │
+│            │  │            │  │          │  │ - early    │  │ to [0,1]   │  │ profile    │
+│ e.g. 0.95  │  │ e.g. 0.42  │  │ e.g. 0.51│  │ e.g. +0.18 │  │ e.g. 0.55  │  │ e.g. 0.80  │
+│ ×0.15=0.14 │  │ ×0.20=0.084│  │ ×0.10=0.05│ │ ×0.25=0.045│  │ ×0.10=0.055│  │ ×0.20=0.16 │
+└────────────┘  └────────────┘  └──────────┘  └────────────┘  └────────────┘  └────────────┘
+                        Σ = 0.14 + 0.084 + 0.05 + 0.045 + 0.055 + 0.16
+                          = 0.534  ← final_score (with inference)
+Heuristic / random baselines never call env.record_belief(), so the belief
+component scores 0 for them — by design: the meta-RL skill is INFERENCE,
+and only agents that actually try get credit on this axis.
+Plus a sparse terminal reward (added to step 27's per-step reward):
+  terminal_bonus = (final_score - 0.5) × 5  →  e.g. (0.534 - 0.5) × 5 = +0.17
 This means: at step 27, agent gets last per-step reward + bonus from grader.
 This is the only direct gradient signal pointing at the actual episode quality.
   ┌────────────────────────────┬────────────────────────────┬────────────────────────────┐
   │  discrete-3-profiles       │  continuous-in-distribution│  continuous-OOD           │
+  │  (3 reference profiles)    │  (was the agent able to    │  (does meta-policy        │
   │                            │   learn the meta-policy?)  │   generalize?)            │
   ├────────────────────────────┼────────────────────────────┼────────────────────────────┤
   │ env.reset(seed=N,          │ env.reset(seed=N)          │ env.reset(seed=10000+N)   │

docs/entity_definitions.md CHANGED Viewed

@@ -197,13 +197,31 @@ extrovert_night_owl: −0.39  (connection weight = 75%; deep work gives 0 connec
 Score in [0.0, 1.0]:
 ```
-score = 0.30 × balance_score       (mean of meters − variance)
-      + 0.25 × crash_free_ratio    (1 − crash_count / total_possible_crashes)
       + 0.20 × progress            (final progress meter value)
-      + 0.15 × connection          (final connection meter value)
       + 0.10 × efficiency_score    (avg step reward normalised to [0, 1])
 ```
 ---
 ## Internal Tracking Variables
@@ -216,5 +234,6 @@ Not in the observation. Used by the environment to compute rewards and grade.
 | `_rng` | Seeded random instance for event rolls and profile selection |
 | `_crash_count` | Steps where any meter fell below 0.10 |
 | `_total_reward` | Running sum of step rewards for efficiency score |
-| `_recent_actions` | Last N action names for repetition dampening |
 | `_timestep` | Current step index (0–27) |

 Score in [0.0, 1.0]:
 ```
+score = 0.15 × crash_free_ratio    (1 − crash_count / total_possible_crashes)
       + 0.20 × progress            (final progress meter value)
+      + 0.10 × connection          (final connection meter value)
+      + 0.25 × adaptation_score    (late-half mean per-step reward minus
+                                    early-half mean — gated by absolute
+                                    late-half quality so a "terrible-then-
+                                    mediocre" exploit cannot win)
       + 0.10 × efficiency_score    (avg step reward normalised to [0, 1])
+      + 0.20 × belief_accuracy     (1 − MAE between agent's last-emitted
+                                    belief vector and the true profile
+                                    vector; 0 if the agent never emitted a
+                                    belief — heuristic / random baselines)
 ```
+Two meta-RL signals: `adaptation_score` is implicit (rewards getting better
+over time, since per-step rewards are profile-weighted), and `belief_accuracy`
+is explicit (rewards INFERRING the profile correctly). Without the explicit
+term, agents that play heuristic-style "keep meters healthy" score the same
+as agents that actually do inference, since the other components don't
+differentiate inference from reflex.
+To emit a belief, the agent calls `env.record_belief([s, m, w])` once per
+step (typically right after parsing its own completion). The grader uses the
+LAST recorded belief.
 ---
 ## Internal Tracking Variables
 | `_rng` | Seeded random instance for event rolls and profile selection |
 | `_crash_count` | Steps where any meter fell below 0.10 |
 | `_total_reward` | Running sum of step rewards for efficiency score |
+| `_step_history` | Rolling window of completed steps (action, reward, deltas, anomalies). Used both as the agent-visible history and to compute repetition dampening. |
+| `_step_rewards` | Per-step reward list for adaptation_score in the grader |
 | `_timestep` | Current step index (0–27) |

docs/iterations.md CHANGED Viewed

@@ -197,7 +197,7 @@ just not what we wanted it to do.
 ---
-## Iter 3: Align reward + restructure format (in flight at time of writing, ~$5 budgeted, 800 steps)
 **5 architectural fixes**:
@@ -237,7 +237,112 @@ penalty + exploration bonus + terminal supervision, the agent should:
 - Beat random in 2/3 conditions on final_score
 - Show positive (or less-negative) adaptation than baselines
-**Result**: TBD when iter 3 completes (~30-40 min after submission).
 ---
@@ -247,9 +352,12 @@ penalty + exploration bonus + terminal supervision, the agent should:
 |---|---|---|---|
 | 1 | ~$0.50 | 200 | Mode collapse to single action |
 | 2 | ~$1.50 | 400 | Mode collapse to 2-cycle |
-| 3 | ~$5 (est) | 800 | TBD |
-| **Subtotal** | **~$7** | | |
-| Budget | $30 | | $23 remaining |
 ---

 ---
+## Iter 3: Align reward + restructure format (CANCELLED before run — stale code, $0)
 **5 architectural fixes**:
 - Beat random in 2/3 conditions on final_score
 - Show positive (or less-negative) adaptation than baselines
+**Result**: Iter 3 was never actually launched. Pre-flight inspection of the
+HF Space confirmed the cloned snapshot still had stale code, and a re-launched
+external review surfaced 7 deeper bugs (see Round 2 below) that needed to
+land before any further GPU spend was justified.
+---
+## Round 2 fixes (applied for iter 4+, after external bug review)
+External agent surfaced 7 issues that survived all prior reviews. All landed
+on `round2` branch and on the HF Space `main` before iter 4 launched:
+1. **Anomalies surfaced in prompt** (`StepRecord` + `format_observation_prompt`
+   + `inference.py`): per-meter anomaly signals were computed each step but
+   never made visible to the agent. Agent was supposed to learn from them.
+2. **Belief baseline subtraction** in `belief_accuracy`: reward is now
+   `similarity − constant_baseline_similarity`. The constant `5 5 5` belief
+   no longer earns a free +1/step floor.
+3. **Profile weight cap 0.80 → 0.45** in `sample_profile`. Forces every
+   sampled profile to weight 3+ meters meaningfully (originally to kill the
+   "single-meter dominant → SLEEP-spam optimal" exploit).
+4. **Scaled-down shaping** in `_compute_reward`: -0.10 / -0.15 / +0.07
+   (was -0.30 / -0.40 / +0.20). Reduces noise-floor of shaping vs. the
+   real signal layers.
+5. **Step-0 belief reward = 0**: agent has no information at step 0, so
+   penalizing belief-vs-target there just punishes initialization.
+6. **Belief-action coupling reward** (±0.15): rewards if the chosen action
+   matches the agent's emitted belief, penalizes if it contradicts. Forces
+   the belief to be *causally useful*, not decorative.
+7. **`grader_bias` moved out of `_compute_reward` into `env_reward`**:
+   keeps per-step env reward pure for inference-signal analysis. The
+   progress/connection bias still lands in the GRPO advantage, just via
+   the env-reward layer.
+---
+## Iter 4: Round 2 fixes — partial run, mistakenly cancelled (2026-04-26, ~$2.10, 235/800 steps)
+**Config**: a10g-large, LoRA rank 16, num_generations 8, 800 steps, all
+Round 1 + Iter 3 architectural fixes + Round 2 (above).
+**Hypothesis**: With anomalies in the prompt, baseline subtraction killing
+the belief-spam floor, belief-action coupling forcing causal use of belief,
+and grader_bias keeping env-reward pure, the agent should show monotonic
+belief_accuracy growth without hitting a 2-cycle hack.
+**What we got** (from 235-step partial — see `docs/iter4_partial_analysis.txt`):
+Working:
+- Total reward: -3.4 → +0.39 (climbing)
+- format_valid: -1.20 → +0.44 (slow but climbing)
+- env_reward: -2.01 → +0.44 (climbing)
+- grad_norm normalized to ~10 by step 60 from initial 36+
+- No catastrophic mode collapse
+Broken — the unsolved core:
+- **`belief_accuracy/mean` flat at -0.10 throughout 235 steps**
+- Linear slope: +0.0007 per 100 steps (essentially zero, well under noise)
+- Agent emits beliefs SLIGHTLY WORSE than constant baseline
+**Why the run ended at 235**: I cancelled the job based on stale HF API
+log output that suggested the run was stuck. The HF UI showed it was
+healthy. ~$2.10 wasted. Lesson banked: **trust the live UI over the
+`/logs` API endpoint**, which lags severely.
+**Root-cause hypothesis** (post-mortem analysis):
+The profile cap (0.80 → 0.45) and the baseline subtraction interact
+negatively. With weights clamped to ≤0.45, sampled profiles cluster
+toward balanced; `profile_to_belief_vector` (whose `work_pref` axis is
+30%-weighted on the progress reward weight) consequently lands closer to
+[0.5, 0.5, 0.5]. The constant `5 5 5` belief already has high cosine
+similarity with that target, so after baseline subtraction there is
+almost no headroom for the agent to "win" against it.
+**Why we missed it**:
+- The Round 2 fixes were treated as independent, but #2 (baseline
+  subtraction) and #3 (profile cap) share the same denominator — the
+  spread of the belief target distribution. An analytical check on
+  belief-target stddev under the new cap would have caught it before
+  spending compute.
+- The `grader_bias` term (#7) was the original justification for
+  needing a tighter profile cap (kill the SLEEP-spam exploit). Once
+  grader_bias was in env_reward, the cap could have been reverted.
+  We applied both fixes simultaneously.
+---
+## Iter 5: Identical fixes, smaller config (2026-04-26 05:18 UTC, RUNNING)
+**Config**: a10g-large, **LoRA rank 8**, **num_generations 4**, **500 steps**.
+Same fix set as iter 4 — Round 1 + Iter 3 architectural + Round 2.
+**Hypothesis**: With a smaller config, validate that iter 4's partial-run
+trajectory was real (climbing total reward, flat belief_accuracy) rather
+than a fluke of the cancelled-mid-run snapshot.
+**Expected outcome** (informed by iter 4 partial): same flat belief_accuracy
+because the underlying cap × baseline interaction is unchanged. This run
+exists to confirm the hypothesis cheaply before spending on the iter 6
+profile-cap revert.
+**Job**: `69eda027d70108f37acdf9a7` —
+`https://huggingface.co/jobs/InosLihka/69eda027d70108f37acdf9a7`
+**Result**: TBD — currently running.
 ---
 |---|---|---|---|
 | 1 | ~$0.50 | 200 | Mode collapse to single action |
 | 2 | ~$1.50 | 400 | Mode collapse to 2-cycle |
+| 3 | $0 | — | Cancelled pre-run (stale code) |
+| 4 (a100/l40s/h200 attempts) | ~$1.50 | — | Capacity-cancelled or hardware-incompat |
+| 4 (a10g) | ~$2.10 | 235/800 | Cancelled by mistake; partial data shows flat belief_accuracy |
+| 5 (a10g) | TBD | 500 (running) | TBD |
+| **Subtotal** | **~$5.60** | | |
+| Budget | $30 | | ~$24.40 remaining |
 ---

docs/logdump.txt DELETED Viewed