Spaces:
Sleeping
client: surface ALL observation fields (was dropping deltas, anomalies, last_action, step_history)
Browse filesPre-fix, an external user calling the deployed HF Space via this client got
a strictly worse view than the server returned: per-meter deltas, anomalies,
last_action, and the full step_history were silently dropped during JSON
parsing. This violated the 'client/server separation done right' criterion
in the hackathon rubric (training/WhatMakesAGoodSubmission.md): the client
was pretending fields didn't exist.
Fix: _parse_result now reconstructs the full RhythmObservation including
all 5 *_delta fields, last_action, and step_history with full StepRecord
fidelity (including the new vitality/cognition/progress/serenity/connection
anomaly fields added in iter 4 commit).
Also adds the hackathon submission criteria doc to the repo for reference.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- client.py +39 -3
- training/WhatMakesAGoodSubmission.md +57 -0
|
@@ -18,9 +18,9 @@ from openenv.core.client_types import StepResult
|
|
| 18 |
from openenv.core.env_client import EnvClient
|
| 19 |
|
| 20 |
try:
|
| 21 |
-
from .models import RhythmAction, RhythmObservation, RhythmState
|
| 22 |
except ImportError:
|
| 23 |
-
from models import RhythmAction, RhythmObservation, RhythmState
|
| 24 |
|
| 25 |
|
| 26 |
class RhythmEnv(EnvClient[RhythmAction, RhythmObservation, RhythmState]):
|
|
@@ -38,9 +38,36 @@ class RhythmEnv(EnvClient[RhythmAction, RhythmObservation, RhythmState]):
|
|
| 38 |
return {"action_type": action.action_type.value}
|
| 39 |
|
| 40 |
def _parse_result(self, payload: Dict[str, Any]) -> StepResult[RhythmObservation]:
|
| 41 |
-
"""Parse server response into StepResult[RhythmObservation].
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
obs_data = payload.get("observation", {})
|
| 43 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
observation = RhythmObservation(
|
| 45 |
timestep=obs_data.get("timestep", 0),
|
| 46 |
day=obs_data.get("day", 0),
|
|
@@ -56,6 +83,15 @@ class RhythmEnv(EnvClient[RhythmAction, RhythmObservation, RhythmState]):
|
|
| 56 |
done=payload.get("done", False),
|
| 57 |
reward=payload.get("reward", 0.0),
|
| 58 |
metadata=obs_data.get("metadata", {}),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
)
|
| 60 |
|
| 61 |
return StepResult(
|
|
|
|
| 18 |
from openenv.core.env_client import EnvClient
|
| 19 |
|
| 20 |
try:
|
| 21 |
+
from .models import RhythmAction, RhythmObservation, RhythmState, StepRecord
|
| 22 |
except ImportError:
|
| 23 |
+
from models import RhythmAction, RhythmObservation, RhythmState, StepRecord
|
| 24 |
|
| 25 |
|
| 26 |
class RhythmEnv(EnvClient[RhythmAction, RhythmObservation, RhythmState]):
|
|
|
|
| 38 |
return {"action_type": action.action_type.value}
|
| 39 |
|
| 40 |
def _parse_result(self, payload: Dict[str, Any]) -> StepResult[RhythmObservation]:
|
| 41 |
+
"""Parse server response into StepResult[RhythmObservation].
|
| 42 |
+
|
| 43 |
+
Surfaces ALL observation fields the server returns, including the
|
| 44 |
+
per-meter deltas, anomalies (in step_history), last_action, and the
|
| 45 |
+
full step history. Without these, an external agent connecting to the
|
| 46 |
+
server can't see the meta-RL signals it needs to infer the profile.
|
| 47 |
+
"""
|
| 48 |
obs_data = payload.get("observation", {})
|
| 49 |
|
| 50 |
+
# Reconstruct step_history with full StepRecord fidelity
|
| 51 |
+
step_history_raw = obs_data.get("step_history", []) or []
|
| 52 |
+
step_history = [
|
| 53 |
+
StepRecord(
|
| 54 |
+
step=h.get("step", 0),
|
| 55 |
+
action=h.get("action", ""),
|
| 56 |
+
reward=h.get("reward", 0.0),
|
| 57 |
+
vitality_delta=h.get("vitality_delta", 0.0),
|
| 58 |
+
cognition_delta=h.get("cognition_delta", 0.0),
|
| 59 |
+
progress_delta=h.get("progress_delta", 0.0),
|
| 60 |
+
serenity_delta=h.get("serenity_delta", 0.0),
|
| 61 |
+
connection_delta=h.get("connection_delta", 0.0),
|
| 62 |
+
vitality_anomaly=h.get("vitality_anomaly", 0.0),
|
| 63 |
+
cognition_anomaly=h.get("cognition_anomaly", 0.0),
|
| 64 |
+
progress_anomaly=h.get("progress_anomaly", 0.0),
|
| 65 |
+
serenity_anomaly=h.get("serenity_anomaly", 0.0),
|
| 66 |
+
connection_anomaly=h.get("connection_anomaly", 0.0),
|
| 67 |
+
)
|
| 68 |
+
for h in step_history_raw
|
| 69 |
+
]
|
| 70 |
+
|
| 71 |
observation = RhythmObservation(
|
| 72 |
timestep=obs_data.get("timestep", 0),
|
| 73 |
day=obs_data.get("day", 0),
|
|
|
|
| 83 |
done=payload.get("done", False),
|
| 84 |
reward=payload.get("reward", 0.0),
|
| 85 |
metadata=obs_data.get("metadata", {}),
|
| 86 |
+
# Per-meter deltas from THIS step (was being silently dropped)
|
| 87 |
+
vitality_delta=obs_data.get("vitality_delta", 0.0),
|
| 88 |
+
cognition_delta=obs_data.get("cognition_delta", 0.0),
|
| 89 |
+
progress_delta=obs_data.get("progress_delta", 0.0),
|
| 90 |
+
serenity_delta=obs_data.get("serenity_delta", 0.0),
|
| 91 |
+
connection_delta=obs_data.get("connection_delta", 0.0),
|
| 92 |
+
last_action=obs_data.get("last_action"),
|
| 93 |
+
# Rolling history with anomalies (the meta-RL signal)
|
| 94 |
+
step_history=step_history,
|
| 95 |
)
|
| 96 |
|
| 97 |
return StepResult(
|
|
@@ -0,0 +1,57 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# What makes a submission stand out:
|
| 2 |
+
Pick an ambitious, original problem
|
| 3 |
+
The themes (problems) are deliberately open. Use them as launching pads, not boxes. Judges have seen a lot of chess, snake, tic-tac-toe, and grid-world clones. To score well on innovation,
|
| 4 |
+
you need a genuinely fresh angle. Some questions to ask yourself:
|
| 5 |
+
Does this environment exist to teach an LLM something it currently can’t do well?
|
| 6 |
+
Is the domain underexplored in RL/LLM training?
|
| 7 |
+
Could a researcher write a paper about training on this?
|
| 8 |
+
|
| 9 |
+
Design a reward signal that actually teaches
|
| 10 |
+
A great environment has a reward function that:
|
| 11 |
+
Provides a rich, informative signal (not just 0/1 at the end)
|
| 12 |
+
Captures something hard to measure in a clever way
|
| 13 |
+
Uses OpenEnv’s Rubric system thoughtfully (composable rubrics > monolithic scoring)
|
| 14 |
+
Is hard to game; an agent that exploits the reward without solving the task should not get high scores
|
| 15 |
+
|
| 16 |
+
Show real training, end to end
|
| 17 |
+
The bar isn’t “training script exists.” The bar is “training script runs against the environment, the
|
| 18 |
+
agent learns, and you can show it.” Concretely:
|
| 19 |
+
Your training loop should connect to your environment (not a static dataset)
|
| 20 |
+
Train long enough that the curves mean something
|
| 21 |
+
Compare a trained agent vs. a random/untrained baseline; quantitative and/or qualitative
|
| 22 |
+
Include the plots and numbers in your README and writeup
|
| 23 |
+
|
| 24 |
+
Make your plots readable
|
| 25 |
+
Reviewers spend seconds, not minutes, on each plot. Help them out:
|
| 26 |
+
Label both axes (e.g. “training step” / “episode” on x, “reward” / “loss” on y) and include units where they apply
|
| 27 |
+
Save plots as .png or .jpg and commit them to the repo (don’t leave them only in a Colab cell or a deleted Wandb run) (if you ran via WANBD, please include the link to that specific run of your plots)
|
| 28 |
+
Embed the key plots in your README with a one-line caption explaining what each one shows If you have multiple runs (baseline vs. trained, ablations, etc.), put them on the same axes so the comparison is obvious
|
| 29 |
+
|
| 30 |
+
|
| 31 |
+
Tell a story, not an API doc
|
| 32 |
+
Your README, blog, and pitch should answer:
|
| 33 |
+
Problem) what capability gap or interesting domain are you targeting?
|
| 34 |
+
Environment) what does the agent see, do, and get rewarded for?
|
| 35 |
+
Results) what changed after training? Show it.
|
| 36 |
+
Why does it matter) who would care, and why?
|
| 37 |
+
|
| 38 |
+
A reviewer should be able to read your README in 3~5 minutes and want to try your
|
| 39 |
+
environment.
|
| 40 |
+
|
| 41 |
+
NOTE: If you have a video, HF post, or anything else interesting, please make sure that it’s linked
|
| 42 |
+
from your README.
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
Engineer it cleanly (table stakes)
|
| 46 |
+
Engineering quality matters less than ambition, but sloppy work hurts. Make sure you:
|
| 47 |
+
Use OpenEnv’s Environment / MCPEnvironment base classes properly
|
| 48 |
+
Respect the client / server separation (clients should never import server internals)
|
| 49 |
+
Follow the standard Gym-style API (reset, step, state)
|
| 50 |
+
Have a valid openenv.yaml manifest
|
| 51 |
+
Don’t use reserved tool names (reset, step, state, close) for MCP tools
|
| 52 |
+
|
| 53 |
+
Final Note
|
| 54 |
+
Judges are looking for environments that push the frontier of what we can train LLMs to do. Be
|
| 55 |
+
ambitious. Pick a problem you find genuinely interesting; that almost always produces better
|
| 56 |
+
work than chasing what you think judges want. Good luck.
|
| 57 |
+
|