Spaces:

Adhitya122
/

molforge

Running

App Files Files Community

Adhitya122 commited on 12 days ago

Commit

5511b8f

verified ·

1 Parent(s): d5ef2b7

Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

.gitattributes +3 -0
README.md +184 -133
assets/Logs.png +3 -0
assets/molforge_architecture.png +3 -0
assets/reward_curve.png +3 -0
index.html +457 -0
molforge_grpo_official_submission.ipynb +132 -79

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+assets/Logs.png filter=lfs diff=lfs merge=lfs -text
+assets/molforge_architecture.png filter=lfs diff=lfs merge=lfs -text
+assets/reward_curve.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,163 +1,214 @@
 ---
-title: MolForge
-emoji: 🧪
-colorFrom: green
-colorTo: indigo
-sdk: docker
-app_port: 8000
 ---
-# MolForge
-This repository implements an OpenEnv-compatible reinforcement learning environment for **medicinal chemistry lead optimization**. The agent does not directly see the true biological properties of the candidate molecule. Instead, a specialist team iteratively edits a KRAS G12C candidate under limited assay budget, partial observability, and strict safety constraints, receiving a noisy simulated output, and is rewarded for discovering a highly potent, synthesizable, and safe drug candidate.
-The environment is designed as a **partially observable Markov decision process (POMDP)** with:
-- hidden ground-truth molecular properties and scenario constraints
-- hidden target mutation traps (e.g. KRAS resistance panel shifts)
-- visible task metadata, team communication, assay results, and remaining budget
-- simulated `RDKit` descriptors and `TDC` (Therapeutics Data Commons) predictions (QED, SA_Score, LogP, TPSA)
-- dense step-wise reward (in curriculum mode) plus terminal reward for submission quality
-At a high level, each episode looks like this:
-1. `reset()` picks a biological scenario (e.g. `level_1_medium`) and seeds the simulator.
-2. The agent receives a `MolForgeObservation` describing the task, the starting molecule scaffold, and the current visible state.
-3. The agent (acting as different roles) submits a `MolForgeAction` such as `edit`, `run_assay`, `propose_nomination`, or `submit`.
-4. The **Governance rule engine** checks whether the action is valid, requiring multi-agent consensus for final decisions.
-5. The transition engine updates the molecule, spends the assay budget, and returns oracle readings.
-6. The reward computer scores the step based on whether the action was invalid, vetoed, or successful.
-7. The environment returns a new observation with updated history, assay readings, and reward.
-8. The episode ends when the agent successfully submits the molecule, exhausts its budget, or reaches the maximum step horizon.
----
-## Hidden state vs Visible state
-### Hidden state
-The simulator keeps ground-truth properties that the agent never directly sees. It contains:
-- The true underlying scoring functions for `potency`, `safety`, and `synthesizability`.
-- Sunk-cost traps and late-stage target mutations (e.g., in `level_2_hard`).
-- The strict constraints required for a valid submission.
-- The remaining hidden milestones for the scenario.
-### Visible state
-The agent only sees `MolForgeObservation`, which includes:
-- The current `TaskSpec` and `scenario_id`.
-- Pipeline history and previous actions.
-- The current molecular scaffold (in SMILES format).
-- The `budget_used` and `remaining_budget`.
-- Responses from the `run_assay` oracle (TDC predictors and RDKit descriptors).
-- The `GovernanceStatus` showing which specialist agents have approved or objected.
-- The `step_reward_breakdown`.
-This separation is what makes the environment a POMDP rather than a fully observed simulator.
----
-## Repository files navigation
-### `models.py`
-Defines the Pydantic contracts that all other modules use:
-- `MolForgeAction`: One structured step chosen by the agent. Fields include `action_type`, `acting_role`, `tool_name`, `slot`, `fragment`, and `rationale`.
-- `MolForgeObservation`: What the agent can see after each step; includes `current_molecule`, `last_transition_summary`, `reward_breakdown`, and `governance_status`.
-- `MolForgeState`: The internal tracked state including `episode_id`, `step_count`, and `invalid_action_count`.
-### `server/scenarios.py`
-This is where episodes come from. It defines a curated library of three biological scenarios, each bundling a starting scaffold, a budget, and a specific molecular target:
-- `level_0_easy`: Potency-first optimization with a generous budget and a starting scaffold that is one or two edits from success.
-- `level_1_medium`: Multi-objective optimization with safety as a hard constraint and moderate budget pressure.
-- `level_2_hard`: A sunk-cost trap plus late target mutation. The initial scaffold family has a hidden liability, and the best policy is often to restart early.
-### `server/actions.py` & `server/governance.py`
-The rule engines enforcing scientific and procedural constraints before each action is applied:
-- `run_assay`: Costs budget. Assembles the fragments into a valid `SMILES` string and evaluates the current molecule using `TDC` Oracles and `RDKit` logic (e.g. `MolLogP`, `TPSA`, `NumRotatableBonds`, `QED`).
-- `edit`: Replaces a specific R-group slot (`warhead`, `hinge`, `solvent_tail`, `back_pocket`) with a new chemical fragment (e.g. `acrylamide`, `fluorophenyl`, `morpholine`). Clears previously gathered evidence.
-- `submit`: Ends the episode. Triggers the final evaluation grader against the scenario's strict hard constraints (`potency_min`, `toxicity_max`, `synth_min`).
-- **Governance**: Certain actions require multi-agent consensus. If the `Lead Chemist` tries to submit without the `Safety Specialist`'s approval, the action is vetoed.
-### `server/molforge_environment.py`
-This is the orchestration layer that ties everything together.
-On `reset()` it:
-- Generates a task scenario.
-- Clears the message log, history, and resets the molecule to the default scaffold.
-On `step()` it:
-- Checks governance rules and validates the action.
-- Executes the action (e.g. replacing an R-group fragment or running an assay).
-- Computes reward (via Curriculum or Assay-Gated mode).
-- Builds the next `MolForgeObservation`.
 ---
-## What actually happens on one step
-Here is the concrete order of operations for `env.step(action)`:
-1. Increment the step counter.
-2. Run validation checks. If the action format is invalid, return a failure report and a `-1.0` reward.
-3. Assess **Governance**. If a required specialist agent vetoes the action, the action is blocked and penalized.
-4. Execute the action (`edit`, `run_assay`, `submit`).
-5. Deduct oracle budget if `run_assay` was called.
-6. Compute decomposed reward from the state transition (e.g., getting penalized for redundant assays).
-7. If the episode is ending (via `submit`, max steps, or zero budget), compute the terminal `submission_score`.
-8. Return an observation that exposes the visible summary but not the hidden truth.
----
-## Typical successful pipeline
-Most scenarios reward a sensible experiment order similar to:
-1. `run_assay` (Assay potency and safety of the baseline molecule).
-2. `edit` (Swap an R-group fragment to improve a weak property).
-3. `run_assay` (Gather new evidence for the modified molecule).
-4. `propose_nomination` (Discuss the findings with the multi-agent review board).
-5. `submit` (Finalize the candidate).
-The exact best sequence depends on the scenario. In `level_2_hard`, the best strategy is often to `restart` the entire scaffold immediately rather than wasting budget on a doomed trajectory.
----
-## Reward Strategy & Episode termination
-MolForge uses two distinct reward settings for different purposes:
-**1. Training / RL Warmup (`MOLFORGE_REWARD_MODE=curriculum`)**
-- Gives partial credit at the end of an episode even if the model didn't submit, provided it gathered useful evidence.
-- It actively prevents "reward hacking" by penalizing assay-spamming, and giving massive multipliers to successful submissions.
-**2. Judge-Facing Evaluation (`MOLFORGE_REWARD_MODE=assay_gated`)**
-- Strict OpenEnv hackathon rules.
-- If the agent does not formally `submit` the candidate, the final score is `0.0`.
-- No partial credit is given for just gathering evidence.
-An episode ends when one of the following happens:
-- The agent explicitly chooses `submit`.
-- Resources (oracle budget) are exhausted.
-- The environment reaches `MAX_STEPS`.
----
-## Installation & Usage
-The package requires Python ≥ 3.10.
-```bash
-pip install "openenv-core[core]>=0.2.3" pydantic transformers trl peft datasets
-```
-### 1. In-process environment
-Use `MolForgeEnvironment` when you want direct Python access with full structured observations:
-```python
-from models import MolForgeAction
-from server.molforge_environment import MolForgeEnvironment
-env = MolForgeEnvironment()
-obs = env.reset()
-action = MolForgeAction(
-    action_type="run_assay",
-    acting_role="Lead Chemist",
-    tool_name="potency_oracle",
-    rationale="Need to gather baseline potency evidence."
-)
-obs = env.step(action)
-print(obs.reward)
-print(obs.last_transition_summary)
 ```
-### 2. RL Training Notebook
-We have provided a cleanly documented `issue/molforge_grpo_official_submission.ipynb` which demonstrates exactly how to fine-tune a Qwen3.5 model using TRL's GRPO trainer natively against this OpenEnv environment.

 ---
+title: "MolForge: Verifier-Driven RL for Drug Discovery"
+emoji: 🧬
+colorFrom: indigo
+colorTo: blue
+sdk: static
+pinned: false
+license: mit
+tags:
+- reinforcement-learning
+- drug-discovery
+- chemistry
+- multi-agent
+- oncology
+- molecular-simulation
+- openenv
 ---
+# MolForge: Verifier-Driven RL for Drug Discovery
+MolForge is a reinforcement learning environment that simulates a **medical oncology discovery lab**. Unlike traditional LLM tasks where the model generates a final answer in one shot, MolForge forces the model to execute the **scientific method** under real-world constraints: budget, toxicity, and synthesis complexity.
+**[View the MolForge Space Deployment on Hugging Face](https://huggingface.co/spaces/Adhitya122/molforge)**
+**[Try the RL Training Notebook on Google Colab](https://colab.research.google.com/drive/1c6npGkGNbbbd8XFNeS6zInBpopLnJ4W4?usp=sharing)**
+### The Scientific Method as a Workflow
+Imagine a biotech team tasked with optimizing a lead candidate for **KRAS G12C** (including a high-difficulty resistance panel). The model doesn't just "write" a molecule; it controls a specialist team that must navigate a resource-constrained laboratory:
+- **Lead Chemist**: Proposes molecular edits and decides when to submit.
+- **Assay Planner**: Allocates limited budget to run empirical tests.
+- **Toxicologist**: Reviews safety risks and can object to unsafe designs.
+- **Process Chemist**: Evaluates whether the molecule is practical to synthesize.
+Every action—editing a fragment, running a docking simulation, or ordering a toxicity assay—is a decision that impacts the final outcome. The model must learn to gather enough evidence to justify a submission while keeping the project within budget.
+> **Core Philosophy:** The LLM is not the judge. The LLM is the scientist being judged by external, verifiable reality.
+## What Makes MolForge Special?
+MolForge is built to move beyond simple "molecule generation" into "scientific workflow optimization." Here are the seven core pillars that make it unique:
+1. [**Verifier-Based Evaluation**](#1-verifier-based-evaluation): The LLM is the scientist, not the judge. It is held accountable by real-world verifiers like **RDKit** and **TDC**.
+2. [**Chemical & Molecular Simulations**](#2-chemical--molecular-simulations): Realistic simulation of potency and existence using heuristic docking, **RDKit**, and **TDC**.
+3. [**Self-Correction & Improvement Loop**](#3-self-correction--improvement-loop): After each edit, agents receive structured feedback from verifiers, allowing them to self-correct.
+4. [**Decomposed Reward Architecture**](#4-decomposed-reward-architecture): Multi-step rewards for every action (research, edits, coordination) provide high observability.
+5. [**Scientific Model Improvement**](#5-scientific-model-improvement): Real verifier feedback (Reviews) guides the model toward scientifically sound designs.
+6. [**Strategic Training Modes**](#6-strategic-training-modes): A dual-mode system using **Curriculum mode** (partial credit) and **Assay-Gated mode** (strict).
+7. [**Multi-Agent Governance**](#7-multi-agent-governance): A specialized team that plans, executes, and shares information to coordinate the next plan of action.
 ---
+## Architecture
+The architecture is a closed scientific feedback loop:
+![MolForge Architecture](assets/molforge_architecture.png)
+### The POMDP Framework: Hidden vs. Visible State
+MolForge is designed as a **partially observable Markov decision process (POMDP)**. This separation is what makes the environment a scientific challenge rather than a simple optimization task.
+- **Hidden State**: The simulator tracks the ground-truth scoring for `potency`, `safety`, and `synthesizability`. It also hides "sunk-cost traps" and late-stage target mutation shifts (e.g., in `level_2_hard`) that the agent must discover through evidence.
+- **Visible State**: The agent only sees noisy `MolForgeObservation` reports: pipeline history, SMILES scaffolds, remaining budget, and the structured feedback from the verifier assays (RDKit and TDC).
+## Scientific Verifier Layers
+### RDKit: chemical plausibility
+RDKit checks molecule-like behavior and chemistry descriptors. In MolForge, this layer is used to keep the molecule edits grounded in chemical reality. It supports descriptor-style reasoning such as lipophilicity, polarity, tractability, and drug-likeness.
+### TDC: biomedical outcome signals
+Therapeutics Data Commons represents the medical outcome side of the environment. It provides the project with a path toward realistic prediction tasks such as toxicity, synthesis difficulty, and drug-likeness. In the default Docker deployment, RDKit remains active and TDC is optional because it can pull a heavier platform-sensitive ML stack.
+### Heuristic docking: receptor fit
+MolForge includes a docking-style surrogate that answers three fast questions:
+| Check | Question | Why it matters |
+| --- | --- | --- |
+| Pocket matching | Does the hinge fragment fit the receptor pocket? | Better pocket complementarity improves potency. |
+| Lipophilic match | Is LogP near the target pocket's hydrophobic comfort zone around `3.0`? | Too much lipophilicity can increase toxicity; too little can weaken binding. |
+| Polarity match | Is TPSA near a useful range around `85.0`? | Polarity affects binding, permeability, and clash risk. |
+This gives the environment fast receptor-aware feedback in milliseconds, which is important for RL.
+## Training Story
+The training pipeline has two stages:
+1. **SFT warm start**
+2. **RL with verifier rewards**
+### Base model
+The model used for the main run is:
+```text
+unsloth/Qwen3.5-2B
 ```
+The raw base model was not reliable enough for the environment at first. It often failed to produce the exact structured JSON actions that MolForge expects, and it did not consistently respect the specialist-agent interaction format.
+So the first step was a small SFT warm start. This stage is not meant to teach the model the optimal chemistry. It teaches the model how to speak the environment's action language:
+- valid JSON actions
+- correct role/action pairing
+- correct molecule slots and fragments
+- concise rationales
+- evidence fields based only on visible observations
+- expected-effect fields such as potency up/down or toxicity up/down
+- valid specialist messages where needed
+### Training Results
+After SFT, the policy is trained with GRPO-style RL against MolForge itself. During training, the model explores the 256-combination molecular edit space, receiving rewards for molecule quality, evidence coverage, and budget discipline.
+![Reward Curve](assets/reward_curve.png)
+![Training Logs](assets/Logs.png)
+### Performance Comparison: SFT vs. RL
+| Difficulty | Before (SFT Model) | After RL Training | Improvement |
+| :--- | :---: | :---: | :---: |
+| **Easy** | 0.1167 | 0.1295 | **+10.9%** |
+| **Medium** | 0.1167 | 0.1278 | **+9.5%** |
+| **Hard** | 0.0800 | 0.0866 | **+8.3%** |
+As shown in the reward curve and logs, the model successfully learns to navigate the scientific constraints, moving from early exploration to consistent, verifier-backed molecule submissions. For strict evaluation, the environment switches back to `assay_gated` mode.
+## Reward Design: Shaping Scientific Behavior
+The reward function mixes coarse shaping with sparse terminal bonuses to promote rigorous scientific exploration:
+- **Coarse Feedback**: Edit feedback avoids exposing exact hidden deltas, forcing the model to rely on assays for decision-making.
+- **Information Gain**: Rewards for running useful assays that provide new, evidence-based signal.
+- **Coordination & Governance**: Rewards for correct specialist reviews, proposal discipline, and multi-agent consensus.
+- **Scientific Penalties**: Deductions for invalid actions, repeated states, wasteful assay repetition, and submitting without sufficient potency/safety support.
+- **Terminal Scoring**: A large bonus for submitting a molecule that beats the baseline while satisfying all hard constraints.
+### Strategic Training Modes
+MolForge uses two distinct reward settings to balance training and evaluation:
+1. **Curriculum Mode (Training)**: Adds bounded warmup rewards for evidence collection and "near-miss" episodes. It also adds a small **missed-nomination penalty** when a strong evidence package is ready but the agent lets the deadline pass without submitting. This acts as "breadcrumbs" for RL, helping smaller models navigate sparse reward landscapes.
+2. **Assay-Gated Mode (Evaluation)**: The strict, official hackathon mode. If the agent does not formally `submit` the candidate before the budget is exhausted, the final score is exactly `0.0`. No partial credit is given for just gathering evidence.
+`final_score` remains the single headline scalar for RL/evaluation. While `candidate_score` and `progress_score` are used for diagnostic observability, the environment is designed so that evidence collection alone cannot look like success; it must lead to a valid submission.
+## Why This Project Matters
+MolForge is designed to test a deeper kind of AI capability than simple answer generation. The model must work inside a scientific feedback loop where actions are checked, evidence costs money, unsafe decisions can be blocked, and the final answer only matters if the path to that answer is experimentally justified.
+The strongest part of the project is that the LLM is not trusted by default. It has to earn trust through verifier-backed decisions.
+## Deep Dive: What Makes MolForge Special?
+### 1. Verifier-Based Evaluation
+In many LLM systems, the model itself is used as a judge, reviewer, or evaluator. MolForge flips that pattern. The LLM is the scientist being judged, not the judge. It is held accountable by real-world verifiers like **RDKit**, **TDC**, and molecular simulation engines. This ensures that the model's progress is grounded in chemical and biological reality, not just persuasive language.
+### 2. Chemical & Molecular Simulations
+MolForge doesn't just predict outcomes; it utilizes multiple simulation layers to ground the model's decisions:
+*   **Chemical Plausibility (RDKit):** Decides if the molecule generated by the LLM (via edits) can actually exist in chemical reality. [Visit RDKit](https://www.rdkit.org)
+*   **Medical Outcomes (TDC):** Predicts the most probable medical outcomes and properties using the [Therapeutics Data Commons](https://tdcommons.ai).
+*   **Heuristic Docking Score:** A fast, physics-inspired simulation that updates **potency** in milliseconds based on three rules:
+    1.  **Pocket Matching:** Does the fragment structurally fit the target receptor pocket (e.g., KRAS G12C)?
+    2.  **Lipophilic Match:** Is the LogP near the ideal **3.0** to sit comfortably in the hydrophobic pocket?
+    3.  **Polarity Match:** Is the TPSA near the ideal **85.0** to avoid repulsive polar clashes?
+### 3. Self-Correction & Improvement Loop
+MolForge is an iterative environment. After each proposed molecular modification, the model receives a structured review from the verifiers. This feedback allows the agent to recognize liabilities (like toxicity or low potency) and correct them in the next step. This creates a genuine **self-improvement loop** within each episode.
+### 4. Decomposed Reward Architecture
+The reward function is not a single "black box" scalar. We use a multi-step reward system where small-scale rewards are designed for every individual action—research, molecular edits, and inter-agent coordination. While we may output a single total reward for training simplicity (especially for the hackathon), the decomposed components allow for massive observability into which sections of the workflow are lacking.
+### 5. Scientific Model Improvement
+We use real verifier feedback to drive model improvement. By providing constant, verifiable reviews, we train the model to improve its designs based on evidence. This moves the model away from simple pattern matching and toward a more rigorous, evidence-based design process.
+### 6. Strategic Training Modes: Curriculum vs. Assay-Gated
+To solve the "sparse reward" problem common in RL, MolForge uses two distinct modes:
+*   **Curriculum Mode (Training):** If a model fails to submit but showed good scientific behavior, it receives "Partial Credit" (up to +0.75). It gets points for gathering evidence and designing promising molecules. These "breadcrumbs" teach the model how to explore before it discovers the terminal submission bonus.
+*   **Assay-Gated Mode (Evaluation):** This is the strict, official mode used for hackathon grading. There is **zero partial credit**. If the model fails to explicitly `submit` a high-potency, safe molecule before the budget runs out, its score is exactly `0.0`.
+### 7. Multi-Agent Governance
+Drug discovery is a team effort. MolForge implements a multi-agent system where specialized roles (Lead Chemist, Toxicologist, Assay Planner) review each other's moves, plans, and executions. Crucially, these agents **share information and coordinate** between themselves to decide the next plan of action, ensuring that every decision undergoes a rigorous "peer review" process before execution.
+---
+## Final Takeaway
+MolForge is special because it treats the LLM as a trainable research agent inside a controlled scientific environment, not as an oracle. The model is judged by chemistry and biomedical verifiers, corrected by specialist feedback, constrained by assay budget, and scored by a reward system that can explain where the policy succeeded or failed.
+The important pieces work together:
+- **Verifier-first evaluation:** RDKit, TDC-style signals, and docking-style simulation judge the model's actions.
+- **Multi-agent review:** specialist roles create checks and balances around each decision.
+- **Self-improvement loop:** every action produces feedback that the next action can respond to.
+- **Decomposed rewards:** the environment tracks molecule quality, evidence, budget, coordination, and safety separately.
+- **Curriculum to strict evaluation:** training can use partial-credit breadcrumbs, while final evaluation remains unforgiving.
+- **Dynamic molecular search:** the model explores 256 fragment combinations across three starting scientific scenarios instead of memorizing one answer.
+That is the project thesis: useful scientific agents should not merely generate plausible ideas. They should operate in a loop where the world pushes back.

assets/Logs.png ADDED Viewed

Git LFS Details

SHA256: a5f41fa2250acb308b5d9036fda12eda1e5dd0c98e3a52b92c1db2c6f25a1a8d
Pointer size: 131 Bytes
Size of remote file: 337 kB

assets/molforge_architecture.png ADDED Viewed

Git LFS Details

SHA256: 3674e13e70719039a42f7a35720e7f3f2657c9980530f22e03a63207e18d457a
Pointer size: 131 Bytes
Size of remote file: 337 kB

assets/reward_curve.png ADDED Viewed

Git LFS Details

SHA256: 533a0f6951602ff55c1e63b793bd86dac41e6d90f355894f6476d2a7cbc64245
Pointer size: 131 Bytes
Size of remote file: 192 kB

index.html ADDED Viewed

	@@ -0,0 +1,457 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>MolForge | Verifier-Driven RL for Drug Discovery</title>
+    <link rel="preconnect" href="https://fonts.googleapis.com">
+    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;600;700;800&family=Outfit:wght@400;600;800&display=swap" rel="stylesheet">
+    <style>
+        :root {
+            --primary: #6366f1;
+            --primary-glow: rgba(99, 102, 241, 0.5);
+            --secondary: #8b5cf6;
+            --bg: #0f172a;
+            --card-bg: rgba(30, 41, 59, 0.7);
+            --text: #f8fafc;
+            --text-dim: #94a3b8;
+            --glass: rgba(255, 255, 255, 0.03);
+            --glass-border: rgba(255, 255, 255, 0.1);
+        }
+        * {
+            margin: 0;
+            padding: 0;
+            box-sizing: border-box;
+        }
+        body {
+            font-family: 'Inter', sans-serif;
+            background-color: var(--bg);
+            background-image:
+                radial-gradient(circle at 20% 20%, rgba(99, 102, 241, 0.15) 0%, transparent 40%),
+                radial-gradient(circle at 80% 80%, rgba(139, 92, 246, 0.15) 0%, transparent 40%);
+            color: var(--text);
+            line-height: 1.6;
+            overflow-x: hidden;
+        }
+        .container {
+            max-width: 1100px;
+            margin: 0 auto;
+            padding: 0 2rem;
+        }
+        /* Hero Section */
+        header {
+            height: 90vh;
+            display: flex;
+            flex-direction: column;
+            justify-content: center;
+            align-items: center;
+            text-align: center;
+            position: relative;
+        }
+        .badge {
+            background: var(--glass);
+            border: 1px solid var(--glass-border);
+            padding: 0.5rem 1.2rem;
+            border-radius: 99px;
+            font-size: 0.85rem;
+            font-weight: 600;
+            color: var(--primary);
+            margin-bottom: 1.5rem;
+            display: inline-block;
+            backdrop-filter: blur(10px);
+        }
+        h1 {
+            font-family: 'Outfit', sans-serif;
+            font-size: clamp(3rem, 8vw, 5.5rem);
+            font-weight: 800;
+            line-height: 1.1;
+            margin-bottom: 1.5rem;
+            background: linear-gradient(to bottom right, #fff 30%, var(--text-dim));
+            -webkit-background-clip: text;
+            -webkit-text-fill-color: transparent;
+        }
+        .hero-tagline {
+            font-size: clamp(1.1rem, 3vw, 1.4rem);
+            color: var(--text-dim);
+            max-width: 700px;
+            margin-bottom: 3rem;
+        }
+        .cta-group {
+            display: flex;
+            gap: 1.5rem;
+            flex-wrap: wrap;
+            justify-content: center;
+        }
+        .btn {
+            padding: 1rem 2.5rem;
+            border-radius: 12px;
+            font-weight: 700;
+            text-decoration: none;
+            transition: all 0.3s cubic-bezier(0.4, 0, 0.2, 1);
+            display: inline-flex;
+            align-items: center;
+            gap: 0.5rem;
+        }
+        .btn-primary {
+            background: var(--primary);
+            color: white;
+            box-shadow: 0 10px 20px -10px var(--primary-glow);
+        }
+        .btn-primary:hover {
+            transform: translateY(-2px);
+            box-shadow: 0 15px 30px -10px var(--primary-glow);
+            filter: brightness(1.1);
+        }
+        .btn-secondary {
+            background: var(--glass);
+            border: 1px solid var(--glass-border);
+            color: white;
+        }
+        .btn-secondary:hover {
+            background: var(--glass-border);
+            transform: translateY(-2px);
+        }
+        /* Section Styling */
+        section {
+            padding: 8rem 0;
+        }
+        .section-header {
+            margin-bottom: 4rem;
+            text-align: center;
+        }
+        .section-header h2 {
+            font-family: 'Outfit', sans-serif;
+            font-size: 2.5rem;
+            margin-bottom: 1rem;
+        }
+        .section-header p {
+            color: var(--text-dim);
+            max-width: 600px;
+            margin: 0 auto;
+        }
+        /* Pillars Grid */
+        .pillars-grid {
+            display: grid;
+            grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
+            gap: 2rem;
+        }
+        .pillar-card {
+            background: var(--card-bg);
+            border: 1px solid var(--glass-border);
+            padding: 2.5rem;
+            border-radius: 24px;
+            transition: all 0.4s ease;
+            backdrop-filter: blur(12px);
+        }
+        .pillar-card:hover {
+            transform: translateY(-10px);
+            border-color: var(--primary);
+            box-shadow: 0 20px 40px -20px rgba(0,0,0,0.5);
+        }
+        .pillar-icon {
+            font-size: 2rem;
+            margin-bottom: 1.5rem;
+            background: var(--glass);
+            width: 60px;
+            height: 60px;
+            display: flex;
+            align-items: center;
+            justify-content: center;
+            border-radius: 16px;
+        }
+        .pillar-card h3 {
+            font-size: 1.4rem;
+            margin-bottom: 1rem;
+            color: var(--primary);
+        }
+        .pillar-card p {
+            color: var(--text-dim);
+            font-size: 0.95rem;
+        }
+        /* Visuals Section */
+        .visual-container {
+            background: var(--card-bg);
+            border: 1px solid var(--glass-border);
+            border-radius: 32px;
+            padding: 3rem;
+            margin-bottom: 4rem;
+            overflow: hidden;
+        }
+        .visual-container img {
+            width: 100%;
+            height: auto;
+            border-radius: 16px;
+            box-shadow: 0 20px 50px rgba(0,0,0,0.4);
+        }
+        .visual-label {
+            display: block;
+            text-align: center;
+            margin-top: 1.5rem;
+            color: var(--text-dim);
+            font-weight: 500;
+        }
+        /* Results Table */
+        .table-wrapper {
+            overflow-x: auto;
+            background: var(--glass);
+            border-radius: 20px;
+            border: 1px solid var(--glass-border);
+        }
+        table {
+            width: 100%;
+            border-collapse: collapse;
+            text-align: left;
+        }
+        th, td {
+            padding: 1.5rem;
+            border-bottom: 1px solid var(--glass-border);
+        }
+        th {
+            background: rgba(255,255,255,0.05);
+            font-weight: 700;
+            text-transform: uppercase;
+            font-size: 0.75rem;
+            letter-spacing: 0.1em;
+            color: var(--text-dim);
+        }
+        .improvement {
+            color: #10b981;
+            font-weight: 800;
+        }
+        /* POMDP Info */
+        .pomdp-box {
+            display: grid;
+            grid-template-columns: 1fr 1fr;
+            gap: 2rem;
+            margin-top: 3rem;
+        }
+        .state-card {
+            background: var(--glass);
+            padding: 2rem;
+            border-radius: 20px;
+            border-left: 4px solid var(--primary);
+        }
+        .state-card h4 {
+            margin-bottom: 1rem;
+            color: var(--text);
+        }
+        /* Footer */
+        footer {
+            padding: 6rem 0;
+            text-align: center;
+            border-top: 1px solid var(--glass-border);
+        }
+        .footer-links {
+            display: flex;
+            justify-content: center;
+            gap: 2rem;
+            margin-bottom: 2rem;
+        }
+        .footer-links a {
+            color: var(--text-dim);
+            text-decoration: none;
+            font-weight: 600;
+            transition: color 0.3s;
+        }
+        .footer-links a:hover {
+            color: var(--primary);
+        }
+        @media (max-width: 768px) {
+            .pomdp-box {
+                grid-template-columns: 1fr;
+            }
+            h1 { font-size: 3rem; }
+            .pillars-grid { grid-template-columns: 1fr; }
+        }
+    </style>
+</head>
+<body>
+    <div class="container">
+        <!-- Hero Section -->
+        <header>
+            <div class="badge">OpenEnv Hackathon 2026</div>
+            <h1>MolForge</h1>
+            <p class="hero-tagline">A verifier-driven reinforcement learning environment for oncology drug discovery, where the LLM is the scientist, not the judge.</p>
+            <div class="cta-group">
+                <a href="https://colab.research.google.com/drive/1c6npGkGNbbbd8XFNeS6zInBpopLnJ4W4?usp=sharing" target="_blank" class="btn btn-primary">
+                    Launch Training Notebook
+                </a>
+                <a href="#pillars" class="btn btn-secondary">Explore the Pillars</a>
+            </div>
+        </header>
+        <!-- POMDP Section -->
+        <section id="architecture">
+            <div class="section-header">
+                <h2>Scientific Architecture</h2>
+                <p>MolForge operates as a Partially Observable Markov Decision Process (POMDP), forcing models to operate under real-world uncertainty.</p>
+            </div>
+            <div class="visual-container">
+                <img src="assets/molforge_architecture.png" alt="MolForge Architecture">
+                <span class="visual-label">Closed-loop scientific feedback architecture</span>
+            </div>
+            <div class="pomdp-box">
+                <div class="state-card">
+                    <h4>Hidden Reality</h4>
+                    <p>The ground-truth scoring for potency, safety, and synthesizability. Includes late-stage mutation traps that only evidence can reveal.</p>
+                </div>
+                <div class="state-card">
+                    <h4>Visible Evidence</h4>
+                    <p>Noisy assay reports from RDKit and TDC, remaining budget, and structured feedback from the governance board.</p>
+                </div>
+            </div>
+        </section>
+        <!-- Pillars Section -->
+        <section id="pillars">
+            <div class="section-header">
+                <h2>The Seven Pillars</h2>
+                <p>Beyond simple molecule generation: a complete medicinal chemistry workflow optimizer.</p>
+            </div>
+            <div class="pillars-grid">
+                <div class="pillar-card">
+                    <div class="pillar-icon">🧪</div>
+                    <h3>Verifier-First</h3>
+                    <p>The LLM is held accountable by RDKit and TDC simulation engines. It must justify every decision with verifiable data.</p>
+                </div>
+                <div class="pillar-card">
+                    <div class="pillar-icon">🧬</div>
+                    <h3>Physics Grounded</h3>
+                    <p>Heuristic docking scores simulate pocket matching, lipophilic fit, and polarity clash in milliseconds.</p>
+                </div>
+                <div class="pillar-card">
+                    <div class="pillar-icon">🔄</div>
+                    <h3>Self-Correction</h3>
+                    <p>A structured loop where agents receive reviews on their edits and iteratively repair candidates.</p>
+                </div>
+                <div class="pillar-card">
+                    <div class="pillar-icon">📊</div>
+                    <h3>Decomposed Rewards</h3>
+                    <p>Fine-grained observability into research, edits, and coordination—not just a single vague scalar.</p>
+                </div>
+                <div class="pillar-card">
+                    <div class="pillar-icon">🔬</div>
+                    <h3>Evidence-Based</h3>
+                    <p>Constant, verifiable reviews drive the model toward sound scientific design rather than pattern matching.</p>
+                </div>
+                <div class="pillar-card">
+                    <div class="pillar-icon">🎓</div>
+                    <h3>Curriculum Learning</h3>
+                    <p>Partial credit "breadcrumbs" for early RL exploration, transitioning to strict evaluation for grading.</p>
+                </div>
+                <div class="pillar-card">
+                    <div class="pillar-icon">🤝</div>
+                    <h3>Governance</h3>
+                    <p>Multi-agent specialist board reviews every plan and execution to ensure rigor and safety.</p>
+                </div>
+            </div>
+        </section>
+        <!-- Results Section -->
+        <section id="results">
+            <div class="section-header">
+                <h2>Training & Performance</h2>
+                <p>Comparing the Supervised Fine-Tuning (SFT) baseline against the final GRPO-trained policy.</p>
+            </div>
+            <div class="visual-container">
+                <img src="assets/reward_curve.png" alt="Reward Curve">
+                <span class="visual-label">Learning progression from sparse rewards to consistent submissions</span>
+            </div>
+            <div class="table-wrapper">
+                <table>
+                    <thead>
+                        <tr>
+                            <th>Scenario Difficulty</th>
+                            <th>Before (SFT)</th>
+                            <th>After (RL)</th>
+                            <th>Improvement</th>
+                        </tr>
+                    </thead>
+                    <tbody>
+                        <tr>
+                            <td><strong>Level 0: Easy</strong></td>
+                            <td>0.1167</td>
+                            <td>0.1295</td>
+                            <td class="improvement">+10.9%</td>
+                        </tr>
+                        <tr>
+                            <td><strong>Level 1: Medium</strong></td>
+                            <td>0.1167</td>
+                            <td>0.1278</td>
+                            <td class="improvement">+9.5%</td>
+                        </tr>
+                        <tr>
+                            <td><strong>Level 2: Hard</strong></td>
+                            <td>0.0800</td>
+                            <td>0.0866</td>
+                            <td class="improvement">+8.3%</td>
+                        </tr>
+                    </tbody>
+                </table>
+            </div>
+            <div style="margin-top: 4rem;">
+                <img src="assets/Logs.png" alt="Training Logs" style="width: 100%; border-radius: 20px; border: 1px solid var(--glass-border);">
+                <span class="visual-label">Detailed action telemetry and governance history</span>
+            </div>
+        </section>
+        <!-- Footer -->
+        <footer>
+            <div class="footer-links">
+                <a href="https://github.com/Adhitya-Vardhan/molt_lab" target="_blank">GitHub Repository</a>
+                <a href="https://huggingface.co/Adhitya122/molforge-grpo-oncology" target="_blank">Model Card</a>
+                <a href="https://colab.research.google.com/drive/1c6npGkGNbbbd8XFNeS6zInBpopLnJ4W4?usp=sharing" target="_blank">Colab Notebook</a>
+            </div>
+            <p style="color: var(--text-dim); font-size: 0.9rem;">Built for the OpenEnv Hackathon 2026</p>
+        </footer>
+    </div>
+</body>
+</html>

molforge_grpo_official_submission.ipynb CHANGED Viewed

@@ -54,7 +54,7 @@
         "os.environ[\"MOLFORGE_REWARD_MODE\"] = \"curriculum\"\n",
         "os.environ[\"MOLFORGE_TRAINING_RANDOMIZATION\"] = \"1\"\n",
         "\n",
-        "RL_MAX_STEPS = 80\n",
         "NUM_GENERATIONS = 2\n",
         "PER_DEVICE_BATCH = 2\n",
         "GRAD_ACCUM = 4\n",
@@ -69,7 +69,9 @@
         "PLOT_DIR = OUTPUT_DIR / \"plots\"\n",
         "\n",
         "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
-        "PLOT_DIR.mkdir(parents=True, exist_ok=True)"
       ]
     },
     {
@@ -88,70 +90,77 @@
       "outputs": [],
       "source": [
         "import json\n",
         "from typing import Any, Dict, Tuple\n",
-        "from inference_common import (\n",
-        "    MolForgeAction,\n",
-        "    attach_reasoning_fields,\n",
-        "    attach_team_messages,\n",
-        "    extract_json,\n",
-        ")\n",
         "from server.molforge_environment import MolForgeEnvironment\n",
-        "from models import MolForgeState\n",
         "\n",
         "def replay_to_state(record: dict[str, Any]) -> MolForgeEnvironment:\n",
         "    env = MolForgeEnvironment()\n",
-        "    env._state = MolForgeState(**record[\"state\"])\n",
-        "    env._molecule = dict(record[\"molecule\"])\n",
-        "    env._scenario = [s for s in env.SCENARIOS if s.scenario_id == env._state.scenario_id][0]\n",
         "    return env\n",
         "\n",
-        "def evaluate_completion(prompt_str: str, completion_str: str, record: dict[str, Any]) -> Tuple[float, dict]:\n",
-        "    diagnostics = {\"valid_json\": False}\n",
         "    try:\n",
         "        action_dict = extract_json(completion_str)\n",
         "        action = MolForgeAction(**action_dict)\n",
         "    except Exception:\n",
-        "        return -1.2, diagnostics\n",
         "\n",
-        "    diagnostics[\"valid_json\"] = True\n",
         "    env = replay_to_state(record)\n",
-        "    \n",
-        "    # Create empty observation and attach reasoning\n",
         "    observation = env._build_observation(reward=0.0, done=False, reward_components=[])\n",
         "    action = attach_team_messages(observation, attach_reasoning_fields(observation, action))\n",
-        "    \n",
-        "    # Step the OpenEnv environment\n",
         "    next_observation = env.step(action)\n",
-        "    reward = float(next_observation.reward)\n",
-        "    grader_scores = next_observation.metadata.get(\"terminal_grader_scores\", {})\n",
         "    \n",
-        "    # --- ANTI-REWARD-HACKING SHAPING ---\n",
-        "    if action.action_type == \"run_assay\" and reward > 0:\n",
-        "        reward *= 0.25  # Nerf assay farming\n",
-        "    elif action.action_type == \"submit\":\n",
-        "        sub_score = float(grader_scores.get(\"submission_score\", 0.0))\n",
-        "        if sub_score > 0.0:\n",
-        "            reward += sub_score * 3.0  # Massive multiplier for submissions\n",
-        "    elif action.action_type == \"edit\" and reward > 0:\n",
-        "        reward *= 1.5  # Boost edits\n",
         "\n",
-        "    diagnostics.update({\n",
-        "        \"action_type\": action.action_type,\n",
-        "        \"reward\": reward,\n",
-        "        \"done\": next_observation.done,\n",
-        "    })\n",
-        "    return reward, diagnostics\n",
         "\n",
         "def molforge_reward_func(prompts, completions, **kwargs) -> list[float]:\n",
         "    rewards = []\n",
-        "    dataset_records = kwargs.get(\"record\", [])\n",
-        "    \n",
-        "    for prompt_list, completion, record in zip(prompts, completions, dataset_records):\n",
-        "        prompt_str = prompt_list[-1][\"content\"] if isinstance(prompt_list, list) else str(prompt_list)\n",
-        "        completion_str = completion[0][\"content\"] if isinstance(completion, list) else str(completion)\n",
-        "        reward, _ = evaluate_completion(prompt_str, completion_str, record)\n",
         "        rewards.append(reward)\n",
-        "    return rewards"
       ]
     },
     {
@@ -169,9 +178,9 @@
       "source": [
         "from unsloth import FastLanguageModel\n",
         "\n",
-        "# Set this to your SFT checkpoint\n",
-        "# You can set this to a local path or a Hugging Face repo\n",
-        "SFT_ADAPTER_PATH = \"/content/drive/MyDrive/Qwen_3.5_finetune/qwen3_5_2b_lora_adapters_compact_v4\" # <-- Change to your path\n",
         "\n",
         "print(\"Loading model and applying Unsloth optimizations...\")\n",
         "model, tokenizer = FastLanguageModel.from_pretrained(\n",
@@ -202,47 +211,91 @@
       "metadata": {},
       "outputs": [],
       "source": [
-        "from trl import GRPOConfig, GRPOTrainer\n",
         "from datasets import Dataset\n",
         "from scripts.generate_sft_compact_policy_v4_dataset import compact_action_payload, COMPACT_ACTION_SYSTEM_PROMPT\n",
         "\n",
-        "# Load dataset\n",
-        "def load_prompt_dataset() -> Dataset:\n",
-        "    import json\n",
-        "    data = []\n",
-        "    with open(\"data/molforge_sft_compact_policy_v4.jsonl\", \"r\") as f:\n",
-        "        for line in f:\n",
-        "            record = json.loads(line)\n",
-        "            prompt_text = compact_action_payload(record)\n",
-        "            data.append({\n",
         "                \"prompt\": [\n",
         "                    {\"role\": \"system\", \"content\": COMPACT_ACTION_SYSTEM_PROMPT},\n",
-        "                    {\"role\": \"user\", \"content\": prompt_text}\n",
         "                ],\n",
-        "                \"record\": record\n",
         "            })\n",
-        "    return Dataset.from_list(data)\n",
         "\n",
-        "dataset = load_prompt_dataset()\n",
         "\n",
-        "# Configure GRPO\n",
-        "training_args = GRPOConfig(\n",
-        "    output_dir=str(OUTPUT_DIR),\n",
-        "    learning_rate=LEARNING_RATE,\n",
-        "    per_device_train_batch_size=PER_DEVICE_BATCH,\n",
-        "    gradient_accumulation_steps=GRAD_ACCUM,\n",
-        "    max_prompt_length=MAX_PROMPT_LENGTH,\n",
-        "    max_completion_length=MAX_COMPLETION_LENGTH,\n",
-        "    num_generations=NUM_GENERATIONS,\n",
-        "    max_steps=RL_MAX_STEPS,\n",
-        "    logging_steps=1,\n",
-        "    save_steps=25,\n",
-        "    bf16=True,\n",
-        "    report_to=\"none\",\n",
-        "    log_completions=True,\n",
-        ")\n",
         "\n",
-        "# Initialize Trainer\n",
         "trainer = GRPOTrainer(\n",
         "    model=model,\n",
         "    reward_funcs=molforge_reward_func,\n",
@@ -256,7 +309,7 @@
         "\n",
         "print(f\"Training complete. Saving adapters to {ADAPTER_SAVE_DIR}\")\n",
         "trainer.save_model(str(ADAPTER_SAVE_DIR))\n",
-        "tokenizer.save_pretrained(str(ADAPTER_SAVE_DIR))"
       ]
     }
   ],

         "os.environ[\"MOLFORGE_REWARD_MODE\"] = \"curriculum\"\n",
         "os.environ[\"MOLFORGE_TRAINING_RANDOMIZATION\"] = \"1\"\n",
         "\n",
+        "RL_MAX_STEPS = 300\n",
         "NUM_GENERATIONS = 2\n",
         "PER_DEVICE_BATCH = 2\n",
         "GRAD_ACCUM = 4\n",
         "PLOT_DIR = OUTPUT_DIR / \"plots\"\n",
         "\n",
         "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
+        "PLOT_DIR.mkdir(parents=True, exist_ok=True)\n",
+        "LOG_DIR = OUTPUT_DIR / \"logs\"\n",
+        "LOG_DIR.mkdir(parents=True, exist_ok=True)\n"
       ]
     },
     {
       "outputs": [],
       "source": [
         "import json\n",
+        "import time\n",
         "from typing import Any, Dict, Tuple\n",
+        "from inference_common import MolForgeAction, attach_reasoning_fields, attach_team_messages, extract_json\n",
         "from server.molforge_environment import MolForgeEnvironment\n",
+        "\n",
+        "COMPLETION_LOG = LOG_DIR / \"completion_diagnostics.jsonl\"\n",
         "\n",
         "def replay_to_state(record: dict[str, Any]) -> MolForgeEnvironment:\n",
         "    env = MolForgeEnvironment()\n",
+        "    if record.get(\"randomized\"): os.environ[\"MOLFORGE_TRAINING_RANDOMIZATION\"] = \"1\"\n",
+        "    os.environ[\"MOLFORGE_RAND_SEED\"] = str(record.get(\"random_seed\", \"rl\"))\n",
+        "    observation = env.reset()\n",
+        "    for action_payload in record.get(\"pre_actions\", []):\n",
+        "        action = MolForgeAction(**action_payload)\n",
+        "        observation = env.step(attach_team_messages(observation, attach_reasoning_fields(observation, action)))\n",
         "    return env\n",
         "\n",
+        "def evaluate_completion(prompt_str, completion_str, record) -> Tuple[float, dict]:\n",
         "    try:\n",
         "        action_dict = extract_json(completion_str)\n",
         "        action = MolForgeAction(**action_dict)\n",
+        "        valid_json = True\n",
         "    except Exception:\n",
+        "        return -1.5, {\"valid_json\": False, \"action_type\": \"invalid\"}\n",
         "\n",
         "    env = replay_to_state(record)\n",
         "    observation = env._build_observation(reward=0.0, done=False, reward_components=[])\n",
         "    action = attach_team_messages(observation, attach_reasoning_fields(observation, action))\n",
         "    next_observation = env.step(action)\n",
         "    \n",
+        "    # --- ANTI-REWARD HACKING FILTER ---\n",
+        "    # We manually sum only the scientific reward components, ignoring \"chatter\" rewards\n",
+        "    filtered_reward = 0.0\n",
+        "    keep_components = {\n",
+        "        \"edit_delta\", \"submission_quality\", \"hard_constraints\", \"baseline_gate\",\n",
+        "        \"submission_evidence\", \"curriculum_terminal_progress\", \"curriculum_evidence_gate\"\n",
+        "    }\n",
+        "    penalties = {\"invalid_action\", \"budget_exhausted\", \"step_limit\", \"policy_veto\", \"loop_penalty\"}\n",
+        "    \n",
+        "    for component in next_observation.reward_components:\n",
+        "        if component.name in keep_components:\n",
+        "            filtered_reward += component.value\n",
+        "        elif component.name in penalties:\n",
+        "            filtered_reward += component.value\n",
         "\n",
+        "    # Add a mandatory time pressure penalty for every step\n",
+        "    filtered_reward -= 0.15 \n",
+        "    \n",
+        "    grader_scores = next_observation.metadata.get(\"terminal_grader_scores\", {})\n",
+        "    \n",
+        "    # Extra multipliers for reaching the goal\n",
+        "    if action.action_type == \"submit\" and grader_scores.get(\"submission_score\", 0) > 0:\n",
+        "        filtered_reward += float(grader_scores[\"submission_score\"]) * 4.0\n",
+        "    \n",
+        "    reward = round(filtered_reward, 4)\n",
+        "    \n",
+        "    return reward, {\n",
+        "        \"valid_json\": True, \"action_type\": action.action_type, \"reward\": reward, \n",
+        "        \"done\": next_observation.done, \"scores\": grader_scores, \n",
+        "        \"raw_completion\": completion_str, \"timestamp\": time.time()\n",
+        "    }\n",
         "\n",
         "def molforge_reward_func(prompts, completions, **kwargs) -> list[float]:\n",
         "    rewards = []\n",
+        "    for i in range(len(completions)):\n",
+        "        record = {\"pre_actions\": kwargs[\"record\"][i][\"pre_actions\"] if \"record\" in kwargs else []}\n",
+        "        reward, diagnostics = evaluate_completion(\"\", completions[i][0][\"content\"], record)\n",
         "        rewards.append(reward)\n",
+        "        with open(COMPLETION_LOG, \"a\") as f:\n",
+        "            f.write(json.dumps(diagnostics) + \"\\n\")\n",
+        "    return rewards\n"
       ]
     },
     {
       "source": [
         "from unsloth import FastLanguageModel\n",
         "\n",
+        "# Set this to your SFT checkpoint Deployed to hugging face space \n",
+        "# SFT trained on only to mimic the response behavioiur of the model (structured responses visit the hf blog for more detailed explanation )\n",
+        "SFT_ADAPTER_PATH = \"Adhitya122/qwen3_5_2b_molforge_sft_v4\"\n",
         "\n",
         "print(\"Loading model and applying Unsloth optimizations...\")\n",
         "model, tokenizer = FastLanguageModel.from_pretrained(\n",
       "metadata": {},
       "outputs": [],
       "source": [
         "from datasets import Dataset\n",
         "from scripts.generate_sft_compact_policy_v4_dataset import compact_action_payload, COMPACT_ACTION_SYSTEM_PROMPT\n",
+        "from inference_common import heuristic_team_action\n",
+        "import random\n",
         "\n",
+        "def build_dynamic_prompts(episodes=50, max_turns=5) -> Dataset:\n",
+        "    \"\"\"Generates training prompts by playing the environment with a heuristic expert.\"\"\"\n",
+        "    print(f\"Generating {episodes} episodes of dynamic prompts...\")\n",
+        "    records = []\n",
+        "    env = MolForgeEnvironment()\n",
+        "    \n",
+        "    for _ in range(episodes):\n",
+        "        observation = env.reset()\n",
+        "        pre_actions = []\n",
+        "        \n",
+        "        for _ in range(max_turns):\n",
+        "            if observation.done:\n",
+        "                break\n",
+        "            \n",
+        "            # Capture the current state as a prompt\n",
+        "            prompt_payload = compact_action_payload(observation)\n",
+        "            records.append({\n",
         "                \"prompt\": [\n",
         "                    {\"role\": \"system\", \"content\": COMPACT_ACTION_SYSTEM_PROMPT},\n",
+        "                    {\"role\": \"user\", \"content\": json.dumps(prompt_payload)}\n",
         "                ],\n",
+        "                \"record\": {\n",
+        "                    \"scenario_id\": observation.scenario_id,\n",
+        "                    \"difficulty\": observation.difficulty,\n",
+        "                    \"step_index\": observation.step_index,\n",
+        "                    \"pre_actions\": list(pre_actions),\n",
+        "                    \"randomized\": True,\n",
+        "                    \"random_seed\": \"dynamic-rl\"\n",
+        "                }\n",
         "            })\n",
+        "            \n",
+        "            # Use expert to move to the next state\n",
+        "            action = heuristic_team_action(observation)\n",
+        "            observation = env.step(action)\n",
+        "            pre_actions.append({\"action_type\": action.action_type, \"acting_role\": action.acting_role})\n",
+        "            \n",
+        "    random.shuffle(records)\n",
+        "    return Dataset.from_list(records)\n",
+        "\n",
+        "# Generate the dataset dynamically (no .jsonl needed!)\n",
+        "dataset = build_dynamic_prompts(episodes=20, max_turns=6)\n",
+        "print(f\"Dynamic dataset created with {len(dataset)} prompt states.\")\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "from trl import GRPOConfig, GRPOTrainer\n",
+        "import inspect\n",
+        "import torch\n",
         "\n",
+        "# Check for BF16 support (T4 does not support it, A100/L4 do)\n",
+        "has_bf16 = torch.cuda.is_bf16_supported()\n",
+        "print(f\"GPU supports BF16: {has_bf16}\")\n",
         "\n",
+        "config_kwargs = {\n",
+        "    \"output_dir\": str(OUTPUT_DIR),\n",
+        "    \"learning_rate\": LEARNING_RATE,\n",
+        "    \"per_device_train_batch_size\": PER_DEVICE_BATCH,\n",
+        "    \"gradient_accumulation_steps\": GRAD_ACCUM,\n",
+        "    \"max_prompt_length\": MAX_PROMPT_LENGTH,\n",
+        "    \"max_completion_length\": MAX_COMPLETION_LENGTH,\n",
+        "    \"num_generations\": NUM_GENERATIONS,\n",
+        "    \"max_steps\": RL_MAX_STEPS,\n",
+        "    \"logging_steps\": 1,\n",
+        "    \"save_steps\": 25,\n",
+        "    \"bf16\": has_bf16,\n",
+        "    \"fp16\": not has_bf16,\n",
+        "    \"report_to\": \"none\",\n",
+        "    \"log_completions\": True,\n",
+        "}\n",
+        "\n",
+        "supported_params = inspect.signature(GRPOConfig.__init__).parameters\n",
+        "filtered_kwargs = {k: v for k, v in config_kwargs.items() if k in supported_params}\n",
+        "\n",
+        "training_args = GRPOConfig(**filtered_kwargs)\n",
         "\n",
         "trainer = GRPOTrainer(\n",
         "    model=model,\n",
         "    reward_funcs=molforge_reward_func,\n",
         "\n",
         "print(f\"Training complete. Saving adapters to {ADAPTER_SAVE_DIR}\")\n",
         "trainer.save_model(str(ADAPTER_SAVE_DIR))\n",
+        "tokenizer.save_pretrained(str(ADAPTER_SAVE_DIR))\n"
       ]
     }
   ],