{ "cells": [ { "id": "intro", "cell_type": "markdown", "metadata": {}, "source": [ "# Autonomous Executive Assistant Sandbox\n", "\n", "Notebook for OpenRouter Gemma rollouts, checkpoint export, and RL training. Use the `scalerhack2-training` kernel so the environment matches the validated training pipeline." ] }, { "id": "workflow", "cell_type": "markdown", "metadata": {}, "source": [ "## Workflow\n", "\n", "1. Load `.env.training` directly from the repository root.\n", "2. Run the baseline suite to confirm the environment is stable.\n", "3. Run an OpenRouter Gemma rollout if the API key is available.\n", "4. Export traces for analysis or imitation-style warm starts.\n", "5. Train the tabular RL agent and save a checkpoint.\n", "6. Promote stable changes back into `src/` and keep tests green." ] }, { "id": "imports", "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "import os\n", "from pathlib import Path\n", "\n", "from src.executive_assistant.agent import BaselineAgent, OpenRouterPolicy\n", "from src.executive_assistant.config import OpenRouterConfig, load_env_file\n", "from src.executive_assistant.runner import EpisodeRunner, export_traces_jsonl, run_policy_suite\n", "from src.executive_assistant.training import evaluate_q_policy, train_q_learning\n" ] }, { "id": "config", "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ENV_FILE = Path('.env.training')\n", "ENV_LOADED = load_env_file(ENV_FILE)\n", "HAS_OPENROUTER_KEY = bool(os.environ.get('OPENROUTER_API_KEY'))\n", "\n", "TASK_NAME = 'hard_rag_reply'\n", "POLICY_PROVIDER = 'openrouter' if HAS_OPENROUTER_KEY else 'baseline'\n", "MODEL_NAME = os.environ.get('OPENROUTER_MODEL', 'google/gemma-4-31b-it')\n", "MAX_STEPS = 12\n", "TRACE_DIR = Path('artifacts/traces')\n", "CHECKPOINT_DIR = Path('artifacts/checkpoints')\n", "TRACE_DIR.mkdir(parents=True, exist_ok=True)\n", "CHECKPOINT_DIR.mkdir(parents=True, exist_ok=True)\n", "\n", "{\n", " 'env_file_found': ENV_LOADED,\n", " 'has_openrouter_key': HAS_OPENROUTER_KEY,\n", " 'policy_provider': POLICY_PROVIDER,\n", " 'model_name': MODEL_NAME,\n", "}\n" ] }, { "id": "policy-builder", "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def build_policy(provider: str, model_name: str):\n", " if provider == 'baseline':\n", " return BaselineAgent()\n", " if provider == 'openrouter':\n", " config = OpenRouterConfig.from_env(ENV_FILE)\n", " config = OpenRouterConfig(\n", " api_key=config.api_key,\n", " model_name=model_name,\n", " base_url=config.base_url,\n", " site_url=config.site_url,\n", " app_name=config.app_name,\n", " temperature=config.temperature,\n", " max_tokens=config.max_tokens,\n", " )\n", " return OpenRouterPolicy(config=config)\n", " raise ValueError(f'Unsupported provider: {provider}')\n" ] }, { "id": "baseline-note", "cell_type": "markdown", "metadata": {}, "source": [ "## Baseline validation\n", "\n", "Run this first. If the baseline is not still solving the seeded tasks, stop and fix the environment before trusting any LLM or RL results." ] }, { "id": "baseline-run", "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "baseline_traces = run_policy_suite(\n", " policy=BaselineAgent(),\n", " task_names=[\n", " 'easy_deadline_extraction',\n", " 'medium_triage_and_negotiation',\n", " 'hard_rag_reply',\n", " ],\n", " max_steps=MAX_STEPS,\n", ")\n", "\n", "{name: {'completed': trace.completed, 'score': trace.final_score, 'steps': len(trace.steps)} for name, trace in baseline_traces.items()}\n" ] }, { "id": "rollout-note", "cell_type": "markdown", "metadata": {}, "source": [ "## Policy rollout\n", "\n", "This uses OpenRouter Gemma automatically when `.env.training` provides the key. Otherwise it falls back to the baseline policy." ] }, { "id": "rollout-run", "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "policy = build_policy(POLICY_PROVIDER, MODEL_NAME)\n", "runner = EpisodeRunner(policy=policy, max_steps=MAX_STEPS)\n", "trace = runner.run(TASK_NAME)\n", "\n", "print(json.dumps(trace.to_dict(), indent=2))\n" ] }, { "id": "rollout-snapshot", "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "trace.steps[-1].snapshot\n" ] }, { "id": "export-note", "cell_type": "markdown", "metadata": {}, "source": [ "## Export traces\n", "\n", "These JSONL traces are the main interface between rollout collection and downstream training or regression analysis." ] }, { "id": "export-run", "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "suite_traces = run_policy_suite(\n", " policy=build_policy(POLICY_PROVIDER, MODEL_NAME),\n", " task_names=[TASK_NAME],\n", " max_steps=MAX_STEPS,\n", ")\n", "\n", "output_path = export_traces_jsonl(\n", " list(suite_traces.values()),\n", " TRACE_DIR / f'{POLICY_PROVIDER}_{TASK_NAME}_traces.jsonl',\n", ")\n", "\n", "print(output_path)\n" ] }, { "id": "train-note", "cell_type": "markdown", "metadata": {}, "source": [ "## RL training\n", "\n", "This trains the tabular Q-learning policy with a baseline-teacher warm start, saves a checkpoint, and evaluates the trained policy on all seeded tasks." ] }, { "id": "train-run", "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "q_policy, training_scores = train_q_learning(\n", " episodes=300,\n", " epsilon=0.15,\n", " teacher=BaselineAgent(),\n", ")\n", "checkpoint_path = q_policy.save(CHECKPOINT_DIR / 'q_policy_notebook.json')\n", "evaluation = evaluate_q_policy(q_policy)\n", "\n", "{\n", " 'checkpoint': str(checkpoint_path),\n", " 'training_scores': training_scores,\n", " 'evaluation': evaluation,\n", "}\n" ] }, { "id": "env-note", "cell_type": "markdown", "metadata": {}, "source": [ "## Environment note\n", "\n", "The notebook loads `.env.training` directly from the repo root. That keeps CLI runs, notebook runs, and Jupyter-launched kernels aligned without requiring manual exports in the shell." ] } ], "metadata": { "kernelspec": { "display_name": "Python (scalerhack2-training)", "language": "python", "name": "scalerhack2-training" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.14" } }, "nbformat": 4, "nbformat_minor": 5 }