Spaces:

Imsachin010
/

salespath-env

Runtime error

App Files Files Community

Imsachin010 commited on 8 days ago

Commit

1af4cba

1 Parent(s): 5edec00

HF Spaces GPU training pipeline

Browse files

Files changed (12) hide show

.dockerignore +28 -0
Dockerfile +22 -11
RULES.md +79 -0
push_to_hub.py +2 -1
pyproject.toml +5 -3
run_hf_training.sh +169 -23
salespath_env/server/rules.py +13 -4
scripts/run_training.sh +4 -0
training/eval_baseline_vs_trained.py +251 -0
training/grpo_train.py +4 -3
training/hf_keepalive_app.py +131 -0
training/preflight_check.py +75 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,28 @@

+# Git
+.git/
+.gitignore
+# Python cache
+__pycache__/
+*.pyc
+*.pyo
+*.egg-info/
+# Training outputs (too large for Docker context)
+salespath_out/
+# IDE
+.idea/
+.vscode/
+*.swp
+*.swo
+# OS
+.DS_Store
+Thumbs.db
+# Notebook checkpoints
+.ipynb_checkpoints/
+# HF Space specific
+push_to_hub.py

Dockerfile CHANGED Viewed

@@ -1,32 +1,43 @@
-FROM python:3.11-slim
 # HuggingFace Spaces runs on port 7860 by default
 ENV PORT=7860
 ENV PYTHONUNBUFFERED=1
 ENV PYTHONDONTWRITEBYTECODE=1
 WORKDIR /app
 # Install system dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
-    curl \
     && rm -rf /var/lib/apt/lists/*
-# Install Python dependencies
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 # Copy the salespath_env package and training scripts
 COPY salespath_env/ ./salespath_env/
 COPY training/ ./training/
-# Copy and set permissions for the training script
-COPY run_hf_training.sh ./run_hf_training.sh
-RUN sed -i 's/\r$//' ./run_hf_training.sh && chmod +x ./run_hf_training.sh
-# Health check
-HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
-    CMD curl -f http://localhost:${PORT}/health || exit 1
-# Start the FastAPI server on HF Spaces port
-CMD ["sh", "-c", "uvicorn salespath_env.server.app:app --host 0.0.0.0 --port ${PORT}"]

+FROM nvidia/cuda:12.1.1-cudnn8-devel-ubuntu22.04
 # HuggingFace Spaces runs on port 7860 by default
 ENV PORT=7860
 ENV PYTHONUNBUFFERED=1
 ENV PYTHONDONTWRITEBYTECODE=1
+ENV DEBIAN_FRONTEND=noninteractive
 WORKDIR /app
 # Install system dependencies
 RUN apt-get update && apt-get install -y --no-install-recommends \
+    python3 python3-pip python3-dev git curl \
+    && ln -sf /usr/bin/python3 /usr/bin/python \
     && rm -rf /var/lib/apt/lists/*
+# Pin NumPy to avoid breakage
+RUN pip install --no-cache-dir --upgrade pip && \
+    pip install "numpy<2"
+# Install PyTorch with CUDA 12.1 support
+RUN pip install torch==2.4.0 torchvision==0.19.0 --index-url https://download.pytorch.org/whl/cu121
+# Copy and install Python dependencies
 COPY requirements.txt .
 RUN pip install --no-cache-dir -r requirements.txt
 # Copy the salespath_env package and training scripts
 COPY salespath_env/ ./salespath_env/
 COPY training/ ./training/
+COPY scripts/ ./scripts/
+# Install the salespath_env package
+RUN pip install -e . --no-deps || true
+# Copy and set permissions for the training entrypoint
+COPY run_hf_training.sh ./run_training.sh
+RUN sed -i 's/\r$//' ./run_training.sh && chmod +x ./run_training.sh
+# NO HEALTHCHECK — the entrypoint script starts a background health server
+# to keep HF Spaces alive during long training runs
+CMD ["/bin/bash", "./run_training.sh"]

RULES.md ADDED Viewed

	@@ -0,0 +1,79 @@

+# SalesPath — Business Rules (R01–R09)
+The environment enforces these 9 business rules at every step.
+Three violations → episode terminates with a heavy penalty.
+---
+## R01 — Qualify Before Present
+> **Must QUALIFY before PRESENT**
+The agent cannot pitch the product until it has asked qualifying questions about the prospect's needs, budget, and situation.
+## R02 — Demo Before Negotiate
+> **Must OFFER_DEMO before NEGOTIATE**
+No discount or price negotiation is allowed unless a product demo has been offered and scheduled.
+## R03 — Budget Known Before Negotiate
+> **Budget must be known before NEGOTIATE**
+The prospect's budget must be revealed (via QUALIFY action) before the agent can enter negotiations.
+## R04 — Discount After Objections
+> **Discount only after 2 objections handled**
+If the agent mentions a discount during NEGOTIATE, at least 2 prospect objections must have been successfully handled first.
+## R05 — No Repeat Action
+> **Cannot repeat same action consecutively**
+The agent cannot use the same action type twice in a row. A QUALIFY cannot follow a QUALIFY, a PRESENT cannot follow a PRESENT, etc.
+## R06 — First Action Must Be PROSPECT
+> **First action must always be PROSPECT**
+Every episode must begin with the PROSPECT action. Any other first action is invalid.
+## R07 — Follow-Up Only After Silence
+> **FOLLOW_UP only after prospect goes silent**
+FOLLOW_UP is only valid when the prospect has disengaged (returned a `silence` response). If the prospect just responded with actual content, FOLLOW_UP is a violation.
+## R08 — Disqualify Logic
+> **DISQUALIFY only if prospect is genuinely unqualified**
+DISQUALIFY is a violation if the prospect is actually closable (true budget ≥ close threshold AND decision maker is present). Use it only when the deal is truly unwinnable.
+## R09 — Close Requires Demo
+> **Must OFFER_DEMO before CLOSE (difficulty 2+)**
+On difficulty 2 and above, the agent must have completed OFFER_DEMO before attempting to CLOSE the deal.
+---
+## How Rules Are Enforced
+Rules are checked **before** the prospect responds to an action. Violations are accumulated in `constraints_violated` and returned in the observation:
+```python
+# Observation schema
+{
+    "constraints_violated": ["R01", "R05"],  # New violations this turn
+    "steps_completed": ["PROSPECT", "QUALIFY"],
+    ...
+}
+```
+When `len(constraints_violated) >= 3`, the episode terminates with:
+- `r_outcome = -0.5` (terminal penalty)
+- `r_compliance = -0.2 × violations` (per-turn penalty)

push_to_hub.py CHANGED Viewed

@@ -13,7 +13,8 @@ IGNORE_PATTERNS = [
     "*.egg-info/**",
     "push_to_hub.py",
     "salespath_env/server/Dockerfile",  # root Dockerfile is used instead
-    "training/**",                        # exclude training scripts from Space
 ]
 def main():

     "*.egg-info/**",
     "push_to_hub.py",
     "salespath_env/server/Dockerfile",  # root Dockerfile is used instead
+    # Training scripts ARE included for HF Spaces GPU training
+    # "training/**",
 ]
 def main():

pyproject.toml CHANGED Viewed

@@ -11,10 +11,12 @@ dependencies = [
     "fastapi",
     "uvicorn",
     "pydantic>=2.0",
-    "trl>=0.8.0",
-    "unsloth",
     "torch",
-    "transformers",
 ]
 [tool.setuptools.packages.find]

     "fastapi",
     "uvicorn",
     "pydantic>=2.0",
+    "trl>=0.11.0",
+    "peft>=0.11.0",
     "torch",
+    "transformers>=4.44.0",
+    "accelerate>=0.33.0",
+    "bitsandbytes>=0.43.0",
 ]
 [tool.setuptools.packages.find]

run_hf_training.sh CHANGED Viewed

@@ -1,28 +1,174 @@
-#!/bin/bash
-# Start the environment server in the background (HF Spaces default port 7860)
-echo "Starting SalesPath environment server..."
-uvicorn salespath_env.server.app:app --host 0.0.0.0 --port 7860 &
-# Give the server a few seconds to start up completely
-sleep 5
-# Start the GRPO Training using standard HuggingFace (PEFT)
-echo "Starting 0.5B GRPO Training..."
-PYTORCH_ALLOC_CONF=expandable_segments:True python -u -m training.grpo_train \
     --mode grpo \
-    --env-url http://127.0.0.1:7860 \
-    --model-name Qwen/Qwen2.5-0.5B-Instruct \
-    --grpo-steps 150 \
     --grpo-dataset-size 128 \
-    --num-generations 4 \
-    --max-completion-length 256 \
-    --per-device-train-batch-size 4 \
-    --gradient-accumulation-steps 8 \
-    --output-dir ./salespath_out \
-    --logging-steps 10 \
-    --push-to-hub \
-    --hub-repo Imsachin010/salespath-qwen25-0.5b
-echo "Training complete and pushed to hub! Keeping container alive for logs..."
-tail -f /dev/null

+#!/usr/bin/env bash
+set -euo pipefail
+cd /app
+# ====================================================================
+# SalesPath Training Pipeline — Configuration
+# ====================================================================
+# Override any of these via HF Space "Variables and secrets" settings.
+#
+# GPU VRAM Guide (for GRPO with LoRA 4-bit):
+#   T4 (16GB)  → 0.5B-3B models   → num_generations=2-4, batch=1-2
+#   L4 (24GB)  → 7B models         → num_generations=2, batch=1
+#   A10G (24GB)→ 7B models         → num_generations=4, batch=2
+#   A100 (40GB)→ 14B-32B models    → num_generations=4, batch=4
+#
+# Example for 7B on L4:
+#   MODEL_NAME=Qwen/Qwen2.5-7B-Instruct
+#   NUM_GENERATIONS=2
+#   PER_DEVICE_BATCH=1
+#   MAX_SEQ_LEN=512
+# ====================================================================
+export PORT="${PORT:-7860}"
+export HF_MODEL_REPO="${HF_MODEL_REPO:-Imsachin010/salespath-qwen25-0.5b}"
+export MODEL_NAME="${MODEL_NAME:-Qwen/Qwen2.5-0.5B-Instruct}"
+export OUTPUT_DIR="${OUTPUT_DIR:-/app/salespath_out}"
+export GRPO_STEPS="${GRPO_STEPS:-100}"
+export NUM_GENERATIONS="${NUM_GENERATIONS:-4}"
+export PER_DEVICE_BATCH="${PER_DEVICE_BATCH:-2}"
+export GRAD_ACCUM="${GRAD_ACCUM:-8}"
+export MAX_SEQ_LEN="${MAX_SEQ_LEN:-1024}"
+export LOGGING_STEPS="${LOGGING_STEPS:-10}"
+export EVAL_EPISODES="${EVAL_EPISODES:-4}"
+echo "========================================"
+echo "  SalesPath Training Pipeline"
+echo "  Model:       $MODEL_NAME"
+echo "  HF Repo:     $HF_MODEL_REPO"
+echo "  GPU:         $(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null || echo 'N/A')"
+echo "  Port:        $PORT"
+echo "========================================"
+# ------------------------------------------------------------------
+# 1. Background health server (keeps HF Spaces happy during training)
+# ------------------------------------------------------------------
+echo "Starting background health server on port $PORT..."
+python3 -c "
+import http.server, socketserver, os
+PORT = int(os.environ.get('PORT', 7860))
+class H(http.server.SimpleHTTPRequestHandler):
+    def do_GET(self):
+        if self.path == '/health':
+            self.send_response(200); self.end_headers(); self.wfile.write(b'OK')
+        else:
+            self.send_response(404); self.end_headers()
+    def log_message(self, *a): pass
+with socketserver.TCPServer(('', PORT), H) as httpd:
+    httpd.serve_forever()
+" &
+HEALTH_PID=$!
+echo "Health server PID: $HEALTH_PID"
+sleep 2
+# ------------------------------------------------------------------
+# 2. HF login (if token is set as secret)
+# ------------------------------------------------------------------
+if [[ -n "${HF_TOKEN:-}" ]]; then
+    echo "Logging in to Hugging Face Hub..."
+    huggingface-cli login --token "$HF_TOKEN" --add-to-git-credential
+fi
+# ------------------------------------------------------------------
+# 3. Pre-flight check
+# ------------------------------------------------------------------
+echo "=== Pre-flight check ==="
+python training/preflight_check.py || echo "Pre-flight warning (non-fatal)"
+# ------------------------------------------------------------------
+# 4. Start environment server (needed for rollout-based training)
+# ------------------------------------------------------------------
+echo "Starting SalesPath environment server on port 8000..."
+uvicorn salespath_env.server.app:app --host 0.0.0.0 --port 8000 &
+ENV_PID=$!
+sleep 3
+# Verify environment server is healthy
+python3 -c "
+import httpx, time
+for i in range(10):
+    try:
+        r = httpx.get('http://127.0.0.1:8000/health', timeout=5)
+        if r.status_code == 200: print('Environment server OK'); break
+    except: pass
+    time.sleep(2)
+"
+# ------------------------------------------------------------------
+# 5. GRPO Training
+# ------------------------------------------------------------------
+echo ""
+echo "=== GRPO Training with $MODEL_NAME ==="
+echo "Steps: $GRPO_STEPS | Generations: $NUM_GENERATIONS | Batch: $PER_DEVICE_BATCH"
+PYTORCH_ALLOC_CONF=expandable_segments:True \
+python -u -m training.grpo_train \
     --mode grpo \
+    --env-url http://127.0.0.1:8000 \
+    --model-name "$MODEL_NAME" \
+    --grpo-steps "$GRPO_STEPS" \
     --grpo-dataset-size 128 \
+    --num-generations "$NUM_GENERATIONS" \
+    --max-completion-length "$MAX_SEQ_LEN" \
+    --per-device-train-batch-size "$PER_DEVICE_BATCH" \
+    --gradient-accumulation-steps "$GRAD_ACCUM" \
+    --output-dir "$OUTPUT_DIR" \
+    --logging-steps "$LOGGING_STEPS"
+TRAINING_EXIT=$?
+echo "GRPO training exit code: $TRAINING_EXIT"
+# ------------------------------------------------------------------
+# 6. Evaluation: baseline vs trained
+# ------------------------------------------------------------------
+if [[ $TRAINING_EXIT -eq 0 ]]; then
+    echo ""
+    echo "=== Evaluation: Baseline vs Trained ==="
+    python training/eval_baseline_vs_trained.py \
+        --base "$MODEL_NAME" \
+        --trained "$OUTPUT_DIR/grpo_final" \
+        --env-url http://127.0.0.1:8000 \
+        --episodes-per-level "$EVAL_EPISODES" \
+        --output "$OUTPUT_DIR/eval_results.json"
+    echo ""
+    echo "=== Generating reward plots ==="
+    python training/plot_rewards.py \
+        --input "$OUTPUT_DIR/reward_history.txt" \
+        --output "$OUTPUT_DIR/reward_graph.png" || echo "Plotting skipped"
+fi
+# ------------------------------------------------------------------
+# 7. Upload artifacts to Hugging Face Hub
+# ------------------------------------------------------------------
+if [[ $TRAINING_EXIT -eq 0 && -n "${HF_TOKEN:-}" ]]; then
+    echo ""
+    echo "=== Uploading to $HF_MODEL_REPO ==="
+    # Upload GRPO adapters
+    huggingface-cli upload "$HF_MODEL_REPO" "$OUTPUT_DIR/grpo_final" . --repo-type model || true
+    # Upload logs and plots
+    for f in reward_history.txt eval_results.json reward_graph.png; do
+        if [[ -f "$OUTPUT_DIR/$f" ]]; then
+            huggingface-cli upload "$HF_MODEL_REPO" "$OUTPUT_DIR/$f" "training_artifacts/$f" --repo-type model || true
+        fi
+    done
+    echo "Upload complete!"
+fi
+# ------------------------------------------------------------------
+# 8. Keep container alive for log inspection
+# ------------------------------------------------------------------
+echo ""
+echo "Training pipeline complete."
+echo "Container will stay alive. Check logs via HF Spaces dashboard."
+echo "Stop the Space manually when done to avoid further billing."
+# Kill background servers
+kill $HEALTH_PID 2>/dev/null || true
+kill $ENV_PID 2>/dev/null || true
+# Start keepalive app
+exec uvicorn training.hf_keepalive_app:app --host 0.0.0.0 --port "$PORT"

salespath_env/server/rules.py CHANGED Viewed

@@ -104,13 +104,22 @@ def _followup_timing(
 ) -> bool:
     """
     R07:
-    FOLLOW_UP only valid after silence.
-    If prospect just responded last turn, violation.
     """
     if action.action_type == "FOLLOW_UP":
         if state.conversation_history:
-            last_speaker = state.conversation_history[-1].get("speaker", "agent")
-            return last_speaker == "prospect"
     return False

 ) -> bool:
     """
     R07:
+    FOLLOW_UP only valid after prospect silence.
+    Violation if the prospect's last response had actual content
+    (i.e., the prospect is still engaged and waiting for a reply).
     """
     if action.action_type == "FOLLOW_UP":
         if state.conversation_history:
+            # Walk backwards to find the last prospect message
+            for entry in reversed(state.conversation_history):
+                if entry.get("speaker") == "prospect":
+                    response_token = entry.get("response_token", "")
+                    # FOLLOW_UP is only valid if the prospect went silent
+                    return response_token != "silence"
+            # No prospect message found — first turn, so violation
+            return True
+        # No history at all — first turn, can't FOLLOW_UP yet
+        return True
     return False

scripts/run_training.sh ADDED Viewed

	@@ -0,0 +1,4 @@

+#!/usr/bin/env bash
+# Simple alias for the main entrypoint
+cd /app
+exec ./run_training.sh "$@"

training/eval_baseline_vs_trained.py ADDED Viewed

	@@ -0,0 +1,251 @@

+"""
+SalesPath — Evaluate Baseline vs Trained Model
+Runs episodes at each difficulty level with both the base model
+and the trained (GRPO) model, then compares performance.
+Usage:
+    python training/eval_baseline_vs_trained.py \
+        --base Qwen/Qwen2.5-0.5B-Instruct \
+        --trained ./salespath_out/grpo_final \
+        --env-url http://127.0.0.1:8000 \
+        --episodes-per-level 4 \
+        --output ./salespath_out/eval_results.json
+"""
+import argparse
+import asyncio
+import json
+import os
+import sys
+import time
+from pathlib import Path
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Ensure project root is on path
+_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
+if _ROOT not in sys.path:
+    sys.path.insert(0, _ROOT)
+from training.rollout import run_episode
+DIFFICULTIES = [1, 2, 3, 4]
+async def eval_model(
+    model,
+    tokenizer,
+    env_url: str,
+    episodes_per_level: int = 4,
+    label: str = "model",
+) -> dict:
+    """Evaluate a model across all difficulty levels."""
+    results = {
+        "label": label,
+        "per_difficulty": {},
+        "overall": {},
+    }
+    all_rewards = []
+    all_violations = []
+    all_closes = []
+    all_lengths = []
+    for difficulty in DIFFICULTIES:
+        diff_rewards = []
+        diff_violations = []
+        diff_closes = []
+        diff_lengths = []
+        print(f"  Difficulty {difficulty}...")
+        for ep in range(episodes_per_level):
+            result = await run_episode(
+                model=model,
+                tokenizer=tokenizer,
+                env_url=env_url,
+                difficulty=difficulty,
+                message_timeout_s=120.0,
+            )
+            trajectory = result["trajectory"]
+            reward = result["total_reward"]
+            violations = result["violations"]
+            steps = result["steps_completed"]
+            length = len(trajectory)
+            # Did we close successfully?
+            last_action = trajectory[-1]["action_type"] if trajectory else ""
+            last_traj = trajectory[-1] if trajectory else {}
+            components = last_traj.get("components", {})
+            r_outcome = components.get("r_outcome", 0.0)
+            closed = last_action == "CLOSE" and r_outcome > 0
+            diff_rewards.append(reward)
+            diff_violations.append(len(violations))
+            diff_closes.append(1 if closed else 0)
+            diff_lengths.append(length)
+            all_rewards.append(reward)
+            all_violations.append(len(violations))
+            all_closes.append(1 if closed else 0)
+            all_lengths.append(length)
+        results["per_difficulty"][difficulty] = {
+            "mean_reward": sum(diff_rewards) / len(diff_rewards) if diff_rewards else 0,
+            "mean_violations": sum(diff_violations) / len(diff_violations) if diff_violations else 0,
+            "close_rate": sum(diff_closes) / len(diff_closes) if diff_closes else 0,
+            "mean_episode_length": sum(diff_lengths) / len(diff_lengths) if diff_lengths else 0,
+            "num_episodes": len(diff_rewards),
+        }
+    results["overall"] = {
+        "mean_reward": sum(all_rewards) / len(all_rewards) if all_rewards else 0,
+        "mean_violations": sum(all_violations) / len(all_violations) if all_violations else 0,
+        "close_rate": sum(all_closes) / len(all_closes) if all_closes else 0,
+        "mean_episode_length": sum(all_lengths) / len(all_lengths) if all_lengths else 0,
+        "num_episodes": len(all_rewards),
+    }
+    return results
+def load_model(model_name_or_path: str):
+    """Load model, detecting if it's a PEFT adapter."""
+    print(f"Loading model: {model_name_or_path}")
+    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    # Check if this is a PEFT adapter directory
+    adapter_path = Path(model_name_or_path)
+    is_adapter = (adapter_path / "adapter_config.json").exists()
+    if is_adapter:
+        print("  Detected PEFT adapter — loading base model + adapter...")
+        from peft import PeftModel
+        # Find the base model name from adapter config
+        import json as _json
+        with open(adapter_path / "adapter_config.json") as f:
+            adapter_cfg = _json.load(f)
+        base_model_name = adapter_cfg.get("base_model_name_or_path", "Qwen/Qwen2.5-0.5B-Instruct")
+        print(f"  Base model: {base_model_name}")
+        bf16_supported = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
+        base_model = AutoModelForCausalLM.from_pretrained(
+            base_model_name,
+            torch_dtype=torch.bfloat16 if bf16_supported else torch.float32,
+            device_map="auto",
+        )
+        model = PeftModel.from_pretrained(base_model, model_name_or_path)
+        # Merge adapter for faster inference
+        model = model.merge_and_unload()
+        print("  Adapter loaded and merged ✅")
+    else:
+        bf16_supported = torch.cuda.is_available() and torch.cuda.is_bf16_supported()
+        model = AutoModelForCausalLM.from_pretrained(
+            model_name_or_path,
+            torch_dtype=torch.bfloat16 if bf16_supported else torch.float32,
+            device_map="auto",
+        )
+    model.eval()
+    print(f"  Model on: {next(model.parameters()).device}")
+    return model, tokenizer
+async def main():
+    parser = argparse.ArgumentParser(description="Evaluate baseline vs trained model")
+    parser.add_argument("--base", default="Qwen/Qwen2.5-0.5B-Instruct")
+    parser.add_argument("--trained", default="./salespath_out/grpo_final")
+    parser.add_argument("--env-url", default="http://127.0.0.1:8000")
+    parser.add_argument("--episodes-per-level", type=int, default=4)
+    parser.add_argument("--output", default="./salespath_out/eval_results.json")
+    args = parser.parse_args()
+    print("=" * 60)
+    print("SalesPath — Model Evaluation")
+    print("=" * 60)
+    print(f"Base model:    {args.base}")
+    print(f"Trained model: {args.trained}")
+    print(f"Episodes/level: {args.episodes_per_level}")
+    print()
+    # Evaluate base model
+    print("Loading base model...")
+    base_model, base_tokenizer = load_model(args.base)
+    print("\nEvaluating base model...")
+    base_results = await eval_model(
+        base_model, base_tokenizer, args.env_url,
+        episodes_per_level=args.episodes_per_level,
+        label="baseline",
+    )
+    # Clean up
+    del base_model
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+    # Evaluate trained model
+    print("\nLoading trained model...")
+    trained_model, trained_tokenizer = load_model(args.trained)
+    print("\nEvaluating trained model...")
+    trained_results = await eval_model(
+        trained_model, trained_tokenizer, args.env_url,
+        episodes_per_level=args.episodes_per_level,
+        label="trained",
+    )
+    # Print comparison
+    print("\n" + "=" * 60)
+    print("RESULTS COMPARISON")
+    print("=" * 60)
+    for model_results in [base_results, trained_results]:
+        label = model_results["label"]
+        overall = model_results["overall"]
+        print(f"\n--- {label.upper()} ---")
+        print(f"  Mean reward:       {overall['mean_reward']:.4f}")
+        print(f"  Mean violations:   {overall['mean_violations']:.2f}")
+        print(f"  Close rate:        {overall['close_rate']:.2%}")
+        print(f"  Mean ep. length:   {overall['mean_episode_length']:.1f}")
+        for diff, metrics in model_results["per_difficulty"].items():
+            print(f"  Difficulty {diff}: reward={metrics['mean_reward']:.3f}, "
+                  f"violations={metrics['mean_violations']:.1f}, "
+                  f"close={metrics['close_rate']:.0%}")
+    # Save results
+    output = {
+        "base": base_results,
+        "trained": trained_results,
+        "comparison": {
+            "reward_delta": trained_results["overall"]["mean_reward"] - base_results["overall"]["mean_reward"],
+            "violation_reduction": base_results["overall"]["mean_violations"] - trained_results["overall"]["mean_violations"],
+            "close_rate_improvement": trained_results["overall"]["close_rate"] - base_results["overall"]["close_rate"],
+        },
+        "config": {
+            "base_model": args.base,
+            "trained_model": args.trained,
+            "episodes_per_level": args.episodes_per_level,
+            "difficulties": DIFFICULTIES,
+        },
+    }
+    os.makedirs(os.path.dirname(args.output) or ".", exist_ok=True)
+    with open(args.output, "w") as f:
+        json.dump(output, f, indent=2)
+    print(f"\nResults saved to {args.output}")
+    # Print comparison summary
+    print("\n=== KEY METRICS ===")
+    c = output["comparison"]
+    print(f"  Reward delta:        {c['reward_delta']:+.4f}")
+    print(f"  Violation reduction: {c['violation_reduction']:+.2f}")
+    print(f"  Close rate change:   {c['close_rate_improvement']:+.2%}")
+if __name__ == "__main__":
+    asyncio.run(main())

training/grpo_train.py CHANGED Viewed

@@ -39,13 +39,12 @@ def _load_model_and_tokenizer(model_name: str, use_unsloth: bool = False):
         try:
             from unsloth import FastLanguageModel
             print("Loading with Unsloth in 4-bit + LoRA...")
-            model, tokenizer = FastLanguageModel.from_pretrained(
                 model_name=model_name,
                 max_seq_length=2048,
                 load_in_4bit=True,
                 fast_inference=True,
                 max_lora_rank=16,
-                max_lora_rank_type="lora",
             )
             # Inject LoRA adapters to drastically reduce VRAM
             model = FastLanguageModel.get_peft_model(
@@ -292,7 +291,9 @@ def run_grpo(args):
             "or fix local pyarrow/datasets installation first."
         ) from exc
-    model, tokenizer = _load_model_and_tokenizer(args.model_name, use_unsloth=True)
     rows = _build_grpo_dataset_rows(args.grpo_dataset_size)
     train_dataset = Dataset.from_list(rows)

         try:
             from unsloth import FastLanguageModel
             print("Loading with Unsloth in 4-bit + LoRA...")
+                        model, tokenizer = FastLanguageModel.from_pretrained(
                 model_name=model_name,
                 max_seq_length=2048,
                 load_in_4bit=True,
                 fast_inference=True,
                 max_lora_rank=16,
             )
             # Inject LoRA adapters to drastically reduce VRAM
             model = FastLanguageModel.get_peft_model(
             "or fix local pyarrow/datasets installation first."
         ) from exc
+    # Try Unsloth first (4-bit saves VRAM), fallback to standard HF
+    use_unsloth = args.model_name.startswith("unsloth/")
+    model, tokenizer = _load_model_and_tokenizer(args.model_name, use_unsloth=use_unsloth)
     rows = _build_grpo_dataset_rows(args.grpo_dataset_size)
     train_dataset = Dataset.from_list(rows)

training/hf_keepalive_app.py ADDED Viewed

	@@ -0,0 +1,131 @@

+"""
+SalesPath — HF Spaces Keepalive App
+Serves a simple FastAPI app after training completes to keep
+the HF Space alive and display training results.
+"""
+import os
+import json
+from pathlib import Path
+from fastapi import FastAPI
+from fastapi.responses import HTMLResponse, JSONResponse
+app = FastAPI(title="SalesPath — Training Complete")
+OUTPUT_DIR = Path(os.environ.get("OUTPUT_DIR", "/app/salespath_out"))
+@app.get("/health")
+async def health():
+    return {"status": "ok", "service": "SalesPath Training"}
+@app.get("/")
+async def root():
+    """Display training results page."""
+    html = """
+    <!DOCTYPE html>
+    <html>
+    <head>
+        <title>SalesPath Training Complete</title>
+        <meta charset="utf-8">
+        <style>
+            body { font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+                   max-width: 800px; margin: 40px auto; padding: 20px;
+                   background: #0f172a; color: #e2e8f0; }
+            h1 { color: #38bdf8; }
+            h2 { color: #94a3b8; margin-top: 30px; }
+            .card { background: #1e293b; border-radius: 12px; padding: 20px; margin: 16px 0; }
+            .metric { display: inline-block; margin: 8px 16px 8px 0; }
+            .metric .value { font-size: 24px; font-weight: bold; color: #4ade80; }
+            .metric .label { font-size: 12px; color: #64748b; text-transform: uppercase; }
+            img { max-width: 100%; border-radius: 8px; margin: 16px 0; }
+            .badge { display: inline-block; padding: 4px 12px; border-radius: 20px;
+                     font-size: 12px; font-weight: bold; }
+            .badge.success { background: #166534; color: #4ade80; }
+            .badge.info { background: #1e3a5f; color: #38bdf8; }
+            pre { background: #0f172a; padding: 12px; border-radius: 8px; overflow-x: auto; }
+        </style>
+    </head>
+    <body>
+        <h1>🏆 SalesPath Training Complete</h1>
+        <p>Trained model has been uploaded to Hugging Face Hub.</p>
+    """
+    # Load eval results
+    eval_path = OUTPUT_DIR / "eval_results.json"
+    if eval_path.exists():
+        try:
+            eval_data = json.loads(eval_path.read_text())
+            html += '<div class="card"><h2>Evaluation Results</h2>'
+            for key, value in eval_data.items():
+                if isinstance(value, (int, float)):
+                    html += f'<div class="metric"><div class="value">{value:.3f}</div><div class="label">{key}</div></div>'
+                else:
+                    html += f'<pre>{json.dumps(value, indent=2)}</pre>'
+            html += "</div>"
+        except Exception:
+            pass
+    # Show reward graph
+    graph_path = OUTPUT_DIR / "reward_graph.png"
+    if graph_path.exists():
+        import base64
+        img_b64 = base64.b64encode(graph_path.read_bytes()).decode()
+        html += f'<div class="card"><h2>Reward Curve</h2><img src="data:image/png;base64,{img_b64}" alt="Reward Graph"/></div>'
+    # Show reward history stats
+    history_path = OUTPUT_DIR / "reward_history.txt"
+    if history_path.exists():
+        lines = history_path.read_text().strip().splitlines()
+        rewards = [float(line.split("\t")[-1]) for line in lines if line.strip()]
+        if rewards:
+            html += f"""
+            <div class="card">
+                <h2>Training Stats</h2>
+                <div class="metric"><div class="value">{len(rewards)}</div><div class="label">Episodes</div></div>
+                <div class="metric"><div class="value">{sum(rewards)/len(rewards):.4f}</div><div class="label">Mean Reward</div></div>
+                <div class="metric"><div class="value">{max(rewards):.4f}</div><div class="label">Max Reward</div></div>
+                <div class="metric"><div class="value">{min(rewards):.4f}</div><div class="label">Min Reward</div></div>
+            </div>
+            """
+    html += """
+        <div class="card">
+            <h2>Next Steps</h2>
+            <p>1. View model on Hugging Face Hub</p>
+            <p>2. Run inference with the trained model</p>
+            <p>3. Stop this Space to avoid billing</p>
+        </div>
+    </body>
+    </html>
+    """
+    return HTMLResponse(html)
+@app.get("/api/results")
+async def api_results():
+    """Return training results as JSON."""
+    results = {}
+    eval_path = OUTPUT_DIR / "eval_results.json"
+    if eval_path.exists():
+        try:
+            results["eval"] = json.loads(eval_path.read_text())
+        except Exception:
+            pass
+    history_path = OUTPUT_DIR / "reward_history.txt"
+    if history_path.exists():
+        lines = history_path.read_text().strip().splitlines()
+        rewards = [float(line.split("\t")[-1]) for line in lines if line.strip()]
+        if rewards:
+            results["training"] = {
+                "episodes": len(rewards),
+                "mean_reward": sum(rewards) / len(rewards),
+                "max_reward": max(rewards),
+                "min_reward": min(rewards),
+                "std_reward": __import__("statistics").stdev(rewards) if len(rewards) > 1 else 0,
+            }
+    return JSONResponse(results)

training/preflight_check.py ADDED Viewed

	@@ -0,0 +1,75 @@

+"""
+SalesPath — Pre-flight Dependency Check
+Run at the start of training to catch version mismatches early.
+"""
+import sys
+import importlib
+REQUIRED_PACKAGES = {
+    "torch": "2.0.0",
+    "transformers": "4.44.2",
+    "trl": "0.11.0",
+    "peft": "0.11.1",
+    "datasets": "2.0.0",
+    "fastapi": "0.100.0",
+    "httpx": "0.24.0",
+    "openenv": None,
+    "accelerate": "0.25.0",
+}
+all_ok = True
+print("=" * 60)
+print("SalesPath Pre-flight Check")
+print("=" * 60)
+# Python version
+print(f"Python: {sys.version}")
+if sys.version_info < (3, 10):
+    print("  WARNING: Python >= 3.10 recommended")
+    all_ok = False
+# CUDA availability
+try:
+    import torch
+    print(f"PyTorch: {torch.__version__}")
+    print(f"CUDA available: {torch.cuda.is_available()}")
+    if torch.cuda.is_available():
+        print(f"CUDA version: {torch.version.cuda}")
+        print(f"GPU: {torch.cuda.get_device_name(0)}")
+        print(f"VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
+except Exception as e:
+    print(f"PyTorch: ERROR — {e}")
+    all_ok = False
+# Check each package
+for pkg_name, min_version in REQUIRED_PACKAGES.items():
+    try:
+        mod = importlib.import_module(pkg_name)
+        ver = getattr(mod, "__version__", "unknown")
+        status = f"{ver}"
+        if min_version:
+            from packaging import version
+            if version.parse(ver) < version.parse(min_version):
+                status += f" (needs >= {min_version}) ⚠️"
+                all_ok = False
+            else:
+                status += " ✅"
+        else:
+            status += " ✅"
+        print(f"{pkg_name}: {status}")
+    except ImportError:
+        print(f"{pkg_name}: NOT FOUND ❌")
+        all_ok = False
+    except Exception as e:
+        print(f"{pkg_name}: ERROR — {e} ❌")
+        all_ok = False
+print("=" * 60)
+if all_ok:
+    print("All checks passed ✅")
+else:
+    print("Some checks failed ⚠️ — training may still work")
+print("=" * 60)
+sys.exit(0 if all_ok else 1)