πŸ€– Career OS β€” SOTA Multi-Agent Career Assistant

A production-quality, research-grade personal career agent built with 3-stage iterative training (SFT β†’ DPO β†’ GRPO), multi-agent orchestration, and Karpathy-style training visualization. Designed for publication and real-world deployment.


πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Career OS Orchestrator                    β”‚
β”‚    (routes requests, manages context, synthesizes output)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β–Ό             β–Ό               β–Ό             β–Ό
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Resume  β”‚  β”‚  Job     β”‚  β”‚  Career    β”‚  β”‚   Salary     β”‚
  β”‚ Parser  β”‚  β”‚  Matcher β”‚  β”‚  Advisor   β”‚  β”‚ Negotiator   β”‚
  β”‚  Agent  β”‚  β”‚  Agent   β”‚  β”‚   Agent    β”‚  β”‚    Agent     β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🧬 3-Stage Training Pipeline

Inspired by Self-Rewarding LLMs, iGRPO, and LoRA Without Regret.

Stage 1: SFT β€” High-Rank LoRA (r=256)

  • Method: Supervised fine-tuning on ~16K multi-turn career conversations
  • Data: Resume reviews, job-fit assessments, JSON parsing, coaching dialogues, reasoning chains
  • Config: LoRA r=256, Ξ±=512, target_modules="all-linear", use_rslora=True
  • Time: ~60 min on A100 40GB
  • LR: 2e-4, cosine schedule, 2 epochs

Stage 2: DPO β€” Preference Optimization (r=64)

  • Method: Direct Preference Optimization on model-generated preference pairs
  • Innovation: Self-generated pairs using the Stage 1 model β€” quality scores for structure, actionability, career relevance
  • Config: LoRA r=64, Ξ±=128, Ξ²=0.1
  • Time: ~30 min on A100 40GB
  • LR: 5e-7, 1 epoch

Stage 3: GRPO β€” Multi-Component Reward (r=32)

  • Method: Group Relative Policy Optimization with custom career reward function
  • Reward components:
    • Structure (25%): Headers, bullet points, sections
    • JSON correctness (25%): Valid JSON when requested
    • Actionability (25%): Action verbs and concrete steps
    • Career relevance (25%): Domain-specific terminology
  • Config: LoRA r=32, Ξ±=64, num_generations=4
  • Time: ~40 min on A100 40GB
  • LR: 1e-6, 1 epoch

Total Pipeline Time: ~2.5–3 hours on A100 40GB


πŸ“Š Training Dashboard

Karpathy-style visualization tracks all training stages:

Metric What It Shows
Loss curve Raw + moving average, best-loss point marked
Learning rate Cosine schedule with warmup
Reward trajectory Mean reward + trend line across DPO/GRPO
Response length Histogram + time series (detects collapse/explosion)
Gradient norm Training stability monitoring
Career quality JSON correctness + actionability scores

All plots auto-generated during training and saved to ./training_dashboard/.


πŸš€ Quick Start (Google Colab Pro A100)

Step 1: Set up Colab

  1. Open a new notebook at Google Colab
  2. Runtime β†’ Change runtime type β†’ GPU β†’ A100 (requires Colab Pro/Pro+)
  3. Secrets (left sidebar) β†’ Add HF_TOKEN with write permission

Step 2: Run everything

Copy the entire contents of career_os_complete_colab.py into one code cell and press Shift+Enter.

That's it. The script runs all stages end-to-end and pushes the final model to the Hub.

What you get after ~3 hours:


🎯 Agent Capabilities

Agent What It Does Output Format
Resume Parser Extracts structured data from raw resume text JSON (name, skills, experience, education)
Job Matcher Scores resume against job description JSON (score 0-100, strengths, gaps, suggestions)
Career Advisor Career path planning, skill gaps, interview prep Markdown with structured headers + action items
Salary Negotiator Compensation strategies, market research, scripts Markdown with data-backed scripts
Orchestrator Routes requests, chains agents, synthesizes output Unified career report

πŸ§ͺ Running Individual Agents

from career_os_orchestrator import CareerOS

cos = CareerOS(agent_model="Builder-Neekhil/career-agent-v1")

# Single task
result = cos.process("Review my resume", resume_text="...")
print(result["synthesized"])

# Full pipeline
results = cos.full_pipeline(
    resume_text="...",
    job_description="...",
    target_role="Senior Software Engineer"
)

πŸ“ File Structure

File Purpose When to Use
career_os_complete_colab.py ⭐ ONE-COPY COLAB SCRIPT β€” everything in one cell Copy into Colab, run once
career_os_sota.py Standalone 3-stage pipeline script Local GPU or scripted training
career_os_orchestrator.py Multi-agent Career OS with 4 specialized agents + orchestrator After training, for inference
training_dashboard.py Karpathy-style visualization dashboard Import during training or for analysis
train_colab.py Original single-stage SFT script Simple SFT only
train.py Original standalone training script Simple SFT only
inference.py Basic inference script Quick testing
demo.py Gradio web UI Interactive demo
requirements.txt Pinned dependencies Setup

πŸ”¬ Research Foundation

Technique Paper arXiv ID What We Used
Self-Rewarding LLMs Iterative DPO for instruction following 2401.10020 Stage 2: self-generated preference pairs
iGRPO Iterative GRPO with self-feedback 2602.09000 Stage 3: multi-component reward function
LoRA Without Regret High-rank LoRA for SFT, low-rank for RL Blog post r=256β†’64β†’32 pipeline
AgentOrchestra Multi-agent orchestration framework 2506.12508 Career OS agent architecture
ResumeFlow LLM resume generation pipeline 2402.06221 Structured output patterns
TALLRec Lightweight LoRA tuning for recommendation 2305.00447 SFT data efficiency

πŸ“Š Dataset Composition

Source Size Purpose
cnamuangtoun/resume-job-description-fit ~4.6K Job-fit assessment with JSON output
opensporks/resumes ~7.2K Resume review + interview + career path
sandeeppanem/resume-json-extraction-5k ~4.9K Structured resume parsing
Synthetic coaching dialogues 800 Salary, pivot, networking, gaps, promotion, interview prep
Reasoning boost 400 Step-by-step career reasoning chains

Total: ~17K multi-turn conversations with system prompts.


πŸ“ˆ Evaluation Benchmark

Since no standard career agent benchmark exists, we propose the Career Agent Evaluation Suite (CAES):

Task Metric What We Measure
Resume parsing F1 on extracted entities Name, skills, experience, education
Job fit assessment Accuracy vs ground truth Match/no-match classification
Career advice quality Human evaluation (1-5) Helpfulness, specificity, actionability
JSON correctness Valid JSON rate % of responses with parseable JSON
Response structure Section headers, lists Markdown structure score

πŸ”§ Hyperparameters

Stage 1: SFT

Param Value
Model Qwen/Qwen2.5-1.5B-Instruct
LoRA rank 256
LoRA alpha 512
Target modules all-linear
RSLora True
Epochs 2
Learning rate 2e-4
Batch size (effective) 8 (2 Γ— 4)
Max length 2048
Precision bf16
Gradient checkpointing True
Assistant-only loss True

Stage 2: DPO

Param Value
LoRA rank 64
LoRA alpha 128
Epochs 1
Learning rate 5e-7
Batch size (effective) 8 (1 Γ— 8)
Beta (DPO temperature) 0.1
Max length 2048
Precision bf16

Stage 3: GRPO

Param Value
LoRA rank 32
LoRA alpha 64
Epochs 1
Learning rate 1e-6
Batch size (effective) 4 (1 Γ— 4)
Completions per prompt 4
Max completion length 512
Precision bf16

πŸ“€ Publishing Checklist

To make this publication-ready:

  • Run full 3-stage training and push weights
  • Generate training dashboard plots
  • Run CAES evaluation benchmark
  • Compare against baseline (untrained Qwen2.5-1.5B-Instruct)
  • Ablation: SFT-only vs SFT+DPO vs SFT+DPO+GRPO
  • Human evaluation with 5 annotators on 50 samples
  • Write arXiv paper with methodology + results

πŸ“„ License

Apache-2.0 (inherits from Qwen/Qwen2.5-1.5B-Instruct)


Built for those who take their career seriously. πŸš€

Downloads last month
28
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for Builder-Neekhil/career-agent-v1