@startuml environment_overview !theme plain top to bottom direction skinparam backgroundColor #FEFEFE skinparam defaultFontName Arial skinparam defaultFontSize 14 skinparam ArrowColor #334155 skinparam RectangleBorderColor #64748B skinparam RectangleFontColor #0F172A skinparam roundcorner 10 skinparam linetype ortho skinparam packageStyle rectangle skinparam nodesep 42 skinparam ranksep 42 title AxiomForgeAI - Phase-Controlled Math Reasoning Loop rectangle "Small Math Model\n1.5B parameters" as MODEL #DBEAFE rectangle "Phase Controller\nwarmup: grounded only\nramp: gradual self-play\ncontinuous: capped mix + fallback" as PHASE #E2E8F0 rectangle "Task Source\nfor each GRPO group" as SELECT #E2E8F0 rectangle "Grounded Source\nKnown-answer practice" as GLANE #ECFDF5 { rectangle "Dataset problem\nGSM8K / MATH" as GQ #CCFBF1 rectangle "Gold answer\navailable" as GOLD #CCFBF1 rectangle "Model samples\nK solutions" as GSOL #CCFBF1 } rectangle "Self-Play Source\nModel-made challenges" as SLANE #EEF2FF { rectangle "Curriculum picks\nskill + difficulty" as CURRIC #E0E7FF rectangle "Model writes\na new question" as SQ #E0E7FF rectangle "Model samples\nK solutions" as SSOL #E0E7FF } rectangle "Shared Grading\nanswer, steps, arithmetic, format\n+ question quality for self-play" as GRADERS #F1F5F9 rectangle "Group Comparison\nWhich attempts worked best?" as COMPARE #EDE9FE rectangle "GRPO Update\nReinforce stronger reasoning" as GRPO #DDD6FE rectangle "Improved Model\nfor the next round" as NEXT #DBEAFE MODEL -down-> PHASE PHASE -down-> SELECT note right of PHASE sets mix end note SELECT -left-> GQ : grounded slot GQ --> GOLD GOLD --> GSOL SELECT -right-> CURRIC : self-play slot CURRIC --> SQ SQ --> SSOL GSOL -down-> GRADERS SSOL -down-> GRADERS GRADERS -right-> COMPARE COMPARE -right-> GRPO GRPO -right-> NEXT NEXT -up-> MODEL : repeat note bottom of SELECT Each batch is randomly interleaved. Phase 1 uses grounded only. Later phases add self-play slots by ratio. end note @enduml