File size: 2,046 Bytes
ec4ae03
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
@startuml environment_overview
!theme plain
top to bottom direction
skinparam backgroundColor #FEFEFE
skinparam defaultFontName Arial
skinparam defaultFontSize 14
skinparam ArrowColor #334155
skinparam RectangleBorderColor #64748B
skinparam RectangleFontColor #0F172A
skinparam roundcorner 10
skinparam linetype ortho
skinparam packageStyle rectangle
skinparam nodesep 42
skinparam ranksep 42

title AxiomForgeAI - Phase-Controlled Math Reasoning Loop

rectangle "Small Math Model\n1.5B parameters" as MODEL #DBEAFE

rectangle "Phase Controller\nwarmup: grounded only\nramp: gradual self-play\ncontinuous: capped mix + fallback" as PHASE #E2E8F0

rectangle "Task Source\nfor each GRPO group" as SELECT #E2E8F0

rectangle "Grounded Source\nKnown-answer practice" as GLANE #ECFDF5 {
    rectangle "Dataset problem\nGSM8K / MATH" as GQ #CCFBF1
    rectangle "Gold answer\navailable" as GOLD #CCFBF1
    rectangle "Model samples\nK solutions" as GSOL #CCFBF1
}

rectangle "Self-Play Source\nModel-made challenges" as SLANE #EEF2FF {
    rectangle "Curriculum picks\nskill + difficulty" as CURRIC #E0E7FF
    rectangle "Model writes\na new question" as SQ #E0E7FF
    rectangle "Model samples\nK solutions" as SSOL #E0E7FF
}

rectangle "Shared Grading\nanswer, steps, arithmetic, format\n+ question quality for self-play" as GRADERS #F1F5F9

rectangle "Group Comparison\nWhich attempts worked best?" as COMPARE #EDE9FE
rectangle "GRPO Update\nReinforce stronger reasoning" as GRPO #DDD6FE
rectangle "Improved Model\nfor the next round" as NEXT #DBEAFE

MODEL -down-> PHASE
PHASE -down-> SELECT

note right of PHASE
  sets mix
end note

SELECT -left-> GQ : grounded slot
GQ --> GOLD
GOLD --> GSOL

SELECT -right-> CURRIC : self-play slot
CURRIC --> SQ
SQ --> SSOL

GSOL -down-> GRADERS
SSOL -down-> GRADERS
GRADERS -right-> COMPARE
COMPARE -right-> GRPO
GRPO -right-> NEXT
NEXT -up-> MODEL : repeat

note bottom of SELECT
  Each batch is randomly interleaved.
  Phase 1 uses grounded only.
  Later phases add self-play slots by ratio.
end note
@enduml