File size: 2,423 Bytes
d6a76d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134

"""
// Testing, conceptually

Test	            What it verifies
ask_info	    info collection logic
resolve (after)	    success path
resolve (before)	penalty logic
reward values	correctness of shaping
done flag	    termination logic

> Detailed test flow

- ask_info

Conceptually, checks whether agent can reduce uncerainiy by asking the correct question

-- The environment is partially observable — the agent doesn’t know everything upfront --

Real-world analogy:
Support agent asking the client of their email


- resolve (after)

Conceptually, checks:

“Can the agent complete the task after gathering required info?”

This is goal completion

- resolve (before)

Conceptually, checks:

“Does the system penalize shortcut / lazy behavior?”

Without this:
Agent would always jump to resolve

- Reward values

Conceptually, checks:

“Is the agent receiving useful learning signals?”

With the reward-mechanism implemented:

Behavior	           Reward
correct info	        +0.3
correct resolution	    +1.0
final score	        +0.0 → +1.0
wrong action	      negative

 technically, we validate:

reward accumulation works
no random jumps
consistent scaling

This is critical, because:

. Bad reward = bad agent/system
. Good reward = learnable system

- done flag

Conceptually, checks:

“Does the environment know when the episode ends?”

- no score field in /reset, since at reset:

Episode has not happened yet
→ No performance → No score


These tests collectively validate:

MDP (Markov Decision Process) -> (State, Action, Reward, Transition, Termination) -> Thorough RL Environment

Component	    Verified by
State	          reset
Action	    ask_info / resolve
Reward	        reward tests
Transition	    state updates
Termination	    done flag



// Expected behavior

Good Agent Flow:
Reset
→ ask_info (+0.3)
→ resolve (+1.0 + bonus)

Bad Agent Flow:
Reset
→ resolve (-0.3)
→ ask random info (-0.1)
→ timeout (-1.0)







"""

import requests

BASE = "http://127.0.0.1:8001"

# Reset
r = requests.get(f"{BASE}/reset")
print(f"\nRESET: \n\n{r.json()}")


# Ask info
r = requests.post(f"{BASE}/step", json={
    "type": "ask_info",
    "field": "account_email"
})
#print("ASK INFO:", r.json())
print(f"\nASK INFO: \n\n{r.json()}")

# Resolve
r = requests.post(f"{BASE}/step", json={
    "type": "resolve"
})
print(f"\nRESOLVE: \n\n{r.json()}")
#print(f"\n"RESOLVE:", {r.json()})