kushalExplores commited on
Commit
05f9611
·
verified ·
1 Parent(s): ca1ea5b

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +222 -247
  2. flatmate_rl.md +268 -0
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
- title: Flatmate RL
3
- emoji: 🏠
4
- colorFrom: yellow
5
  colorTo: green
6
  sdk: docker
7
  pinned: false
@@ -9,61 +9,103 @@ app_port: 8000
9
  base_path: /web
10
  tags:
11
  - openenv
12
- - flatmate
13
- - scheduling
14
  - reinforcement-learning
 
 
 
 
 
 
 
15
  ---
16
 
17
  # Flatmate RL
18
 
19
- An OpenEnv environment for training and evaluating agents on broker-style flatmate visit scheduling.
20
-
21
- This environment converts the `broker_app` visit-scheduling scenarios into a deterministic RL task where an agent must:
22
 
23
- - gather missing buyer or seller details through conversation
24
- - call the right environment tools in the right order
25
- - respect scheduling and confirmation guardrails
26
- - book valid visits only after the required checks and confirmations succeed
27
 
28
- It also includes a custom Gradio UI mounted at `/web`.
29
 
30
- ## What This Environment Does
31
 
32
- `flatmate_rl` simulates a housing broker workflow around flatmate-share listings in Mumbai. The agent interacts with the environment one step at a time using either:
33
-
34
- - an `assistant_message` action to talk to the active user
35
- - a `tool_call` action to use broker tools such as profile storage, listing search, slot checks, poster contact, and booking
36
-
37
- The environment tracks:
 
 
 
 
 
 
38
 
39
- - which required fields have been gathered
40
- - which fields are still missing
41
- - which posts were selected
42
- - whether buyer or seller data was stored
43
- - whether tool order rules were violated
44
- - whether visits were successfully booked
 
 
 
 
 
45
 
46
- The simulator is deterministic by design so it is easier to use for RL training, regression testing, and reward iteration.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47
 
48
- ## Included Scenarios
49
 
50
- The environment mirrors the main `broker_app` scenarios:
51
 
52
- - `task_visit_single`
53
- One valid visit must be booked.
54
- - `task_visit_single_hidden_flex`
55
- The buyer initially exposes only one slot, but hidden flexibility can be unlocked if the agent offers concrete alternatives.
56
- - `task_visit_multi`
57
- At least two valid non-overlapping visits must be booked.
58
- - `task_visit_single_seller_followup`
59
- The first buyer flow cannot book a visit, then a seller follow-up creates a new listing that can be matched and scheduled.
60
 
61
- Scenario declarations live in:
 
 
62
 
63
- - [server/scenario_factory.py](/Users/kushaljaisinghani/Documents/sample_envs/flatmate_rl/server/scenario_factory.py)
64
- - [server/scenarios.py](/Users/kushaljaisinghani/Documents/sample_envs/flatmate_rl/server/scenarios.py)
65
 
66
- ## Action Format
67
 
68
  `FlatmateRlAction` supports two action types:
69
 
@@ -93,52 +135,11 @@ FlatmateRlAction(
93
  )
94
  ```
95
 
96
- ## Observation Format
97
-
98
- Each reset or step returns a `FlatmateRlObservation` with fields such as:
99
-
100
- - `status`
101
- - `scenario_id`
102
- - `phase`
103
- - `conversation_history`
104
- - `last_tool_result`
105
- - `available_tools`
106
- - `gathered_fields`
107
- - `remaining_required_fields`
108
- - `selected_posts`
109
- - `booked_visits`
110
- - `violations`
111
- - `message`
112
 
113
- This gives an RL policy enough structured state to learn the broker workflow while still preserving the conversation transcript.
114
 
115
- ## Tooling Model
116
-
117
- The broker-side tool space includes these buyer-phase tools:
118
-
119
- - `store_user_details`
120
- - `search_posts`
121
- - `close_buyer_conversation`
122
- - `match_location_preference`
123
- - `get_commute_time`
124
- - `check_calendar_slots`
125
- - `shortlist`
126
- - `contact_poster`
127
- - `book_viewing`
128
-
129
- The seller-follow-up phase adds:
130
-
131
- - `store_seller_details`
132
- - `check_table_slot_matches`
133
- - `confirm_seller_match`
134
- - `offer_matched_listing_to_buyer`
135
- - `schedule_table_visit`
136
-
137
- The environment enforces sequencing constraints. For example:
138
-
139
- - searching before `store_user_details` fails
140
- - seller follow-up tools cannot be used before `store_seller_details`
141
- - bookings fail if the required confirmations are missing
142
 
143
  ## Quick Start
144
 
@@ -170,29 +171,27 @@ obs = env.step(
170
  print(obs.last_tool_result)
171
  ```
172
 
173
- ## Training an RL Agent
174
 
175
- The action space is mixed discrete-plus-structured:
176
 
177
- - choose whether to send a message or call a tool
178
- - if sending a message, generate natural language
179
- - if calling a tool, choose the tool and valid JSON arguments
180
-
181
- In practice, the easiest setup is usually:
182
-
183
- 1. use an LLM policy or seq2seq policy that emits a structured action object
184
- 2. compute reward from `done`, `violations`, `booked_visits`, and `last_tool_result`
185
- 3. train with policy gradient, GRPO, PPO, or offline imitation plus RL fine-tuning
 
186
 
187
- ### Example Training Loop
188
 
189
- The example below shows a minimal policy-gradient style skeleton. It is intentionally simple and is meant to show how to interact with the environment, not to be a production trainer.
190
 
191
  ```python
192
- from __future__ import annotations
193
-
194
  import random
195
- from dataclasses import dataclass
196
 
197
  from flatmate_rl import FlatmateRlAction
198
  from flatmate_rl.server.flatmate_rl_environment import FlatmateRlEnvironment
@@ -205,180 +204,156 @@ SCENARIOS = [
205
  "task_visit_single_seller_followup",
206
  ]
207
 
 
208
 
209
- @dataclass
210
- class Transition:
211
- observation_text: str
212
- action: FlatmateRlAction
213
- reward: float
214
- done: bool
215
-
216
-
217
- def flatten_observation(obs) -> str:
218
- return (
219
- f"scenario={obs.scenario_id}\n"
220
- f"phase={obs.phase}\n"
221
- f"status={obs.status}\n"
222
- f"remaining={obs.remaining_required_fields}\n"
223
- f"available_tools={obs.available_tools}\n"
224
- f"selected_posts={obs.selected_posts}\n"
225
- f"booked_visits={obs.booked_visits}\n"
226
- f"violations={obs.violations}\n"
227
- f"message={obs.message}\n"
228
- f"last_user_message={obs.last_user_message}\n"
229
- )
230
-
231
 
232
- class DummyPolicy:
233
- def act(self, obs) -> FlatmateRlAction:
234
- remaining = set(obs.remaining_required_fields)
235
-
236
- if "diet" in remaining or "visit_availability" in remaining:
237
- return FlatmateRlAction(
238
- action_type="assistant_message",
239
- assistant_message="Please share your dietary preference and visit availability.",
240
- )
241
-
242
- if not obs.buyer_profile_stored and obs.phase == "buyer":
243
- return FlatmateRlAction(
244
- action_type="tool_call",
245
- tool_name="store_user_details",
246
- tool_arguments={},
247
- )
248
-
249
- if obs.phase == "buyer" and "search_posts" in obs.available_tools:
250
- return FlatmateRlAction(
251
- action_type="tool_call",
252
- tool_name="search_posts",
253
- tool_arguments={},
254
- )
255
-
256
- available_tools = obs.available_tools
257
- fallback_tool = available_tools[0] if available_tools else "store_user_details"
258
- return FlatmateRlAction(
259
- action_type="tool_call",
260
- tool_name=fallback_tool,
261
- tool_arguments={},
262
- )
263
-
264
- def update(self, trajectory: list[Transition]) -> None:
265
- # Replace with PPO / GRPO / REINFORCE / DPO / imitation loss, etc.
266
- pass
267
-
268
-
269
- def compute_reward(obs) -> float:
270
- reward = 0.0
271
- reward += 5.0 * len(obs.booked_visits)
272
- reward -= 1.0 * len(obs.violations)
273
-
274
- last_tool = obs.last_tool_result or {}
275
- if last_tool.get("success") is True:
276
- reward += 0.2
277
- if last_tool.get("success") is False:
278
- reward -= 0.5
279
-
280
- if obs.done:
281
- reward += 10.0
282
- return reward
283
-
284
-
285
- def train(num_episodes: int = 1000, max_steps: int = 20) -> None:
286
- env = FlatmateRlEnvironment()
287
- policy = DummyPolicy()
288
-
289
- for episode_idx in range(num_episodes):
290
- scenario_id = random.choice(SCENARIOS)
291
- obs = env.reset(scenario_id=scenario_id)
292
- trajectory: list[Transition] = []
293
-
294
- for _ in range(max_steps):
295
- action = policy.act(obs)
296
- next_obs = env.step(action)
297
- reward = compute_reward(next_obs)
298
-
299
- trajectory.append(
300
- Transition(
301
- observation_text=flatten_observation(obs),
302
- action=action,
303
- reward=reward,
304
- done=next_obs.done,
305
- )
306
- )
307
-
308
- obs = next_obs
309
- if obs.done:
310
- break
311
-
312
- policy.update(trajectory)
313
-
314
- if episode_idx % 50 == 0:
315
- print(
316
- f"episode={episode_idx} "
317
- f"scenario={scenario_id} "
318
- f"done={obs.done} "
319
- f"bookings={len(obs.booked_visits)} "
320
- f"violations={len(obs.violations)}"
321
- )
322
-
323
-
324
- if __name__ == "__main__":
325
- train()
326
  ```
327
 
328
- ### Recommended Training Strategy
329
 
330
- For serious training, a better progression is:
 
 
331
 
332
- 1. Start with supervised trajectories for correct broker flows.
333
- 2. Fine-tune with RL on sparse success reward plus shaping reward.
334
- 3. Penalize:
335
- `violations`, failed tool calls, missing storage steps, invalid booking attempts.
336
- 4. Reward:
337
- correct information gathering, correct tool order, valid slot coordination, successful booking completion.
338
 
339
- ## Web UI
340
 
341
- The environment exposes a custom Gradio UI at `/web`.
 
 
 
342
 
343
- It includes:
 
 
 
 
 
344
 
345
- - scenario selector
346
- - transcript viewer
347
- - assistant-message controls
348
- - tool-call runner with JSON arguments
349
- - live gathered/remaining field panels
350
- - selected posts, booked visits, violations
351
- - request/response payload panes
352
 
353
- Run locally:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
354
 
355
  ```bash
356
  cd flatmate_rl
357
- uv run --project . server
 
358
  ```
359
 
360
- Then open:
361
 
362
  ```text
363
  http://127.0.0.1:8000/web
364
  ```
365
 
366
- ## Docker
367
 
368
- This repo includes a Dockerfile similar to `sudoku_rl`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
369
 
370
- It enables the web interface by default:
 
 
 
 
371
 
372
- ```dockerfile
373
- ENV ENABLE_WEB_INTERFACE=true
374
  ```
375
 
376
- Build and run:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
377
 
378
  ```bash
379
  cd flatmate_rl
380
- docker build -t flatmate_rl .
381
- docker run -p 8000:8000 flatmate_rl
382
  ```
383
 
384
  Then open:
@@ -439,6 +414,7 @@ flatmate_rl/
439
  │ ├── flatmate_rl_environment.py
440
  │ ├── gradio_ui.py
441
  │ ├── scenario_factory.py
 
442
  │ └── scenarios.py
443
  └── tests/
444
  └── test_flatmate_rl.py
@@ -448,4 +424,3 @@ flatmate_rl/
448
 
449
  - The environment is deterministic and designed for RL experimentation, not as a drop-in replacement for the original multi-LLM broker simulator.
450
  - The current Python 3.13 Anaconda runtime in this workspace can crash when importing parts of `openenv`; using the local Python 3.12 virtualenv is the safer path for testing here.
451
- # flatmate_rl
 
1
  ---
2
+ title: Flatmate RL Broker Environment
3
+ emoji: 🏘️
4
+ colorFrom: indigo
5
  colorTo: green
6
  sdk: docker
7
  pinned: false
 
9
  base_path: /web
10
  tags:
11
  - openenv
 
 
12
  - reinforcement-learning
13
+ - agents
14
+ - tool-use
15
+ - flatmate-search
16
+ - housing
17
+ - scheduling
18
+ - fastapi
19
+ - docker
20
  ---
21
 
22
  # Flatmate RL
23
 
24
+ Flatmate RL is a deterministic OpenEnv reinforcement-learning environment for broker agents. It models flatmate-share search as a multi-step workflow where the policy must gather details, inspect listings, check slots, coordinate buyer/seller confirmations, and schedule visits only when the guardrails are satisfied.
 
 
25
 
26
+ ![Flatmate RL app screenshot](screenshot.png)
 
 
 
27
 
28
+ Read the full project writeup: [Flatmate RL: Training Broker Agents for Real Flatmate Search](flatmate_rl.md).
29
 
30
+ ## Environment Flow
31
 
32
+ ```mermaid
33
+ flowchart LR
34
+ O["Observation<br/>conversation, tools, fields,<br/>posts, visits, rewards"] --> P["Policy / RL Agent"]
35
+ P --> A{"Action"}
36
+ A --> M["assistant_message"]
37
+ A --> T["tool_call<br/>tool + JSON args"]
38
+ M --> E["Flatmate RL<br/>OpenEnv Environment"]
39
+ T --> E
40
+ E --> G["Guardrails<br/>tool order, arguments,<br/>slot conflicts, consent"]
41
+ G --> R["Reward + done<br/>next observation"]
42
+ R --> O
43
+ ```
44
 
45
+ ```mermaid
46
+ flowchart TD
47
+ U["Buyer / seller request"] --> D["Gather required details"]
48
+ D --> S["Search and filter posts"]
49
+ S --> C["Check location, commute,<br/>calendar slots, conflicts"]
50
+ C --> K["Shortlist / negotiate / waitlist"]
51
+ K --> Q{"Buyer and poster confirmed?"}
52
+ Q -- "yes" --> B["Book visit or close deal"]
53
+ Q -- "no" --> F["Ask follow-up or call next tool"]
54
+ F --> D
55
+ ```
56
 
57
+ ## At A Glance
58
+
59
+ | Area | Details |
60
+ | --- | --- |
61
+ | Runtime | OpenEnv environment served through FastAPI |
62
+ | Domain | flatmate-share search and visit scheduling |
63
+ | Policy output | `assistant_message` or structured `tool_call` |
64
+ | Observation | transcript, phase, tools, fields, posts, bookings, violations, reward |
65
+ | Reward signal | positive workflow progress, penalties for invalid order, hallucinated tools, bad bookings |
66
+ | UI | custom Gradio app at `/web` |
67
+ | Deployment | local Docker or Hugging Face Docker Space |
68
+
69
+ ## Scenario Types
70
+
71
+ | Scenario | What the agent must learn |
72
+ | --- | --- |
73
+ | `task_visit_single` | book one valid visit |
74
+ | `task_visit_single_hidden_flex` | recover when the buyer reveals only one bad slot |
75
+ | `task_visit_multi` | schedule multiple non-overlapping visits |
76
+ | `task_visit_single_seller_followup` | switch from failed buyer flow to seller follow-up |
77
+ | `task_negotiation_hidden_budget` | discover buyer/seller price overlap |
78
+ | `task_slot_cancellation_waitlist` | waitlist, react to cancellation, then book |
79
+ | `task_multi_visit_preference_evolution` | update preferences after visits and new listings |
80
+ | `task_visit_conflict_check` | avoid pre-booked slots and propose only open times |
81
+
82
+ Scenario declarations live in [server/scenarios.py](server/scenarios.py) and are built with helpers from [server/scenario_factory.py](server/scenario_factory.py).
83
+
84
+ ## Synthetic Data And No-Leakage Design
85
+
86
+ ```mermaid
87
+ flowchart LR
88
+ F["scenario_factory.py<br/>synthetic profiles, posts,<br/>ground truth"] --> S["scenarios.py"]
89
+ Seed["seed"] --> V["scenario_variants.py<br/>safe value shifts"]
90
+ S --> V
91
+ V --> E["Episode"]
92
+ E --> Obs["Observation"]
93
+ Strict["STRICT_EVAL_MODE=true"] --> Obs
94
+ ```
95
 
96
+ All scenarios are synthetic. Seeded variants use `random.Random(f"{task_id}:{seed}")` to vary safe surface values such as occupation, rent, budget, and opening messages while preserving task id, post ids, required tools, feasible slots, required bookings, phase transitions, and the canonical success path.
97
 
98
+ The environment should not contain real names, phone numbers, emails, addresses, scraped listings, or private housing records. If names or richer details are added later, generate them only inside [server/scenario_variants.py](server/scenario_variants.py) as synthetic seeded values.
99
 
100
+ For stricter evaluation, set:
 
 
 
 
 
 
 
101
 
102
+ ```bash
103
+ STRICT_EVAL_MODE=true
104
+ ```
105
 
106
+ Strict eval mode hides direct scenario labels, difficulty, gathered/remaining fields, violations, tool traces, and rewards from the observation while still allowing sanitized tool results. Use this when you want to reduce prompt leakage during model evaluation.
 
107
 
108
+ ## Action, Observation, And Tools
109
 
110
  `FlatmateRlAction` supports two action types:
111
 
 
135
  )
136
  ```
137
 
138
+ Each `reset` or `step` returns a `FlatmateRlObservation` with transcript state, active phase, available tools, gathered and remaining fields, selected posts, booked visits, violations, `step_reward`, and `total_reward`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
 
140
+ Main buyer tools include `store_user_details`, `search_posts`, `match_location_preference`, `get_commute_time`, `check_calendar_slots`, `shortlist`, `contact_poster`, and `book_viewing`. Scenario-specific tools add negotiation, waitlist, debrief, new-arrival filtering, and seller-follow-up workflows.
141
 
142
+ Guardrails penalize searching before storing user details, seller tools before seller details, booking before slot checks and confirmations, unknown tools, missing arguments, repeated successful calls, and non-canonical ordering.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
 
144
  ## Quick Start
145
 
 
171
  print(obs.last_tool_result)
172
  ```
173
 
174
+ ## Training An RL Agent
175
 
176
+ Use the environment as a reward source for an LLM or seq2seq policy that emits JSON actions.
177
 
178
+ ```mermaid
179
+ flowchart LR
180
+ Reset["reset(scenario_id, seed)"] --> Prompt["serialize observation"]
181
+ Prompt --> Model["policy model"]
182
+ Model --> Parse["parse JSON action"]
183
+ Parse --> Step["env.step(action)"]
184
+ Step --> Reward["step_reward / total_reward"]
185
+ Reward --> Update["SFT, GRPO, PPO,<br/>REINFORCE, eval"]
186
+ Step --> Prompt
187
+ ```
188
 
189
+ Recommended path: start with SFT/imitation on valid trajectories, then use GRPO/PPO/REINFORCE with endpoint reward. Evaluate on held-out seeds with `STRICT_EVAL_MODE=true`.
190
 
191
+ Minimal local loop:
192
 
193
  ```python
 
 
194
  import random
 
195
 
196
  from flatmate_rl import FlatmateRlAction
197
  from flatmate_rl.server.flatmate_rl_environment import FlatmateRlEnvironment
 
204
  "task_visit_single_seller_followup",
205
  ]
206
 
207
+ env = FlatmateRlEnvironment()
208
 
209
+ for episode_idx in range(100):
210
+ obs = env.reset(scenario_id=random.choice(SCENARIOS), seed=episode_idx)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
211
 
212
+ while not obs.done:
213
+ prompt = obs.model_dump()
214
+ action_json = policy_generate_json(prompt) # your model
215
+ action = FlatmateRlAction.model_validate(action_json)
216
+ obs = env.step(action)
217
+ update_policy(obs.step_reward, obs.total_reward, obs.done)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218
  ```
219
 
220
+ When training against Docker or the Hugging Face Space, use `/ws`; a websocket session keeps one environment instance alive across `reset` and `step`.
221
 
222
+ ```python
223
+ import asyncio
224
+ import json
225
 
226
+ import websockets
 
 
 
 
 
227
 
 
228
 
229
+ async def rollout(ws_url: str) -> None:
230
+ async with websockets.connect(ws_url, open_timeout=120, ping_timeout=120) as ws:
231
+ await ws.send(json.dumps({"type": "reset", "data": {"scenario_id": "task_visit_single", "seed": 7}}))
232
+ reset_payload = json.loads(await ws.recv())
233
 
234
+ action = {
235
+ "action_type": "assistant_message",
236
+ "assistant_message": "Please share your dietary preference and visit availability.",
237
+ }
238
+ await ws.send(json.dumps({"type": "step", "data": action}))
239
+ step_payload = json.loads(await ws.recv())
240
 
241
+ print(reset_payload["observation"]["status"])
242
+ print(step_payload["reward"], step_payload["done"])
243
+ await ws.send(json.dumps({"type": "close"}))
 
 
 
 
244
 
245
+
246
+ asyncio.run(rollout("ws://127.0.0.1:8000/ws"))
247
+ # Hosted Space: wss://kushalexplores-flatmate-rl.hf.space/ws
248
+ ```
249
+
250
+ ## Running With Docker
251
+
252
+ ```mermaid
253
+ flowchart LR
254
+ Repo["flatmate_rl repo"] --> Docker["Dockerfile<br/>OpenEnv base + uv sync"]
255
+ Docker --> Server["uvicorn server.app:app<br/>port 8000"]
256
+ Server --> UI["/web Gradio UI"]
257
+ Server --> WS["/ws training endpoint"]
258
+ Server --> Health["/health"]
259
+ ```
260
+
261
+ Build and run locally:
262
 
263
  ```bash
264
  cd flatmate_rl
265
+ docker build -t flatmate_rl .
266
+ docker run --rm -p 8000:8000 flatmate_rl
267
  ```
268
 
269
+ Open the UI:
270
 
271
  ```text
272
  http://127.0.0.1:8000/web
273
  ```
274
 
275
+ Use the websocket endpoint for training:
276
 
277
+ ```text
278
+ ws://127.0.0.1:8000/ws
279
+ ```
280
+
281
+ The Dockerfile uses the OpenEnv base image, installs dependencies with `uv`, sets `ENABLE_WEB_INTERFACE=true`, exposes the app on port `8000`, and starts:
282
+
283
+ ```bash
284
+ uvicorn server.app:app --host 0.0.0.0 --port 8000
285
+ ```
286
+
287
+ ## Hugging Face Space Deployment
288
+
289
+ ```mermaid
290
+ flowchart LR
291
+ HF["Hugging Face Space<br/>kushalExplores/flatmate_rl"] --> Build["Docker build"]
292
+ Build --> App["FastAPI OpenEnv app"]
293
+ App --> Web["/web"]
294
+ App --> Train["wss://.../ws"]
295
+ ```
296
+
297
+ The deployed Space is:
298
+
299
+ ```text
300
+ https://huggingface.co/spaces/kushalExplores/flatmate_rl
301
+ ```
302
+
303
+ The Space is configured as Docker/FastAPI:
304
+
305
+ ```yaml
306
+ sdk: docker
307
+ app_port: 8000
308
+ base_path: /web
309
+ ```
310
+
311
+ The OpenEnv deployment config is in [openenv.yaml](openenv.yaml):
312
+
313
+ ```yaml
314
+ spec_version: 1
315
+ name: flatmate_rl
316
+ type: space
317
+ runtime: fastapi
318
+ app: server.app:app
319
+ port: 8000
320
+ ```
321
+
322
+ Programmatic training endpoint:
323
 
324
+ ```text
325
+ wss://kushalexplores-flatmate-rl.hf.space/ws
326
+ ```
327
+
328
+ For the browser UI, open:
329
 
330
+ ```text
331
+ https://kushalexplores-flatmate-rl.hf.space/web
332
  ```
333
 
334
+ If Hugging Face changes the direct app subdomain, open the Space page and use the app link shown there.
335
+
336
+ The server is configured with `max_concurrent_envs=4`, so keep GRPO/PPO reward workers conservative at first. Increase rollout concurrency only after the endpoint is stable.
337
+
338
+ ## Web UI
339
+
340
+ The environment exposes a custom Gradio UI at `/web`.
341
+
342
+ It includes:
343
+
344
+ - scenario selector
345
+ - transcript viewer
346
+ - assistant-message controls
347
+ - tool-call runner with JSON arguments
348
+ - live gathered/remaining field panels
349
+ - selected posts, booked visits, violations
350
+ - request/response payload panes
351
+
352
+ Run locally:
353
 
354
  ```bash
355
  cd flatmate_rl
356
+ uv run --project . server
 
357
  ```
358
 
359
  Then open:
 
414
  │ ├── flatmate_rl_environment.py
415
  │ ├── gradio_ui.py
416
  │ ├── scenario_factory.py
417
+ │ ├── scenario_variants.py
418
  │ └── scenarios.py
419
  └── tests/
420
  └── test_flatmate_rl.py
 
424
 
425
  - The environment is deterministic and designed for RL experimentation, not as a drop-in replacement for the original multi-LLM broker simulator.
426
  - The current Python 3.13 Anaconda runtime in this workspace can crash when importing parts of `openenv`; using the local Python 3.12 virtualenv is the safer path for testing here.
 
flatmate_rl.md ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Flatmate RL: Training Agents For Real Flatmate Search
2
+
3
+ Finding a flatmate sounds simple until you actually do it.
4
+
5
+ You open a Facebook group, WhatsApp group, Telegram channel, or listing board. You see hundreds of posts. Some are repeated. Some are old. Some do not mention rent. Some do not mention diet, gender preference, deposit, move-in date, or whether visits are even possible.
6
+
7
+ Then the real work begins.
8
+
9
+ You message one person. They do not reply.
10
+
11
+ You message another. The room is already taken.
12
+
13
+ You find a good place, but the visit slot clashes with work.
14
+
15
+ You visit one flat and realize the area is too noisy.
16
+
17
+ You change your preferences, and now you have to start searching again.
18
+
19
+ This is why flatmate search is not just a search problem. It is a coordination problem.
20
+
21
+ ## The Real Problem
22
+
23
+ Most flatmate platforms and groups are built like feeds.
24
+
25
+ They show posts, but they do not manage the search.
26
+
27
+ ```text
28
+ +----------------------+ +----------------------+
29
+ | Flatmate group/feed | ---> | User does all work |
30
+ +----------------------+ +----------------------+
31
+ | Repeated posts | | Filter listings |
32
+ | Missing details | | Message posters |
33
+ | Outdated listings | | Check availability |
34
+ | Unclear visit slots | | Remember preferences |
35
+ +----------------------+ +----------------------+
36
+ ```
37
+
38
+ The interface gives you information, but the burden stays on you.
39
+
40
+ You have to remember which post matched your budget, which poster replied, which room was vegetarian-only, which visit was possible on Saturday, and which place you rejected after seeing it.
41
+
42
+ That is a lot of hidden work.
43
+
44
+ ## What A Broker Actually Does
45
+
46
+ A good human broker does not just show listings.
47
+
48
+ They manage the process.
49
+
50
+ They ask for missing preferences. They filter bad leads. They check whether a flat is still available. They coordinate with owners or current flatmates. They schedule visits. They remember feedback after each visit.
51
+
52
+ In other words, a broker turns a messy feed into a workflow.
53
+
54
+ ```text
55
+ +-------------+ +-------------+ +-------------+
56
+ | Understand | --> | Match | --> | Coordinate |
57
+ | the buyer | | listings | | visits |
58
+ +-------------+ +-------------+ +-------------+
59
+ | |
60
+ v v
61
+ +-------------+ +-------------+
62
+ | Update | <--------------------- | Learn from |
63
+ | preferences | | feedback |
64
+ +-------------+ +-------------+
65
+ ```
66
+
67
+ Flatmate RL is built around this idea: train an agent to behave more like a reliable broker than a passive search box.
68
+
69
+ ## Feed Search vs Agent Search
70
+
71
+ Here is the difference in concrete terms.
72
+
73
+ | Situation | Feed-based search | Agent-led search |
74
+ | --- | --- | --- |
75
+ | Repeated listing | You see the same Andheri West room posted multiple times. | The agent treats duplicates as one lead and checks if it is still active. |
76
+ | Missing details | A post says "DM for rent" and gives no diet or move-in details. | The agent asks for missing fields before considering it a serious option. |
77
+ | Budget mismatch | You skip a Rs. 24,000 room because your budget is Rs. 20,000. | The agent can check whether negotiation is possible. |
78
+ | Visit timing | You ask "Can I visit Saturday?" and wait for a reply. | The agent checks open slots and proposes only valid times. |
79
+ | Preference change | After a visit, you realize you need a quieter area. | The agent updates your profile and changes future recommendations. |
80
+ | New listing later | A good seller appears two days after you stopped checking. | The agent can match new supply back to your saved preferences. |
81
+
82
+ That is the product gap Flatmate RL is trying to model.
83
+
84
+ ## What Flatmate RL Is
85
+
86
+ Flatmate RL is an OpenEnv reinforcement-learning environment for training broker-style agents.
87
+
88
+ The agent is placed inside a simulated flatmate-search workflow. It sees the current search state: preferences, listings, messages, calendar slots, and previous actions.
89
+
90
+ At every step, it must choose one action:
91
+
92
+ - send a message when more information or confirmation is needed
93
+ - call a structured broker tool
94
+
95
+ The goal is not just to recommend a post. The goal is to complete the workflow correctly.
96
+
97
+ ## The Environment In One Diagram
98
+
99
+ ```text
100
+ +-----------------------------+
101
+ | Current housing situation |
102
+ +-----------------------------+
103
+ | Buyer preferences |
104
+ | Seller/listing details |
105
+ | Chat history |
106
+ | Calendar slots |
107
+ +-------------+---------------+
108
+ |
109
+ v
110
+ +-----------------------------+
111
+ | Broker agent decides |
112
+ +-----------------------------+
113
+ | Ask a question? |
114
+ | Search posts? |
115
+ | Check slots? |
116
+ | Contact poster? |
117
+ | Book visit? |
118
+ +-------------+---------------+
119
+ |
120
+ v
121
+ +-----------------------------+
122
+ | Environment checks action |
123
+ +-----------------------------+
124
+ | Did it follow the rules? |
125
+ | Did it move the task ahead? |
126
+ | Should it get reward? |
127
+ +-------------+---------------+
128
+ |
129
+ v
130
+ +-----------------------------+
131
+ | Updated state |
132
+ +-----------------------------+
133
+ | New facts are stored |
134
+ | Bad actions are penalized |
135
+ | Good progress is rewarded |
136
+ +-----------------------------+
137
+ ```
138
+
139
+ This loop teaches the model when to talk, when to use tools, what arguments to pass, and when it is safe to book a visit.
140
+
141
+ ## The Tools
142
+
143
+ The agent has tools that mirror the work a broker would do.
144
+
145
+ Buyer-side tools include:
146
+
147
+ - `store_user_details`
148
+ - `search_posts`
149
+ - `match_location_preference`
150
+ - `get_commute_time`
151
+ - `check_calendar_slots`
152
+ - `shortlist`
153
+ - `contact_poster`
154
+ - `book_viewing`
155
+
156
+ Advanced tools handle seller follow-up, negotiation, waitlists, cancellations, and feedback after visits.
157
+
158
+ The important part is sequencing.
159
+
160
+ The agent cannot just book a visit because a listing looks good. It first needs enough buyer details, a matching listing, available calendar slots, buyer confirmation, and poster confirmation.
161
+
162
+ ## Example: Booking One Visit
163
+
164
+ Suppose a buyer says:
165
+
166
+ > I want a room near Andheri West, budget around Rs. 22,000. I work in media.
167
+
168
+ That is not enough to book a visit.
169
+
170
+ The agent should first ask for missing details, such as diet preference and visit availability.
171
+
172
+ Then it can search posts, match the area, check commute, inspect calendar slots, contact the poster, and book only after both sides agree.
173
+
174
+ ```text
175
+ Buyer request
176
+ |
177
+ v
178
+ Ask missing details
179
+ |
180
+ v
181
+ Search and match listings
182
+ |
183
+ v
184
+ Check visit slots
185
+ |
186
+ v
187
+ Confirm buyer + poster
188
+ |
189
+ v
190
+ Book viewing
191
+ ```
192
+
193
+ This is a small example, but it captures the main difference.
194
+
195
+ A search engine can return a listing. A broker agent has to finish the job.
196
+
197
+ ## Example: Preferences Change After A Visit
198
+
199
+ Flatmate search changes after real visits.
200
+
201
+ A buyer may start by saying they want Andheri West. After visiting one flat, they may realize the building is too noisy. After another visit, they may decide they need a gym nearby.
202
+
203
+ A normal feed does not remember that.
204
+
205
+ Flatmate RL rewards an agent that can debrief the visit, update the buyer profile, and continue the search with the new information.
206
+
207
+ ```text
208
+ Visit flat
209
+ |
210
+ v
211
+ Buyer gives feedback
212
+ |
213
+ v
214
+ Agent updates preferences
215
+ |
216
+ v
217
+ New search is more specific
218
+ ```
219
+
220
+ This matters because real users do not know every preference upfront. A useful agent must learn during the process.
221
+
222
+ ## Example: Negotiation
223
+
224
+ Sometimes a listing looks too expensive, but the deal is still possible.
225
+
226
+ Imagine a room listed at Rs. 24,000. The buyer says their budget is Rs. 20,000, but they can stretch to Rs. 22,000. The seller wants Rs. 24,000, but would accept Rs. 21,000.
227
+
228
+ A static filter may reject the listing.
229
+
230
+ A broker agent can check whether there is overlap and propose a price that both sides accept.
231
+
232
+ That is another reason this is not just retrieval. The best outcome may require several careful steps.
233
+
234
+ ## Why Reinforcement Learning Fits
235
+
236
+ Flatmate search has delayed outcomes.
237
+
238
+ An early mistake can break the whole workflow. If the agent books before checking calendar slots, the booking is invalid. If it contacts a poster before collecting buyer details, the poster may not have enough information. If it ignores feedback after a visit, it keeps recommending the wrong places.
239
+
240
+ Reinforcement learning is useful because the model can learn from the result of the whole sequence, not just from one message.
241
+
242
+ Good behavior gets rewarded:
243
+
244
+ - collecting missing details
245
+ - using tools in the right order
246
+ - avoiding unavailable slots
247
+ - getting both sides to confirm
248
+ - adapting after feedback
249
+
250
+ Bad behavior gets penalized:
251
+
252
+ - calling tools too early
253
+ - repeating actions
254
+ - booking without consent
255
+ - ignoring calendar conflicts
256
+ - getting stuck in loops
257
+
258
+ Over time, the overall loss should go down because the agent makes fewer workflow errors. In this environment, lower error means fewer invalid tool calls, fewer missed confirmations, fewer bad slot choices, and more completed bookings or deals.
259
+
260
+ ## What Success Looks Like
261
+
262
+ A strong Flatmate RL agent should feel less like a chatbot and more like an operational assistant.
263
+
264
+ It should remember what the buyer wants. It should search continuously. It should filter noisy supply. It should coordinate with sellers. It should handle calendar conflicts. It should negotiate when there is room. It should adapt after visits.
265
+
266
+ Most importantly, it should only book when the workflow is actually valid.
267
+
268
+ That is the real-world problem Flatmate RL tries to make trainable: turning messy flatmate search into a repeatable agent task with state, tools, feedback, and measurable progress.