kushalExplores commited on
Commit
ca1ea5b
·
verified ·
1 Parent(s): 06a5966

Delete HF_BLOG.md

Browse files
Files changed (1) hide show
  1. HF_BLOG.md +0 -259
HF_BLOG.md DELETED
@@ -1,259 +0,0 @@
1
- # Flatmate RL: Training Broker Agents for Real Flatmate Search
2
-
3
- Finding a flatmate-share is rarely a clean search problem. A person who has just moved to a city has to scan flat and flatmate groups, sort through repeated posts and spam, identify listings that match their budget and preferences, message owners or current flatmates, check whether the place is still available, coordinate visit slots, and then repeat the whole process when one detail does not line up.
4
-
5
- The hard part is not only "find a relevant listing." The hard part is the loop:
6
-
7
- 1. Find posts that might match.
8
- 2. Filter out bad or unavailable listings.
9
- 3. Ask the buyer for missing preferences.
10
- 4. Contact the listing owner or flatmate.
11
- 5. Check whether the flat is still available.
12
- 6. Match both sides on budget, lifestyle, location, commute, and visit time.
13
- 7. Schedule visits without double-booking or assuming consent.
14
- 8. Keep going when preferences change after real visits.
15
-
16
- At the end of this process, most people still do not have a clean path forward. They end up contacting multiple brokers, repeating the same preferences, re-checking the same details, and hoping one broker happens to have the right listing at the right time. The search becomes painful, slow, and unreliable.
17
-
18
- This is the problem Flatmate RL models. A human broker is essentially an agent operating in a messy housing environment: they have access to listings, talk to owners and buyers, check calendars, negotiate on behalf of both sides, and carry context across days or weeks. Flatmate RL turns that workflow into an OpenEnv reinforcement-learning environment.
19
-
20
- ## The Broker Loop At A Glance
21
-
22
- ```mermaid
23
- flowchart TD
24
- A[Buyer shares initial preferences] --> B{Required details complete?}
25
- B -- No --> C[Ask for missing fields]
26
- C --> B
27
- B -- Yes --> D[Store buyer profile]
28
- D --> E[Search and filter posts]
29
- E --> F[Match area, budget, lifestyle, commute]
30
- F --> G{Good listing found?}
31
- G -- No --> H[Waitlist or monitor new arrivals]
32
- H --> E
33
- G -- Yes --> I[Check calendar slots]
34
- I --> J[Contact poster or flatmate]
35
- J --> K{Buyer and poster confirm same slot?}
36
- K -- No --> L[Ask, adjust, or try next option]
37
- L --> F
38
- K -- Yes --> M[Book viewing]
39
- M --> N{Visit outcome final?}
40
- N -- No --> O[Debrief and update preferences]
41
- O --> E
42
- N -- Yes --> P[Close conversation or deal]
43
- ```
44
-
45
- This loop is what the environment asks a model to learn. The agent is rewarded for moving the workflow forward, not for jumping straight from a rough buyer request to an unsupported booking.
46
-
47
- ## What Flatmate RL Models
48
-
49
- Flatmate RL simulates a broker for flatmate-share discovery and visit scheduling. The agent does not just rank listings. It must make step-by-step decisions under constraints.
50
-
51
- At each step, the policy can either:
52
-
53
- - send an `assistant_message` to the active buyer or seller
54
- - call a broker tool with structured JSON arguments
55
-
56
- The environment tracks the state that matters in a real broker workflow:
57
-
58
- - buyer and seller conversation history
59
- - missing required fields such as diet, areas, occupation, budget, and visit availability
60
- - selected posts and matched listings
61
- - available broker tools for the current phase
62
- - calendar slots, pre-booked slots, and visit confirmations
63
- - booked visits
64
- - tool trace and policy violations
65
- - step reward and total reward
66
-
67
- That makes the task more useful than a static recommendation benchmark. The model has to learn the process that converts an uncertain lead into a confirmed visit or deal.
68
-
69
- ```mermaid
70
- flowchart LR
71
- O[Observation] --> P[Policy model]
72
- P --> A{Action}
73
- A -- assistant_message --> U[Buyer or seller conversation]
74
- A -- tool_call JSON --> T[Broker tools]
75
- T --> S[Environment state]
76
- U --> S
77
- S --> R[Reward and validation]
78
- R --> O
79
-
80
- S -. tracks .-> C[Conversations]
81
- S -. tracks .-> L[Listings and matches]
82
- S -. tracks .-> K[Calendar slots]
83
- S -. tracks .-> V[Consent and violations]
84
- ```
85
-
86
- The state is not just a chat transcript. It carries operational facts like selected posts, available slots, already-booked times, confirmations, and failed tool calls. That is why the policy has to learn both conversation and execution.
87
-
88
- ## Why This Needs an Agent
89
-
90
- Flat hunting through social posts breaks down because relevant information is scattered and time-sensitive. A post may be good but buried in spam. A listing may look relevant but already be booked. A buyer may say they are only available Tuesday, but become flexible when shown strong Saturday or Sunday options. A seller may arrive later with a better listing after the first buyer flow fails.
91
-
92
- A useful broker agent needs to handle those cases without losing state. It should know when to ask a question, when to search, when to match by area, when to check commute, when to ask the buyer to choose between options, when to contact a poster, and when it is finally safe to book.
93
-
94
- Flatmate RL makes those decisions trainable and measurable.
95
-
96
- ## Tools The Agent Learns To Use
97
-
98
- The broker action space is intentionally mixed: natural language plus structured tools.
99
-
100
- Buyer-side tools include:
101
-
102
- - `store_user_details`
103
- - `search_posts`
104
- - `match_location_preference`
105
- - `get_commute_time`
106
- - `check_calendar_slots`
107
- - `shortlist`
108
- - `contact_poster`
109
- - `book_viewing`
110
- - `close_buyer_conversation`
111
-
112
- Seller and advanced-flow tools include:
113
-
114
- - `store_seller_details`
115
- - `check_table_slot_matches`
116
- - `confirm_seller_match`
117
- - `offer_matched_listing_to_buyer`
118
- - `schedule_table_visit`
119
- - `propose_price_to_buyer`
120
- - `propose_price_to_seller`
121
- - `confirm_negotiated_deal`
122
- - `add_to_waitlist`
123
- - `notify_buyer_slot_freed`
124
- - `debrief_visit`
125
- - `filter_new_arrivals`
126
-
127
- The environment enforces sequencing. For example, searching before storing user details fails. Booking before checking calendar slots fails. Booking without both buyer and poster confirmation fails. Repeating the same successful tool call gets penalized. These constraints force the model to learn a broker workflow instead of learning to call tools randomly.
128
-
129
- ## Scenarios From The Code
130
-
131
- The scenario definitions live in [`server/scenarios.py`](server/scenarios.py). They are built to represent common real flat-search problems rather than toy tasks.
132
-
133
- | Scenario | Real-world problem | What the agent must learn |
134
- | --- | --- | --- |
135
- | `task_visit_single` | A buyer wants one suitable flatmate-share near Andheri West or Jogeshwari. | Ask for missing diet and availability, search, match listings, check slots, contact the poster, and book only after both sides confirm. |
136
- | `task_visit_single_hidden_flex` | The buyer first shares only Tuesday availability, but good listings have Saturday or Sunday slots. | Surface concrete alternative slots and unlock hidden flexibility instead of giving up too early. |
137
- | `task_visit_multi` | The buyer wants to compare multiple options before deciding. | Shortlist several matching listings, ask which ones to pursue, and book at least two non-overlapping visits. |
138
- | `task_visit_single_seller_followup` | No current listing fits, then a new seller contacts the broker with a relevant property. | Close the first buyer flow, gather seller details, create a new listing, match it to the saved buyer, and schedule the visit. |
139
- | `task_negotiation_hidden_budget` | A listing is above the buyer's stated budget, but both sides have hidden negotiation room. | Probe buyer and seller price limits, discover overlap, and close a negotiated deal. |
140
- | `task_slot_cancellation_waitlist` | A desired flat has all slots pre-booked, but a cancellation opens one later. | Add the buyer to a waitlist, react when the slot opens, notify the buyer, and complete the booking. |
141
- | `task_multi_visit_preference_evolution` | The buyer only discovers true preferences after visiting places. | Debrief after visits, update the buyer profile, filter new arrivals, and keep searching until the final match is found. |
142
- | `task_visit_conflict_check` | A listing shows several slots, but some are already taken by other buyers. | Read `pre_booked_slots`, avoid unavailable times, and propose only the actually open slot. |
143
-
144
- These examples mirror the lifecycle of a real broker conversation. The best action is often not a single recommendation. It is the next correct operational step.
145
-
146
- ## Example: A Single Confirmed Visit
147
-
148
- In `task_visit_single`, the buyer starts with budget, area, and occupation. The agent still needs diet and visit availability. A successful policy learns a sequence like this:
149
-
150
- ```mermaid
151
- sequenceDiagram
152
- participant B as Buyer
153
- participant A as Broker agent
154
- participant E as Flatmate RL env
155
- participant P as Poster
156
-
157
- B->>A: Budget, area, occupation
158
- A->>B: Ask diet and visit availability
159
- A->>E: store_user_details
160
- A->>E: search_posts
161
- A->>E: match_location_preference
162
- A->>E: get_commute_time
163
- A->>E: check_calendar_slots
164
- E-->>A: Saturday 11am is available
165
- A->>B: Offer matched listing and slot
166
- B-->>A: Confirms Saturday 11am
167
- A->>P: Contact poster with buyer profile and slot
168
- P-->>A: Confirms same slot
169
- A->>E: book_viewing
170
- E-->>A: Booking accepted and reward issued
171
- ```
172
-
173
- ```json
174
- {"action_type": "assistant_message", "assistant_message": "Please share your dietary preference and visit availability."}
175
- ```
176
-
177
- Then it can store details, search posts, match locations, check commute, and inspect calendar slots:
178
-
179
- ```json
180
- {"action_type": "tool_call", "tool_name": "check_calendar_slots", "tool_arguments": {"post_ids": ["post_023"]}}
181
- ```
182
-
183
- Only after the buyer confirms the slot and the poster confirms both the buyer profile and the same time should the agent book:
184
-
185
- ```json
186
- {"action_type": "tool_call", "tool_name": "book_viewing", "tool_arguments": {"post_id": "post_023", "time_text": "Saturday 11am"}}
187
- ```
188
-
189
- That final booking is not just a search result. It is the result of a valid chain of information gathering, matching, calendar checking, and consent.
190
-
191
- ## Example: Preferences Change After Visits
192
-
193
- The `task_multi_visit_preference_evolution` scenario captures a problem brokers see often: buyers do not know every preference upfront.
194
-
195
- The buyer first visits `post_023` and realizes the area is too noisy. The agent must debrief the visit and update the profile. Then new listings arrive. Some are relevant, some are not. After another visit, the buyer realizes they also want a nearby gym. The agent has to filter new arrivals again and keep searching until it finds `post_067`, which is quiet and has gym access nearby.
196
-
197
- ```mermaid
198
- flowchart TD
199
- A[Initial profile] --> B[Book first visit: post_023]
200
- B --> C[Debrief: area is too noisy]
201
- C --> D[Update profile: prefers quieter area]
202
- D --> E[Filter new arrivals]
203
- E --> F[Book next relevant visit]
204
- F --> G[Debrief: wants nearby gym too]
205
- G --> H[Update profile again]
206
- H --> I[Filter new arrivals with quiet + gym constraints]
207
- I --> J[Find post_067]
208
- J --> K[Confirm final match]
209
- ```
210
-
211
- This is important for model training because the target changes during the episode. The model must improve its plan as it gathers real feedback.
212
-
213
- ## Example: Negotiation Is A Sequential Skill
214
-
215
- In `task_negotiation_hidden_budget`, the listed rent is Rs. 24,000. The buyer's stated budget is Rs. 20,000, but their hidden ceiling is Rs. 22,000. The seller's hidden floor is Rs. 21,000.
216
-
217
- A static matcher might reject the listing as too expensive. A broker agent should recognize that the listing is negotiable, probe both sides, discover the overlap, and close at an accepted rent. The scenario rewards the model for using `propose_price_to_buyer`, `propose_price_to_seller`, and `confirm_negotiated_deal` in the right order.
218
-
219
- That makes the environment closer to real housing workflows, where a deal can take multiple rounds rather than one retrieval call.
220
-
221
- ## Reward And Evaluation
222
-
223
- The reward design encourages operational correctness:
224
-
225
- - small positive reward for successful tool progress
226
- - penalties for failed tools, invalid order, redundant calls, calendar mismatches, missing consent, and loops
227
- - completion reward when the scenario's required booking or deal condition is satisfied
228
-
229
- Regression tests in [`tests/test_flatmate_rl.py`](tests/test_flatmate_rl.py) and [`tests/test_reward_regression.py`](tests/test_reward_regression.py) check core behavior such as single-visit booking, hidden flexibility, multi-booking, seller follow-up, strict evaluation hiding, and reward stability.
230
-
231
- The result is an environment where model performance can improve in a measurable way: fewer invalid tool calls, better ordering, more valid bookings, and stronger handling of edge cases like pre-booked slots or changed preferences.
232
-
233
- ## Training Shape
234
-
235
- The notebooks in this repository train policies that emit Flatmate RL JSON actions against the live Space endpoint. The practical training path is:
236
-
237
- 1. Collect valid broker trajectories from heuristic or scripted policies.
238
- 2. Supervise a model to emit well-formed action JSON.
239
- 3. Fine-tune with RL so the model learns from delayed success, failed tools, and environment feedback.
240
- 4. Evaluate on the same scenario IDs with fixed seeds and reward regression checks.
241
-
242
- The important part is that the model is not only learning what to say. It is learning when to talk, when to use tools, what arguments to pass, and how to recover when the housing workflow changes.
243
-
244
- ## Why This Matters
245
-
246
- A better flatmate search system should not behave like a passive search box. It should behave like a reliable broker that works for the user:
247
-
248
- - it remembers preferences
249
- - it searches and filters continuously
250
- - it checks whether listings are still available
251
- - it coordinates with owners and flatmates
252
- - it handles calendar conflicts
253
- - it negotiates when there is room
254
- - it follows up when new supply appears
255
- - it books only after both sides consent
256
-
257
- Flatmate RL provides a controlled environment for training that kind of agent. It turns the messy broker workflow into repeatable episodes with structured state, realistic scenarios, strict tool contracts, and measurable rewards.
258
-
259
- The end goal is not just to find a flat post. The goal is to train models that can manage the whole housing search loop until a buyer and a compatible flatmate or owner reach a real conclusion.