kushalExplores commited on
Commit
06a5966
·
verified ·
1 Parent(s): ffd37a5

Upload HF_BLOG.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. HF_BLOG.md +83 -0
HF_BLOG.md CHANGED
@@ -17,6 +17,33 @@ At the end of this process, most people still do not have a clean path forward.
17
 
18
  This is the problem Flatmate RL models. A human broker is essentially an agent operating in a messy housing environment: they have access to listings, talk to owners and buyers, check calendars, negotiate on behalf of both sides, and carry context across days or weeks. Flatmate RL turns that workflow into an OpenEnv reinforcement-learning environment.
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  ## What Flatmate RL Models
21
 
22
  Flatmate RL simulates a broker for flatmate-share discovery and visit scheduling. The agent does not just rank listings. It must make step-by-step decisions under constraints.
@@ -39,6 +66,25 @@ The environment tracks the state that matters in a real broker workflow:
39
 
40
  That makes the task more useful than a static recommendation benchmark. The model has to learn the process that converts an uncertain lead into a confirmed visit or deal.
41
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
  ## Why This Needs an Agent
43
 
44
  Flat hunting through social posts breaks down because relevant information is scattered and time-sensitive. A post may be good but buried in spam. A listing may look relevant but already be booked. A buyer may say they are only available Tuesday, but become flexible when shown strong Saturday or Sunday options. A seller may arrive later with a better listing after the first buyer flow fails.
@@ -101,6 +147,29 @@ These examples mirror the lifecycle of a real broker conversation. The best acti
101
 
102
  In `task_visit_single`, the buyer starts with budget, area, and occupation. The agent still needs diet and visit availability. A successful policy learns a sequence like this:
103
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  ```json
105
  {"action_type": "assistant_message", "assistant_message": "Please share your dietary preference and visit availability."}
106
  ```
@@ -125,6 +194,20 @@ The `task_multi_visit_preference_evolution` scenario captures a problem brokers
125
 
126
  The buyer first visits `post_023` and realizes the area is too noisy. The agent must debrief the visit and update the profile. Then new listings arrive. Some are relevant, some are not. After another visit, the buyer realizes they also want a nearby gym. The agent has to filter new arrivals again and keep searching until it finds `post_067`, which is quiet and has gym access nearby.
127
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
  This is important for model training because the target changes during the episode. The model must improve its plan as it gathers real feedback.
129
 
130
  ## Example: Negotiation Is A Sequential Skill
 
17
 
18
  This is the problem Flatmate RL models. A human broker is essentially an agent operating in a messy housing environment: they have access to listings, talk to owners and buyers, check calendars, negotiate on behalf of both sides, and carry context across days or weeks. Flatmate RL turns that workflow into an OpenEnv reinforcement-learning environment.
19
 
20
+ ## The Broker Loop At A Glance
21
+
22
+ ```mermaid
23
+ flowchart TD
24
+ A[Buyer shares initial preferences] --> B{Required details complete?}
25
+ B -- No --> C[Ask for missing fields]
26
+ C --> B
27
+ B -- Yes --> D[Store buyer profile]
28
+ D --> E[Search and filter posts]
29
+ E --> F[Match area, budget, lifestyle, commute]
30
+ F --> G{Good listing found?}
31
+ G -- No --> H[Waitlist or monitor new arrivals]
32
+ H --> E
33
+ G -- Yes --> I[Check calendar slots]
34
+ I --> J[Contact poster or flatmate]
35
+ J --> K{Buyer and poster confirm same slot?}
36
+ K -- No --> L[Ask, adjust, or try next option]
37
+ L --> F
38
+ K -- Yes --> M[Book viewing]
39
+ M --> N{Visit outcome final?}
40
+ N -- No --> O[Debrief and update preferences]
41
+ O --> E
42
+ N -- Yes --> P[Close conversation or deal]
43
+ ```
44
+
45
+ This loop is what the environment asks a model to learn. The agent is rewarded for moving the workflow forward, not for jumping straight from a rough buyer request to an unsupported booking.
46
+
47
  ## What Flatmate RL Models
48
 
49
  Flatmate RL simulates a broker for flatmate-share discovery and visit scheduling. The agent does not just rank listings. It must make step-by-step decisions under constraints.
 
66
 
67
  That makes the task more useful than a static recommendation benchmark. The model has to learn the process that converts an uncertain lead into a confirmed visit or deal.
68
 
69
+ ```mermaid
70
+ flowchart LR
71
+ O[Observation] --> P[Policy model]
72
+ P --> A{Action}
73
+ A -- assistant_message --> U[Buyer or seller conversation]
74
+ A -- tool_call JSON --> T[Broker tools]
75
+ T --> S[Environment state]
76
+ U --> S
77
+ S --> R[Reward and validation]
78
+ R --> O
79
+
80
+ S -. tracks .-> C[Conversations]
81
+ S -. tracks .-> L[Listings and matches]
82
+ S -. tracks .-> K[Calendar slots]
83
+ S -. tracks .-> V[Consent and violations]
84
+ ```
85
+
86
+ The state is not just a chat transcript. It carries operational facts like selected posts, available slots, already-booked times, confirmations, and failed tool calls. That is why the policy has to learn both conversation and execution.
87
+
88
  ## Why This Needs an Agent
89
 
90
  Flat hunting through social posts breaks down because relevant information is scattered and time-sensitive. A post may be good but buried in spam. A listing may look relevant but already be booked. A buyer may say they are only available Tuesday, but become flexible when shown strong Saturday or Sunday options. A seller may arrive later with a better listing after the first buyer flow fails.
 
147
 
148
  In `task_visit_single`, the buyer starts with budget, area, and occupation. The agent still needs diet and visit availability. A successful policy learns a sequence like this:
149
 
150
+ ```mermaid
151
+ sequenceDiagram
152
+ participant B as Buyer
153
+ participant A as Broker agent
154
+ participant E as Flatmate RL env
155
+ participant P as Poster
156
+
157
+ B->>A: Budget, area, occupation
158
+ A->>B: Ask diet and visit availability
159
+ A->>E: store_user_details
160
+ A->>E: search_posts
161
+ A->>E: match_location_preference
162
+ A->>E: get_commute_time
163
+ A->>E: check_calendar_slots
164
+ E-->>A: Saturday 11am is available
165
+ A->>B: Offer matched listing and slot
166
+ B-->>A: Confirms Saturday 11am
167
+ A->>P: Contact poster with buyer profile and slot
168
+ P-->>A: Confirms same slot
169
+ A->>E: book_viewing
170
+ E-->>A: Booking accepted and reward issued
171
+ ```
172
+
173
  ```json
174
  {"action_type": "assistant_message", "assistant_message": "Please share your dietary preference and visit availability."}
175
  ```
 
194
 
195
  The buyer first visits `post_023` and realizes the area is too noisy. The agent must debrief the visit and update the profile. Then new listings arrive. Some are relevant, some are not. After another visit, the buyer realizes they also want a nearby gym. The agent has to filter new arrivals again and keep searching until it finds `post_067`, which is quiet and has gym access nearby.
196
 
197
+ ```mermaid
198
+ flowchart TD
199
+ A[Initial profile] --> B[Book first visit: post_023]
200
+ B --> C[Debrief: area is too noisy]
201
+ C --> D[Update profile: prefers quieter area]
202
+ D --> E[Filter new arrivals]
203
+ E --> F[Book next relevant visit]
204
+ F --> G[Debrief: wants nearby gym too]
205
+ G --> H[Update profile again]
206
+ H --> I[Filter new arrivals with quiet + gym constraints]
207
+ I --> J[Find post_067]
208
+ J --> K[Confirm final match]
209
+ ```
210
+
211
  This is important for model training because the target changes during the episode. The model must improve its plan as it gathers real feedback.
212
 
213
  ## Example: Negotiation Is A Sequential Skill