Spaces:

kushalExplores
/

flatmate_rl

Sleeping

App Files Files Community

kushalExplores commited on 12 days ago

Commit

06a5966

verified ·

1 Parent(s): ffd37a5

Upload HF_BLOG.md with huggingface_hub

Browse files

Files changed (1) hide show

HF_BLOG.md +83 -0

HF_BLOG.md CHANGED Viewed

@@ -17,6 +17,33 @@ At the end of this process, most people still do not have a clean path forward.
 This is the problem Flatmate RL models. A human broker is essentially an agent operating in a messy housing environment: they have access to listings, talk to owners and buyers, check calendars, negotiate on behalf of both sides, and carry context across days or weeks. Flatmate RL turns that workflow into an OpenEnv reinforcement-learning environment.
 ## What Flatmate RL Models
 Flatmate RL simulates a broker for flatmate-share discovery and visit scheduling. The agent does not just rank listings. It must make step-by-step decisions under constraints.
@@ -39,6 +66,25 @@ The environment tracks the state that matters in a real broker workflow:
 That makes the task more useful than a static recommendation benchmark. The model has to learn the process that converts an uncertain lead into a confirmed visit or deal.
 ## Why This Needs an Agent
 Flat hunting through social posts breaks down because relevant information is scattered and time-sensitive. A post may be good but buried in spam. A listing may look relevant but already be booked. A buyer may say they are only available Tuesday, but become flexible when shown strong Saturday or Sunday options. A seller may arrive later with a better listing after the first buyer flow fails.
@@ -101,6 +147,29 @@ These examples mirror the lifecycle of a real broker conversation. The best acti
 In `task_visit_single`, the buyer starts with budget, area, and occupation. The agent still needs diet and visit availability. A successful policy learns a sequence like this:
 ```json
 {"action_type": "assistant_message", "assistant_message": "Please share your dietary preference and visit availability."}
 ```
@@ -125,6 +194,20 @@ The `task_multi_visit_preference_evolution` scenario captures a problem brokers
 The buyer first visits `post_023` and realizes the area is too noisy. The agent must debrief the visit and update the profile. Then new listings arrive. Some are relevant, some are not. After another visit, the buyer realizes they also want a nearby gym. The agent has to filter new arrivals again and keep searching until it finds `post_067`, which is quiet and has gym access nearby.
 This is important for model training because the target changes during the episode. The model must improve its plan as it gathers real feedback.
 ## Example: Negotiation Is A Sequential Skill

 This is the problem Flatmate RL models. A human broker is essentially an agent operating in a messy housing environment: they have access to listings, talk to owners and buyers, check calendars, negotiate on behalf of both sides, and carry context across days or weeks. Flatmate RL turns that workflow into an OpenEnv reinforcement-learning environment.
+## The Broker Loop At A Glance
+```mermaid
+flowchart TD
+    A[Buyer shares initial preferences] --> B{Required details complete?}
+    B -- No --> C[Ask for missing fields]
+    C --> B
+    B -- Yes --> D[Store buyer profile]
+    D --> E[Search and filter posts]
+    E --> F[Match area, budget, lifestyle, commute]
+    F --> G{Good listing found?}
+    G -- No --> H[Waitlist or monitor new arrivals]
+    H --> E
+    G -- Yes --> I[Check calendar slots]
+    I --> J[Contact poster or flatmate]
+    J --> K{Buyer and poster confirm same slot?}
+    K -- No --> L[Ask, adjust, or try next option]
+    L --> F
+    K -- Yes --> M[Book viewing]
+    M --> N{Visit outcome final?}
+    N -- No --> O[Debrief and update preferences]
+    O --> E
+    N -- Yes --> P[Close conversation or deal]
+```
+This loop is what the environment asks a model to learn. The agent is rewarded for moving the workflow forward, not for jumping straight from a rough buyer request to an unsupported booking.
 ## What Flatmate RL Models
 Flatmate RL simulates a broker for flatmate-share discovery and visit scheduling. The agent does not just rank listings. It must make step-by-step decisions under constraints.
 That makes the task more useful than a static recommendation benchmark. The model has to learn the process that converts an uncertain lead into a confirmed visit or deal.
+```mermaid
+flowchart LR
+    O[Observation] --> P[Policy model]
+    P --> A{Action}
+    A -- assistant_message --> U[Buyer or seller conversation]
+    A -- tool_call JSON --> T[Broker tools]
+    T --> S[Environment state]
+    U --> S
+    S --> R[Reward and validation]
+    R --> O
+    S -. tracks .-> C[Conversations]
+    S -. tracks .-> L[Listings and matches]
+    S -. tracks .-> K[Calendar slots]
+    S -. tracks .-> V[Consent and violations]
+```
+The state is not just a chat transcript. It carries operational facts like selected posts, available slots, already-booked times, confirmations, and failed tool calls. That is why the policy has to learn both conversation and execution.
 ## Why This Needs an Agent
 Flat hunting through social posts breaks down because relevant information is scattered and time-sensitive. A post may be good but buried in spam. A listing may look relevant but already be booked. A buyer may say they are only available Tuesday, but become flexible when shown strong Saturday or Sunday options. A seller may arrive later with a better listing after the first buyer flow fails.
 In `task_visit_single`, the buyer starts with budget, area, and occupation. The agent still needs diet and visit availability. A successful policy learns a sequence like this:
+```mermaid
+sequenceDiagram
+    participant B as Buyer
+    participant A as Broker agent
+    participant E as Flatmate RL env
+    participant P as Poster
+    B->>A: Budget, area, occupation
+    A->>B: Ask diet and visit availability
+    A->>E: store_user_details
+    A->>E: search_posts
+    A->>E: match_location_preference
+    A->>E: get_commute_time
+    A->>E: check_calendar_slots
+    E-->>A: Saturday 11am is available
+    A->>B: Offer matched listing and slot
+    B-->>A: Confirms Saturday 11am
+    A->>P: Contact poster with buyer profile and slot
+    P-->>A: Confirms same slot
+    A->>E: book_viewing
+    E-->>A: Booking accepted and reward issued
+```
 ```json
 {"action_type": "assistant_message", "assistant_message": "Please share your dietary preference and visit availability."}
 ```
 The buyer first visits `post_023` and realizes the area is too noisy. The agent must debrief the visit and update the profile. Then new listings arrive. Some are relevant, some are not. After another visit, the buyer realizes they also want a nearby gym. The agent has to filter new arrivals again and keep searching until it finds `post_067`, which is quiet and has gym access nearby.
+```mermaid
+flowchart TD
+    A[Initial profile] --> B[Book first visit: post_023]
+    B --> C[Debrief: area is too noisy]
+    C --> D[Update profile: prefers quieter area]
+    D --> E[Filter new arrivals]
+    E --> F[Book next relevant visit]
+    F --> G[Debrief: wants nearby gym too]
+    G --> H[Update profile again]
+    H --> I[Filter new arrivals with quiet + gym constraints]
+    I --> J[Find post_067]
+    J --> K[Confirm final match]
+```
 This is important for model training because the target changes during the episode. The model must improve its plan as it gathers real feedback.
 ## Example: Negotiation Is A Sequential Skill