Spaces:

kushalExplores
/

flatmate_rl

Sleeping

App Files Files Community

kushalExplores commited on 11 days ago

Commit

05f9611

verified ·

1 Parent(s): ca1ea5b

Upload 2 files

Browse files

Files changed (2) hide show

README.md +222 -247
flatmate_rl.md +268 -0

README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
-title: Flatmate RL
-emoji: 🏠
-colorFrom: yellow
 colorTo: green
 sdk: docker
 pinned: false
@@ -9,61 +9,103 @@ app_port: 8000
 base_path: /web
 tags:
   - openenv
-  - flatmate
-  - scheduling
   - reinforcement-learning
 ---
 # Flatmate RL
-An OpenEnv environment for training and evaluating agents on broker-style flatmate visit scheduling.
-This environment converts the `broker_app` visit-scheduling scenarios into a deterministic RL task where an agent must:
-- gather missing buyer or seller details through conversation
-- call the right environment tools in the right order
-- respect scheduling and confirmation guardrails
-- book valid visits only after the required checks and confirmations succeed
-It also includes a custom Gradio UI mounted at `/web`.
-## What This Environment Does
-`flatmate_rl` simulates a housing broker workflow around flatmate-share listings in Mumbai. The agent interacts with the environment one step at a time using either:
-- an `assistant_message` action to talk to the active user
-- a `tool_call` action to use broker tools such as profile storage, listing search, slot checks, poster contact, and booking
-The environment tracks:
-- which required fields have been gathered
-- which fields are still missing
-- which posts were selected
-- whether buyer or seller data was stored
-- whether tool order rules were violated
-- whether visits were successfully booked
-The simulator is deterministic by design so it is easier to use for RL training, regression testing, and reward iteration.
-## Included Scenarios
-The environment mirrors the main `broker_app` scenarios:
-- `task_visit_single`
-  One valid visit must be booked.
-- `task_visit_single_hidden_flex`
-  The buyer initially exposes only one slot, but hidden flexibility can be unlocked if the agent offers concrete alternatives.
-- `task_visit_multi`
-  At least two valid non-overlapping visits must be booked.
-- `task_visit_single_seller_followup`
-  The first buyer flow cannot book a visit, then a seller follow-up creates a new listing that can be matched and scheduled.
-Scenario declarations live in:
-- [server/scenario_factory.py](/Users/kushaljaisinghani/Documents/sample_envs/flatmate_rl/server/scenario_factory.py)
-- [server/scenarios.py](/Users/kushaljaisinghani/Documents/sample_envs/flatmate_rl/server/scenarios.py)
-## Action Format
 `FlatmateRlAction` supports two action types:
@@ -93,52 +135,11 @@ FlatmateRlAction(
 )
 ```
-## Observation Format
-Each reset or step returns a `FlatmateRlObservation` with fields such as:
-- `status`
-- `scenario_id`
-- `phase`
-- `conversation_history`
-- `last_tool_result`
-- `available_tools`
-- `gathered_fields`
-- `remaining_required_fields`
-- `selected_posts`
-- `booked_visits`
-- `violations`
-- `message`
-This gives an RL policy enough structured state to learn the broker workflow while still preserving the conversation transcript.
-## Tooling Model
-The broker-side tool space includes these buyer-phase tools:
-- `store_user_details`
-- `search_posts`
-- `close_buyer_conversation`
-- `match_location_preference`
-- `get_commute_time`
-- `check_calendar_slots`
-- `shortlist`
-- `contact_poster`
-- `book_viewing`
-The seller-follow-up phase adds:
-- `store_seller_details`
-- `check_table_slot_matches`
-- `confirm_seller_match`
-- `offer_matched_listing_to_buyer`
-- `schedule_table_visit`
-The environment enforces sequencing constraints. For example:
-- searching before `store_user_details` fails
-- seller follow-up tools cannot be used before `store_seller_details`
-- bookings fail if the required confirmations are missing
 ## Quick Start
@@ -170,29 +171,27 @@ obs = env.step(
 print(obs.last_tool_result)
 ```
-## Training an RL Agent
-The action space is mixed discrete-plus-structured:
-- choose whether to send a message or call a tool
-- if sending a message, generate natural language
-- if calling a tool, choose the tool and valid JSON arguments
-In practice, the easiest setup is usually:
-1. use an LLM policy or seq2seq policy that emits a structured action object
-2. compute reward from `done`, `violations`, `booked_visits`, and `last_tool_result`
-3. train with policy gradient, GRPO, PPO, or offline imitation plus RL fine-tuning
-### Example Training Loop
-The example below shows a minimal policy-gradient style skeleton. It is intentionally simple and is meant to show how to interact with the environment, not to be a production trainer.
 ```python
-from __future__ import annotations
 import random
-from dataclasses import dataclass
 from flatmate_rl import FlatmateRlAction
 from flatmate_rl.server.flatmate_rl_environment import FlatmateRlEnvironment
@@ -205,180 +204,156 @@ SCENARIOS = [
     "task_visit_single_seller_followup",
 ]
-@dataclass
-class Transition:
-    observation_text: str
-    action: FlatmateRlAction
-    reward: float
-    done: bool
-def flatten_observation(obs) -> str:
-    return (
-        f"scenario={obs.scenario_id}\n"
-        f"phase={obs.phase}\n"
-        f"status={obs.status}\n"
-        f"remaining={obs.remaining_required_fields}\n"
-        f"available_tools={obs.available_tools}\n"
-        f"selected_posts={obs.selected_posts}\n"
-        f"booked_visits={obs.booked_visits}\n"
-        f"violations={obs.violations}\n"
-        f"message={obs.message}\n"
-        f"last_user_message={obs.last_user_message}\n"
-    )
-class DummyPolicy:
-    def act(self, obs) -> FlatmateRlAction:
-        remaining = set(obs.remaining_required_fields)
-        if "diet" in remaining or "visit_availability" in remaining:
-            return FlatmateRlAction(
-                action_type="assistant_message",
-                assistant_message="Please share your dietary preference and visit availability.",
-            )
-        if not obs.buyer_profile_stored and obs.phase == "buyer":
-            return FlatmateRlAction(
-                action_type="tool_call",
-                tool_name="store_user_details",
-                tool_arguments={},
-            )
-        if obs.phase == "buyer" and "search_posts" in obs.available_tools:
-            return FlatmateRlAction(
-                action_type="tool_call",
-                tool_name="search_posts",
-                tool_arguments={},
-            )
-        available_tools = obs.available_tools
-        fallback_tool = available_tools[0] if available_tools else "store_user_details"
-        return FlatmateRlAction(
-            action_type="tool_call",
-            tool_name=fallback_tool,
-            tool_arguments={},
-        )
-    def update(self, trajectory: list[Transition]) -> None:
-        # Replace with PPO / GRPO / REINFORCE / DPO / imitation loss, etc.
-        pass
-def compute_reward(obs) -> float:
-    reward = 0.0
-    reward += 5.0 * len(obs.booked_visits)
-    reward -= 1.0 * len(obs.violations)
-    last_tool = obs.last_tool_result or {}
-    if last_tool.get("success") is True:
-        reward += 0.2
-    if last_tool.get("success") is False:
-        reward -= 0.5
-    if obs.done:
-        reward += 10.0
-    return reward
-def train(num_episodes: int = 1000, max_steps: int = 20) -> None:
-    env = FlatmateRlEnvironment()
-    policy = DummyPolicy()
-    for episode_idx in range(num_episodes):
-        scenario_id = random.choice(SCENARIOS)
-        obs = env.reset(scenario_id=scenario_id)
-        trajectory: list[Transition] = []
-        for _ in range(max_steps):
-            action = policy.act(obs)
-            next_obs = env.step(action)
-            reward = compute_reward(next_obs)
-            trajectory.append(
-                Transition(
-                    observation_text=flatten_observation(obs),
-                    action=action,
-                    reward=reward,
-                    done=next_obs.done,
-                )
-            )
-            obs = next_obs
-            if obs.done:
-                break
-        policy.update(trajectory)
-        if episode_idx % 50 == 0:
-            print(
-                f"episode={episode_idx} "
-                f"scenario={scenario_id} "
-                f"done={obs.done} "
-                f"bookings={len(obs.booked_visits)} "
-                f"violations={len(obs.violations)}"
-            )
-if __name__ == "__main__":
-    train()
 ```
-### Recommended Training Strategy
-For serious training, a better progression is:
-1. Start with supervised trajectories for correct broker flows.
-2. Fine-tune with RL on sparse success reward plus shaping reward.
-3. Penalize:
-   `violations`, failed tool calls, missing storage steps, invalid booking attempts.
-4. Reward:
-   correct information gathering, correct tool order, valid slot coordination, successful booking completion.
-## Web UI
-The environment exposes a custom Gradio UI at `/web`.
-It includes:
-- scenario selector
-- transcript viewer
-- assistant-message controls
-- tool-call runner with JSON arguments
-- live gathered/remaining field panels
-- selected posts, booked visits, violations
-- request/response payload panes
-Run locally:
 ```bash
 cd flatmate_rl
-uv run --project . server
 ```
-Then open:
 ```text
 http://127.0.0.1:8000/web
 ```
-## Docker
-This repo includes a Dockerfile similar to `sudoku_rl`.
-It enables the web interface by default:
-```dockerfile
-ENV ENABLE_WEB_INTERFACE=true
 ```
-Build and run:
 ```bash
 cd flatmate_rl
-docker build -t flatmate_rl .
-docker run -p 8000:8000 flatmate_rl
 ```
 Then open:
@@ -439,6 +414,7 @@ flatmate_rl/
 │   ├── flatmate_rl_environment.py
 │   ├── gradio_ui.py
 │   ├── scenario_factory.py
 │   └── scenarios.py
 └── tests/
     └── test_flatmate_rl.py
@@ -448,4 +424,3 @@ flatmate_rl/
 - The environment is deterministic and designed for RL experimentation, not as a drop-in replacement for the original multi-LLM broker simulator.
 - The current Python 3.13 Anaconda runtime in this workspace can crash when importing parts of `openenv`; using the local Python 3.12 virtualenv is the safer path for testing here.
-# flatmate_rl

 ---
+title: Flatmate RL Broker Environment
+emoji: 🏘️
+colorFrom: indigo
 colorTo: green
 sdk: docker
 pinned: false
 base_path: /web
 tags:
   - openenv
   - reinforcement-learning
+  - agents
+  - tool-use
+  - flatmate-search
+  - housing
+  - scheduling
+  - fastapi
+  - docker
 ---
 # Flatmate RL
+Flatmate RL is a deterministic OpenEnv reinforcement-learning environment for broker agents. It models flatmate-share search as a multi-step workflow where the policy must gather details, inspect listings, check slots, coordinate buyer/seller confirmations, and schedule visits only when the guardrails are satisfied.
+![Flatmate RL app screenshot](screenshot.png)
+Read the full project writeup: [Flatmate RL: Training Broker Agents for Real Flatmate Search](flatmate_rl.md).
+## Environment Flow
+```mermaid
+flowchart LR
+    O["Observation<br/>conversation, tools, fields,<br/>posts, visits, rewards"] --> P["Policy / RL Agent"]
+    P --> A{"Action"}
+    A --> M["assistant_message"]
+    A --> T["tool_call<br/>tool + JSON args"]
+    M --> E["Flatmate RL<br/>OpenEnv Environment"]
+    T --> E
+    E --> G["Guardrails<br/>tool order, arguments,<br/>slot conflicts, consent"]
+    G --> R["Reward + done<br/>next observation"]
+    R --> O
+```
+```mermaid
+flowchart TD
+    U["Buyer / seller request"] --> D["Gather required details"]
+    D --> S["Search and filter posts"]
+    S --> C["Check location, commute,<br/>calendar slots, conflicts"]
+    C --> K["Shortlist / negotiate / waitlist"]
+    K --> Q{"Buyer and poster confirmed?"}
+    Q -- "yes" --> B["Book visit or close deal"]
+    Q -- "no" --> F["Ask follow-up or call next tool"]
+    F --> D
+```
+## At A Glance
+| Area | Details |
+| --- | --- |
+| Runtime | OpenEnv environment served through FastAPI |
+| Domain | flatmate-share search and visit scheduling |
+| Policy output | `assistant_message` or structured `tool_call` |
+| Observation | transcript, phase, tools, fields, posts, bookings, violations, reward |
+| Reward signal | positive workflow progress, penalties for invalid order, hallucinated tools, bad bookings |
+| UI | custom Gradio app at `/web` |
+| Deployment | local Docker or Hugging Face Docker Space |
+## Scenario Types
+| Scenario | What the agent must learn |
+| --- | --- |
+| `task_visit_single` | book one valid visit |
+| `task_visit_single_hidden_flex` | recover when the buyer reveals only one bad slot |
+| `task_visit_multi` | schedule multiple non-overlapping visits |
+| `task_visit_single_seller_followup` | switch from failed buyer flow to seller follow-up |
+| `task_negotiation_hidden_budget` | discover buyer/seller price overlap |
+| `task_slot_cancellation_waitlist` | waitlist, react to cancellation, then book |
+| `task_multi_visit_preference_evolution` | update preferences after visits and new listings |
+| `task_visit_conflict_check` | avoid pre-booked slots and propose only open times |
+Scenario declarations live in [server/scenarios.py](server/scenarios.py) and are built with helpers from [server/scenario_factory.py](server/scenario_factory.py).
+## Synthetic Data And No-Leakage Design
+```mermaid
+flowchart LR
+    F["scenario_factory.py<br/>synthetic profiles, posts,<br/>ground truth"] --> S["scenarios.py"]
+    Seed["seed"] --> V["scenario_variants.py<br/>safe value shifts"]
+    S --> V
+    V --> E["Episode"]
+    E --> Obs["Observation"]
+    Strict["STRICT_EVAL_MODE=true"] --> Obs
+```
+All scenarios are synthetic. Seeded variants use `random.Random(f"{task_id}:{seed}")` to vary safe surface values such as occupation, rent, budget, and opening messages while preserving task id, post ids, required tools, feasible slots, required bookings, phase transitions, and the canonical success path.
+The environment should not contain real names, phone numbers, emails, addresses, scraped listings, or private housing records. If names or richer details are added later, generate them only inside [server/scenario_variants.py](server/scenario_variants.py) as synthetic seeded values.
+For stricter evaluation, set:
+```bash
+STRICT_EVAL_MODE=true
+```
+Strict eval mode hides direct scenario labels, difficulty, gathered/remaining fields, violations, tool traces, and rewards from the observation while still allowing sanitized tool results. Use this when you want to reduce prompt leakage during model evaluation.
+## Action, Observation, And Tools
 `FlatmateRlAction` supports two action types:
 )
 ```
+Each `reset` or `step` returns a `FlatmateRlObservation` with transcript state, active phase, available tools, gathered and remaining fields, selected posts, booked visits, violations, `step_reward`, and `total_reward`.
+Main buyer tools include `store_user_details`, `search_posts`, `match_location_preference`, `get_commute_time`, `check_calendar_slots`, `shortlist`, `contact_poster`, and `book_viewing`. Scenario-specific tools add negotiation, waitlist, debrief, new-arrival filtering, and seller-follow-up workflows.
+Guardrails penalize searching before storing user details, seller tools before seller details, booking before slot checks and confirmations, unknown tools, missing arguments, repeated successful calls, and non-canonical ordering.
 ## Quick Start
 print(obs.last_tool_result)
 ```
+## Training An RL Agent
+Use the environment as a reward source for an LLM or seq2seq policy that emits JSON actions.
+```mermaid
+flowchart LR
+    Reset["reset(scenario_id, seed)"] --> Prompt["serialize observation"]
+    Prompt --> Model["policy model"]
+    Model --> Parse["parse JSON action"]
+    Parse --> Step["env.step(action)"]
+    Step --> Reward["step_reward / total_reward"]
+    Reward --> Update["SFT, GRPO, PPO,<br/>REINFORCE, eval"]
+    Step --> Prompt
+```
+Recommended path: start with SFT/imitation on valid trajectories, then use GRPO/PPO/REINFORCE with endpoint reward. Evaluate on held-out seeds with `STRICT_EVAL_MODE=true`.
+Minimal local loop:
 ```python
 import random
 from flatmate_rl import FlatmateRlAction
 from flatmate_rl.server.flatmate_rl_environment import FlatmateRlEnvironment
     "task_visit_single_seller_followup",
 ]
+env = FlatmateRlEnvironment()
+for episode_idx in range(100):
+    obs = env.reset(scenario_id=random.choice(SCENARIOS), seed=episode_idx)
+    while not obs.done:
+        prompt = obs.model_dump()
+        action_json = policy_generate_json(prompt)  # your model
+        action = FlatmateRlAction.model_validate(action_json)
+        obs = env.step(action)
+        update_policy(obs.step_reward, obs.total_reward, obs.done)
 ```
+When training against Docker or the Hugging Face Space, use `/ws`; a websocket session keeps one environment instance alive across `reset` and `step`.
+```python
+import asyncio
+import json
+import websockets
+async def rollout(ws_url: str) -> None:
+    async with websockets.connect(ws_url, open_timeout=120, ping_timeout=120) as ws:
+        await ws.send(json.dumps({"type": "reset", "data": {"scenario_id": "task_visit_single", "seed": 7}}))
+        reset_payload = json.loads(await ws.recv())
+        action = {
+            "action_type": "assistant_message",
+            "assistant_message": "Please share your dietary preference and visit availability.",
+        }
+        await ws.send(json.dumps({"type": "step", "data": action}))
+        step_payload = json.loads(await ws.recv())
+        print(reset_payload["observation"]["status"])
+        print(step_payload["reward"], step_payload["done"])
+        await ws.send(json.dumps({"type": "close"}))
+asyncio.run(rollout("ws://127.0.0.1:8000/ws"))
+# Hosted Space: wss://kushalexplores-flatmate-rl.hf.space/ws
+```
+## Running With Docker
+```mermaid
+flowchart LR
+    Repo["flatmate_rl repo"] --> Docker["Dockerfile<br/>OpenEnv base + uv sync"]
+    Docker --> Server["uvicorn server.app:app<br/>port 8000"]
+    Server --> UI["/web Gradio UI"]
+    Server --> WS["/ws training endpoint"]
+    Server --> Health["/health"]
+```
+Build and run locally:
 ```bash
 cd flatmate_rl
+docker build -t flatmate_rl .
+docker run --rm -p 8000:8000 flatmate_rl
 ```
+Open the UI:
 ```text
 http://127.0.0.1:8000/web
 ```
+Use the websocket endpoint for training:
+```text
+ws://127.0.0.1:8000/ws
+```
+The Dockerfile uses the OpenEnv base image, installs dependencies with `uv`, sets `ENABLE_WEB_INTERFACE=true`, exposes the app on port `8000`, and starts:
+```bash
+uvicorn server.app:app --host 0.0.0.0 --port 8000
+```
+## Hugging Face Space Deployment
+```mermaid
+flowchart LR
+    HF["Hugging Face Space<br/>kushalExplores/flatmate_rl"] --> Build["Docker build"]
+    Build --> App["FastAPI OpenEnv app"]
+    App --> Web["/web"]
+    App --> Train["wss://.../ws"]
+```
+The deployed Space is:
+```text
+https://huggingface.co/spaces/kushalExplores/flatmate_rl
+```
+The Space is configured as Docker/FastAPI:
+```yaml
+sdk: docker
+app_port: 8000
+base_path: /web
+```
+The OpenEnv deployment config is in [openenv.yaml](openenv.yaml):
+```yaml
+spec_version: 1
+name: flatmate_rl
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000
+```
+Programmatic training endpoint:
+```text
+wss://kushalexplores-flatmate-rl.hf.space/ws
+```
+For the browser UI, open:
+```text
+https://kushalexplores-flatmate-rl.hf.space/web
 ```
+If Hugging Face changes the direct app subdomain, open the Space page and use the app link shown there.
+The server is configured with `max_concurrent_envs=4`, so keep GRPO/PPO reward workers conservative at first. Increase rollout concurrency only after the endpoint is stable.
+## Web UI
+The environment exposes a custom Gradio UI at `/web`.
+It includes:
+- scenario selector
+- transcript viewer
+- assistant-message controls
+- tool-call runner with JSON arguments
+- live gathered/remaining field panels
+- selected posts, booked visits, violations
+- request/response payload panes
+Run locally:
 ```bash
 cd flatmate_rl
+uv run --project . server
 ```
 Then open:
 │   ├── flatmate_rl_environment.py
 │   ├── gradio_ui.py
 │   ├── scenario_factory.py
+│   ├── scenario_variants.py
 │   └── scenarios.py
 └── tests/
     └── test_flatmate_rl.py
 - The environment is deterministic and designed for RL experimentation, not as a drop-in replacement for the original multi-LLM broker simulator.
 - The current Python 3.13 Anaconda runtime in this workspace can crash when importing parts of `openenv`; using the local Python 3.12 virtualenv is the safer path for testing here.

flatmate_rl.md ADDED Viewed

	@@ -0,0 +1,268 @@

+# Flatmate RL: Training Agents For Real Flatmate Search
+Finding a flatmate sounds simple until you actually do it.
+You open a Facebook group, WhatsApp group, Telegram channel, or listing board. You see hundreds of posts. Some are repeated. Some are old. Some do not mention rent. Some do not mention diet, gender preference, deposit, move-in date, or whether visits are even possible.
+Then the real work begins.
+You message one person. They do not reply.
+You message another. The room is already taken.
+You find a good place, but the visit slot clashes with work.
+You visit one flat and realize the area is too noisy.
+You change your preferences, and now you have to start searching again.
+This is why flatmate search is not just a search problem. It is a coordination problem.
+## The Real Problem
+Most flatmate platforms and groups are built like feeds.
+They show posts, but they do not manage the search.
+```text
++----------------------+      +----------------------+
+| Flatmate group/feed  | ---> | User does all work   |
++----------------------+      +----------------------+
+| Repeated posts       |      | Filter listings      |
+| Missing details      |      | Message posters      |
+| Outdated listings    |      | Check availability   |
+| Unclear visit slots  |      | Remember preferences |
++----------------------+      +----------------------+
+```
+The interface gives you information, but the burden stays on you.
+You have to remember which post matched your budget, which poster replied, which room was vegetarian-only, which visit was possible on Saturday, and which place you rejected after seeing it.
+That is a lot of hidden work.
+## What A Broker Actually Does
+A good human broker does not just show listings.
+They manage the process.
+They ask for missing preferences. They filter bad leads. They check whether a flat is still available. They coordinate with owners or current flatmates. They schedule visits. They remember feedback after each visit.
+In other words, a broker turns a messy feed into a workflow.
+```text
++-------------+     +-------------+     +-------------+
+| Understand  | --> | Match       | --> | Coordinate  |
+| the buyer   |     | listings    |     | visits      |
++-------------+     +-------------+     +-------------+
+        |                                      |
+        v                                      v
++-------------+                        +-------------+
+| Update      | <--------------------- | Learn from  |
+| preferences |                        | feedback    |
++-------------+                        +-------------+
+```
+Flatmate RL is built around this idea: train an agent to behave more like a reliable broker than a passive search box.
+## Feed Search vs Agent Search
+Here is the difference in concrete terms.
+| Situation | Feed-based search | Agent-led search |
+| --- | --- | --- |
+| Repeated listing | You see the same Andheri West room posted multiple times. | The agent treats duplicates as one lead and checks if it is still active. |
+| Missing details | A post says "DM for rent" and gives no diet or move-in details. | The agent asks for missing fields before considering it a serious option. |
+| Budget mismatch | You skip a Rs. 24,000 room because your budget is Rs. 20,000. | The agent can check whether negotiation is possible. |
+| Visit timing | You ask "Can I visit Saturday?" and wait for a reply. | The agent checks open slots and proposes only valid times. |
+| Preference change | After a visit, you realize you need a quieter area. | The agent updates your profile and changes future recommendations. |
+| New listing later | A good seller appears two days after you stopped checking. | The agent can match new supply back to your saved preferences. |
+That is the product gap Flatmate RL is trying to model.
+## What Flatmate RL Is
+Flatmate RL is an OpenEnv reinforcement-learning environment for training broker-style agents.
+The agent is placed inside a simulated flatmate-search workflow. It sees the current search state: preferences, listings, messages, calendar slots, and previous actions.
+At every step, it must choose one action:
+- send a message when more information or confirmation is needed
+- call a structured broker tool
+The goal is not just to recommend a post. The goal is to complete the workflow correctly.
+## The Environment In One Diagram
+```text
++-----------------------------+
+| Current housing situation   |
++-----------------------------+
+| Buyer preferences           |
+| Seller/listing details      |
+| Chat history                |
+| Calendar slots              |
++-------------+---------------+
+              |
+              v
++-----------------------------+
+| Broker agent decides        |
++-----------------------------+
+| Ask a question?             |
+| Search posts?               |
+| Check slots?                |
+| Contact poster?             |
+| Book visit?                 |
++-------------+---------------+
+              |
+              v
++-----------------------------+
+| Environment checks action   |
++-----------------------------+
+| Did it follow the rules?    |
+| Did it move the task ahead? |
+| Should it get reward?       |
++-------------+---------------+
+              |
+              v
++-----------------------------+
+| Updated state               |
++-----------------------------+
+| New facts are stored        |
+| Bad actions are penalized   |
+| Good progress is rewarded   |
++-----------------------------+
+```
+This loop teaches the model when to talk, when to use tools, what arguments to pass, and when it is safe to book a visit.
+## The Tools
+The agent has tools that mirror the work a broker would do.
+Buyer-side tools include:
+- `store_user_details`
+- `search_posts`
+- `match_location_preference`
+- `get_commute_time`
+- `check_calendar_slots`
+- `shortlist`
+- `contact_poster`
+- `book_viewing`
+Advanced tools handle seller follow-up, negotiation, waitlists, cancellations, and feedback after visits.
+The important part is sequencing.
+The agent cannot just book a visit because a listing looks good. It first needs enough buyer details, a matching listing, available calendar slots, buyer confirmation, and poster confirmation.
+## Example: Booking One Visit
+Suppose a buyer says:
+> I want a room near Andheri West, budget around Rs. 22,000. I work in media.
+That is not enough to book a visit.
+The agent should first ask for missing details, such as diet preference and visit availability.
+Then it can search posts, match the area, check commute, inspect calendar slots, contact the poster, and book only after both sides agree.
+```text
+Buyer request
+     |
+     v
+Ask missing details
+     |
+     v
+Search and match listings
+     |
+     v
+Check visit slots
+     |
+     v
+Confirm buyer + poster
+     |
+     v
+Book viewing
+```
+This is a small example, but it captures the main difference.
+A search engine can return a listing. A broker agent has to finish the job.
+## Example: Preferences Change After A Visit
+Flatmate search changes after real visits.
+A buyer may start by saying they want Andheri West. After visiting one flat, they may realize the building is too noisy. After another visit, they may decide they need a gym nearby.
+A normal feed does not remember that.
+Flatmate RL rewards an agent that can debrief the visit, update the buyer profile, and continue the search with the new information.
+```text
+Visit flat
+   |
+   v
+Buyer gives feedback
+   |
+   v
+Agent updates preferences
+   |
+   v
+New search is more specific
+```
+This matters because real users do not know every preference upfront. A useful agent must learn during the process.
+## Example: Negotiation
+Sometimes a listing looks too expensive, but the deal is still possible.
+Imagine a room listed at Rs. 24,000. The buyer says their budget is Rs. 20,000, but they can stretch to Rs. 22,000. The seller wants Rs. 24,000, but would accept Rs. 21,000.
+A static filter may reject the listing.
+A broker agent can check whether there is overlap and propose a price that both sides accept.
+That is another reason this is not just retrieval. The best outcome may require several careful steps.
+## Why Reinforcement Learning Fits
+Flatmate search has delayed outcomes.
+An early mistake can break the whole workflow. If the agent books before checking calendar slots, the booking is invalid. If it contacts a poster before collecting buyer details, the poster may not have enough information. If it ignores feedback after a visit, it keeps recommending the wrong places.
+Reinforcement learning is useful because the model can learn from the result of the whole sequence, not just from one message.
+Good behavior gets rewarded:
+- collecting missing details
+- using tools in the right order
+- avoiding unavailable slots
+- getting both sides to confirm
+- adapting after feedback
+Bad behavior gets penalized:
+- calling tools too early
+- repeating actions
+- booking without consent
+- ignoring calendar conflicts
+- getting stuck in loops
+Over time, the overall loss should go down because the agent makes fewer workflow errors. In this environment, lower error means fewer invalid tool calls, fewer missed confirmations, fewer bad slot choices, and more completed bookings or deals.
+## What Success Looks Like
+A strong Flatmate RL agent should feel less like a chatbot and more like an operational assistant.
+It should remember what the buyer wants. It should search continuously. It should filter noisy supply. It should coordinate with sellers. It should handle calendar conflicts. It should negotiate when there is room. It should adapt after visits.
+Most importantly, it should only book when the workflow is actually valid.
+That is the real-world problem Flatmate RL tries to make trainable: turning messy flatmate search into a repeatable agent task with state, tools, feedback, and measurable progress.