Spaces:

hannan2859r
/

focusflow_env

Sleeping

App Files Files Community

hannan2859r commited on 19 days ago

Commit

a5ae22e

verified ·

1 Parent(s): 168fef1

Update README.md

Browse files

Files changed (1) hide show

README.md +69 -40

README.md CHANGED Viewed

@@ -9,66 +9,43 @@ pinned: true
 short_description: LLM-hard OpenEnv RL env for student focus management
 ---
-# FocusFlow RL Environment v2.0
-### Meta × Scaler OpenEnv Hackathon 2026
-> An LLM-hard RL environment where an AI agent manages a student's real cognitive world —
-> [cite_start]navigating natural language distractions, shifting deadlines, and multi-day energy dynamics.
-[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/your-colab-link)
-[![HuggingFace Space](https://img.shields.io/badge/🤗-HuggingFace%20Space-yellow)](https://huggingface.co/spaces/your-space)
 ---
-## Why This Environment Is LLM-Hard
-[cite_start]Unlike toy RL environments solvable by a simple rule-based policy, FocusFlow requires genuine LLM reasoning:
-| Challenge | Why It Needs an LLM |
-|---|---|
-| Natural language distraction events | [cite_start]Agent must read and interpret messages to judge urgency  |
-| Mandatory `reasoning` field (graded) | Empty reasoning = reward penalty. [cite_start]LLMs must justify decisions  |
-| Cognitive load dynamics | [cite_start]Overworking degrades future rewards — requires adaptive strategy  |
-| Multi-day deadline tracking | [cite_start]Planning today affects energy and deadlines tomorrow  |
-| Deferred events expire | [cite_start]Agent must track time-sensitive commitments across steps  |
-| Urgency vs. deferability trade-off | [cite_start]"Mom called twice" ≠ "Friend wants to play BGMI"  |
 ---
-## Environment Design
-### Action Space (8 actions)
-| Action | When to Use | Reward |
-|---|---|---|
-| `focus` | Stay on task | [cite_start]+0.05 × (1 − cognitive_load)  |
-| `block_app` | Block a distracting app | [cite_start]+0.20 × temptation_level  |
-| `take_break` | Rest at session boundary or when load > 0.75 | [cite_start]+0.20 to +0.30  |
-| `defer_event` | Postpone a low-urgency event | [cite_start]+0.15 if correct, −0.05 if wrong  |
-| `respond_to_event` | Handle urgent events immediately | [cite_start]+0.20 if correct  |
-| `plan_day` | Set a study schedule at day start | [cite_start]+0.00 to +0.30 based on quality  |
-| `adjust_energy` | Recover from fatigue/environmental noise | [cite_start]+0.10  |
-| `check_app` | **(BAD)** Give in to distraction | [cite_start]−0.50  |
-### Reasoning Quality Reward (Universal)
-[cite_start]Every action carries a **reasoning bonus/penalty** (±0.10) based on:
-- [cite_start]Mentions of relevant concepts (urgency, priority, focus, deadlines)
-- [cite_start]Use of causal language ("because", "therefore", "in order to")
-- [cite_start]Whether the action matches the correct response for the active event
-### Observation Space
 ```json
 {
   "time_remaining_seconds": 1140,
   "current_phase": "focus",
   "sessions_completed": 1,
   "focus_score": 0.923,
-  "active_distractions": ["Instagram", "BGMI"],
-  "blocked_apps": ["YouTube", "Netflix"],
   "cognitive_load": 0.62,
   "deadline_pressure": 0.45,
   "pending_event": {
     "type": "social_message",
     "description": "Rahul texted: 'bhai BGMI chalate hain, sirf 1 ghanta, kal exam nahi hai'",
@@ -87,3 +64,55 @@ short_description: LLM-hard OpenEnv RL env for student focus management
   "last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
   "reasoning_quality_score": 0.82
 }

 short_description: LLM-hard OpenEnv RL env for student focus management
 ---
+# 🧠 FocusFlow: LLM-Hard RL Environment for Cognitive Management
+### Meta × Scaler OpenEnv Hackathon 2026 — Grand Finale Submission
+[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abdulhannan-18/Focus_Flow_env/blob/main/training_colab.py)
+[![HuggingFace Space](https://img.shields.io/badge/🤗-HuggingFace%20Live%20API-yellow)](https://huggingface.co/spaces/hannan2859r/focusflow_env)
+[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/)
+[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
+> **Executive Summary:** FocusFlow is an OpenEnv-compliant reinforcement learning environment that simulates the cognitive friction of modern digital life. It abandons traditional spatial tasks (like moving a robot arm) in favor of **LLM-hard cognitive tasks**: managing mental energy, tracking shifting deadlines, and utilizing natural language comprehension to filter informal social distractions from urgent professional tasks.
 ---
+## 🎯 Hackathon Theme Alignment
+**Core Themes Addressed:** Long-Horizon Planning & Instruction Following | World Modeling across Professional/Personal Tasks
+* **The Problem Statement:** Modern digital workspaces cause catastrophic context-switching. Traditional RL bots fail here because evaluating a distraction requires contextual language understanding. The problem is designing an environment that forces an AI agent to manage time, mental energy, and dynamic deadlines while processing rich natural-language interruptions.
+* **The Environment:** A fully Dockerized, RESTful API environment. The world state dynamically models time progression, cognitive load (rising with work, decaying with breaks), and an event engine that injects multi-tiered distractions.
+* **Agent Capabilities Required:** Agents must possess reading comprehension (urgency evaluation), multi-day memory (tracking deferred events before they expire), and Chain-of-Thought (CoT) reasoning to justify scheduling decisions.
 ---
+## 🏗️ System Architecture & Observation Space
+The environment operates via a FastAPI backend, serving strictly typed JSON payloads. The observation space is designed to be highly complex, forcing the LLM to synthesize multiple data streams.
+### Example Observation Payload
 ```json
 {
   "time_remaining_seconds": 1140,
   "current_phase": "focus",
   "sessions_completed": 1,
   "focus_score": 0.923,
   "cognitive_load": 0.62,
   "deadline_pressure": 0.45,
+  "active_distractions": ["Instagram", "BGMI"],
+  "blocked_apps": ["YouTube"],
   "pending_event": {
     "type": "social_message",
     "description": "Rahul texted: 'bhai BGMI chalate hain, sirf 1 ghanta, kal exam nahi hai'",
   "last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
   "reasoning_quality_score": 0.82
 }
+```
+---
+## ⚖️ Dual-Layer Reward Model & Evaluation Logic
+FocusFlow implements a hybrid objective/subjective reward function.
+### 1. Objective Mechanical Rewards
+| Action | Environmental Trigger | Reward / Penalty |
+|---|---|---|
+| `focus` | Executed during work phase | `+0.05 × (1 − cognitive_load)` |
+| `block_app` | Targets an active high-temptation app | `+0.20 × temptation_level` |
+| `take_break` | Executed when `cognitive_load > 0.75` | `+0.20` to `+0.30` |
+| `defer_event` | Postpones a low-urgency social text | `+0.15` (Correct) / `-0.05` (Wrong) |
+| `respond_to_event` | Handles urgent/hard deadlines | `+0.20` (Correct) / `-0.10` (Wrong) |
+| `plan_day` | Sets schedule aligning with deadlines | `+0.00` to `+0.30` (Quality scaled) |
+| `check_app` | **(BAD)** Agent gives in to temptation | **`-0.50` Hard Penalty** |
+### 2. Subjective Reasoning Grader
+To prevent random action-spamming, the `grade_reasoning()` heuristic parses the agent's mandatory reasoning field.
+* It applies a `±0.10` multiplier based on the use of causal language, task-awareness, and logical alignment with the current `pending_event`.
+* Empty or repetitive reasoning results in immediate reward degradation.
+---
+## 📋 Task Progressions
+| Task ID | Challenge Pillar | Success Criteria | Horizon |
+|---|---|---|---|
+| `task_1` | **Execution** | Complete a 25-min session with 0 app checks. Handle basic distractions logically. | 60 Steps |
+| `task_2` | **Load Management** | Complete a multi-session day. Keep `cognitive_load < 0.85` via strategic breaks. | 120 Steps |
+| `task_3` | **Long-Horizon** | Execute a 3-day plan, manage energy decay, and maintain a perfect focus streak. | 240 Steps |
+---
+## 🚀 Post-Training & Self-Improvement Strategy (GRPO)
+A baseline LLM will struggle with FocusFlow's delayed rewards (e.g., deferring an event now to save energy for a deadline 50 steps later).
+To achieve an optimal policy, the project includes a **Group Relative Policy Optimization (GRPO)** pipeline:
+1.  **Framework:** Uses `TRL` (Transformer Reinforcement Learning) and `Unsloth` for efficient 4-bit quantization on consumer hardware (T4 GPUs).
+2.  **Data Generation:** The baseline agent explores the live FastAPI environment, collecting trajectories of observations, actions, and rewards.
+3.  **Optimization:** GRPO updates the LLM weights directly based on the environment's trajectory rewards, teaching the model that maintaining cognitive load and providing high-quality reasoning yields the highest cumulative return.
+---
+## 💻 Technical Setup & Quick Start
+### Local Installation
+```bash
+# Clone the repository