hannan2859r commited on
Commit
a5ae22e
Β·
verified Β·
1 Parent(s): 168fef1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -40
README.md CHANGED
@@ -9,66 +9,43 @@ pinned: true
9
  short_description: LLM-hard OpenEnv RL env for student focus management
10
  ---
11
 
12
- # FocusFlow RL Environment v2.0
13
- ### Meta Γ— Scaler OpenEnv Hackathon 2026
14
 
15
- > An LLM-hard RL environment where an AI agent manages a student's real cognitive world β€”
16
- > [cite_start]navigating natural language distractions, shifting deadlines, and multi-day energy dynamics.
 
 
17
 
18
- [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/your-colab-link)
19
- [![HuggingFace Space](https://img.shields.io/badge/πŸ€—-HuggingFace%20Space-yellow)](https://huggingface.co/spaces/your-space)
20
 
21
  ---
22
 
23
- ## Why This Environment Is LLM-Hard
24
 
25
- [cite_start]Unlike toy RL environments solvable by a simple rule-based policy, FocusFlow requires genuine LLM reasoning:
26
 
27
- | Challenge | Why It Needs an LLM |
28
- |---|---|
29
- | Natural language distraction events | [cite_start]Agent must read and interpret messages to judge urgency |
30
- | Mandatory `reasoning` field (graded) | Empty reasoning = reward penalty. [cite_start]LLMs must justify decisions |
31
- | Cognitive load dynamics | [cite_start]Overworking degrades future rewards β€” requires adaptive strategy |
32
- | Multi-day deadline tracking | [cite_start]Planning today affects energy and deadlines tomorrow |
33
- | Deferred events expire | [cite_start]Agent must track time-sensitive commitments across steps |
34
- | Urgency vs. deferability trade-off | [cite_start]"Mom called twice" β‰  "Friend wants to play BGMI" |
35
 
36
  ---
37
 
38
- ## Environment Design
39
 
40
- ### Action Space (8 actions)
41
-
42
- | Action | When to Use | Reward |
43
- |---|---|---|
44
- | `focus` | Stay on task | [cite_start]+0.05 Γ— (1 βˆ’ cognitive_load) |
45
- | `block_app` | Block a distracting app | [cite_start]+0.20 Γ— temptation_level |
46
- | `take_break` | Rest at session boundary or when load > 0.75 | [cite_start]+0.20 to +0.30 |
47
- | `defer_event` | Postpone a low-urgency event | [cite_start]+0.15 if correct, βˆ’0.05 if wrong |
48
- | `respond_to_event` | Handle urgent events immediately | [cite_start]+0.20 if correct |
49
- | `plan_day` | Set a study schedule at day start | [cite_start]+0.00 to +0.30 based on quality |
50
- | `adjust_energy` | Recover from fatigue/environmental noise | [cite_start]+0.10 |
51
- | `check_app` | **(BAD)** Give in to distraction | [cite_start]βˆ’0.50 |
52
-
53
- ### Reasoning Quality Reward (Universal)
54
-
55
- [cite_start]Every action carries a **reasoning bonus/penalty** (Β±0.10) based on:
56
- - [cite_start]Mentions of relevant concepts (urgency, priority, focus, deadlines)
57
- - [cite_start]Use of causal language ("because", "therefore", "in order to")
58
- - [cite_start]Whether the action matches the correct response for the active event
59
-
60
- ### Observation Space
61
 
 
62
  ```json
63
  {
64
  "time_remaining_seconds": 1140,
65
  "current_phase": "focus",
66
  "sessions_completed": 1,
67
  "focus_score": 0.923,
68
- "active_distractions": ["Instagram", "BGMI"],
69
- "blocked_apps": ["YouTube", "Netflix"],
70
  "cognitive_load": 0.62,
71
  "deadline_pressure": 0.45,
 
 
72
  "pending_event": {
73
  "type": "social_message",
74
  "description": "Rahul texted: 'bhai BGMI chalate hain, sirf 1 ghanta, kal exam nahi hai'",
@@ -87,3 +64,55 @@ short_description: LLM-hard OpenEnv RL env for student focus management
87
  "last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
88
  "reasoning_quality_score": 0.82
89
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  short_description: LLM-hard OpenEnv RL env for student focus management
10
  ---
11
 
12
+ # 🧠 FocusFlow: LLM-Hard RL Environment for Cognitive Management
13
+ ### Meta Γ— Scaler OpenEnv Hackathon 2026 β€” Grand Finale Submission
14
 
15
+ [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/abdulhannan-18/Focus_Flow_env/blob/main/training_colab.py)
16
+ [![HuggingFace Space](https://img.shields.io/badge/πŸ€—-HuggingFace%20Live%20API-yellow)](https://huggingface.co/spaces/hannan2859r/focusflow_env)
17
+ [![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/)
18
+ [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
19
 
20
+ > **Executive Summary:** FocusFlow is an OpenEnv-compliant reinforcement learning environment that simulates the cognitive friction of modern digital life. It abandons traditional spatial tasks (like moving a robot arm) in favor of **LLM-hard cognitive tasks**: managing mental energy, tracking shifting deadlines, and utilizing natural language comprehension to filter informal social distractions from urgent professional tasks.
 
21
 
22
  ---
23
 
24
+ ## 🎯 Hackathon Theme Alignment
25
 
26
+ **Core Themes Addressed:** Long-Horizon Planning & Instruction Following | World Modeling across Professional/Personal Tasks
27
 
28
+ * **The Problem Statement:** Modern digital workspaces cause catastrophic context-switching. Traditional RL bots fail here because evaluating a distraction requires contextual language understanding. The problem is designing an environment that forces an AI agent to manage time, mental energy, and dynamic deadlines while processing rich natural-language interruptions.
29
+ * **The Environment:** A fully Dockerized, RESTful API environment. The world state dynamically models time progression, cognitive load (rising with work, decaying with breaks), and an event engine that injects multi-tiered distractions.
30
+ * **Agent Capabilities Required:** Agents must possess reading comprehension (urgency evaluation), multi-day memory (tracking deferred events before they expire), and Chain-of-Thought (CoT) reasoning to justify scheduling decisions.
 
 
 
 
 
31
 
32
  ---
33
 
34
+ ## πŸ—οΈ System Architecture & Observation Space
35
 
36
+ The environment operates via a FastAPI backend, serving strictly typed JSON payloads. The observation space is designed to be highly complex, forcing the LLM to synthesize multiple data streams.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
37
 
38
+ ### Example Observation Payload
39
  ```json
40
  {
41
  "time_remaining_seconds": 1140,
42
  "current_phase": "focus",
43
  "sessions_completed": 1,
44
  "focus_score": 0.923,
 
 
45
  "cognitive_load": 0.62,
46
  "deadline_pressure": 0.45,
47
+ "active_distractions": ["Instagram", "BGMI"],
48
+ "blocked_apps": ["YouTube"],
49
  "pending_event": {
50
  "type": "social_message",
51
  "description": "Rahul texted: 'bhai BGMI chalate hain, sirf 1 ghanta, kal exam nahi hai'",
 
64
  "last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
65
  "reasoning_quality_score": 0.82
66
  }
67
+ ```
68
+
69
+ ---
70
+
71
+ ## βš–οΈ Dual-Layer Reward Model & Evaluation Logic
72
+
73
+ FocusFlow implements a hybrid objective/subjective reward function.
74
+
75
+ ### 1. Objective Mechanical Rewards
76
+ | Action | Environmental Trigger | Reward / Penalty |
77
+ |---|---|---|
78
+ | `focus` | Executed during work phase | `+0.05 Γ— (1 βˆ’ cognitive_load)` |
79
+ | `block_app` | Targets an active high-temptation app | `+0.20 Γ— temptation_level` |
80
+ | `take_break` | Executed when `cognitive_load > 0.75` | `+0.20` to `+0.30` |
81
+ | `defer_event` | Postpones a low-urgency social text | `+0.15` (Correct) / `-0.05` (Wrong) |
82
+ | `respond_to_event` | Handles urgent/hard deadlines | `+0.20` (Correct) / `-0.10` (Wrong) |
83
+ | `plan_day` | Sets schedule aligning with deadlines | `+0.00` to `+0.30` (Quality scaled) |
84
+ | `check_app` | **(BAD)** Agent gives in to temptation | **`-0.50` Hard Penalty** |
85
+
86
+ ### 2. Subjective Reasoning Grader
87
+ To prevent random action-spamming, the `grade_reasoning()` heuristic parses the agent's mandatory reasoning field.
88
+ * It applies a `Β±0.10` multiplier based on the use of causal language, task-awareness, and logical alignment with the current `pending_event`.
89
+ * Empty or repetitive reasoning results in immediate reward degradation.
90
+
91
+ ---
92
+
93
+ ## πŸ“‹ Task Progressions
94
+
95
+ | Task ID | Challenge Pillar | Success Criteria | Horizon |
96
+ |---|---|---|---|
97
+ | `task_1` | **Execution** | Complete a 25-min session with 0 app checks. Handle basic distractions logically. | 60 Steps |
98
+ | `task_2` | **Load Management** | Complete a multi-session day. Keep `cognitive_load < 0.85` via strategic breaks. | 120 Steps |
99
+ | `task_3` | **Long-Horizon** | Execute a 3-day plan, manage energy decay, and maintain a perfect focus streak. | 240 Steps |
100
+
101
+ ---
102
+
103
+ ## πŸš€ Post-Training & Self-Improvement Strategy (GRPO)
104
+
105
+ A baseline LLM will struggle with FocusFlow's delayed rewards (e.g., deferring an event now to save energy for a deadline 50 steps later).
106
+
107
+ To achieve an optimal policy, the project includes a **Group Relative Policy Optimization (GRPO)** pipeline:
108
+ 1. **Framework:** Uses `TRL` (Transformer Reinforcement Learning) and `Unsloth` for efficient 4-bit quantization on consumer hardware (T4 GPUs).
109
+ 2. **Data Generation:** The baseline agent explores the live FastAPI environment, collecting trajectories of observations, actions, and rewards.
110
+ 3. **Optimization:** GRPO updates the LLM weights directly based on the environment's trajectory rewards, teaching the model that maintaining cognitive load and providing high-quality reasoning yields the highest cumulative return.
111
+
112
+ ---
113
+
114
+ ## πŸ’» Technical Setup & Quick Start
115
+
116
+ ### Local Installation
117
+ ```bash
118
+ # Clone the repository