Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -9,66 +9,43 @@ pinned: true
|
|
| 9 |
short_description: LLM-hard OpenEnv RL env for student focus management
|
| 10 |
---
|
| 11 |
|
| 12 |
-
# FocusFlow RL Environment
|
| 13 |
-
### Meta Γ Scaler OpenEnv Hackathon 2026
|
| 14 |
|
| 15 |
-
|
| 16 |
-
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
|
| 19 |
-
[](https://huggingface.co/spaces/your-space)
|
| 20 |
|
| 21 |
---
|
| 22 |
|
| 23 |
-
##
|
| 24 |
|
| 25 |
-
|
| 26 |
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
| Mandatory `reasoning` field (graded) | Empty reasoning = reward penalty. [cite_start]LLMs must justify decisions |
|
| 31 |
-
| Cognitive load dynamics | [cite_start]Overworking degrades future rewards β requires adaptive strategy |
|
| 32 |
-
| Multi-day deadline tracking | [cite_start]Planning today affects energy and deadlines tomorrow |
|
| 33 |
-
| Deferred events expire | [cite_start]Agent must track time-sensitive commitments across steps |
|
| 34 |
-
| Urgency vs. deferability trade-off | [cite_start]"Mom called twice" β "Friend wants to play BGMI" |
|
| 35 |
|
| 36 |
---
|
| 37 |
|
| 38 |
-
##
|
| 39 |
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
| Action | When to Use | Reward |
|
| 43 |
-
|---|---|---|
|
| 44 |
-
| `focus` | Stay on task | [cite_start]+0.05 Γ (1 β cognitive_load) |
|
| 45 |
-
| `block_app` | Block a distracting app | [cite_start]+0.20 Γ temptation_level |
|
| 46 |
-
| `take_break` | Rest at session boundary or when load > 0.75 | [cite_start]+0.20 to +0.30 |
|
| 47 |
-
| `defer_event` | Postpone a low-urgency event | [cite_start]+0.15 if correct, β0.05 if wrong |
|
| 48 |
-
| `respond_to_event` | Handle urgent events immediately | [cite_start]+0.20 if correct |
|
| 49 |
-
| `plan_day` | Set a study schedule at day start | [cite_start]+0.00 to +0.30 based on quality |
|
| 50 |
-
| `adjust_energy` | Recover from fatigue/environmental noise | [cite_start]+0.10 |
|
| 51 |
-
| `check_app` | **(BAD)** Give in to distraction | [cite_start]β0.50 |
|
| 52 |
-
|
| 53 |
-
### Reasoning Quality Reward (Universal)
|
| 54 |
-
|
| 55 |
-
[cite_start]Every action carries a **reasoning bonus/penalty** (Β±0.10) based on:
|
| 56 |
-
- [cite_start]Mentions of relevant concepts (urgency, priority, focus, deadlines)
|
| 57 |
-
- [cite_start]Use of causal language ("because", "therefore", "in order to")
|
| 58 |
-
- [cite_start]Whether the action matches the correct response for the active event
|
| 59 |
-
|
| 60 |
-
### Observation Space
|
| 61 |
|
|
|
|
| 62 |
```json
|
| 63 |
{
|
| 64 |
"time_remaining_seconds": 1140,
|
| 65 |
"current_phase": "focus",
|
| 66 |
"sessions_completed": 1,
|
| 67 |
"focus_score": 0.923,
|
| 68 |
-
"active_distractions": ["Instagram", "BGMI"],
|
| 69 |
-
"blocked_apps": ["YouTube", "Netflix"],
|
| 70 |
"cognitive_load": 0.62,
|
| 71 |
"deadline_pressure": 0.45,
|
|
|
|
|
|
|
| 72 |
"pending_event": {
|
| 73 |
"type": "social_message",
|
| 74 |
"description": "Rahul texted: 'bhai BGMI chalate hain, sirf 1 ghanta, kal exam nahi hai'",
|
|
@@ -87,3 +64,55 @@ short_description: LLM-hard OpenEnv RL env for student focus management
|
|
| 87 |
"last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
|
| 88 |
"reasoning_quality_score": 0.82
|
| 89 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
short_description: LLM-hard OpenEnv RL env for student focus management
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# π§ FocusFlow: LLM-Hard RL Environment for Cognitive Management
|
| 13 |
+
### Meta Γ Scaler OpenEnv Hackathon 2026 β Grand Finale Submission
|
| 14 |
|
| 15 |
+
[](https://colab.research.google.com/github/abdulhannan-18/Focus_Flow_env/blob/main/training_colab.py)
|
| 16 |
+
[](https://huggingface.co/spaces/hannan2859r/focusflow_env)
|
| 17 |
+
[](https://www.python.org/downloads/release/python-3110/)
|
| 18 |
+
[](https://opensource.org/licenses/MIT)
|
| 19 |
|
| 20 |
+
> **Executive Summary:** FocusFlow is an OpenEnv-compliant reinforcement learning environment that simulates the cognitive friction of modern digital life. It abandons traditional spatial tasks (like moving a robot arm) in favor of **LLM-hard cognitive tasks**: managing mental energy, tracking shifting deadlines, and utilizing natural language comprehension to filter informal social distractions from urgent professional tasks.
|
|
|
|
| 21 |
|
| 22 |
---
|
| 23 |
|
| 24 |
+
## π― Hackathon Theme Alignment
|
| 25 |
|
| 26 |
+
**Core Themes Addressed:** Long-Horizon Planning & Instruction Following | World Modeling across Professional/Personal Tasks
|
| 27 |
|
| 28 |
+
* **The Problem Statement:** Modern digital workspaces cause catastrophic context-switching. Traditional RL bots fail here because evaluating a distraction requires contextual language understanding. The problem is designing an environment that forces an AI agent to manage time, mental energy, and dynamic deadlines while processing rich natural-language interruptions.
|
| 29 |
+
* **The Environment:** A fully Dockerized, RESTful API environment. The world state dynamically models time progression, cognitive load (rising with work, decaying with breaks), and an event engine that injects multi-tiered distractions.
|
| 30 |
+
* **Agent Capabilities Required:** Agents must possess reading comprehension (urgency evaluation), multi-day memory (tracking deferred events before they expire), and Chain-of-Thought (CoT) reasoning to justify scheduling decisions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
---
|
| 33 |
|
| 34 |
+
## ποΈ System Architecture & Observation Space
|
| 35 |
|
| 36 |
+
The environment operates via a FastAPI backend, serving strictly typed JSON payloads. The observation space is designed to be highly complex, forcing the LLM to synthesize multiple data streams.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
+
### Example Observation Payload
|
| 39 |
```json
|
| 40 |
{
|
| 41 |
"time_remaining_seconds": 1140,
|
| 42 |
"current_phase": "focus",
|
| 43 |
"sessions_completed": 1,
|
| 44 |
"focus_score": 0.923,
|
|
|
|
|
|
|
| 45 |
"cognitive_load": 0.62,
|
| 46 |
"deadline_pressure": 0.45,
|
| 47 |
+
"active_distractions": ["Instagram", "BGMI"],
|
| 48 |
+
"blocked_apps": ["YouTube"],
|
| 49 |
"pending_event": {
|
| 50 |
"type": "social_message",
|
| 51 |
"description": "Rahul texted: 'bhai BGMI chalate hain, sirf 1 ghanta, kal exam nahi hai'",
|
|
|
|
| 64 |
"last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
|
| 65 |
"reasoning_quality_score": 0.82
|
| 66 |
}
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
---
|
| 70 |
+
|
| 71 |
+
## βοΈ Dual-Layer Reward Model & Evaluation Logic
|
| 72 |
+
|
| 73 |
+
FocusFlow implements a hybrid objective/subjective reward function.
|
| 74 |
+
|
| 75 |
+
### 1. Objective Mechanical Rewards
|
| 76 |
+
| Action | Environmental Trigger | Reward / Penalty |
|
| 77 |
+
|---|---|---|
|
| 78 |
+
| `focus` | Executed during work phase | `+0.05 Γ (1 β cognitive_load)` |
|
| 79 |
+
| `block_app` | Targets an active high-temptation app | `+0.20 Γ temptation_level` |
|
| 80 |
+
| `take_break` | Executed when `cognitive_load > 0.75` | `+0.20` to `+0.30` |
|
| 81 |
+
| `defer_event` | Postpones a low-urgency social text | `+0.15` (Correct) / `-0.05` (Wrong) |
|
| 82 |
+
| `respond_to_event` | Handles urgent/hard deadlines | `+0.20` (Correct) / `-0.10` (Wrong) |
|
| 83 |
+
| `plan_day` | Sets schedule aligning with deadlines | `+0.00` to `+0.30` (Quality scaled) |
|
| 84 |
+
| `check_app` | **(BAD)** Agent gives in to temptation | **`-0.50` Hard Penalty** |
|
| 85 |
+
|
| 86 |
+
### 2. Subjective Reasoning Grader
|
| 87 |
+
To prevent random action-spamming, the `grade_reasoning()` heuristic parses the agent's mandatory reasoning field.
|
| 88 |
+
* It applies a `Β±0.10` multiplier based on the use of causal language, task-awareness, and logical alignment with the current `pending_event`.
|
| 89 |
+
* Empty or repetitive reasoning results in immediate reward degradation.
|
| 90 |
+
|
| 91 |
+
---
|
| 92 |
+
|
| 93 |
+
## π Task Progressions
|
| 94 |
+
|
| 95 |
+
| Task ID | Challenge Pillar | Success Criteria | Horizon |
|
| 96 |
+
|---|---|---|---|
|
| 97 |
+
| `task_1` | **Execution** | Complete a 25-min session with 0 app checks. Handle basic distractions logically. | 60 Steps |
|
| 98 |
+
| `task_2` | **Load Management** | Complete a multi-session day. Keep `cognitive_load < 0.85` via strategic breaks. | 120 Steps |
|
| 99 |
+
| `task_3` | **Long-Horizon** | Execute a 3-day plan, manage energy decay, and maintain a perfect focus streak. | 240 Steps |
|
| 100 |
+
|
| 101 |
+
---
|
| 102 |
+
|
| 103 |
+
## π Post-Training & Self-Improvement Strategy (GRPO)
|
| 104 |
+
|
| 105 |
+
A baseline LLM will struggle with FocusFlow's delayed rewards (e.g., deferring an event now to save energy for a deadline 50 steps later).
|
| 106 |
+
|
| 107 |
+
To achieve an optimal policy, the project includes a **Group Relative Policy Optimization (GRPO)** pipeline:
|
| 108 |
+
1. **Framework:** Uses `TRL` (Transformer Reinforcement Learning) and `Unsloth` for efficient 4-bit quantization on consumer hardware (T4 GPUs).
|
| 109 |
+
2. **Data Generation:** The baseline agent explores the live FastAPI environment, collecting trajectories of observations, actions, and rewards.
|
| 110 |
+
3. **Optimization:** GRPO updates the LLM weights directly based on the environment's trajectory rewards, teaching the model that maintaining cognitive load and providing high-quality reasoning yields the highest cumulative return.
|
| 111 |
+
|
| 112 |
+
---
|
| 113 |
+
|
| 114 |
+
## π» Technical Setup & Quick Start
|
| 115 |
+
|
| 116 |
+
### Local Installation
|
| 117 |
+
```bash
|
| 118 |
+
# Clone the repository
|