Spaces:
Sleeping
Sleeping
File size: 6,172 Bytes
bb019be 68aeab7 bb019be 5a03697 a5ae22e 5a03697 a5ae22e 5a03697 68aeab7 a5ae22e 5a03697 bb019be 5a03697 a5ae22e 5a03697 a5ae22e bb019be a5ae22e bb019be a5ae22e bb019be a5ae22e bb019be a5ae22e bb019be fdd45f1 5a03697 fdd45f1 5a03697 a5ae22e 5a03697 fdd45f1 a5ae22e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | ---
title: FocusFlow RL Environment
emoji: π―
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: true
short_description: LLM-hard OpenEnv RL env for student focus management
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/68f093da561f15826cc8ad59/y40SmMZCx-xgI4v4wH3pS.png
---
# π§ FocusFlow: LLM-Hard RL Environment for Cognitive Management
### Meta Γ Scaler OpenEnv Hackathon 2026 β Grand Finale Submission
[](https://huggingface.co/spaces/hannan2859r/focusflow_env)
[](https://www.python.org/downloads/release/python-3110/)
[](https://opensource.org/licenses/MIT)
Links:
Google Colab:https://colab.research.google.com/drive/16wJ4mw6sdcTuOYABpdoV2AuO6_KYnc4Q?usp=sharing
Github:https://github.com/abdulhannan-18/Focus_Flow_env
> **Executive Summary:** FocusFlow is an OpenEnv-compliant reinforcement learning environment that simulates the cognitive friction of modern digital life. It abandons traditional spatial tasks (like moving a robot arm) in favor of **LLM-hard cognitive tasks**: managing mental energy, tracking shifting deadlines, and utilizing natural language comprehension to filter informal social distractions from urgent professional tasks.
---
## π― Hackathon Theme Alignment
**Core Themes Addressed:** Long-Horizon Planning & Instruction Following | World Modeling across Professional/Personal Tasks
* **The Problem Statement:** Modern digital workspaces cause catastrophic context-switching. Traditional RL bots fail here because evaluating a distraction requires contextual language understanding. The problem is designing an environment that forces an AI agent to manage time, mental energy, and dynamic deadlines while processing rich natural-language interruptions.
* **The Environment:** A fully Dockerized, RESTful API environment. The world state dynamically models time progression, cognitive load (rising with work, decaying with breaks), and an event engine that injects multi-tiered distractions.
* **Agent Capabilities Required:** Agents must possess reading comprehension (urgency evaluation), multi-day memory (tracking deferred events before they expire), and Chain-of-Thought (CoT) reasoning to justify scheduling decisions.
---
## ποΈ System Architecture & Observation Space
The environment operates via a FastAPI backend, serving strictly typed JSON payloads. The observation space is designed to be highly complex, forcing the LLM to synthesize multiple data streams.
### Example Observation Payload
```json
{
"time_remaining_seconds": 1140,
"current_phase": "focus",
"sessions_completed": 1,
"focus_score": 0.923,
"cognitive_load": 0.62,
"deadline_pressure": 0.45,
"active_distractions": ["Instagram", "BGMI"],
"blocked_apps": ["YouTube"],
"pending_event": {
"type": "social_message",
"description": "Rahul texted: 'bhai BGMI chalate hain, sirf 1 ghanta, kal exam nahi hai'",
"urgency": 0.30,
"can_defer": true,
"deadline_steps": 8,
"correct_action": "defer_event"
},
"day_context": {
"day_number": 1,
"energy_level": 0.84,
"pending_deadlines": [
{"task": "Math Assignment", "due_step": 45, "completed": false}
]
},
"last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
"reasoning_quality_score": 0.82
}
```
---
## βοΈ Dual-Layer Reward Model & Evaluation Logic
FocusFlow implements a hybrid objective/subjective reward function.
### 1. Objective Mechanical Rewards
| Action | Environmental Trigger | Reward / Penalty |
|---|---|---|
| `focus` | Executed during work phase | `+0.05 Γ (1 β cognitive_load)` |
| `block_app` | Targets an active high-temptation app | `+0.20 Γ temptation_level` |
| `take_break` | Executed when `cognitive_load > 0.75` | `+0.20` to `+0.30` |
| `defer_event` | Postpones a low-urgency social text | `+0.15` (Correct) / `-0.05` (Wrong) |
| `respond_to_event` | Handles urgent/hard deadlines | `+0.20` (Correct) / `-0.10` (Wrong) |
| `plan_day` | Sets schedule aligning with deadlines | `+0.00` to `+0.30` (Quality scaled) |
| `check_app` | **(BAD)** Agent gives in to temptation | **`-0.50` Hard Penalty** |
### 2. Subjective Reasoning Grader
To prevent random action-spamming, the `grade_reasoning()` heuristic parses the agent's mandatory reasoning field.
* It applies a `Β±0.10` multiplier based on the use of causal language, task-awareness, and logical alignment with the current `pending_event`.
* Empty or repetitive reasoning results in immediate reward degradation.
---
## π Task Progressions
| Task ID | Challenge Pillar | Success Criteria | Horizon |
|---|---|---|---|
| `task_1` | **Execution** | Complete a 25-min session with 0 app checks. Handle basic distractions logically. | 60 Steps |
| `task_2` | **Load Management** | Complete a multi-session day. Keep `cognitive_load < 0.85` via strategic breaks. | 120 Steps |
| `task_3` | **Long-Horizon** | Execute a 3-day plan, manage energy decay, and maintain a perfect focus streak. | 240 Steps |
---
## π Post-Training & Self-Improvement Strategy (GRPO)
A baseline LLM will struggle with FocusFlow's delayed rewards (e.g., deferring an event now to save energy for a deadline 50 steps later).
To achieve an optimal policy, the project includes a **Group Relative Policy Optimization (GRPO)** pipeline:
1. **Framework:** Uses `TRL` (Transformer Reinforcement Learning) and `Unsloth` for efficient 4-bit quantization on consumer hardware (T4 GPUs).
2. **Data Generation:** The baseline agent explores the live FastAPI environment, collecting trajectories of observations, actions, and rewards.
3. **Optimization:** GRPO updates the LLM weights directly based on the environment's trajectory rewards, teaching the model that maintaining cognitive load and providing high-quality reasoning yields the highest cumulative return.
---
## π» Technical Setup & Quick Start
### Local Installation
```bash
# Clone the repository |