Spaces:
Sleeping
Sleeping
| title: FocusFlow RL Environment | |
| emoji: π― | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: true | |
| short_description: LLM-hard OpenEnv RL env for student focus management | |
| thumbnail: >- | |
| https://cdn-uploads.huggingface.co/production/uploads/68f093da561f15826cc8ad59/y40SmMZCx-xgI4v4wH3pS.png | |
| # π§ FocusFlow: LLM-Hard RL Environment for Cognitive Management | |
| ### Meta Γ Scaler OpenEnv Hackathon 2026 β Grand Finale Submission | |
| [](https://huggingface.co/spaces/hannan2859r/focusflow_env) | |
| [](https://www.python.org/downloads/release/python-3110/) | |
| [](https://opensource.org/licenses/MIT) | |
| Links: | |
| Google Colab:https://colab.research.google.com/drive/16wJ4mw6sdcTuOYABpdoV2AuO6_KYnc4Q?usp=sharing | |
| Github:https://github.com/abdulhannan-18/Focus_Flow_env | |
| > **Executive Summary:** FocusFlow is an OpenEnv-compliant reinforcement learning environment that simulates the cognitive friction of modern digital life. It abandons traditional spatial tasks (like moving a robot arm) in favor of **LLM-hard cognitive tasks**: managing mental energy, tracking shifting deadlines, and utilizing natural language comprehension to filter informal social distractions from urgent professional tasks. | |
| --- | |
| ## π― Hackathon Theme Alignment | |
| **Core Themes Addressed:** Long-Horizon Planning & Instruction Following | World Modeling across Professional/Personal Tasks | |
| * **The Problem Statement:** Modern digital workspaces cause catastrophic context-switching. Traditional RL bots fail here because evaluating a distraction requires contextual language understanding. The problem is designing an environment that forces an AI agent to manage time, mental energy, and dynamic deadlines while processing rich natural-language interruptions. | |
| * **The Environment:** A fully Dockerized, RESTful API environment. The world state dynamically models time progression, cognitive load (rising with work, decaying with breaks), and an event engine that injects multi-tiered distractions. | |
| * **Agent Capabilities Required:** Agents must possess reading comprehension (urgency evaluation), multi-day memory (tracking deferred events before they expire), and Chain-of-Thought (CoT) reasoning to justify scheduling decisions. | |
| --- | |
| ## ποΈ System Architecture & Observation Space | |
| The environment operates via a FastAPI backend, serving strictly typed JSON payloads. The observation space is designed to be highly complex, forcing the LLM to synthesize multiple data streams. | |
| ### Example Observation Payload | |
| ```json | |
| { | |
| "time_remaining_seconds": 1140, | |
| "current_phase": "focus", | |
| "sessions_completed": 1, | |
| "focus_score": 0.923, | |
| "cognitive_load": 0.62, | |
| "deadline_pressure": 0.45, | |
| "active_distractions": ["Instagram", "BGMI"], | |
| "blocked_apps": ["YouTube"], | |
| "pending_event": { | |
| "type": "social_message", | |
| "description": "Rahul texted: 'bhai BGMI chalate hain, sirf 1 ghanta, kal exam nahi hai'", | |
| "urgency": 0.30, | |
| "can_defer": true, | |
| "deadline_steps": 8, | |
| "correct_action": "defer_event" | |
| }, | |
| "day_context": { | |
| "day_number": 1, | |
| "energy_level": 0.84, | |
| "pending_deadlines": [ | |
| {"task": "Math Assignment", "due_step": 45, "completed": false} | |
| ] | |
| }, | |
| "last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10", | |
| "reasoning_quality_score": 0.82 | |
| } | |
| ``` | |
| --- | |
| ## βοΈ Dual-Layer Reward Model & Evaluation Logic | |
| FocusFlow implements a hybrid objective/subjective reward function. | |
| ### 1. Objective Mechanical Rewards | |
| | Action | Environmental Trigger | Reward / Penalty | | |
| |---|---|---| | |
| | `focus` | Executed during work phase | `+0.05 Γ (1 β cognitive_load)` | | |
| | `block_app` | Targets an active high-temptation app | `+0.20 Γ temptation_level` | | |
| | `take_break` | Executed when `cognitive_load > 0.75` | `+0.20` to `+0.30` | | |
| | `defer_event` | Postpones a low-urgency social text | `+0.15` (Correct) / `-0.05` (Wrong) | | |
| | `respond_to_event` | Handles urgent/hard deadlines | `+0.20` (Correct) / `-0.10` (Wrong) | | |
| | `plan_day` | Sets schedule aligning with deadlines | `+0.00` to `+0.30` (Quality scaled) | | |
| | `check_app` | **(BAD)** Agent gives in to temptation | **`-0.50` Hard Penalty** | | |
| ### 2. Subjective Reasoning Grader | |
| To prevent random action-spamming, the `grade_reasoning()` heuristic parses the agent's mandatory reasoning field. | |
| * It applies a `Β±0.10` multiplier based on the use of causal language, task-awareness, and logical alignment with the current `pending_event`. | |
| * Empty or repetitive reasoning results in immediate reward degradation. | |
| --- | |
| ## π Task Progressions | |
| | Task ID | Challenge Pillar | Success Criteria | Horizon | | |
| |---|---|---|---| | |
| | `task_1` | **Execution** | Complete a 25-min session with 0 app checks. Handle basic distractions logically. | 60 Steps | | |
| | `task_2` | **Load Management** | Complete a multi-session day. Keep `cognitive_load < 0.85` via strategic breaks. | 120 Steps | | |
| | `task_3` | **Long-Horizon** | Execute a 3-day plan, manage energy decay, and maintain a perfect focus streak. | 240 Steps | | |
| --- | |
| ## π Post-Training & Self-Improvement Strategy (GRPO) | |
| A baseline LLM will struggle with FocusFlow's delayed rewards (e.g., deferring an event now to save energy for a deadline 50 steps later). | |
| To achieve an optimal policy, the project includes a **Group Relative Policy Optimization (GRPO)** pipeline: | |
| 1. **Framework:** Uses `TRL` (Transformer Reinforcement Learning) and `Unsloth` for efficient 4-bit quantization on consumer hardware (T4 GPUs). | |
| 2. **Data Generation:** The baseline agent explores the live FastAPI environment, collecting trajectories of observations, actions, and rewards. | |
| 3. **Optimization:** GRPO updates the LLM weights directly based on the environment's trajectory rewards, teaching the model that maintaining cognitive load and providing high-quality reasoning yields the highest cumulative return. | |
| --- | |
| ## π» Technical Setup & Quick Start | |
| ### Local Installation | |
| ```bash | |
| # Clone the repository |