Spaces:
Sleeping
title: FocusFlow RL Environment
emoji: π―
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: true
short_description: LLM-hard OpenEnv RL env for student focus management
thumbnail: >-
https://cdn-uploads.huggingface.co/production/uploads/68f093da561f15826cc8ad59/y40SmMZCx-xgI4v4wH3pS.png
π§ FocusFlow: LLM-Hard RL Environment for Cognitive Management
Meta Γ Scaler OpenEnv Hackathon 2026 β Grand Finale Submission
Links: Google Colab:https://colab.research.google.com/drive/16wJ4mw6sdcTuOYABpdoV2AuO6_KYnc4Q?usp=sharing
Github:https://github.com/abdulhannan-18/Focus_Flow_env
Executive Summary: FocusFlow is an OpenEnv-compliant reinforcement learning environment that simulates the cognitive friction of modern digital life. It abandons traditional spatial tasks (like moving a robot arm) in favor of LLM-hard cognitive tasks: managing mental energy, tracking shifting deadlines, and utilizing natural language comprehension to filter informal social distractions from urgent professional tasks.
π― Hackathon Theme Alignment
Core Themes Addressed: Long-Horizon Planning & Instruction Following | World Modeling across Professional/Personal Tasks
- The Problem Statement: Modern digital workspaces cause catastrophic context-switching. Traditional RL bots fail here because evaluating a distraction requires contextual language understanding. The problem is designing an environment that forces an AI agent to manage time, mental energy, and dynamic deadlines while processing rich natural-language interruptions.
- The Environment: A fully Dockerized, RESTful API environment. The world state dynamically models time progression, cognitive load (rising with work, decaying with breaks), and an event engine that injects multi-tiered distractions.
- Agent Capabilities Required: Agents must possess reading comprehension (urgency evaluation), multi-day memory (tracking deferred events before they expire), and Chain-of-Thought (CoT) reasoning to justify scheduling decisions.
ποΈ System Architecture & Observation Space
The environment operates via a FastAPI backend, serving strictly typed JSON payloads. The observation space is designed to be highly complex, forcing the LLM to synthesize multiple data streams.
Example Observation Payload
{
"time_remaining_seconds": 1140,
"current_phase": "focus",
"sessions_completed": 1,
"focus_score": 0.923,
"cognitive_load": 0.62,
"deadline_pressure": 0.45,
"active_distractions": ["Instagram", "BGMI"],
"blocked_apps": ["YouTube"],
"pending_event": {
"type": "social_message",
"description": "Rahul texted: 'bhai BGMI chalate hain, sirf 1 ghanta, kal exam nahi hai'",
"urgency": 0.30,
"can_defer": true,
"deadline_steps": 8,
"correct_action": "defer_event"
},
"day_context": {
"day_number": 1,
"energy_level": 0.84,
"pending_deadlines": [
{"task": "Math Assignment", "due_step": 45, "completed": false}
]
},
"last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
"reasoning_quality_score": 0.82
}
βοΈ Dual-Layer Reward Model & Evaluation Logic
FocusFlow implements a hybrid objective/subjective reward function.
1. Objective Mechanical Rewards
| Action | Environmental Trigger | Reward / Penalty |
|---|---|---|
focus |
Executed during work phase | +0.05 Γ (1 β cognitive_load) |
block_app |
Targets an active high-temptation app | +0.20 Γ temptation_level |
take_break |
Executed when cognitive_load > 0.75 |
+0.20 to +0.30 |
defer_event |
Postpones a low-urgency social text | +0.15 (Correct) / -0.05 (Wrong) |
respond_to_event |
Handles urgent/hard deadlines | +0.20 (Correct) / -0.10 (Wrong) |
plan_day |
Sets schedule aligning with deadlines | +0.00 to +0.30 (Quality scaled) |
check_app |
(BAD) Agent gives in to temptation | -0.50 Hard Penalty |
2. Subjective Reasoning Grader
To prevent random action-spamming, the grade_reasoning() heuristic parses the agent's mandatory reasoning field.
- It applies a
Β±0.10multiplier based on the use of causal language, task-awareness, and logical alignment with the currentpending_event. - Empty or repetitive reasoning results in immediate reward degradation.
π Task Progressions
| Task ID | Challenge Pillar | Success Criteria | Horizon |
|---|---|---|---|
task_1 |
Execution | Complete a 25-min session with 0 app checks. Handle basic distractions logically. | 60 Steps |
task_2 |
Load Management | Complete a multi-session day. Keep cognitive_load < 0.85 via strategic breaks. |
120 Steps |
task_3 |
Long-Horizon | Execute a 3-day plan, manage energy decay, and maintain a perfect focus streak. | 240 Steps |
π Post-Training & Self-Improvement Strategy (GRPO)
A baseline LLM will struggle with FocusFlow's delayed rewards (e.g., deferring an event now to save energy for a deadline 50 steps later).
To achieve an optimal policy, the project includes a Group Relative Policy Optimization (GRPO) pipeline:
- Framework: Uses
TRL(Transformer Reinforcement Learning) andUnslothfor efficient 4-bit quantization on consumer hardware (T4 GPUs). - Data Generation: The baseline agent explores the live FastAPI environment, collecting trajectories of observations, actions, and rewards.
- Optimization: GRPO updates the LLM weights directly based on the environment's trajectory rewards, teaching the model that maintaining cognitive load and providing high-quality reasoning yields the highest cumulative return.
π» Technical Setup & Quick Start
Local Installation
# Clone the repository