Spaces:

hannan2859r
/

focusflow_env

Sleeping

File size: 6,172 Bytes

bb019be
 
 
 
 
 
 
 
 
68aeab7
 
bb019be
5a03697
a5ae22e
 
5a03697
a5ae22e
 
 
5a03697
68aeab7
 
 
 
 
a5ae22e
5a03697
bb019be
5a03697
a5ae22e
5a03697
a5ae22e
bb019be
a5ae22e
 
 
bb019be
 
 
a5ae22e
bb019be
a5ae22e
bb019be
a5ae22e
bb019be
fdd45f1
5a03697
fdd45f1
5a03697
 
 
 
a5ae22e
 
5a03697
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fdd45f1
a5ae22e

---
title: FocusFlow RL Environment
emoji: 🎯
colorFrom: blue
colorTo: purple
sdk: docker
app_port: 7860
pinned: true
short_description: LLM-hard OpenEnv RL env for student focus management
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/68f093da561f15826cc8ad59/y40SmMZCx-xgI4v4wH3pS.png
---

# 🧠 FocusFlow: LLM-Hard RL Environment for Cognitive Management
### Meta × Scaler OpenEnv Hackathon 2026 — Grand Finale Submission

[![HuggingFace Space](https://img.shields.io/badge/🤗-HuggingFace%20Live%20API-yellow)](https://huggingface.co/spaces/hannan2859r/focusflow_env)
[![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3110/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)

Links:
Google Colab:https://colab.research.google.com/drive/16wJ4mw6sdcTuOYABpdoV2AuO6_KYnc4Q?usp=sharing


Github:https://github.com/abdulhannan-18/Focus_Flow_env
> **Executive Summary:** FocusFlow is an OpenEnv-compliant reinforcement learning environment that simulates the cognitive friction of modern digital life. It abandons traditional spatial tasks (like moving a robot arm) in favor of **LLM-hard cognitive tasks**: managing mental energy, tracking shifting deadlines, and utilizing natural language comprehension to filter informal social distractions from urgent professional tasks.

---

## 🎯 Hackathon Theme Alignment

**Core Themes Addressed:** Long-Horizon Planning & Instruction Following | World Modeling across Professional/Personal Tasks

* **The Problem Statement:** Modern digital workspaces cause catastrophic context-switching. Traditional RL bots fail here because evaluating a distraction requires contextual language understanding. The problem is designing an environment that forces an AI agent to manage time, mental energy, and dynamic deadlines while processing rich natural-language interruptions.
* **The Environment:** A fully Dockerized, RESTful API environment. The world state dynamically models time progression, cognitive load (rising with work, decaying with breaks), and an event engine that injects multi-tiered distractions.
* **Agent Capabilities Required:** Agents must possess reading comprehension (urgency evaluation), multi-day memory (tracking deferred events before they expire), and Chain-of-Thought (CoT) reasoning to justify scheduling decisions.

---

## 🏗️ System Architecture & Observation Space

The environment operates via a FastAPI backend, serving strictly typed JSON payloads. The observation space is designed to be highly complex, forcing the LLM to synthesize multiple data streams.

### Example Observation Payload
```json
{
  "time_remaining_seconds": 1140,
  "current_phase": "focus",
  "sessions_completed": 1,
  "focus_score": 0.923,
  "cognitive_load": 0.62,
  "deadline_pressure": 0.45,
  "active_distractions": ["Instagram", "BGMI"],
  "blocked_apps": ["YouTube"],
  "pending_event": {
    "type": "social_message",
    "description": "Rahul texted: 'bhai BGMI chalate hain, sirf 1 ghanta, kal exam nahi hai'",
    "urgency": 0.30,
    "can_defer": true,
    "deadline_steps": 8,
    "correct_action": "defer_event"
  },
  "day_context": {
    "day_number": 1,
    "energy_level": 0.84,
    "pending_deadlines": [
      {"task": "Math Assignment", "due_step": 45, "completed": false}
    ]
  },
  "last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
  "reasoning_quality_score": 0.82
}
```

---

## ⚖️ Dual-Layer Reward Model & Evaluation Logic

FocusFlow implements a hybrid objective/subjective reward function. 

### 1. Objective Mechanical Rewards
| Action | Environmental Trigger | Reward / Penalty |
|---|---|---|
| `focus` | Executed during work phase | `+0.05 × (1 − cognitive_load)` |
| `block_app` | Targets an active high-temptation app | `+0.20 × temptation_level` |
| `take_break` | Executed when `cognitive_load > 0.75` | `+0.20` to `+0.30` |
| `defer_event` | Postpones a low-urgency social text | `+0.15` (Correct) / `-0.05` (Wrong) |
| `respond_to_event` | Handles urgent/hard deadlines | `+0.20` (Correct) / `-0.10` (Wrong) |
| `plan_day` | Sets schedule aligning with deadlines | `+0.00` to `+0.30` (Quality scaled) |
| `check_app` | **(BAD)** Agent gives in to temptation | **`-0.50` Hard Penalty** |

### 2. Subjective Reasoning Grader
To prevent random action-spamming, the `grade_reasoning()` heuristic parses the agent's mandatory reasoning field. 
* It applies a `±0.10` multiplier based on the use of causal language, task-awareness, and logical alignment with the current `pending_event`. 
* Empty or repetitive reasoning results in immediate reward degradation.

---

## 📋 Task Progressions

| Task ID | Challenge Pillar | Success Criteria | Horizon |
|---|---|---|---|
| `task_1` | **Execution** | Complete a 25-min session with 0 app checks. Handle basic distractions logically. | 60 Steps |
| `task_2` | **Load Management** | Complete a multi-session day. Keep `cognitive_load < 0.85` via strategic breaks. | 120 Steps |
| `task_3` | **Long-Horizon** | Execute a 3-day plan, manage energy decay, and maintain a perfect focus streak. | 240 Steps |

---

## 🚀 Post-Training & Self-Improvement Strategy (GRPO)

A baseline LLM will struggle with FocusFlow's delayed rewards (e.g., deferring an event now to save energy for a deadline 50 steps later). 

To achieve an optimal policy, the project includes a **Group Relative Policy Optimization (GRPO)** pipeline:
1.  **Framework:** Uses `TRL` (Transformer Reinforcement Learning) and `Unsloth` for efficient 4-bit quantization on consumer hardware (T4 GPUs).
2.  **Data Generation:** The baseline agent explores the live FastAPI environment, collecting trajectories of observations, actions, and rewards.
3.  **Optimization:** GRPO updates the LLM weights directly based on the environment's trajectory rewards, teaching the model that maintaining cognitive load and providing high-quality reasoning yields the highest cumulative return.

---

## 💻 Technical Setup & Quick Start

### Local Installation
```bash
# Clone the repository