Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,41 +1,65 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
|
| 5 |
-
|
|
|
|
| 6 |
|
| 7 |
-
|
| 8 |
-
|
| 9 |
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
Mandatory reasoning field (graded) Empty reasoning = reward penalty. LLMs must justify decisions
|
| 13 |
-
Cognitive load dynamics Overworking degrades future rewards β requires adaptive strategy
|
| 14 |
-
Multi-day deadline tracking Planning today affects energy and deadlines tomorrow
|
| 15 |
-
Deferred events expire Agent must track time-sensitive commitments across steps
|
| 16 |
-
Urgency vs. deferability trade-off "Mom called twice" β "Friend wants to play BGMI"
|
| 17 |
-
A simple if-else policy cannot pass Task 2 or Task 3 without understanding language.
|
| 18 |
|
| 19 |
-
|
| 20 |
-
Action Space (8 actions)
|
| 21 |
-
Action When to Use Reward
|
| 22 |
-
focus Stay on task +0.05 Γ (1 β cognitive_load)
|
| 23 |
-
block_app Block a distracting app +0.20 Γ temptation_level
|
| 24 |
-
take_break Rest at session boundary or when load > 0.75 +0.20 to +0.30
|
| 25 |
-
defer_event Postpone a low-urgency event +0.15 if correct, β0.05 if wrong
|
| 26 |
-
respond_to_event Handle urgent events immediately +0.20 if correct
|
| 27 |
-
plan_day Set a study schedule at day start +0.00 to +0.30 based on quality
|
| 28 |
-
adjust_energy Recover from fatigue/environmental noise +0.10
|
| 29 |
-
check_app (BAD) Give in to distraction β0.50
|
| 30 |
-
Reasoning Quality Reward (Universal)
|
| 31 |
-
Every action carries a reasoning bonus/penalty (Β±0.10) based on:
|
| 32 |
|
| 33 |
-
|
| 34 |
-
Use of causal language ("because", "therefore", "in order to")
|
| 35 |
-
Whether the action matches the correct response for the active event
|
| 36 |
-
This is what separates an LLM from a rule-based bot.
|
| 37 |
|
| 38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
{
|
| 40 |
"time_remaining_seconds": 1140,
|
| 41 |
"current_phase": "focus",
|
|
@@ -63,84 +87,3 @@ Observation Space
|
|
| 63 |
"last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
|
| 64 |
"reasoning_quality_score": 0.82
|
| 65 |
}
|
| 66 |
-
Tasks
|
| 67 |
-
Task Description Max Steps Key Challenge
|
| 68 |
-
task_1 One session, zero distractions 60 Reasoning quality + event handling
|
| 69 |
-
task_2 Two sessions, manage cognitive load 120 Break timing + multi-event judgment
|
| 70 |
-
task_3 3-day week plan, maintain streak 240 Long-horizon planning + energy decay
|
| 71 |
-
Reward Function Summary
|
| 72 |
-
Universal (every step):
|
| 73 |
-
reasoning_quality β [-0.10, +0.10] scored by heuristic grader
|
| 74 |
-
|
| 75 |
-
Action-specific:
|
| 76 |
-
focus β +0.05 Γ (1 - cognitive_load)
|
| 77 |
-
block_app β +0.20 Γ temptation_level
|
| 78 |
-
take_break β +0.20 to +0.30 (well-timed) or -0.10 (premature)
|
| 79 |
-
defer_event β +0.15 (correct) / -0.05 (wrong) / -0.20 (non-deferrable)
|
| 80 |
-
respond_to_event β +0.20 (correct) / -0.10 (wrong)
|
| 81 |
-
plan_day β +0.00 to +0.30 (based on plan quality scoring)
|
| 82 |
-
adjust_energy β +0.10 (when needed) / +0.01 (unnecessary)
|
| 83 |
-
check_app β -0.50 (hard penalty)
|
| 84 |
-
|
| 85 |
-
Episode bonuses:
|
| 86 |
-
task_1: +0.25 if avg reasoning quality > 70%
|
| 87 |
-
task_2: +0.30 for zero app checks across both sessions
|
| 88 |
-
task_3: +0.40 for 3-day perfect focus streak
|
| 89 |
-
Quick Start
|
| 90 |
-
# Install
|
| 91 |
-
pip install -r requirements.txt
|
| 92 |
-
|
| 93 |
-
# Start environment server
|
| 94 |
-
uvicorn app:app --host 0.0.0.0 --port 7860 --reload
|
| 95 |
-
|
| 96 |
-
# Reset and step (in another terminal)
|
| 97 |
-
curl -X POST "http://localhost:7860/reset?task_id=task_1"
|
| 98 |
-
|
| 99 |
-
curl -X POST http://localhost:7860/step \
|
| 100 |
-
-H "Content-Type: application/json" \
|
| 101 |
-
-d '{
|
| 102 |
-
"action_type": "defer_event",
|
| 103 |
-
"event_id": "evt_3",
|
| 104 |
-
"reasoning": "This is a low urgency social message from a friend asking to play games. Since I have a Math Assignment due at step 45 and I am currently in focus phase, I should defer this and stay focused. The friend can wait.",
|
| 105 |
-
"response_text": "bhai abhi padh raha hoon, baad mein baat karte hain"
|
| 106 |
-
}'
|
| 107 |
-
Run LLM Agent
|
| 108 |
-
export API_BASE_URL=https://api.groq.com/openai/v1
|
| 109 |
-
export GROQ_API_KEY=your_key_here
|
| 110 |
-
export MODEL_NAME=llama-3.1-8b-instant
|
| 111 |
-
export ENV_BASE_URL=http://localhost:7860
|
| 112 |
-
export TASK_ID=task_2
|
| 113 |
-
export MAX_EPISODES=5
|
| 114 |
-
|
| 115 |
-
python inference.py
|
| 116 |
-
Train with GRPO (Google Colab T4)
|
| 117 |
-
Open training_colab.py in Colab. It will:
|
| 118 |
-
|
| 119 |
-
Load Llama-3.2-1B-Instruct with Unsloth 4-bit quantisation
|
| 120 |
-
Collect environment episodes as training data
|
| 121 |
-
Fine-tune with GRPO using environment rewards
|
| 122 |
-
Plot reward curves
|
| 123 |
-
Push trained model to HuggingFace Hub
|
| 124 |
-
Project Structure
|
| 125 |
-
focusflow_rl_env/
|
| 126 |
-
βββ models.py # Pydantic: FocusAction, FocusObservation, FocusState, DistractionEvent
|
| 127 |
-
βββ environment.py # Core RL logic: step(), reset(), reward, NL event grading
|
| 128 |
-
βββ app.py # FastAPI server (OpenEnv HTTP API + /metrics endpoint)
|
| 129 |
-
βββ inference.py # LLM baseline agent with chain-of-thought prompting
|
| 130 |
-
βββ training_colab.py # GRPO training script (Unsloth + HF TRL)
|
| 131 |
-
βββ openenv.yaml # OpenEnv metadata
|
| 132 |
-
βββ Dockerfile
|
| 133 |
-
βββ requirements.txt
|
| 134 |
-
βββ README.md
|
| 135 |
-
What Was Upgraded in v2.0
|
| 136 |
-
Feature v1 (original) v2 (this submission)
|
| 137 |
-
Distraction events App names only Rich NL messages with urgency & deferability
|
| 138 |
-
Reasoning Not required Mandatory, graded, rewarded
|
| 139 |
-
Action space 5 simple actions 8 actions including plan_day, defer, respond
|
| 140 |
-
Cognitive load Not modelled Dynamic: rises with focus, falls with breaks
|
| 141 |
-
Multi-day context Single session 3-day week with energy decay & deadlines
|
| 142 |
-
Training script Missing Full GRPO Colab notebook with reward curves
|
| 143 |
-
Success criteria Fixed string eval() Type-safe lambda functions
|
| 144 |
-
Metrics endpoint None /metrics for reward curve plotting
|
| 145 |
-
Submitted by
|
| 146 |
-
Abdul Hannan β Meta Γ Scaler OpenEnv Hackathon 2026
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: FocusFlow RL Environment
|
| 3 |
+
emoji: π―
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
+
sdk: docker
|
| 7 |
+
app_port: 7860
|
| 8 |
+
pinned: true
|
| 9 |
+
short_description: LLM-hard OpenEnv RL env for student focus management
|
| 10 |
+
---
|
| 11 |
|
| 12 |
+
# FocusFlow RL Environment v2.0
|
| 13 |
+
### Meta Γ Scaler OpenEnv Hackathon 2026
|
| 14 |
|
| 15 |
+
> An LLM-hard RL environment where an AI agent manages a student's real cognitive world β
|
| 16 |
+
> [cite_start]navigating natural language distractions, shifting deadlines, and multi-day energy dynamics.
|
| 17 |
|
| 18 |
+
[](https://colab.research.google.com/your-colab-link)
|
| 19 |
+
[](https://huggingface.co/spaces/your-space)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
+
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
## Why This Environment Is LLM-Hard
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
+
[cite_start]Unlike toy RL environments solvable by a simple rule-based policy, FocusFlow requires genuine LLM reasoning:
|
| 26 |
+
|
| 27 |
+
| Challenge | Why It Needs an LLM |
|
| 28 |
+
|---|---|
|
| 29 |
+
| Natural language distraction events | [cite_start]Agent must read and interpret messages to judge urgency |
|
| 30 |
+
| Mandatory `reasoning` field (graded) | Empty reasoning = reward penalty. [cite_start]LLMs must justify decisions |
|
| 31 |
+
| Cognitive load dynamics | [cite_start]Overworking degrades future rewards β requires adaptive strategy |
|
| 32 |
+
| Multi-day deadline tracking | [cite_start]Planning today affects energy and deadlines tomorrow |
|
| 33 |
+
| Deferred events expire | [cite_start]Agent must track time-sensitive commitments across steps |
|
| 34 |
+
| Urgency vs. deferability trade-off | [cite_start]"Mom called twice" β "Friend wants to play BGMI" |
|
| 35 |
+
|
| 36 |
+
---
|
| 37 |
+
|
| 38 |
+
## Environment Design
|
| 39 |
+
|
| 40 |
+
### Action Space (8 actions)
|
| 41 |
+
|
| 42 |
+
| Action | When to Use | Reward |
|
| 43 |
+
|---|---|---|
|
| 44 |
+
| `focus` | Stay on task | [cite_start]+0.05 Γ (1 β cognitive_load) |
|
| 45 |
+
| `block_app` | Block a distracting app | [cite_start]+0.20 Γ temptation_level |
|
| 46 |
+
| `take_break` | Rest at session boundary or when load > 0.75 | [cite_start]+0.20 to +0.30 |
|
| 47 |
+
| `defer_event` | Postpone a low-urgency event | [cite_start]+0.15 if correct, β0.05 if wrong |
|
| 48 |
+
| `respond_to_event` | Handle urgent events immediately | [cite_start]+0.20 if correct |
|
| 49 |
+
| `plan_day` | Set a study schedule at day start | [cite_start]+0.00 to +0.30 based on quality |
|
| 50 |
+
| `adjust_energy` | Recover from fatigue/environmental noise | [cite_start]+0.10 |
|
| 51 |
+
| `check_app` | **(BAD)** Give in to distraction | [cite_start]β0.50 |
|
| 52 |
+
|
| 53 |
+
### Reasoning Quality Reward (Universal)
|
| 54 |
+
|
| 55 |
+
[cite_start]Every action carries a **reasoning bonus/penalty** (Β±0.10) based on:
|
| 56 |
+
- [cite_start]Mentions of relevant concepts (urgency, priority, focus, deadlines)
|
| 57 |
+
- [cite_start]Use of causal language ("because", "therefore", "in order to")
|
| 58 |
+
- [cite_start]Whether the action matches the correct response for the active event
|
| 59 |
+
|
| 60 |
+
### Observation Space
|
| 61 |
+
|
| 62 |
+
```json
|
| 63 |
{
|
| 64 |
"time_remaining_seconds": 1140,
|
| 65 |
"current_phase": "focus",
|
|
|
|
| 87 |
"last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
|
| 88 |
"reasoning_quality_score": 0.82
|
| 89 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|