hannan2859r commited on
Commit
bb019be
Β·
verified Β·
1 Parent(s): 928d775

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +56 -113
README.md CHANGED
@@ -1,41 +1,65 @@
1
- FocusFlow RL Environment v2.0
2
- Meta Γ— Scaler OpenEnv Hackathon 2026
3
- An LLM-hard RL environment where an AI agent manages a student's real cognitive world β€” navigating natural language distractions, shifting deadlines, and multi-day energy dynamics.
 
 
 
 
 
 
 
4
 
5
- Open in Colab HuggingFace Space
 
6
 
7
- Why This Environment Is LLM-Hard
8
- Unlike toy RL environments solvable by a simple rule-based policy, FocusFlow requires genuine LLM reasoning:
9
 
10
- Challenge Why It Needs an LLM
11
- Natural language distraction events Agent must read and interpret messages to judge urgency
12
- Mandatory reasoning field (graded) Empty reasoning = reward penalty. LLMs must justify decisions
13
- Cognitive load dynamics Overworking degrades future rewards β€” requires adaptive strategy
14
- Multi-day deadline tracking Planning today affects energy and deadlines tomorrow
15
- Deferred events expire Agent must track time-sensitive commitments across steps
16
- Urgency vs. deferability trade-off "Mom called twice" β‰  "Friend wants to play BGMI"
17
- A simple if-else policy cannot pass Task 2 or Task 3 without understanding language.
18
 
19
- Environment Design
20
- Action Space (8 actions)
21
- Action When to Use Reward
22
- focus Stay on task +0.05 Γ— (1 βˆ’ cognitive_load)
23
- block_app Block a distracting app +0.20 Γ— temptation_level
24
- take_break Rest at session boundary or when load > 0.75 +0.20 to +0.30
25
- defer_event Postpone a low-urgency event +0.15 if correct, βˆ’0.05 if wrong
26
- respond_to_event Handle urgent events immediately +0.20 if correct
27
- plan_day Set a study schedule at day start +0.00 to +0.30 based on quality
28
- adjust_energy Recover from fatigue/environmental noise +0.10
29
- check_app (BAD) Give in to distraction βˆ’0.50
30
- Reasoning Quality Reward (Universal)
31
- Every action carries a reasoning bonus/penalty (Β±0.10) based on:
32
 
33
- Mentions of relevant concepts (urgency, priority, focus, deadlines)
34
- Use of causal language ("because", "therefore", "in order to")
35
- Whether the action matches the correct response for the active event
36
- This is what separates an LLM from a rule-based bot.
37
 
38
- Observation Space
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  {
40
  "time_remaining_seconds": 1140,
41
  "current_phase": "focus",
@@ -63,84 +87,3 @@ Observation Space
63
  "last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
64
  "reasoning_quality_score": 0.82
65
  }
66
- Tasks
67
- Task Description Max Steps Key Challenge
68
- task_1 One session, zero distractions 60 Reasoning quality + event handling
69
- task_2 Two sessions, manage cognitive load 120 Break timing + multi-event judgment
70
- task_3 3-day week plan, maintain streak 240 Long-horizon planning + energy decay
71
- Reward Function Summary
72
- Universal (every step):
73
- reasoning_quality ∈ [-0.10, +0.10] scored by heuristic grader
74
-
75
- Action-specific:
76
- focus β†’ +0.05 Γ— (1 - cognitive_load)
77
- block_app β†’ +0.20 Γ— temptation_level
78
- take_break β†’ +0.20 to +0.30 (well-timed) or -0.10 (premature)
79
- defer_event β†’ +0.15 (correct) / -0.05 (wrong) / -0.20 (non-deferrable)
80
- respond_to_event β†’ +0.20 (correct) / -0.10 (wrong)
81
- plan_day β†’ +0.00 to +0.30 (based on plan quality scoring)
82
- adjust_energy β†’ +0.10 (when needed) / +0.01 (unnecessary)
83
- check_app β†’ -0.50 (hard penalty)
84
-
85
- Episode bonuses:
86
- task_1: +0.25 if avg reasoning quality > 70%
87
- task_2: +0.30 for zero app checks across both sessions
88
- task_3: +0.40 for 3-day perfect focus streak
89
- Quick Start
90
- # Install
91
- pip install -r requirements.txt
92
-
93
- # Start environment server
94
- uvicorn app:app --host 0.0.0.0 --port 7860 --reload
95
-
96
- # Reset and step (in another terminal)
97
- curl -X POST "http://localhost:7860/reset?task_id=task_1"
98
-
99
- curl -X POST http://localhost:7860/step \
100
- -H "Content-Type: application/json" \
101
- -d '{
102
- "action_type": "defer_event",
103
- "event_id": "evt_3",
104
- "reasoning": "This is a low urgency social message from a friend asking to play games. Since I have a Math Assignment due at step 45 and I am currently in focus phase, I should defer this and stay focused. The friend can wait.",
105
- "response_text": "bhai abhi padh raha hoon, baad mein baat karte hain"
106
- }'
107
- Run LLM Agent
108
- export API_BASE_URL=https://api.groq.com/openai/v1
109
- export GROQ_API_KEY=your_key_here
110
- export MODEL_NAME=llama-3.1-8b-instant
111
- export ENV_BASE_URL=http://localhost:7860
112
- export TASK_ID=task_2
113
- export MAX_EPISODES=5
114
-
115
- python inference.py
116
- Train with GRPO (Google Colab T4)
117
- Open training_colab.py in Colab. It will:
118
-
119
- Load Llama-3.2-1B-Instruct with Unsloth 4-bit quantisation
120
- Collect environment episodes as training data
121
- Fine-tune with GRPO using environment rewards
122
- Plot reward curves
123
- Push trained model to HuggingFace Hub
124
- Project Structure
125
- focusflow_rl_env/
126
- β”œβ”€β”€ models.py # Pydantic: FocusAction, FocusObservation, FocusState, DistractionEvent
127
- β”œβ”€β”€ environment.py # Core RL logic: step(), reset(), reward, NL event grading
128
- β”œβ”€β”€ app.py # FastAPI server (OpenEnv HTTP API + /metrics endpoint)
129
- β”œβ”€β”€ inference.py # LLM baseline agent with chain-of-thought prompting
130
- β”œβ”€β”€ training_colab.py # GRPO training script (Unsloth + HF TRL)
131
- β”œβ”€β”€ openenv.yaml # OpenEnv metadata
132
- β”œβ”€β”€ Dockerfile
133
- β”œβ”€β”€ requirements.txt
134
- └── README.md
135
- What Was Upgraded in v2.0
136
- Feature v1 (original) v2 (this submission)
137
- Distraction events App names only Rich NL messages with urgency & deferability
138
- Reasoning Not required Mandatory, graded, rewarded
139
- Action space 5 simple actions 8 actions including plan_day, defer, respond
140
- Cognitive load Not modelled Dynamic: rises with focus, falls with breaks
141
- Multi-day context Single session 3-day week with energy decay & deadlines
142
- Training script Missing Full GRPO Colab notebook with reward curves
143
- Success criteria Fixed string eval() Type-safe lambda functions
144
- Metrics endpoint None /metrics for reward curve plotting
145
- Submitted by
146
- Abdul Hannan β€” Meta Γ— Scaler OpenEnv Hackathon 2026
 
1
+ ---
2
+ title: FocusFlow RL Environment
3
+ emoji: 🎯
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: docker
7
+ app_port: 7860
8
+ pinned: true
9
+ short_description: LLM-hard OpenEnv RL env for student focus management
10
+ ---
11
 
12
+ # FocusFlow RL Environment v2.0
13
+ ### Meta Γ— Scaler OpenEnv Hackathon 2026
14
 
15
+ > An LLM-hard RL environment where an AI agent manages a student's real cognitive world β€”
16
+ > [cite_start]navigating natural language distractions, shifting deadlines, and multi-day energy dynamics.
17
 
18
+ [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/your-colab-link)
19
+ [![HuggingFace Space](https://img.shields.io/badge/πŸ€—-HuggingFace%20Space-yellow)](https://huggingface.co/spaces/your-space)
 
 
 
 
 
 
20
 
21
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
+ ## Why This Environment Is LLM-Hard
 
 
 
24
 
25
+ [cite_start]Unlike toy RL environments solvable by a simple rule-based policy, FocusFlow requires genuine LLM reasoning:
26
+
27
+ | Challenge | Why It Needs an LLM |
28
+ |---|---|
29
+ | Natural language distraction events | [cite_start]Agent must read and interpret messages to judge urgency |
30
+ | Mandatory `reasoning` field (graded) | Empty reasoning = reward penalty. [cite_start]LLMs must justify decisions |
31
+ | Cognitive load dynamics | [cite_start]Overworking degrades future rewards β€” requires adaptive strategy |
32
+ | Multi-day deadline tracking | [cite_start]Planning today affects energy and deadlines tomorrow |
33
+ | Deferred events expire | [cite_start]Agent must track time-sensitive commitments across steps |
34
+ | Urgency vs. deferability trade-off | [cite_start]"Mom called twice" β‰  "Friend wants to play BGMI" |
35
+
36
+ ---
37
+
38
+ ## Environment Design
39
+
40
+ ### Action Space (8 actions)
41
+
42
+ | Action | When to Use | Reward |
43
+ |---|---|---|
44
+ | `focus` | Stay on task | [cite_start]+0.05 Γ— (1 βˆ’ cognitive_load) |
45
+ | `block_app` | Block a distracting app | [cite_start]+0.20 Γ— temptation_level |
46
+ | `take_break` | Rest at session boundary or when load > 0.75 | [cite_start]+0.20 to +0.30 |
47
+ | `defer_event` | Postpone a low-urgency event | [cite_start]+0.15 if correct, βˆ’0.05 if wrong |
48
+ | `respond_to_event` | Handle urgent events immediately | [cite_start]+0.20 if correct |
49
+ | `plan_day` | Set a study schedule at day start | [cite_start]+0.00 to +0.30 based on quality |
50
+ | `adjust_energy` | Recover from fatigue/environmental noise | [cite_start]+0.10 |
51
+ | `check_app` | **(BAD)** Give in to distraction | [cite_start]βˆ’0.50 |
52
+
53
+ ### Reasoning Quality Reward (Universal)
54
+
55
+ [cite_start]Every action carries a **reasoning bonus/penalty** (Β±0.10) based on:
56
+ - [cite_start]Mentions of relevant concepts (urgency, priority, focus, deadlines)
57
+ - [cite_start]Use of causal language ("because", "therefore", "in order to")
58
+ - [cite_start]Whether the action matches the correct response for the active event
59
+
60
+ ### Observation Space
61
+
62
+ ```json
63
  {
64
  "time_remaining_seconds": 1140,
65
  "current_phase": "focus",
 
87
  "last_action_feedback": "Well-timed break: +0.30 | Good reasoning (0.82): +0.10",
88
  "reasoning_quality_score": 0.82
89
  }