Manas Mehta commited on
Commit
fd61835
·
1 Parent(s): 0b81240

Update README

Browse files
Files changed (1) hide show
  1. README.md +0 -216
README.md CHANGED
@@ -1,217 +1 @@
1
  # DataSelectEnv — OpenEnv Environment for Data Curation in ML Training
2
-
3
- 🚀 **Overview**
4
-
5
- DataSelectEnv is a reinforcement learning environment that simulates a real-world machine learning workflow:
6
- selecting high-quality training data under constraints.
7
-
8
- Agents must decide:
9
- - which data to select
10
- - how much to select
11
- - and which strategy to prioritize
12
-
13
- while balancing:
14
- - data quality
15
- - diversity
16
- - noise
17
- - and budget
18
-
19
- ---
20
-
21
- 🎯 **Motivation**
22
-
23
- In real-world ML systems (e.g., at companies like Meta or Hugging Face), performance is not just about model architecture —
24
- it heavily depends on which data you train on.
25
-
26
- This environment models:
27
- - active learning
28
- - data filtering
29
- - noisy dataset handling
30
- - budget-constrained training
31
-
32
- ---
33
-
34
- 🧠 **Core Idea**
35
-
36
- Instead of training a model once, the agent interacts step-by-step:
37
- 1. Observe current model performance and dataset state
38
- 2. Select a batch of data using different strategies
39
- 3. Incrementally train the model (`partial_fit`)
40
- 4. Receive reward based on improvement and data quality
41
-
42
- ---
43
-
44
- ⚙️ **Environment Design**
45
-
46
- 📦 **Dataset**
47
- - Generated using `sklearn.make_classification`
48
- - 1500 samples, 20 features
49
- - Structured clusters with controlled noise
50
- - Split:
51
- - Seed: 100 samples (initial training)
52
- - Validation: 200 samples
53
- - Pool: remaining samples
54
-
55
- 🔥 **Noise Simulation**
56
- - Some samples have corrupted labels
57
- - Noisy samples also have distorted features
58
- - They appear highly uncertain but harmful → creates a realistic trap
59
-
60
- ---
61
-
62
- 🔁 **Interaction Loop**
63
-
64
- `reset()`
65
- - Initializes dataset and model
66
- - Trains model on seed data
67
- - Returns initial observation
68
-
69
- `step(action)`
70
- 1. Normalize strategy weights
71
- 2. Sample batch using:
72
- - uncertainty
73
- - diversity
74
- - random
75
- 3. Train model incrementally (`SGDClassifier.partial_fit`)
76
- 4. Compute reward
77
- 5. Update state
78
-
79
- ---
80
-
81
- 📊 **Observation Space**
82
-
83
- ```json
84
- {
85
- "remaining_budget": int,
86
- "diversity_score": float,
87
- "noise_estimate": float,
88
- "current_performance": float,
89
- "samples_available": int
90
- }
91
- ```
92
-
93
- ---
94
-
95
- 🎮 **Action Space**
96
-
97
- ```json
98
- {
99
- "action_type": "select_batch | stop",
100
- "batch_size": int,
101
- "strategy_weights": {
102
- "uncertainty": float,
103
- "diversity": float,
104
- "random": float
105
- }
106
- }
107
- ```
108
-
109
- - Weights are normalized internally
110
- - Enables continuous trade-offs, not discrete actions
111
-
112
- ---
113
-
114
- 🏆 **Reward Function**
115
-
116
- The reward reflects learning progress + data quality trade-offs:
117
- - **Positive signal**:
118
- - improvement in model performance
119
- - diverse data selection
120
- - **Negative signal**:
121
- - noisy data
122
- - redundant samples
123
- - excessive cost
124
-
125
- ---
126
-
127
- 🧪 **Key Learning Dynamics**
128
-
129
- This environment models real ML behaviors:
130
- - 📉 **Diminishing returns** — repeated data gives less benefit
131
- - ⚠️ **Noise trap** — uncertain samples can be misleading
132
- - 🧩 **Diversity importance** — covering more data space improves learning
133
- - 💸 **Budget constraint** — forces efficient decisions
134
-
135
- ---
136
-
137
- 🔥 **What Makes This Challenging**
138
- - No single strategy works
139
- - Uncertainty alone fails due to noise
140
- - Random is safe but suboptimal
141
- - Best performance requires balancing multiple signals
142
-
143
- ---
144
-
145
- 📈 **Expected Behavior**
146
-
147
- | Strategy | Outcome |
148
- | :--- | :--- |
149
- | Balanced | Best performance |
150
- | Random | Moderate |
151
- | Uncertainty-only | Worst (fails on noisy data) |
152
-
153
- ---
154
-
155
- 🛠️ **Tech Stack**
156
- - Python
157
- - NumPy
158
- - scikit-learn (`SGDClassifier`)
159
- - Pydantic (for typed models)
160
-
161
- ---
162
-
163
- 📁 **Project Structure**
164
-
165
- ```
166
- DataSelectEnv/
167
- ├── env.py
168
- ├── models.py
169
- ├── sampling.py
170
- ├── reward.py
171
- ├── test_env.py
172
- ```
173
-
174
- ---
175
-
176
- ▶️ **How to Run**
177
-
178
- ```bash
179
- python test_env.py
180
- ```
181
-
182
- ---
183
-
184
- 📌 **Current Status**
185
- - ✅ Core environment implemented
186
- - ✅ Stable training loop
187
- - ✅ Realistic reward dynamics
188
- - 🔄 Next: tasks + graders + OpenEnv API
189
-
190
- ---
191
-
192
- 🧠 **Key Insight**
193
-
194
- > “We simulate data curation as a sequential decision-making problem where agents must balance uncertainty, diversity, and noise under budget constraints, using real incremental model training.”
195
-
196
- ---
197
-
198
- 👥 **Team Notes**
199
- - Do NOT modify reward aggressively — current balance is tuned
200
- - Focus next on:
201
- - tasks (easy/medium/hard)
202
- - graders
203
- - API + deployment
204
-
205
- ---
206
-
207
- 📌 **Future Work**
208
- - OpenEnv API integration
209
- - Hugging Face Spaces deployment
210
- - Baseline agent evaluation
211
- - Advanced dataset scenarios
212
-
213
- ---
214
-
215
- 🏁 **Goal**
216
-
217
- Build a realistic, non-trivial environment that can be used to evaluate intelligent data selection strategies in ML systems.
 
1
  # DataSelectEnv — OpenEnv Environment for Data Curation in ML Training