gaurv007 commited on
Commit
a05239b
Β·
verified Β·
1 Parent(s): e95d8e3

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +162 -33
README.md CHANGED
@@ -6,7 +6,7 @@ tags:
6
  - worldquant-brain
7
  ---
8
 
9
- # Alpha Factory
10
 
11
  **LLM-assisted alpha expression generator for WorldQuant BRAIN.**
12
 
@@ -14,36 +14,34 @@ tags:
14
 
15
  ## What It Actually Does
16
 
17
- 1. **Picks a theme** from 12 data-driven domains (deterministic gap analysis)
18
- 2. **Generates a hypothesis** using an LLM (Qwen via HuggingFace Inference API)
19
- 3. **Compiles to BRAIN expression** (Jinja templates for proven archetypes, LLM fallback for novel)
20
- 4. **Lints the expression** (validates 71 operators, checks arity, parentheses, look-ahead, coverage)
21
- 5. **Stores in DuckDB** for review
22
 
23
- ## What Does NOT Work Yet
 
 
 
 
 
 
 
 
 
24
 
25
- - BRAIN API submission (no client connected β€” manual paste required)
26
- - Crowd Scout / Performance Surgeon / Gatekeeper personas (stub only)
27
- - RAG over arXiv papers (stub only)
28
- - Feedback loop / evolutionary improvement
29
-
30
- ## Quickstart
31
 
32
  ```bash
33
  git clone https://huggingface.co/gaurv007/alpha-factory
34
  cd alpha-factory
35
- uv sync
36
- ```
37
-
38
- Create `.env`:
39
- ```
40
- HF_TOKEN=hf_your_token_here
41
  ```
42
 
43
  ### Proven Templates (RECOMMENDED β€” no LLM, guaranteed valid)
44
 
45
  ```bash
46
- uv run python -m alpha_factory.run --proven --batch-size 10
47
  ```
48
 
49
  Uses proven Alpha 15/6 structures with novel AC=0 fields. Every expression is syntactically valid.
@@ -51,28 +49,159 @@ Uses proven Alpha 15/6 structures with novel AC=0 fields. Every expression is sy
51
  ### LLM-Assisted Generation
52
 
53
  ```bash
54
- uv run python -m alpha_factory.run --dry-run --batch-size 5
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  ```
56
 
57
- ### Gradio UI
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ```bash
60
- uv run python -m alpha_factory.ui
61
  ```
62
 
63
- ## BRAIN Submission Settings
64
 
65
- When pasting expressions into BRAIN:
66
- - Region: USA | Universe: TOP3000 | Delay: 1 | Decay: 5 | Truncation: 0.08
 
 
 
 
67
 
68
- ## Field Strategy
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
- | Tier | Dataset | Density | Notes |
71
- |------|---------|---------|-------|
72
- | 1 | model77 | 24 Ξ±/field | 5 fields with AC=0 globally |
73
- | 2 | model16, news12 | 192-385 | Score derivatives |
74
- | 3 | analyst4, option9, pv13 | 656-822 | Supply chain, PCR |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ## License
77
 
78
- MIT
 
6
  - worldquant-brain
7
  ---
8
 
9
+ # Alpha Factory v0.2.0
10
 
11
  **LLM-assisted alpha expression generator for WorldQuant BRAIN.**
12
 
 
14
 
15
  ## What It Actually Does
16
 
17
+ | Mode | What it does | Needs LLM? | Submits to BRAIN? |
18
+ |------|-------------|-----------|-------------------|
19
+ | **Proven Templates** (`--proven`) | Generates valid BRAIN expressions using hardcoded templates with novel fields | No | Optional (`--enable-brain`) |
20
+ | **LLM Mode** (default) | Uses 1.5B-72B LLMs to generate hypothesis β†’ expression β†’ evaluation | Yes (local/cloud) | Optional |
 
21
 
22
+ The pipeline runs:
23
+ 1. **Theme sampling** β€” picks under-explored themes from field registry
24
+ 2. **Expression generation** β€” deterministic templates OR LLM hypothesis + compile
25
+ 3. **Static lint** β€” validates BRAIN syntax (operator arity, look-ahead, parentheses)
26
+ 4. **Deduplication** β€” SHA256 hash to avoid duplicates
27
+ 5. **Store** β€” persists to DuckDB for review
28
+ 6. **Crowd Scout** β€” novelty assessment (LLM or heuristic)
29
+ 7. **Performance Surgeon** β€” diagnoses weak alphas, suggests mutations
30
+ 8. **Gatekeeper** β€” final go/no-go (LLM)
31
+ 9. **BRAIN submission** β€” live submission (requires `BRAIN_SESSION_TOKEN`)
32
 
33
+ ## Quick Start
 
 
 
 
 
34
 
35
  ```bash
36
  git clone https://huggingface.co/gaurv007/alpha-factory
37
  cd alpha-factory
38
+ pip install -e ".[all]"
 
 
 
 
 
39
  ```
40
 
41
  ### Proven Templates (RECOMMENDED β€” no LLM, guaranteed valid)
42
 
43
  ```bash
44
+ python -m alpha_factory.run --proven --batch-size 10
45
  ```
46
 
47
  Uses proven Alpha 15/6 structures with novel AC=0 fields. Every expression is syntactically valid.
 
49
  ### LLM-Assisted Generation
50
 
51
  ```bash
52
+ # Needs Ollama or HF token
53
+ python -m alpha_factory.run --batch-size 5 --dry-run
54
+ ```
55
+
56
+ ### Standalone Proven Generator
57
+
58
+ ```bash
59
+ python -m alpha_factory.generate_proven
60
+ ```
61
+
62
+ ### Run Tests
63
+
64
+ ```bash
65
+ pytest tests/ -v
66
  ```
67
 
68
+ ## BRAIN Setup
69
 
70
+ BRAIN uses **session-based authentication** (browser cookies), not API keys. The session token expires quickly.
71
+
72
+ 1. Log in to [brain.worldquant.com](https://brain.worldquant.com)
73
+ 2. Open browser DevTools β†’ Network tab
74
+ 3. Run any simulation, look for request to `api.worldquantbrain.com`
75
+ 4. Copy the `Cookie: session=...` header value
76
+ 5. Set: `export BRAIN_SESSION_TOKEN=your_token_here`
77
+
78
+ **No token = dry-run mode only.** The pipeline will generate expressions but not submit them.
79
+
80
+ ## Architecture
81
+
82
+ ```
83
+ Theme Sampler β†’ Expression Generation β†’ Static Lint β†’ Dedup β†’ Store
84
+ ↓ (Templates or LLM) ↓ ↓
85
+ Crowd Scout β†’ Performance Surgeon β†’ Gatekeeper β†’ BRAIN Submit
86
+ ↓ (iteration queue)
87
+ Winner Memory ← Mutator ← Performance Surgeon
88
+ ```
89
+
90
+ ### Components
91
+
92
+ | Module | Purpose | Status |
93
+ |--------|---------|--------|
94
+ | `proven_templates.py` | Deterministic expression generation | βœ… Working |
95
+ | `lint.py` | BRAIN syntax validation (arity, lookahead, parens) | βœ… Working |
96
+ | `pipeline.py` | Orchestrates all stages | βœ… Working |
97
+ | `expression_compiler.py` | Jinja2 templates + LLM fallback | βœ… Working |
98
+ | `crowd_scout.py` | Novelty / correlation assessment | βœ… Working |
99
+ | `performance_surgeon.py` | Diagnose failures, suggest mutations | βœ… Working |
100
+ | `gatekeeper.py` | Final go/no-go memo | βœ… Working |
101
+ | `wq_client.py` | BRAIN API submission | ⚠️ Needs `BRAIN_SESSION_TOKEN` |
102
+ | `brain_sim.py` | Local numpy backtest | ⚠️ Not wired to pipeline |
103
+ | `regime_tagger.py` | Vol/trend/rate/style regimes | ⚠️ Not wired to pipeline |
104
+
105
+ ## Key Features
106
+
107
+ - **Proven template mode**: No LLM required. Generates valid BRAIN expressions instantly.
108
+ - **Field registry**: 40+ real BRAIN fields with coverage, alpha counts, and sign conventions.
109
+ - **Novel group keys**: Uses under-explored neutralization groups (AC ≀ 30) for lower correlation.
110
+ - **Static lint**: Catches syntax errors before submission (operator arity, look-ahead, unbalanced parens).
111
+ - **Kill switches**: Circuit breakers for runaway pipelines (consecutive fails, daily limits, token budget).
112
+ - **Winner memory**: Tracks which field/archetype combinations work, feeds back to generation.
113
+ - **Expression mutator**: Auto-generates decay, horizon, neutralization, and sign-flip variants.
114
+ - **DuckDB store**: Persistent history of all alphas, metrics, and verdicts.
115
+ - **Retry logic**: LLM client retries transient failures (429, 502, 503, 504, timeout) with exponential backoff.
116
+
117
+ ## Known Limitations
118
+
119
+ 1. **BRAIN auth is session-based**: Token expires. No automatic refresh. You must re-copy from browser.
120
+ 2. **Local simulation is not wired**: `brain_sim.py` exists but is not integrated into the pipeline. It needs price data (yfinance) and produces approximate results.
121
+ 3. **Regime tagger not wired**: `regime_tagger.py` exists but is not used by the Performance Surgeon.
122
+ 4. **LLM generation can hallucinate fields**: Static lint catches most errors, but field names from LLMs may not exist on BRAIN.
123
+ 5. **Weights inside `rank()` are decorative**: `rank(0.6*a + 0.4*b)` is monotonic β€” coefficients don't linearly combine. The signal comes from which fields are combined.
124
+ 6. **Not a guarantee of profitable alphas**: This generates candidates. BRAIN's simulation is the ground truth.
125
+
126
+ ## Configuration
127
+
128
+ All settings in `alpha_factory/config.py`. Key ones:
129
+
130
+ ```python
131
+ batch_size = 10 # Candidates per run
132
+ use_proven_templates = False # Set True for deterministic mode
133
+ enable_brain_client = False # Set True for live BRAIN submission
134
+ max_parallel_candidates = 3 # Concurrent LLM calls
135
+ ```
136
+
137
+ Or via CLI:
138
  ```bash
139
+ python -m alpha_factory.run --proven --batch-size 10 --enable-brain
140
  ```
141
 
142
+ ## Cost
143
 
144
+ | Item | Cost |
145
+ |------|------|
146
+ | Proven template mode | $0 (no LLM) |
147
+ | Local Ollama (7B) | $0 (your GPU) |
148
+ | HuggingFace Inference API | Free tier / rate-limited |
149
+ | BRAIN submissions | $0 (uses your existing BRAIN credits) |
150
 
151
+ ## File Structure
152
+
153
+ ```
154
+ alpha_factory/
155
+ β”œβ”€β”€ config.py # All settings (Pydantic v2)
156
+ β”œβ”€β”€ run.py # Entry point
157
+ β”œβ”€β”€ schemas/ # Typed Pydantic contracts
158
+ β”œβ”€β”€ deterministic/
159
+ β”‚ β”œβ”€β”€ lint.py # Static pre-flight (Layer 2)
160
+ β”‚ β”œβ”€β”€ theme_sampler.py # Gap analysis (Layer 1)
161
+ β”‚ β”œβ”€β”€ fitness.py # Composite scoring
162
+ β”‚ β”œβ”€β”€ proven_templates.py # Deterministic generation
163
+ β”‚ β”œβ”€β”€ expression_mutator.py # Evolutionary variants
164
+ β”‚ └── regime_tagger.py # Vol/trend/rate/style regimes (not wired)
165
+ β”œβ”€β”€ infra/
166
+ β”‚ β”œβ”€β”€ model_manager.py # Ollama + HF auto-detection
167
+ β”‚ β”œβ”€β”€ llm_client.py # Unified LLM interface (token budget + retry)
168
+ β”‚ β”œβ”€β”€ factor_store.py # DuckDB persistence
169
+ β”‚ β”œβ”€β”€ wq_client.py # BRAIN API wrapper (session auth)
170
+ β”‚ └── winner_memory.py # Feedback loop
171
+ β”œβ”€β”€ local/
172
+ β”‚ └── brain_sim.py # Local BRAIN simulator (not wired)
173
+ β”œβ”€β”€ personas/
174
+ β”‚ β”œβ”€β”€ hypothesis_hunter.py # Persona 1
175
+ β”‚ β”œβ”€β”€ expression_compiler.py # Persona 2
176
+ β”‚ β”œβ”€β”€ crowd_scout.py # Persona 4
177
+ β”‚ β”œβ”€β”€ performance_surgeon.py # Persona 5
178
+ β”‚ └── gatekeeper.py # Persona 6
179
+ └── orchestration/
180
+ └── pipeline.py # Full DAG
181
+ ```
182
 
183
+ ## Changelog v0.2.0
184
+
185
+ - **Fixed**: `pv13_ustomergraphrank` β†’ `pv13_customergraphrank` typos
186
+ - **Fixed**: `operators.csv` arity mismatches (ts_mean, ts_std, ts_delta now correctly listed as 2-arg)
187
+ - **Fixed**: `cleanup.py` no longer blacklists valid BRAIN fields (`vwap`, `close`, `volume`, etc.)
188
+ - **Fixed**: `personas/__init__.py` imports real modules instead of stubs
189
+ - **Fixed**: `infra/__init__.py` imports real `BrainClient` instead of stub class
190
+ - **Fixed**: Expression compiler sign logic β€” now per-component, no global blind negation
191
+ - **Fixed**: LLM client stops error amplification (no more 3x API calls on auth/network failures)
192
+ - **Fixed**: LLM client enforces token budget (was declared but never checked)
193
+ - **Fixed**: LLM client adds retry logic with exponential backoff for transient failures (429, 502, 503, 504, timeout)
194
+ - **Fixed**: Removed dead `enable_local_sim` config field and `--local-sim` CLI flag (local sim exists but is not wired)
195
+ - **Fixed**: Removed orphan `rag.py` (arXiv retrieval not wired, will be re-added when integrated)
196
+ - **Fixed**: Added missing `local/__init__.py` for proper package structure
197
+ - **Fixed**: Added GitHub Actions CI workflow (`.github/workflows/ci.yml`)
198
+ - **New**: Proven template mode (`--proven`) generates expressions without any LLM
199
+ - **New**: Winner memory integration in pipeline (records winners/failures/iterations)
200
+ - **New**: Expression mutator integration (auto-generates decay/horizon/group/sign variants)
201
+ - **New**: Parallel batch processing with `max_parallel_candidates` semaphore
202
+ - **New**: 32 comprehensive tests covering templates, lint, mutations, config, fitness, fields, groups
203
+ - **Updated**: Honest README that accurately describes what works and what doesn't
204
 
205
  ## License
206
 
207
+ MIT β€” use at your own risk. This is not financial advice. BRAIN simulations are the ground truth.