Upload README.md
Browse files
README.md
CHANGED
|
@@ -6,7 +6,7 @@ tags:
|
|
| 6 |
- worldquant-brain
|
| 7 |
---
|
| 8 |
|
| 9 |
-
# Alpha Factory
|
| 10 |
|
| 11 |
**LLM-assisted alpha expression generator for WorldQuant BRAIN.**
|
| 12 |
|
|
@@ -14,36 +14,34 @@ tags:
|
|
| 14 |
|
| 15 |
## What It Actually Does
|
| 16 |
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
5. **Stores in DuckDB** for review
|
| 22 |
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
-
|
| 26 |
-
- Crowd Scout / Performance Surgeon / Gatekeeper personas (stub only)
|
| 27 |
-
- RAG over arXiv papers (stub only)
|
| 28 |
-
- Feedback loop / evolutionary improvement
|
| 29 |
-
|
| 30 |
-
## Quickstart
|
| 31 |
|
| 32 |
```bash
|
| 33 |
git clone https://huggingface.co/gaurv007/alpha-factory
|
| 34 |
cd alpha-factory
|
| 35 |
-
|
| 36 |
-
```
|
| 37 |
-
|
| 38 |
-
Create `.env`:
|
| 39 |
-
```
|
| 40 |
-
HF_TOKEN=hf_your_token_here
|
| 41 |
```
|
| 42 |
|
| 43 |
### Proven Templates (RECOMMENDED β no LLM, guaranteed valid)
|
| 44 |
|
| 45 |
```bash
|
| 46 |
-
|
| 47 |
```
|
| 48 |
|
| 49 |
Uses proven Alpha 15/6 structures with novel AC=0 fields. Every expression is syntactically valid.
|
|
@@ -51,28 +49,159 @@ Uses proven Alpha 15/6 structures with novel AC=0 fields. Every expression is sy
|
|
| 51 |
### LLM-Assisted Generation
|
| 52 |
|
| 53 |
```bash
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 55 |
```
|
| 56 |
|
| 57 |
-
##
|
| 58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
```bash
|
| 60 |
-
|
| 61 |
```
|
| 62 |
|
| 63 |
-
##
|
| 64 |
|
| 65 |
-
|
| 66 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
## License
|
| 77 |
|
| 78 |
-
MIT
|
|
|
|
| 6 |
- worldquant-brain
|
| 7 |
---
|
| 8 |
|
| 9 |
+
# Alpha Factory v0.2.0
|
| 10 |
|
| 11 |
**LLM-assisted alpha expression generator for WorldQuant BRAIN.**
|
| 12 |
|
|
|
|
| 14 |
|
| 15 |
## What It Actually Does
|
| 16 |
|
| 17 |
+
| Mode | What it does | Needs LLM? | Submits to BRAIN? |
|
| 18 |
+
|------|-------------|-----------|-------------------|
|
| 19 |
+
| **Proven Templates** (`--proven`) | Generates valid BRAIN expressions using hardcoded templates with novel fields | No | Optional (`--enable-brain`) |
|
| 20 |
+
| **LLM Mode** (default) | Uses 1.5B-72B LLMs to generate hypothesis β expression β evaluation | Yes (local/cloud) | Optional |
|
|
|
|
| 21 |
|
| 22 |
+
The pipeline runs:
|
| 23 |
+
1. **Theme sampling** β picks under-explored themes from field registry
|
| 24 |
+
2. **Expression generation** β deterministic templates OR LLM hypothesis + compile
|
| 25 |
+
3. **Static lint** β validates BRAIN syntax (operator arity, look-ahead, parentheses)
|
| 26 |
+
4. **Deduplication** β SHA256 hash to avoid duplicates
|
| 27 |
+
5. **Store** β persists to DuckDB for review
|
| 28 |
+
6. **Crowd Scout** β novelty assessment (LLM or heuristic)
|
| 29 |
+
7. **Performance Surgeon** β diagnoses weak alphas, suggests mutations
|
| 30 |
+
8. **Gatekeeper** β final go/no-go (LLM)
|
| 31 |
+
9. **BRAIN submission** β live submission (requires `BRAIN_SESSION_TOKEN`)
|
| 32 |
|
| 33 |
+
## Quick Start
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
```bash
|
| 36 |
git clone https://huggingface.co/gaurv007/alpha-factory
|
| 37 |
cd alpha-factory
|
| 38 |
+
pip install -e ".[all]"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
```
|
| 40 |
|
| 41 |
### Proven Templates (RECOMMENDED β no LLM, guaranteed valid)
|
| 42 |
|
| 43 |
```bash
|
| 44 |
+
python -m alpha_factory.run --proven --batch-size 10
|
| 45 |
```
|
| 46 |
|
| 47 |
Uses proven Alpha 15/6 structures with novel AC=0 fields. Every expression is syntactically valid.
|
|
|
|
| 49 |
### LLM-Assisted Generation
|
| 50 |
|
| 51 |
```bash
|
| 52 |
+
# Needs Ollama or HF token
|
| 53 |
+
python -m alpha_factory.run --batch-size 5 --dry-run
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
### Standalone Proven Generator
|
| 57 |
+
|
| 58 |
+
```bash
|
| 59 |
+
python -m alpha_factory.generate_proven
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
### Run Tests
|
| 63 |
+
|
| 64 |
+
```bash
|
| 65 |
+
pytest tests/ -v
|
| 66 |
```
|
| 67 |
|
| 68 |
+
## BRAIN Setup
|
| 69 |
|
| 70 |
+
BRAIN uses **session-based authentication** (browser cookies), not API keys. The session token expires quickly.
|
| 71 |
+
|
| 72 |
+
1. Log in to [brain.worldquant.com](https://brain.worldquant.com)
|
| 73 |
+
2. Open browser DevTools β Network tab
|
| 74 |
+
3. Run any simulation, look for request to `api.worldquantbrain.com`
|
| 75 |
+
4. Copy the `Cookie: session=...` header value
|
| 76 |
+
5. Set: `export BRAIN_SESSION_TOKEN=your_token_here`
|
| 77 |
+
|
| 78 |
+
**No token = dry-run mode only.** The pipeline will generate expressions but not submit them.
|
| 79 |
+
|
| 80 |
+
## Architecture
|
| 81 |
+
|
| 82 |
+
```
|
| 83 |
+
Theme Sampler β Expression Generation β Static Lint β Dedup β Store
|
| 84 |
+
β (Templates or LLM) β β
|
| 85 |
+
Crowd Scout β Performance Surgeon β Gatekeeper β BRAIN Submit
|
| 86 |
+
β (iteration queue)
|
| 87 |
+
Winner Memory β Mutator β Performance Surgeon
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### Components
|
| 91 |
+
|
| 92 |
+
| Module | Purpose | Status |
|
| 93 |
+
|--------|---------|--------|
|
| 94 |
+
| `proven_templates.py` | Deterministic expression generation | β
Working |
|
| 95 |
+
| `lint.py` | BRAIN syntax validation (arity, lookahead, parens) | β
Working |
|
| 96 |
+
| `pipeline.py` | Orchestrates all stages | β
Working |
|
| 97 |
+
| `expression_compiler.py` | Jinja2 templates + LLM fallback | β
Working |
|
| 98 |
+
| `crowd_scout.py` | Novelty / correlation assessment | β
Working |
|
| 99 |
+
| `performance_surgeon.py` | Diagnose failures, suggest mutations | β
Working |
|
| 100 |
+
| `gatekeeper.py` | Final go/no-go memo | β
Working |
|
| 101 |
+
| `wq_client.py` | BRAIN API submission | β οΈ Needs `BRAIN_SESSION_TOKEN` |
|
| 102 |
+
| `brain_sim.py` | Local numpy backtest | β οΈ Not wired to pipeline |
|
| 103 |
+
| `regime_tagger.py` | Vol/trend/rate/style regimes | β οΈ Not wired to pipeline |
|
| 104 |
+
|
| 105 |
+
## Key Features
|
| 106 |
+
|
| 107 |
+
- **Proven template mode**: No LLM required. Generates valid BRAIN expressions instantly.
|
| 108 |
+
- **Field registry**: 40+ real BRAIN fields with coverage, alpha counts, and sign conventions.
|
| 109 |
+
- **Novel group keys**: Uses under-explored neutralization groups (AC β€ 30) for lower correlation.
|
| 110 |
+
- **Static lint**: Catches syntax errors before submission (operator arity, look-ahead, unbalanced parens).
|
| 111 |
+
- **Kill switches**: Circuit breakers for runaway pipelines (consecutive fails, daily limits, token budget).
|
| 112 |
+
- **Winner memory**: Tracks which field/archetype combinations work, feeds back to generation.
|
| 113 |
+
- **Expression mutator**: Auto-generates decay, horizon, neutralization, and sign-flip variants.
|
| 114 |
+
- **DuckDB store**: Persistent history of all alphas, metrics, and verdicts.
|
| 115 |
+
- **Retry logic**: LLM client retries transient failures (429, 502, 503, 504, timeout) with exponential backoff.
|
| 116 |
+
|
| 117 |
+
## Known Limitations
|
| 118 |
+
|
| 119 |
+
1. **BRAIN auth is session-based**: Token expires. No automatic refresh. You must re-copy from browser.
|
| 120 |
+
2. **Local simulation is not wired**: `brain_sim.py` exists but is not integrated into the pipeline. It needs price data (yfinance) and produces approximate results.
|
| 121 |
+
3. **Regime tagger not wired**: `regime_tagger.py` exists but is not used by the Performance Surgeon.
|
| 122 |
+
4. **LLM generation can hallucinate fields**: Static lint catches most errors, but field names from LLMs may not exist on BRAIN.
|
| 123 |
+
5. **Weights inside `rank()` are decorative**: `rank(0.6*a + 0.4*b)` is monotonic β coefficients don't linearly combine. The signal comes from which fields are combined.
|
| 124 |
+
6. **Not a guarantee of profitable alphas**: This generates candidates. BRAIN's simulation is the ground truth.
|
| 125 |
+
|
| 126 |
+
## Configuration
|
| 127 |
+
|
| 128 |
+
All settings in `alpha_factory/config.py`. Key ones:
|
| 129 |
+
|
| 130 |
+
```python
|
| 131 |
+
batch_size = 10 # Candidates per run
|
| 132 |
+
use_proven_templates = False # Set True for deterministic mode
|
| 133 |
+
enable_brain_client = False # Set True for live BRAIN submission
|
| 134 |
+
max_parallel_candidates = 3 # Concurrent LLM calls
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
Or via CLI:
|
| 138 |
```bash
|
| 139 |
+
python -m alpha_factory.run --proven --batch-size 10 --enable-brain
|
| 140 |
```
|
| 141 |
|
| 142 |
+
## Cost
|
| 143 |
|
| 144 |
+
| Item | Cost |
|
| 145 |
+
|------|------|
|
| 146 |
+
| Proven template mode | $0 (no LLM) |
|
| 147 |
+
| Local Ollama (7B) | $0 (your GPU) |
|
| 148 |
+
| HuggingFace Inference API | Free tier / rate-limited |
|
| 149 |
+
| BRAIN submissions | $0 (uses your existing BRAIN credits) |
|
| 150 |
|
| 151 |
+
## File Structure
|
| 152 |
+
|
| 153 |
+
```
|
| 154 |
+
alpha_factory/
|
| 155 |
+
βββ config.py # All settings (Pydantic v2)
|
| 156 |
+
βββ run.py # Entry point
|
| 157 |
+
βββ schemas/ # Typed Pydantic contracts
|
| 158 |
+
βββ deterministic/
|
| 159 |
+
β βββ lint.py # Static pre-flight (Layer 2)
|
| 160 |
+
β βββ theme_sampler.py # Gap analysis (Layer 1)
|
| 161 |
+
β βββ fitness.py # Composite scoring
|
| 162 |
+
β βββ proven_templates.py # Deterministic generation
|
| 163 |
+
β βββ expression_mutator.py # Evolutionary variants
|
| 164 |
+
β βββ regime_tagger.py # Vol/trend/rate/style regimes (not wired)
|
| 165 |
+
βββ infra/
|
| 166 |
+
β βββ model_manager.py # Ollama + HF auto-detection
|
| 167 |
+
β βββ llm_client.py # Unified LLM interface (token budget + retry)
|
| 168 |
+
β βββ factor_store.py # DuckDB persistence
|
| 169 |
+
β βββ wq_client.py # BRAIN API wrapper (session auth)
|
| 170 |
+
β βββ winner_memory.py # Feedback loop
|
| 171 |
+
βββ local/
|
| 172 |
+
β βββ brain_sim.py # Local BRAIN simulator (not wired)
|
| 173 |
+
βββ personas/
|
| 174 |
+
β βββ hypothesis_hunter.py # Persona 1
|
| 175 |
+
β βββ expression_compiler.py # Persona 2
|
| 176 |
+
β βββ crowd_scout.py # Persona 4
|
| 177 |
+
β βββ performance_surgeon.py # Persona 5
|
| 178 |
+
β βββ gatekeeper.py # Persona 6
|
| 179 |
+
βββ orchestration/
|
| 180 |
+
βββ pipeline.py # Full DAG
|
| 181 |
+
```
|
| 182 |
|
| 183 |
+
## Changelog v0.2.0
|
| 184 |
+
|
| 185 |
+
- **Fixed**: `pv13_ustomergraphrank` β `pv13_customergraphrank` typos
|
| 186 |
+
- **Fixed**: `operators.csv` arity mismatches (ts_mean, ts_std, ts_delta now correctly listed as 2-arg)
|
| 187 |
+
- **Fixed**: `cleanup.py` no longer blacklists valid BRAIN fields (`vwap`, `close`, `volume`, etc.)
|
| 188 |
+
- **Fixed**: `personas/__init__.py` imports real modules instead of stubs
|
| 189 |
+
- **Fixed**: `infra/__init__.py` imports real `BrainClient` instead of stub class
|
| 190 |
+
- **Fixed**: Expression compiler sign logic β now per-component, no global blind negation
|
| 191 |
+
- **Fixed**: LLM client stops error amplification (no more 3x API calls on auth/network failures)
|
| 192 |
+
- **Fixed**: LLM client enforces token budget (was declared but never checked)
|
| 193 |
+
- **Fixed**: LLM client adds retry logic with exponential backoff for transient failures (429, 502, 503, 504, timeout)
|
| 194 |
+
- **Fixed**: Removed dead `enable_local_sim` config field and `--local-sim` CLI flag (local sim exists but is not wired)
|
| 195 |
+
- **Fixed**: Removed orphan `rag.py` (arXiv retrieval not wired, will be re-added when integrated)
|
| 196 |
+
- **Fixed**: Added missing `local/__init__.py` for proper package structure
|
| 197 |
+
- **Fixed**: Added GitHub Actions CI workflow (`.github/workflows/ci.yml`)
|
| 198 |
+
- **New**: Proven template mode (`--proven`) generates expressions without any LLM
|
| 199 |
+
- **New**: Winner memory integration in pipeline (records winners/failures/iterations)
|
| 200 |
+
- **New**: Expression mutator integration (auto-generates decay/horizon/group/sign variants)
|
| 201 |
+
- **New**: Parallel batch processing with `max_parallel_candidates` semaphore
|
| 202 |
+
- **New**: 32 comprehensive tests covering templates, lint, mutations, config, fitness, fields, groups
|
| 203 |
+
- **Updated**: Honest README that accurately describes what works and what doesn't
|
| 204 |
|
| 205 |
## License
|
| 206 |
|
| 207 |
+
MIT β use at your own risk. This is not financial advice. BRAIN simulations are the ground truth.
|