File size: 9,979 Bytes
ba3c7ad
 
 
 
 
 
4eb54d8
ba3c7ad
 
a05239b
918a226
74675f0
918a226
ba3c7ad
918a226
74675f0
86157cd
a05239b
 
 
 
918a226
a05239b
 
 
 
 
 
 
 
 
 
86157cd
a05239b
918a226
74675f0
 
 
a05239b
af04321
 
ba3c7ad
af04321
74675f0
a05239b
af04321
 
ba3c7ad
af04321
ba3c7ad
86157cd
74675f0
a05239b
 
 
 
 
 
 
 
 
 
 
 
 
 
74675f0
86157cd
a05239b
918a226
a05239b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74675f0
a05239b
74675f0
af04321
a05239b
fe4294a
a05239b
 
 
 
 
 
fe4294a
a05239b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe4294a
a05239b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe4294a
74675f0
fe4294a
a05239b
4eb54d8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
license: mit
tags:
- quantitative-finance
- alpha-generation
- worldquant-brain
- ml-intern
---

# Alpha Factory v0.2.0

**LLM-assisted alpha expression generator for WorldQuant BRAIN.**

> This is a Python application, not a model. It generates candidate BRAIN expressions for manual review and submission.

## What It Actually Does

| Mode | What it does | Needs LLM? | Submits to BRAIN? |
|------|-------------|-----------|-------------------|
| **Proven Templates** (`--proven`) | Generates valid BRAIN expressions using hardcoded templates with novel fields | No | Optional (`--enable-brain`) |
| **LLM Mode** (default) | Uses 1.5B-72B LLMs to generate hypothesis β†’ expression β†’ evaluation | Yes (local/cloud) | Optional |

The pipeline runs:
1. **Theme sampling** β€” picks under-explored themes from field registry
2. **Expression generation** β€” deterministic templates OR LLM hypothesis + compile
3. **Static lint** β€” validates BRAIN syntax (operator arity, look-ahead, parentheses)
4. **Deduplication** β€” SHA256 hash to avoid duplicates
5. **Store** β€” persists to DuckDB for review
6. **Crowd Scout** β€” novelty assessment (LLM or heuristic)
7. **Performance Surgeon** β€” diagnoses weak alphas, suggests mutations
8. **Gatekeeper** β€” final go/no-go (LLM)
9. **BRAIN submission** β€” live submission (requires `BRAIN_SESSION_TOKEN`)

## Quick Start

```bash
git clone https://huggingface.co/gaurv007/alpha-factory
cd alpha-factory
pip install -e ".[all]"
```

### Proven Templates (RECOMMENDED β€” no LLM, guaranteed valid)

```bash
python -m alpha_factory.run --proven --batch-size 10
```

Uses proven Alpha 15/6 structures with novel AC=0 fields. Every expression is syntactically valid.

### LLM-Assisted Generation

```bash
# Needs Ollama or HF token
python -m alpha_factory.run --batch-size 5 --dry-run
```

### Standalone Proven Generator

```bash
python -m alpha_factory.generate_proven
```

### Run Tests

```bash
pytest tests/ -v
```

## BRAIN Setup

BRAIN uses **session-based authentication** (browser cookies), not API keys. The session token expires quickly.

1. Log in to [brain.worldquant.com](https://brain.worldquant.com)
2. Open browser DevTools β†’ Network tab
3. Run any simulation, look for request to `api.worldquantbrain.com`
4. Copy the `Cookie: session=...` header value
5. Set: `export BRAIN_SESSION_TOKEN=your_token_here`

**No token = dry-run mode only.** The pipeline will generate expressions but not submit them.

## Architecture

```
Theme Sampler β†’ Expression Generation β†’ Static Lint β†’ Dedup β†’ Store
     ↓              (Templates or LLM)     ↓           ↓
Crowd Scout β†’ Performance Surgeon β†’ Gatekeeper β†’ BRAIN Submit
                   ↓ (iteration queue)
Winner Memory ← Mutator ← Performance Surgeon
```

### Components

| Module | Purpose | Status |
|--------|---------|--------|
| `proven_templates.py` | Deterministic expression generation | βœ… Working |
| `lint.py` | BRAIN syntax validation (arity, lookahead, parens) | βœ… Working |
| `pipeline.py` | Orchestrates all stages | βœ… Working |
| `expression_compiler.py` | Jinja2 templates + LLM fallback | βœ… Working |
| `crowd_scout.py` | Novelty / correlation assessment | βœ… Working |
| `performance_surgeon.py` | Diagnose failures, suggest mutations | βœ… Working |
| `gatekeeper.py` | Final go/no-go memo | βœ… Working |
| `wq_client.py` | BRAIN API submission | ⚠️ Needs `BRAIN_SESSION_TOKEN` |
| `brain_sim.py` | Local numpy backtest | ⚠️ Not wired to pipeline |
| `regime_tagger.py` | Vol/trend/rate/style regimes | ⚠️ Not wired to pipeline |

## Key Features

- **Proven template mode**: No LLM required. Generates valid BRAIN expressions instantly.
- **Field registry**: 40+ real BRAIN fields with coverage, alpha counts, and sign conventions.
- **Novel group keys**: Uses under-explored neutralization groups (AC ≀ 30) for lower correlation.
- **Static lint**: Catches syntax errors before submission (operator arity, look-ahead, unbalanced parens).
- **Kill switches**: Circuit breakers for runaway pipelines (consecutive fails, daily limits, token budget).
- **Winner memory**: Tracks which field/archetype combinations work, feeds back to generation.
- **Expression mutator**: Auto-generates decay, horizon, neutralization, and sign-flip variants.
- **DuckDB store**: Persistent history of all alphas, metrics, and verdicts.
- **Retry logic**: LLM client retries transient failures (429, 502, 503, 504, timeout) with exponential backoff.

## Known Limitations

1. **BRAIN auth is session-based**: Token expires. No automatic refresh. You must re-copy from browser.
2. **Local simulation is not wired**: `brain_sim.py` exists but is not integrated into the pipeline. It needs price data (yfinance) and produces approximate results.
3. **Regime tagger not wired**: `regime_tagger.py` exists but is not used by the Performance Surgeon.
4. **LLM generation can hallucinate fields**: Static lint catches most errors, but field names from LLMs may not exist on BRAIN.
5. **Weights inside `rank()` are decorative**: `rank(0.6*a + 0.4*b)` is monotonic β€” coefficients don't linearly combine. The signal comes from which fields are combined.
6. **Not a guarantee of profitable alphas**: This generates candidates. BRAIN's simulation is the ground truth.

## Configuration

All settings in `alpha_factory/config.py`. Key ones:

```python
batch_size = 10                     # Candidates per run
use_proven_templates = False        # Set True for deterministic mode
enable_brain_client = False        # Set True for live BRAIN submission
max_parallel_candidates = 3        # Concurrent LLM calls
```

Or via CLI:
```bash
python -m alpha_factory.run --proven --batch-size 10 --enable-brain
```

## Cost

| Item | Cost |
|------|------|
| Proven template mode | $0 (no LLM) |
| Local Ollama (7B) | $0 (your GPU) |
| HuggingFace Inference API | Free tier / rate-limited |
| BRAIN submissions | $0 (uses your existing BRAIN credits) |

## File Structure

```
alpha_factory/
β”œβ”€β”€ config.py                  # All settings (Pydantic v2)
β”œβ”€β”€ run.py                     # Entry point
β”œβ”€β”€ schemas/                   # Typed Pydantic contracts
β”œβ”€β”€ deterministic/
β”‚   β”œβ”€β”€ lint.py                # Static pre-flight (Layer 2)
β”‚   β”œβ”€β”€ theme_sampler.py       # Gap analysis (Layer 1)
β”‚   β”œβ”€β”€ fitness.py             # Composite scoring
β”‚   β”œβ”€β”€ proven_templates.py    # Deterministic generation
β”‚   β”œβ”€β”€ expression_mutator.py  # Evolutionary variants
β”‚   └── regime_tagger.py       # Vol/trend/rate/style regimes (not wired)
β”œβ”€β”€ infra/
β”‚   β”œβ”€β”€ model_manager.py       # Ollama + HF auto-detection
β”‚   β”œβ”€β”€ llm_client.py          # Unified LLM interface (token budget + retry)
β”‚   β”œβ”€β”€ factor_store.py        # DuckDB persistence
β”‚   β”œβ”€β”€ wq_client.py           # BRAIN API wrapper (session auth)
β”‚   └── winner_memory.py       # Feedback loop
β”œβ”€β”€ local/
β”‚   └── brain_sim.py           # Local BRAIN simulator (not wired)
β”œβ”€β”€ personas/
β”‚   β”œβ”€β”€ hypothesis_hunter.py   # Persona 1
β”‚   β”œβ”€β”€ expression_compiler.py # Persona 2
β”‚   β”œβ”€β”€ crowd_scout.py         # Persona 4
β”‚   β”œβ”€β”€ performance_surgeon.py # Persona 5
β”‚   └── gatekeeper.py          # Persona 6
└── orchestration/
    └── pipeline.py            # Full DAG
```

## Changelog v0.2.0

- **Fixed**: `pv13_ustomergraphrank` β†’ `pv13_customergraphrank` typos
- **Fixed**: `operators.csv` arity mismatches (ts_mean, ts_std, ts_delta now correctly listed as 2-arg)
- **Fixed**: `cleanup.py` no longer blacklists valid BRAIN fields (`vwap`, `close`, `volume`, etc.)
- **Fixed**: `personas/__init__.py` imports real modules instead of stubs
- **Fixed**: `infra/__init__.py` imports real `BrainClient` instead of stub class
- **Fixed**: Expression compiler sign logic β€” now per-component, no global blind negation
- **Fixed**: LLM client stops error amplification (no more 3x API calls on auth/network failures)
- **Fixed**: LLM client enforces token budget (was declared but never checked)
- **Fixed**: LLM client adds retry logic with exponential backoff for transient failures (429, 502, 503, 504, timeout)
- **Fixed**: Removed dead `enable_local_sim` config field and `--local-sim` CLI flag (local sim exists but is not wired)
- **Fixed**: Removed orphan `rag.py` (arXiv retrieval not wired, will be re-added when integrated)
- **Fixed**: Added missing `local/__init__.py` for proper package structure
- **Fixed**: Added GitHub Actions CI workflow (`.github/workflows/ci.yml`)
- **New**: Proven template mode (`--proven`) generates expressions without any LLM
- **New**: Winner memory integration in pipeline (records winners/failures/iterations)
- **New**: Expression mutator integration (auto-generates decay/horizon/group/sign variants)
- **New**: Parallel batch processing with `max_parallel_candidates` semaphore
- **New**: 32 comprehensive tests covering templates, lint, mutations, config, fitness, fields, groups
- **Updated**: Honest README that accurately describes what works and what doesn't

## License

MIT β€” use at your own risk. This is not financial advice. BRAIN simulations are the ground truth.

<!-- ml-intern-provenance -->
## Generated by ML Intern

This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'gaurv007/alpha-factory'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```

For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.