alpha-factory / README.md
gaurv007's picture
Upload README.md
1f8ffc4 verified
|
raw
history blame
11.5 kB
---
license: mit
tags:
- quantitative-finance
- alpha-generation
- worldquant-brain
- ml-intern
---
# Alpha Factory v0.2.0
**LLM-assisted alpha expression generator for WorldQuant BRAIN.**
> This is a Python application, not a model. It generates candidate BRAIN expressions for manual review and submission.
## What It Actually Does
| Mode | What it does | Needs LLM? | Submits to BRAIN? |
|------|-------------|-----------|-------------------|
| **Proven Templates** (`--proven`) | Generates valid BRAIN expressions using hardcoded templates with novel fields | No | Optional (`--enable-brain`) |
| **LLM Mode** (default) | Uses 1.5B-72B LLMs to generate hypothesis β†’ expression β†’ evaluation | Yes (local/cloud) | Optional |
The pipeline runs:
1. **Theme sampling** β€” picks under-explored themes from field registry
2. **Expression generation** β€” deterministic templates OR LLM hypothesis + compile
3. **Static lint** β€” validates BRAIN syntax (operator arity, look-ahead, parentheses)
4. **Deduplication** β€” SHA256 hash to avoid duplicates
5. **Store** β€” persists to DuckDB for review
6. **Local sim** β€” lightweight numpy backtest as triage (lenient thresholds, never blocks)
7. **Acceptance checklist** β€” 14-point pre-submission gate
8. **Crowd Scout** β€” novelty assessment (LLM or heuristic)
9. **BRAIN submission** β€” live submission (requires `BRAIN_SESSION_TOKEN`)
10. **Performance Surgeon** β€” diagnoses weak alphas, suggests mutations
11. **Gatekeeper** β€” final go/no-go (LLM)
## Quick Start
```bash
git clone https://huggingface.co/gaurv007/alpha-factory
cd alpha-factory
pip install -e ".[all]"
```
### Proven Templates (RECOMMENDED β€” no LLM, guaranteed valid)
```bash
python -m alpha_factory.run --proven --batch-size 10
```
Uses proven Alpha 15/6 structures with novel AC=0 fields. Every expression is syntactically valid.
### LLM-Assisted Generation
```bash
# Needs Ollama or HF token
python -m alpha_factory.run --batch-size 5 --dry-run
```
### Standalone Proven Generator
```bash
python -m alpha_factory.generate_proven
```
### Run Tests
```bash
pytest tests/ -v
```
## BRAIN Setup
BRAIN uses **session-based authentication** (browser cookies), not API keys. The session token expires quickly.
1. Log in to [brain.worldquant.com](https://brain.worldquant.com)
2. Open browser DevTools β†’ Network tab
3. Run any simulation, look for request to `api.worldquantbrain.com`
4. Copy the `Cookie: session=...` header value
5. Set: `export BRAIN_SESSION_TOKEN=your_token_here`
**No token = dry-run mode only.** The pipeline will generate expressions but not submit them.
## Architecture
```
Theme Sampler β†’ Expression Generation β†’ Static Lint β†’ Dedup β†’ Store β†’ Local Sim β†’ Checklist
↓ (Templates or LLM) ↓ ↓ ↓ ↓
Crowd Scout β†’ Performance Surgeon β†’ Gatekeeper β†’ BRAIN Submit
↓ (iteration queue)
Winner Memory ← Mutator ← Performance Surgeon
```
### Components
| Module | Purpose | Status |
|--------|---------|--------|
| `proven_templates.py` | Deterministic expression generation | βœ… Working |
| `lint.py` | BRAIN syntax validation (arity, lookahead, parens) | βœ… Working |
| `pipeline.py` | Orchestrates all stages | βœ… Working |
| `expression_compiler.py` | Jinja2 templates + LLM fallback | βœ… Working |
| `crowd_scout.py` | Novelty / correlation assessment | βœ… Working |
| `performance_surgeon.py` | Diagnose failures, suggest mutations | βœ… Working |
| `gatekeeper.py` | Final go/no-go memo | βœ… Working |
| `wq_client.py` | BRAIN API submission | ⚠️ Needs `BRAIN_SESSION_TOKEN` |
| `brain_sim.py` | Local numpy backtest (triage, lenient) | βœ… Wired (never blocks) |
| `regime_tagger.py` | Vol/trend/rate/style regimes | βœ… Wired via Performance Surgeon |
## Key Features
- **Proven template mode**: No LLM required. Generates valid BRAIN expressions instantly.
- **Field registry**: 40+ real BRAIN fields with coverage, alpha counts, and sign conventions.
- **Novel group keys**: Uses under-explored neutralization groups (AC ≀ 30) for lower correlation.
- **Static lint**: Catches syntax errors before submission (operator arity, look-ahead, unbalanced parens).
- **Kill switches**: Circuit breakers for runaway pipelines (consecutive fails, daily limits, token budget).
- **Winner memory**: Tracks which field/archetype combinations work, feeds back to generation.
- **Expression mutator**: Auto-generates decay, horizon, neutralization, and sign-flip variants.
- **DuckDB store**: Persistent history of all alphas, metrics, and verdicts.
- **Retry logic**: LLM client retries transient failures (429, 502, 503, 504, timeout) with exponential backoff. Non-retryable errors (401, 400, OOM) abort immediately.
- **Unified pipeline**: Both proven and LLM paths flow through `_process_candidate()` β€” no code duplication.
## Known Limitations
1. **BRAIN auth is session-based**: Token expires. No automatic refresh. You must re-copy from browser.
2. **Local simulation is triage-only**: `brain_sim.py` runs with lenient thresholds (min_sharpe=0.3) and prints warnings but **never blocks** a candidate. It's for sanity checking, not filtering.
3. **LLM generation can hallucinate fields**: Static lint catches most errors, but field names from LLMs may not exist on BRAIN.
4. **Weights inside `rank()` are decorative**: `rank(0.6*a + 0.4*b)` is monotonic β€” coefficients don't linearly combine. The signal comes from which fields are combined.
5. **Not a guarantee of profitable alphas**: This generates candidates. BRAIN's simulation is the ground truth.
## Configuration
All settings in `alpha_factory/config.py`. Key ones:
```python
batch_size = 10 # Candidates per run
use_proven_templates = False # Set True for deterministic mode
enable_brain_client = False # Set True for live BRAIN submission
max_parallel_candidates = 3 # Concurrent LLM calls
```
Or via CLI:
```bash
python -m alpha_factory.run --proven --batch-size 10 --enable-brain
```
## Cost
| Item | Cost |
|------|------|
| Proven template mode | $0 (no LLM) |
| Local Ollama (7B) | $0 (your GPU) |
| HuggingFace Inference API | Free tier / rate-limited |
| BRAIN submissions | $0 (uses your existing BRAIN credits) |
## File Structure
```
alpha_factory/
β”œβ”€β”€ config.py # All settings (Pydantic v2)
β”œβ”€β”€ run.py # Entry point (single asyncio.run)
β”œβ”€β”€ schemas/ # Typed Pydantic contracts
β”œβ”€β”€ deterministic/
β”‚ β”œβ”€β”€ lint.py # Static pre-flight (Layer 2)
β”‚ β”œβ”€β”€ theme_sampler.py # Gap analysis (Layer 1)
β”‚ β”œβ”€β”€ fitness.py # Composite scoring
β”‚ β”œβ”€β”€ proven_templates.py # Deterministic generation
β”‚ β”œβ”€β”€ expression_mutator.py # Evolutionary variants
β”‚ β”œβ”€β”€ acceptance_checklist.py # 14-point pre-submission gate
β”‚ β”œβ”€β”€ brain_sim.py # Local numpy backtest (triage)
β”‚ └── regime_tagger.py # IQR-based regime detection
β”œβ”€β”€ infra/
β”‚ β”œβ”€β”€ model_manager.py # Ollama + HF auto-detection
β”‚ β”œβ”€β”€ llm_client.py # Unified LLM interface (token budget + retry)
β”‚ β”œβ”€β”€ factor_store.py # DuckDB persistence (parameterized SQL)
β”‚ β”œβ”€β”€ wq_client.py # BRAIN API wrapper (session auth, circuit breaker)
β”‚ └── winner_memory.py # Feedback loop
β”œβ”€β”€ local/
β”‚ └── brain_sim.py # (identical, part of deterministic)
β”œβ”€β”€ personas/
β”‚ β”œβ”€β”€ hypothesis_hunter.py # Persona 1 (LLM)
β”‚ β”œβ”€β”€ expression_compiler.py # Persona 2 (templates + LLM fallback)
β”‚ β”œβ”€β”€ crowd_scout.py # Persona 4 (heuristic + LLM)
β”‚ β”œβ”€β”€ performance_surgeon.py # Persona 5 (heuristic + LLM)
β”‚ └── gatekeeper.py # Persona 6 (LLM)
└── orchestration/
└── pipeline.py # Full DAG (unified _process_candidate)
```
## Changelog v0.2.0
- **Fixed**: `pv13_ustomergraphrank` β†’ `pv13_customergraphrank` typos
- **Fixed**: `operators.csv` arity mismatches (ts_mean, ts_std, ts_delta now correctly listed as 2-arg)
- **Fixed**: `cleanup.py` no longer blacklists valid BRAIN fields (`vwap`, `close`, `volume`, etc.)
- **Fixed**: `personas/__init__.py` imports real modules instead of stubs
- **Fixed**: `infra/__init__.py` imports real `BrainClient` instead of stub class
- **Fixed**: Expression compiler sign logic β€” now per-component, no global blind negation
- **Fixed**: LLM client stops error amplification (no more 3x API calls on auth/network failures)
- **Fixed**: LLM client enforces token budget (was declared but never checked)
- **Fixed**: LLM client adds retry logic with exponential backoff for transient failures
- **Fixed**: LLM client JSON parsing regex no longer strips all whitespace (was mangling responses)
- **Fixed**: `pipeline.py` `NameError: max_corr` β€” correlation is now computed before checklist call
- **Fixed**: `pipeline.py` `_submit_or_dryrun` reuses `self.brain` instead of creating new clients
- **Fixed**: `run.py` uses single `asyncio.run()` β€” no more session leak
- **Fixed**: `acceptance_checklist.py` RETURNS-CORR check no longer always fails (lowered from 0.05 to 0.95)
- **Fixed**: `factor_store.py` uses DuckDB transaction context manager instead of string-based BEGIN/COMMIT
- **Fixed**: `ui.py` SQL uses parameterized LIMIT instead of f-string injection
- **Fixed**: `expression_compiler.py` `_validate_expression` is now called, issues logged
- **Fixed**: `expression_mutator.py` regex now handles uppercase field IDs (e.g., `mdl77_2GlobalDev...`)
- **Fixed**: `proven_templates.py` decay parameter is now passed through (was hardcoded to 5)
- **Fixed**: `theme_sampler.py` `pick_theme()` has alive-theme fallback when all themes exhausted
- **Fixed**: Removed dead `enable_local_sim` config field and `--local-sim` CLI flag
- **Fixed**: Removed orphan `rag.py` (arXiv retrieval not wired, will be re-added when integrated)
- **Fixed**: Added missing `local/__init__.py` and `orchestration/__init__.py`
- **Fixed**: `pyproject.toml` version bumped to 0.2.0, removed unused `scipy` dependency
- **New**: Proven template mode (`--proven`) generates expressions without any LLM
- **New**: Winner memory integration in pipeline (records winners/failures/iterations)
- **New**: Expression mutator integration (auto-generates decay/horizon/group/sign variants)
- **New**: Acceptance checklist (14 checks, wired before BRAIN submission)
- **New**: Parallel batch processing with `max_parallel_candidates` semaphore
- **New**: 64+ comprehensive tests covering templates, lint, mutations, config, fitness, fields, groups
- **New**: `_process_candidate()` unified path β€” both proven and LLM candidates flow through same pipeline
- **Updated**: Honest README that accurately describes what works and what doesn't
## License
MIT β€” use at your own risk. This is not financial advice. BRAIN simulations are the ground truth.
<!-- ml-intern-provenance -->
## Generated by ML Intern
This repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern