| --- |
| license: mit |
| tags: |
| - quantitative-finance |
| - alpha-generation |
| - worldquant-brain |
| - ml-intern |
| --- |
| |
| # Alpha Factory v0.2.0 |
|
|
| **LLM-assisted alpha expression generator for WorldQuant BRAIN.** |
|
|
| > This is a Python application, not a model. It generates candidate BRAIN expressions for manual review and submission. |
|
|
| ## What It Actually Does |
|
|
| | Mode | What it does | Needs LLM? | Submits to BRAIN? | |
| |------|-------------|-----------|-------------------| |
| | **Proven Templates** (`--proven`) | Generates valid BRAIN expressions using hardcoded templates with novel fields | No | Optional (`--enable-brain`) | |
| | **LLM Mode** (default) | Uses 1.5B-72B LLMs to generate hypothesis β expression β evaluation | Yes (local/cloud) | Optional | |
|
|
| The pipeline runs: |
| 1. **Theme sampling** β picks under-explored themes from field registry |
| 2. **Expression generation** β deterministic templates OR LLM hypothesis + compile |
| 3. **Static lint** β validates BRAIN syntax (operator arity, look-ahead, parentheses) |
| 4. **Deduplication** β SHA256 hash to avoid duplicates |
| 5. **Store** β persists to DuckDB for review |
| 6. **Crowd Scout** β novelty assessment (LLM or heuristic) |
| 7. **Performance Surgeon** β diagnoses weak alphas, suggests mutations |
| 8. **Gatekeeper** β final go/no-go (LLM) |
| 9. **BRAIN submission** β live submission (requires `BRAIN_SESSION_TOKEN`) |
|
|
| ## Quick Start |
|
|
| ```bash |
| git clone https://huggingface.co/gaurv007/alpha-factory |
| cd alpha-factory |
| pip install -e ".[all]" |
| ``` |
|
|
| ### Proven Templates (RECOMMENDED β no LLM, guaranteed valid) |
|
|
| ```bash |
| python -m alpha_factory.run --proven --batch-size 10 |
| ``` |
|
|
| Uses proven Alpha 15/6 structures with novel AC=0 fields. Every expression is syntactically valid. |
|
|
| ### LLM-Assisted Generation |
|
|
| ```bash |
| # Needs Ollama or HF token |
| python -m alpha_factory.run --batch-size 5 --dry-run |
| ``` |
|
|
| ### Standalone Proven Generator |
|
|
| ```bash |
| python -m alpha_factory.generate_proven |
| ``` |
|
|
| ### Run Tests |
|
|
| ```bash |
| pytest tests/ -v |
| ``` |
|
|
| ## BRAIN Setup |
|
|
| BRAIN uses **session-based authentication** (browser cookies), not API keys. The session token expires quickly. |
|
|
| 1. Log in to [brain.worldquant.com](https://brain.worldquant.com) |
| 2. Open browser DevTools β Network tab |
| 3. Run any simulation, look for request to `api.worldquantbrain.com` |
| 4. Copy the `Cookie: session=...` header value |
| 5. Set: `export BRAIN_SESSION_TOKEN=your_token_here` |
|
|
| **No token = dry-run mode only.** The pipeline will generate expressions but not submit them. |
|
|
| ## Architecture |
|
|
| ``` |
| Theme Sampler β Expression Generation β Static Lint β Dedup β Store |
| β (Templates or LLM) β β |
| Crowd Scout β Performance Surgeon β Gatekeeper β BRAIN Submit |
| β (iteration queue) |
| Winner Memory β Mutator β Performance Surgeon |
| ``` |
|
|
| ### Components |
|
|
| | Module | Purpose | Status | |
| |--------|---------|--------| |
| | `proven_templates.py` | Deterministic expression generation | β
Working | |
| | `lint.py` | BRAIN syntax validation (arity, lookahead, parens) | β
Working | |
| | `pipeline.py` | Orchestrates all stages | β
Working | |
| | `expression_compiler.py` | Jinja2 templates + LLM fallback | β
Working | |
| | `crowd_scout.py` | Novelty / correlation assessment | β
Working | |
| | `performance_surgeon.py` | Diagnose failures, suggest mutations | β
Working | |
| | `gatekeeper.py` | Final go/no-go memo | β
Working | |
| | `wq_client.py` | BRAIN API submission | β οΈ Needs `BRAIN_SESSION_TOKEN` | |
| | `brain_sim.py` | Local numpy backtest | β οΈ Not wired to pipeline | |
| | `regime_tagger.py` | Vol/trend/rate/style regimes | β οΈ Not wired to pipeline | |
|
|
| ## Key Features |
|
|
| - **Proven template mode**: No LLM required. Generates valid BRAIN expressions instantly. |
| - **Field registry**: 40+ real BRAIN fields with coverage, alpha counts, and sign conventions. |
| - **Novel group keys**: Uses under-explored neutralization groups (AC β€ 30) for lower correlation. |
| - **Static lint**: Catches syntax errors before submission (operator arity, look-ahead, unbalanced parens). |
| - **Kill switches**: Circuit breakers for runaway pipelines (consecutive fails, daily limits, token budget). |
| - **Winner memory**: Tracks which field/archetype combinations work, feeds back to generation. |
| - **Expression mutator**: Auto-generates decay, horizon, neutralization, and sign-flip variants. |
| - **DuckDB store**: Persistent history of all alphas, metrics, and verdicts. |
| - **Retry logic**: LLM client retries transient failures (429, 502, 503, 504, timeout) with exponential backoff. |
|
|
| ## Known Limitations |
|
|
| 1. **BRAIN auth is session-based**: Token expires. No automatic refresh. You must re-copy from browser. |
| 2. **Local simulation is not wired**: `brain_sim.py` exists but is not integrated into the pipeline. It needs price data (yfinance) and produces approximate results. |
| 3. **Regime tagger not wired**: `regime_tagger.py` exists but is not used by the Performance Surgeon. |
| 4. **LLM generation can hallucinate fields**: Static lint catches most errors, but field names from LLMs may not exist on BRAIN. |
| 5. **Weights inside `rank()` are decorative**: `rank(0.6*a + 0.4*b)` is monotonic β coefficients don't linearly combine. The signal comes from which fields are combined. |
| 6. **Not a guarantee of profitable alphas**: This generates candidates. BRAIN's simulation is the ground truth. |
|
|
| ## Configuration |
|
|
| All settings in `alpha_factory/config.py`. Key ones: |
|
|
| ```python |
| batch_size = 10 # Candidates per run |
| use_proven_templates = False # Set True for deterministic mode |
| enable_brain_client = False # Set True for live BRAIN submission |
| max_parallel_candidates = 3 # Concurrent LLM calls |
| ``` |
|
|
| Or via CLI: |
| ```bash |
| python -m alpha_factory.run --proven --batch-size 10 --enable-brain |
| ``` |
|
|
| ## Cost |
|
|
| | Item | Cost | |
| |------|------| |
| | Proven template mode | $0 (no LLM) | |
| | Local Ollama (7B) | $0 (your GPU) | |
| | HuggingFace Inference API | Free tier / rate-limited | |
| | BRAIN submissions | $0 (uses your existing BRAIN credits) | |
|
|
| ## File Structure |
|
|
| ``` |
| alpha_factory/ |
| βββ config.py # All settings (Pydantic v2) |
| βββ run.py # Entry point |
| βββ schemas/ # Typed Pydantic contracts |
| βββ deterministic/ |
| β βββ lint.py # Static pre-flight (Layer 2) |
| β βββ theme_sampler.py # Gap analysis (Layer 1) |
| β βββ fitness.py # Composite scoring |
| β βββ proven_templates.py # Deterministic generation |
| β βββ expression_mutator.py # Evolutionary variants |
| β βββ regime_tagger.py # Vol/trend/rate/style regimes (not wired) |
| βββ infra/ |
| β βββ model_manager.py # Ollama + HF auto-detection |
| β βββ llm_client.py # Unified LLM interface (token budget + retry) |
| β βββ factor_store.py # DuckDB persistence |
| β βββ wq_client.py # BRAIN API wrapper (session auth) |
| β βββ winner_memory.py # Feedback loop |
| βββ local/ |
| β βββ brain_sim.py # Local BRAIN simulator (not wired) |
| βββ personas/ |
| β βββ hypothesis_hunter.py # Persona 1 |
| β βββ expression_compiler.py # Persona 2 |
| β βββ crowd_scout.py # Persona 4 |
| β βββ performance_surgeon.py # Persona 5 |
| β βββ gatekeeper.py # Persona 6 |
| βββ orchestration/ |
| βββ pipeline.py # Full DAG |
| ``` |
|
|
| ## Changelog v0.2.0 |
|
|
| - **Fixed**: `pv13_ustomergraphrank` β `pv13_customergraphrank` typos |
| - **Fixed**: `operators.csv` arity mismatches (ts_mean, ts_std, ts_delta now correctly listed as 2-arg) |
| - **Fixed**: `cleanup.py` no longer blacklists valid BRAIN fields (`vwap`, `close`, `volume`, etc.) |
| - **Fixed**: `personas/__init__.py` imports real modules instead of stubs |
| - **Fixed**: `infra/__init__.py` imports real `BrainClient` instead of stub class |
| - **Fixed**: Expression compiler sign logic β now per-component, no global blind negation |
| - **Fixed**: LLM client stops error amplification (no more 3x API calls on auth/network failures) |
| - **Fixed**: LLM client enforces token budget (was declared but never checked) |
| - **Fixed**: LLM client adds retry logic with exponential backoff for transient failures (429, 502, 503, 504, timeout) |
| - **Fixed**: Removed dead `enable_local_sim` config field and `--local-sim` CLI flag (local sim exists but is not wired) |
| - **Fixed**: Removed orphan `rag.py` (arXiv retrieval not wired, will be re-added when integrated) |
| - **Fixed**: Added missing `local/__init__.py` for proper package structure |
| - **Fixed**: Added GitHub Actions CI workflow (`.github/workflows/ci.yml`) |
| - **New**: Proven template mode (`--proven`) generates expressions without any LLM |
| - **New**: Winner memory integration in pipeline (records winners/failures/iterations) |
| - **New**: Expression mutator integration (auto-generates decay/horizon/group/sign variants) |
| - **New**: Parallel batch processing with `max_parallel_candidates` semaphore |
| - **New**: 32 comprehensive tests covering templates, lint, mutations, config, fitness, fields, groups |
| - **Updated**: Honest README that accurately describes what works and what doesn't |
| |
| ## License |
| |
| MIT β use at your own risk. This is not financial advice. BRAIN simulations are the ground truth. |
| |
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
| |
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
| |
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
| |
| ## Usage |
| |
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = 'gaurv007/alpha-factory' |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| ``` |
| |
| For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class. |
| |