--- license: mit tags: - quantitative-finance - alpha-generation - worldquant-brain - ml-intern --- # Alpha Factory v0.2.0 **LLM-assisted alpha expression generator for WorldQuant BRAIN.** > This is a Python application, not a model. It generates candidate BRAIN expressions for manual review and submission. ## What It Actually Does | Mode | What it does | Needs LLM? | Submits to BRAIN? | |------|-------------|-----------|-------------------| | **Proven Templates** (`--proven`) | Generates valid BRAIN expressions using hardcoded templates with novel fields | No | Optional (`--enable-brain`) | | **LLM Mode** (default) | Uses 1.5B-72B LLMs to generate hypothesis → expression → evaluation | Yes (local/cloud) | Optional | The pipeline runs: 1. **Theme sampling** — picks under-explored themes from field registry 2. **Expression generation** — deterministic templates OR LLM hypothesis + compile 3. **Static lint** — validates BRAIN syntax (operator arity, look-ahead, parentheses) 4. **Deduplication** — SHA256 hash to avoid duplicates 5. **Store** — persists to DuckDB for review 6. **Crowd Scout** — novelty assessment (LLM or heuristic) 7. **Performance Surgeon** — diagnoses weak alphas, suggests mutations 8. **Gatekeeper** — final go/no-go (LLM) 9. **BRAIN submission** — live submission (requires `BRAIN_SESSION_TOKEN`) ## Quick Start ```bash git clone https://huggingface.co/gaurv007/alpha-factory cd alpha-factory pip install -e ".[all]" ``` ### Proven Templates (RECOMMENDED — no LLM, guaranteed valid) ```bash python -m alpha_factory.run --proven --batch-size 10 ``` Uses proven Alpha 15/6 structures with novel AC=0 fields. Every expression is syntactically valid. ### LLM-Assisted Generation ```bash # Needs Ollama or HF token python -m alpha_factory.run --batch-size 5 --dry-run ``` ### Standalone Proven Generator ```bash python -m alpha_factory.generate_proven ``` ### Run Tests ```bash pytest tests/ -v ``` ## BRAIN Setup BRAIN uses **session-based authentication** (browser cookies), not API keys. The session token expires quickly. 1. Log in to [brain.worldquant.com](https://brain.worldquant.com) 2. Open browser DevTools → Network tab 3. Run any simulation, look for request to `api.worldquantbrain.com` 4. Copy the `Cookie: session=...` header value 5. Set: `export BRAIN_SESSION_TOKEN=your_token_here` **No token = dry-run mode only.** The pipeline will generate expressions but not submit them. ## Architecture ``` Theme Sampler → Expression Generation → Static Lint → Dedup → Store ↓ (Templates or LLM) ↓ ↓ Crowd Scout → Performance Surgeon → Gatekeeper → BRAIN Submit ↓ (iteration queue) Winner Memory ← Mutator ← Performance Surgeon ``` ### Components | Module | Purpose | Status | |--------|---------|--------| | `proven_templates.py` | Deterministic expression generation | ✅ Working | | `lint.py` | BRAIN syntax validation (arity, lookahead, parens) | ✅ Working | | `pipeline.py` | Orchestrates all stages | ✅ Working | | `expression_compiler.py` | Jinja2 templates + LLM fallback | ✅ Working | | `crowd_scout.py` | Novelty / correlation assessment | ✅ Working | | `performance_surgeon.py` | Diagnose failures, suggest mutations | ✅ Working | | `gatekeeper.py` | Final go/no-go memo | ✅ Working | | `wq_client.py` | BRAIN API submission | ⚠️ Needs `BRAIN_SESSION_TOKEN` | | `brain_sim.py` | Local numpy backtest | ⚠️ Not wired to pipeline | | `regime_tagger.py` | Vol/trend/rate/style regimes | ⚠️ Not wired to pipeline | ## Key Features - **Proven template mode**: No LLM required. Generates valid BRAIN expressions instantly. - **Field registry**: 40+ real BRAIN fields with coverage, alpha counts, and sign conventions. - **Novel group keys**: Uses under-explored neutralization groups (AC ≤ 30) for lower correlation. - **Static lint**: Catches syntax errors before submission (operator arity, look-ahead, unbalanced parens). - **Kill switches**: Circuit breakers for runaway pipelines (consecutive fails, daily limits, token budget). - **Winner memory**: Tracks which field/archetype combinations work, feeds back to generation. - **Expression mutator**: Auto-generates decay, horizon, neutralization, and sign-flip variants. - **DuckDB store**: Persistent history of all alphas, metrics, and verdicts. - **Retry logic**: LLM client retries transient failures (429, 502, 503, 504, timeout) with exponential backoff. ## Known Limitations 1. **BRAIN auth is session-based**: Token expires. No automatic refresh. You must re-copy from browser. 2. **Local simulation is not wired**: `brain_sim.py` exists but is not integrated into the pipeline. It needs price data (yfinance) and produces approximate results. 3. **Regime tagger not wired**: `regime_tagger.py` exists but is not used by the Performance Surgeon. 4. **LLM generation can hallucinate fields**: Static lint catches most errors, but field names from LLMs may not exist on BRAIN. 5. **Weights inside `rank()` are decorative**: `rank(0.6*a + 0.4*b)` is monotonic — coefficients don't linearly combine. The signal comes from which fields are combined. 6. **Not a guarantee of profitable alphas**: This generates candidates. BRAIN's simulation is the ground truth. ## Configuration All settings in `alpha_factory/config.py`. Key ones: ```python batch_size = 10 # Candidates per run use_proven_templates = False # Set True for deterministic mode enable_brain_client = False # Set True for live BRAIN submission max_parallel_candidates = 3 # Concurrent LLM calls ``` Or via CLI: ```bash python -m alpha_factory.run --proven --batch-size 10 --enable-brain ``` ## Cost | Item | Cost | |------|------| | Proven template mode | $0 (no LLM) | | Local Ollama (7B) | $0 (your GPU) | | HuggingFace Inference API | Free tier / rate-limited | | BRAIN submissions | $0 (uses your existing BRAIN credits) | ## File Structure ``` alpha_factory/ ├── config.py # All settings (Pydantic v2) ├── run.py # Entry point ├── schemas/ # Typed Pydantic contracts ├── deterministic/ │ ├── lint.py # Static pre-flight (Layer 2) │ ├── theme_sampler.py # Gap analysis (Layer 1) │ ├── fitness.py # Composite scoring │ ├── proven_templates.py # Deterministic generation │ ├── expression_mutator.py # Evolutionary variants │ └── regime_tagger.py # Vol/trend/rate/style regimes (not wired) ├── infra/ │ ├── model_manager.py # Ollama + HF auto-detection │ ├── llm_client.py # Unified LLM interface (token budget + retry) │ ├── factor_store.py # DuckDB persistence │ ├── wq_client.py # BRAIN API wrapper (session auth) │ └── winner_memory.py # Feedback loop ├── local/ │ └── brain_sim.py # Local BRAIN simulator (not wired) ├── personas/ │ ├── hypothesis_hunter.py # Persona 1 │ ├── expression_compiler.py # Persona 2 │ ├── crowd_scout.py # Persona 4 │ ├── performance_surgeon.py # Persona 5 │ └── gatekeeper.py # Persona 6 └── orchestration/ └── pipeline.py # Full DAG ``` ## Changelog v0.2.0 - **Fixed**: `pv13_ustomergraphrank` → `pv13_customergraphrank` typos - **Fixed**: `operators.csv` arity mismatches (ts_mean, ts_std, ts_delta now correctly listed as 2-arg) - **Fixed**: `cleanup.py` no longer blacklists valid BRAIN fields (`vwap`, `close`, `volume`, etc.) - **Fixed**: `personas/__init__.py` imports real modules instead of stubs - **Fixed**: `infra/__init__.py` imports real `BrainClient` instead of stub class - **Fixed**: Expression compiler sign logic — now per-component, no global blind negation - **Fixed**: LLM client stops error amplification (no more 3x API calls on auth/network failures) - **Fixed**: LLM client enforces token budget (was declared but never checked) - **Fixed**: LLM client adds retry logic with exponential backoff for transient failures (429, 502, 503, 504, timeout) - **Fixed**: Removed dead `enable_local_sim` config field and `--local-sim` CLI flag (local sim exists but is not wired) - **Fixed**: Removed orphan `rag.py` (arXiv retrieval not wired, will be re-added when integrated) - **Fixed**: Added missing `local/__init__.py` for proper package structure - **Fixed**: Added GitHub Actions CI workflow (`.github/workflows/ci.yml`) - **New**: Proven template mode (`--proven`) generates expressions without any LLM - **New**: Winner memory integration in pipeline (records winners/failures/iterations) - **New**: Expression mutator integration (auto-generates decay/horizon/group/sign variants) - **New**: Parallel batch processing with `max_parallel_candidates` semaphore - **New**: 32 comprehensive tests covering templates, lint, mutations, config, fitness, fields, groups - **Updated**: Honest README that accurately describes what works and what doesn't ## License MIT — use at your own risk. This is not financial advice. BRAIN simulations are the ground truth. ## Generated by ML Intern This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. - Try ML Intern: https://smolagents-ml-intern.hf.space - Source code: https://github.com/huggingface/ml-intern ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = 'gaurv007/alpha-factory' tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) ``` For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.