Upload README.md

1f8ffc4 verified 16 days ago

11.5 kB

	---
	license: mit
	tags:
	- quantitative-finance
	- alpha-generation
	- worldquant-brain
	- ml-intern
	---

	# Alpha Factory v0.2.0

	LLM-assisted alpha expression generator for WorldQuant BRAIN.

	> This is a Python application, not a model. It generates candidate BRAIN expressions for manual review and submission.

	## What It Actually Does

	\| Mode \| What it does \| Needs LLM? \| Submits to BRAIN? \|
	\|------\|-------------\|-----------\|-------------------\|
	\| Proven Templates (`--proven`) \| Generates valid BRAIN expressions using hardcoded templates with novel fields \| No \| Optional (`--enable-brain`) \|
	\| LLM Mode (default) \| Uses 1.5B-72B LLMs to generate hypothesis → expression → evaluation \| Yes (local/cloud) \| Optional \|

	The pipeline runs:
	1. Theme sampling — picks under-explored themes from field registry
	2. Expression generation — deterministic templates OR LLM hypothesis + compile
	3. Static lint — validates BRAIN syntax (operator arity, look-ahead, parentheses)
	4. Deduplication — SHA256 hash to avoid duplicates
	5. Store — persists to DuckDB for review
	6. Local sim — lightweight numpy backtest as triage (lenient thresholds, never blocks)
	7. Acceptance checklist — 14-point pre-submission gate
	8. Crowd Scout — novelty assessment (LLM or heuristic)
	9. BRAIN submission — live submission (requires `BRAIN_SESSION_TOKEN`)
	10. Performance Surgeon — diagnoses weak alphas, suggests mutations
	11. Gatekeeper — final go/no-go (LLM)

	## Quick Start

	```bash
	git clone https://huggingface.co/gaurv007/alpha-factory
	cd alpha-factory
	pip install -e ".[all]"
	```

	### Proven Templates (RECOMMENDED — no LLM, guaranteed valid)

	```bash
	python -m alpha_factory.run --proven --batch-size 10
	```

	Uses proven Alpha 15/6 structures with novel AC=0 fields. Every expression is syntactically valid.

	### LLM-Assisted Generation

	```bash
	# Needs Ollama or HF token
	python -m alpha_factory.run --batch-size 5 --dry-run
	```

	### Standalone Proven Generator

	```bash
	python -m alpha_factory.generate_proven
	```

	### Run Tests

	```bash
	pytest tests/ -v
	```

	## BRAIN Setup

	BRAIN uses session-based authentication (browser cookies), not API keys. The session token expires quickly.

	1. Log in to [brain.worldquant.com](https://brain.worldquant.com)
	2. Open browser DevTools → Network tab
	3. Run any simulation, look for request to `api.worldquantbrain.com`
	4. Copy the `Cookie: session=...` header value
	5. Set: `export BRAIN_SESSION_TOKEN=your_token_here`

	No token = dry-run mode only. The pipeline will generate expressions but not submit them.

	## Architecture

	```
	Theme Sampler → Expression Generation → Static Lint → Dedup → Store → Local Sim → Checklist
	↓ (Templates or LLM) ↓ ↓ ↓ ↓
	Crowd Scout → Performance Surgeon → Gatekeeper → BRAIN Submit
	↓ (iteration queue)
	Winner Memory ← Mutator ← Performance Surgeon
	```

	### Components

	\| Module \| Purpose \| Status \|
	\|--------\|---------\|--------\|
	\| `proven_templates.py` \| Deterministic expression generation \| ✅ Working \|
	\| `lint.py` \| BRAIN syntax validation (arity, lookahead, parens) \| ✅ Working \|
	\| `pipeline.py` \| Orchestrates all stages \| ✅ Working \|
	\| `expression_compiler.py` \| Jinja2 templates + LLM fallback \| ✅ Working \|
	\| `crowd_scout.py` \| Novelty / correlation assessment \| ✅ Working \|
	\| `performance_surgeon.py` \| Diagnose failures, suggest mutations \| ✅ Working \|
	\| `gatekeeper.py` \| Final go/no-go memo \| ✅ Working \|
	\| `wq_client.py` \| BRAIN API submission \| ⚠️ Needs `BRAIN_SESSION_TOKEN` \|
	\| `brain_sim.py` \| Local numpy backtest (triage, lenient) \| ✅ Wired (never blocks) \|
	\| `regime_tagger.py` \| Vol/trend/rate/style regimes \| ✅ Wired via Performance Surgeon \|

	## Key Features

	- Proven template mode: No LLM required. Generates valid BRAIN expressions instantly.
	- Field registry: 40+ real BRAIN fields with coverage, alpha counts, and sign conventions.
	- Novel group keys: Uses under-explored neutralization groups (AC ≤ 30) for lower correlation.
	- Static lint: Catches syntax errors before submission (operator arity, look-ahead, unbalanced parens).
	- Kill switches: Circuit breakers for runaway pipelines (consecutive fails, daily limits, token budget).
	- Winner memory: Tracks which field/archetype combinations work, feeds back to generation.
	- Expression mutator: Auto-generates decay, horizon, neutralization, and sign-flip variants.
	- DuckDB store: Persistent history of all alphas, metrics, and verdicts.
	- Retry logic: LLM client retries transient failures (429, 502, 503, 504, timeout) with exponential backoff. Non-retryable errors (401, 400, OOM) abort immediately.
	- Unified pipeline: Both proven and LLM paths flow through `_process_candidate()` — no code duplication.

	## Known Limitations

	1. BRAIN auth is session-based: Token expires. No automatic refresh. You must re-copy from browser.
	2. Local simulation is triage-only: `brain_sim.py` runs with lenient thresholds (min_sharpe=0.3) and prints warnings but never blocks a candidate. It's for sanity checking, not filtering.
	3. LLM generation can hallucinate fields: Static lint catches most errors, but field names from LLMs may not exist on BRAIN.
	4. Weights inside `rank()` are decorative: `rank(0.6a + 0.4b)` is monotonic — coefficients don't linearly combine. The signal comes from which fields are combined.
	5. Not a guarantee of profitable alphas: This generates candidates. BRAIN's simulation is the ground truth.

	## Configuration

	All settings in `alpha_factory/config.py`. Key ones:

	```python
	batch_size = 10 # Candidates per run
	use_proven_templates = False # Set True for deterministic mode
	enable_brain_client = False # Set True for live BRAIN submission
	max_parallel_candidates = 3 # Concurrent LLM calls
	```

	Or via CLI:
	```bash
	python -m alpha_factory.run --proven --batch-size 10 --enable-brain
	```

	## Cost

	\| Item \| Cost \|
	\|------\|------\|
	\| Proven template mode \| $0 (no LLM) \|
	\| Local Ollama (7B) \| $0 (your GPU) \|
	\| HuggingFace Inference API \| Free tier / rate-limited \|
	\| BRAIN submissions \| $0 (uses your existing BRAIN credits) \|

	## File Structure

	```
	alpha_factory/
	├── config.py # All settings (Pydantic v2)
	├── run.py # Entry point (single asyncio.run)
	├── schemas/ # Typed Pydantic contracts
	├── deterministic/
	│ ├── lint.py # Static pre-flight (Layer 2)
	│ ├── theme_sampler.py # Gap analysis (Layer 1)
	│ ├── fitness.py # Composite scoring
	│ ├── proven_templates.py # Deterministic generation
	│ ├── expression_mutator.py # Evolutionary variants
	│ ├── acceptance_checklist.py # 14-point pre-submission gate
	│ ├── brain_sim.py # Local numpy backtest (triage)
	│ └── regime_tagger.py # IQR-based regime detection
	├── infra/
	│ ├── model_manager.py # Ollama + HF auto-detection
	│ ├── llm_client.py # Unified LLM interface (token budget + retry)
	│ ├── factor_store.py # DuckDB persistence (parameterized SQL)
	│ ├── wq_client.py # BRAIN API wrapper (session auth, circuit breaker)
	│ └── winner_memory.py # Feedback loop
	├── local/
	│ └── brain_sim.py # (identical, part of deterministic)
	├── personas/
	│ ├── hypothesis_hunter.py # Persona 1 (LLM)
	│ ├── expression_compiler.py # Persona 2 (templates + LLM fallback)
	│ ├── crowd_scout.py # Persona 4 (heuristic + LLM)
	│ ├── performance_surgeon.py # Persona 5 (heuristic + LLM)
	│ └── gatekeeper.py # Persona 6 (LLM)
	└── orchestration/
	└── pipeline.py # Full DAG (unified _process_candidate)
	```

	## Changelog v0.2.0

	- Fixed: `pv13_ustomergraphrank` → `pv13_customergraphrank` typos
	- Fixed: `operators.csv` arity mismatches (ts_mean, ts_std, ts_delta now correctly listed as 2-arg)
	- Fixed: `cleanup.py` no longer blacklists valid BRAIN fields (`vwap`, `close`, `volume`, etc.)
	- Fixed: `personas/__init__.py` imports real modules instead of stubs
	- Fixed: `infra/__init__.py` imports real `BrainClient` instead of stub class
	- Fixed: Expression compiler sign logic — now per-component, no global blind negation
	- Fixed: LLM client stops error amplification (no more 3x API calls on auth/network failures)
	- Fixed: LLM client enforces token budget (was declared but never checked)
	- Fixed: LLM client adds retry logic with exponential backoff for transient failures
	- Fixed: LLM client JSON parsing regex no longer strips all whitespace (was mangling responses)
	- Fixed: `pipeline.py` `NameError: max_corr` — correlation is now computed before checklist call
	- Fixed: `pipeline.py` `_submit_or_dryrun` reuses `self.brain` instead of creating new clients
	- Fixed: `run.py` uses single `asyncio.run()` — no more session leak
	- Fixed: `acceptance_checklist.py` RETURNS-CORR check no longer always fails (lowered from 0.05 to 0.95)
	- Fixed: `factor_store.py` uses DuckDB transaction context manager instead of string-based BEGIN/COMMIT
	- Fixed: `ui.py` SQL uses parameterized LIMIT instead of f-string injection
	- Fixed: `expression_compiler.py` `_validate_expression` is now called, issues logged
	- Fixed: `expression_mutator.py` regex now handles uppercase field IDs (e.g., `mdl77_2GlobalDev...`)
	- Fixed: `proven_templates.py` decay parameter is now passed through (was hardcoded to 5)
	- Fixed: `theme_sampler.py` `pick_theme()` has alive-theme fallback when all themes exhausted
	- Fixed: Removed dead `enable_local_sim` config field and `--local-sim` CLI flag
	- Fixed: Removed orphan `rag.py` (arXiv retrieval not wired, will be re-added when integrated)
	- Fixed: Added missing `local/__init__.py` and `orchestration/__init__.py`
	- Fixed: `pyproject.toml` version bumped to 0.2.0, removed unused `scipy` dependency
	- New: Proven template mode (`--proven`) generates expressions without any LLM
	- New: Winner memory integration in pipeline (records winners/failures/iterations)
	- New: Expression mutator integration (auto-generates decay/horizon/group/sign variants)
	- New: Acceptance checklist (14 checks, wired before BRAIN submission)
	- New: Parallel batch processing with `max_parallel_candidates` semaphore
	- New: 64+ comprehensive tests covering templates, lint, mutations, config, fitness, fields, groups
	- New: `_process_candidate()` unified path — both proven and LLM candidates flow through same pipeline
	- Updated: Honest README that accurately describes what works and what doesn't

	## License

	MIT — use at your own risk. This is not financial advice. BRAIN simulations are the ground truth.

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern