Update ML Intern artifact metadata

4eb54d8 verified about 4 hours ago

9.98 kB

	---
	license: mit
	tags:
	- quantitative-finance
	- alpha-generation
	- worldquant-brain
	- ml-intern
	---

	# Alpha Factory v0.2.0

	LLM-assisted alpha expression generator for WorldQuant BRAIN.

	> This is a Python application, not a model. It generates candidate BRAIN expressions for manual review and submission.

	## What It Actually Does

	\| Mode \| What it does \| Needs LLM? \| Submits to BRAIN? \|
	\|------\|-------------\|-----------\|-------------------\|
	\| Proven Templates (`--proven`) \| Generates valid BRAIN expressions using hardcoded templates with novel fields \| No \| Optional (`--enable-brain`) \|
	\| LLM Mode (default) \| Uses 1.5B-72B LLMs to generate hypothesis → expression → evaluation \| Yes (local/cloud) \| Optional \|

	The pipeline runs:
	1. Theme sampling — picks under-explored themes from field registry
	2. Expression generation — deterministic templates OR LLM hypothesis + compile
	3. Static lint — validates BRAIN syntax (operator arity, look-ahead, parentheses)
	4. Deduplication — SHA256 hash to avoid duplicates
	5. Store — persists to DuckDB for review
	6. Crowd Scout — novelty assessment (LLM or heuristic)
	7. Performance Surgeon — diagnoses weak alphas, suggests mutations
	8. Gatekeeper — final go/no-go (LLM)
	9. BRAIN submission — live submission (requires `BRAIN_SESSION_TOKEN`)

	## Quick Start

	```bash
	git clone https://huggingface.co/gaurv007/alpha-factory
	cd alpha-factory
	pip install -e ".[all]"
	```

	### Proven Templates (RECOMMENDED — no LLM, guaranteed valid)

	```bash
	python -m alpha_factory.run --proven --batch-size 10
	```

	Uses proven Alpha 15/6 structures with novel AC=0 fields. Every expression is syntactically valid.

	### LLM-Assisted Generation

	```bash
	# Needs Ollama or HF token
	python -m alpha_factory.run --batch-size 5 --dry-run
	```

	### Standalone Proven Generator

	```bash
	python -m alpha_factory.generate_proven
	```

	### Run Tests

	```bash
	pytest tests/ -v
	```

	## BRAIN Setup

	BRAIN uses session-based authentication (browser cookies), not API keys. The session token expires quickly.

	1. Log in to [brain.worldquant.com](https://brain.worldquant.com)
	2. Open browser DevTools → Network tab
	3. Run any simulation, look for request to `api.worldquantbrain.com`
	4. Copy the `Cookie: session=...` header value
	5. Set: `export BRAIN_SESSION_TOKEN=your_token_here`

	No token = dry-run mode only. The pipeline will generate expressions but not submit them.

	## Architecture

	```
	Theme Sampler → Expression Generation → Static Lint → Dedup → Store
	↓ (Templates or LLM) ↓ ↓
	Crowd Scout → Performance Surgeon → Gatekeeper → BRAIN Submit
	↓ (iteration queue)
	Winner Memory ← Mutator ← Performance Surgeon
	```

	### Components

	\| Module \| Purpose \| Status \|
	\|--------\|---------\|--------\|
	\| `proven_templates.py` \| Deterministic expression generation \| ✅ Working \|
	\| `lint.py` \| BRAIN syntax validation (arity, lookahead, parens) \| ✅ Working \|
	\| `pipeline.py` \| Orchestrates all stages \| ✅ Working \|
	\| `expression_compiler.py` \| Jinja2 templates + LLM fallback \| ✅ Working \|
	\| `crowd_scout.py` \| Novelty / correlation assessment \| ✅ Working \|
	\| `performance_surgeon.py` \| Diagnose failures, suggest mutations \| ✅ Working \|
	\| `gatekeeper.py` \| Final go/no-go memo \| ✅ Working \|
	\| `wq_client.py` \| BRAIN API submission \| ⚠️ Needs `BRAIN_SESSION_TOKEN` \|
	\| `brain_sim.py` \| Local numpy backtest \| ⚠️ Not wired to pipeline \|
	\| `regime_tagger.py` \| Vol/trend/rate/style regimes \| ⚠️ Not wired to pipeline \|

	## Key Features

	- Proven template mode: No LLM required. Generates valid BRAIN expressions instantly.
	- Field registry: 40+ real BRAIN fields with coverage, alpha counts, and sign conventions.
	- Novel group keys: Uses under-explored neutralization groups (AC ≤ 30) for lower correlation.
	- Static lint: Catches syntax errors before submission (operator arity, look-ahead, unbalanced parens).
	- Kill switches: Circuit breakers for runaway pipelines (consecutive fails, daily limits, token budget).
	- Winner memory: Tracks which field/archetype combinations work, feeds back to generation.
	- Expression mutator: Auto-generates decay, horizon, neutralization, and sign-flip variants.
	- DuckDB store: Persistent history of all alphas, metrics, and verdicts.
	- Retry logic: LLM client retries transient failures (429, 502, 503, 504, timeout) with exponential backoff.

	## Known Limitations

	1. BRAIN auth is session-based: Token expires. No automatic refresh. You must re-copy from browser.
	2. Local simulation is not wired: `brain_sim.py` exists but is not integrated into the pipeline. It needs price data (yfinance) and produces approximate results.
	3. Regime tagger not wired: `regime_tagger.py` exists but is not used by the Performance Surgeon.
	4. LLM generation can hallucinate fields: Static lint catches most errors, but field names from LLMs may not exist on BRAIN.
	5. Weights inside `rank()` are decorative: `rank(0.6a + 0.4b)` is monotonic — coefficients don't linearly combine. The signal comes from which fields are combined.
	6. Not a guarantee of profitable alphas: This generates candidates. BRAIN's simulation is the ground truth.

	## Configuration

	All settings in `alpha_factory/config.py`. Key ones:

	```python
	batch_size = 10 # Candidates per run
	use_proven_templates = False # Set True for deterministic mode
	enable_brain_client = False # Set True for live BRAIN submission
	max_parallel_candidates = 3 # Concurrent LLM calls
	```

	Or via CLI:
	```bash
	python -m alpha_factory.run --proven --batch-size 10 --enable-brain
	```

	## Cost

	\| Item \| Cost \|
	\|------\|------\|
	\| Proven template mode \| $0 (no LLM) \|
	\| Local Ollama (7B) \| $0 (your GPU) \|
	\| HuggingFace Inference API \| Free tier / rate-limited \|
	\| BRAIN submissions \| $0 (uses your existing BRAIN credits) \|

	## File Structure

	```
	alpha_factory/
	├── config.py # All settings (Pydantic v2)
	├── run.py # Entry point
	├── schemas/ # Typed Pydantic contracts
	├── deterministic/
	│ ├── lint.py # Static pre-flight (Layer 2)
	│ ├── theme_sampler.py # Gap analysis (Layer 1)
	│ ├── fitness.py # Composite scoring
	│ ├── proven_templates.py # Deterministic generation
	│ ├── expression_mutator.py # Evolutionary variants
	│ └── regime_tagger.py # Vol/trend/rate/style regimes (not wired)
	├── infra/
	│ ├── model_manager.py # Ollama + HF auto-detection
	│ ├── llm_client.py # Unified LLM interface (token budget + retry)
	│ ├── factor_store.py # DuckDB persistence
	│ ├── wq_client.py # BRAIN API wrapper (session auth)
	│ └── winner_memory.py # Feedback loop
	├── local/
	│ └── brain_sim.py # Local BRAIN simulator (not wired)
	├── personas/
	│ ├── hypothesis_hunter.py # Persona 1
	│ ├── expression_compiler.py # Persona 2
	│ ├── crowd_scout.py # Persona 4
	│ ├── performance_surgeon.py # Persona 5
	│ └── gatekeeper.py # Persona 6
	└── orchestration/
	└── pipeline.py # Full DAG
	```

	## Changelog v0.2.0

	- Fixed: `pv13_ustomergraphrank` → `pv13_customergraphrank` typos
	- Fixed: `operators.csv` arity mismatches (ts_mean, ts_std, ts_delta now correctly listed as 2-arg)
	- Fixed: `cleanup.py` no longer blacklists valid BRAIN fields (`vwap`, `close`, `volume`, etc.)
	- Fixed: `personas/__init__.py` imports real modules instead of stubs
	- Fixed: `infra/__init__.py` imports real `BrainClient` instead of stub class
	- Fixed: Expression compiler sign logic — now per-component, no global blind negation
	- Fixed: LLM client stops error amplification (no more 3x API calls on auth/network failures)
	- Fixed: LLM client enforces token budget (was declared but never checked)
	- Fixed: LLM client adds retry logic with exponential backoff for transient failures (429, 502, 503, 504, timeout)
	- Fixed: Removed dead `enable_local_sim` config field and `--local-sim` CLI flag (local sim exists but is not wired)
	- Fixed: Removed orphan `rag.py` (arXiv retrieval not wired, will be re-added when integrated)
	- Fixed: Added missing `local/__init__.py` for proper package structure
	- Fixed: Added GitHub Actions CI workflow (`.github/workflows/ci.yml`)
	- New: Proven template mode (`--proven`) generates expressions without any LLM
	- New: Winner memory integration in pipeline (records winners/failures/iterations)
	- New: Expression mutator integration (auto-generates decay/horizon/group/sign variants)
	- New: Parallel batch processing with `max_parallel_candidates` semaphore
	- New: 32 comprehensive tests covering templates, lint, mutations, config, fitness, fields, groups
	- Updated: Honest README that accurately describes what works and what doesn't

	## License

	MIT — use at your own risk. This is not financial advice. BRAIN simulations are the ground truth.

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = 'gaurv007/alpha-factory'
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)
	```

	For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.