fix: honest README — removes false claims, documents what actually works, adds proper usage instructions"
Browse files
README.md
CHANGED
|
@@ -1,154 +1,126 @@
|
|
| 1 |
-
|
| 2 |
-
tags:
|
| 3 |
-
- ml-intern
|
| 4 |
-
---
|
| 5 |
-
# Alpha Factory — Open-Source LLM-Driven Pipeline for WorldQuant BRAIN
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
|
| 10 |
|
| 11 |
-
|
| 12 |
-
# Install uv (if not already installed)
|
| 13 |
-
# Windows:
|
| 14 |
-
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
|
| 15 |
-
# macOS/Linux:
|
| 16 |
-
curl -LsSf https://astral.sh/uv/install.sh | sh
|
| 17 |
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
-
uv sync
|
| 24 |
|
| 25 |
-
#
|
| 26 |
-
uv sync --extra rag
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
ollama serve
|
| 35 |
-
|
| 36 |
-
# Dry run (no BRAIN credits spent)
|
| 37 |
-
uv run python -m alpha_factory.run --dry-run --batch-size 5
|
| 38 |
|
| 39 |
-
#
|
| 40 |
-
uv run python -m alpha_factory.run --interactive --dry-run
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
|
|
|
| 47 |
```
|
| 48 |
|
| 49 |
-
##
|
| 50 |
|
| 51 |
-
```
|
| 52 |
-
|
| 53 |
-
→ Static Lint → Dedup → BRAIN Submit → Crowd Scout (Mediumfish)
|
| 54 |
-
→ Performance Surgeon (Mediumfish) → Gatekeeper (Bigfish) → Portfolio
|
| 55 |
```
|
| 56 |
|
| 57 |
-
|
| 58 |
|
| 59 |
-
|
| 60 |
-
|---|---------|------------|-----|
|
| 61 |
-
| 1 | Hypothesis Hunter | Microfish (1.5B) | Generate novel factor blueprints |
|
| 62 |
-
| 2 | Expression Compiler | Tinyfish (3B) / Jinja | Convert blueprint to BRAIN expression |
|
| 63 |
-
| 3 | Look-Ahead Sniffer | Deterministic | Static analysis for future leakage |
|
| 64 |
-
| 4 | Crowd Scout | Mediumfish (7B) | Novelty + correlation check |
|
| 65 |
-
| 5 | Performance Surgeon | Mediumfish (7B) | Diagnose failures, suggest fixes |
|
| 66 |
-
| 6 | Production Gatekeeper | Bigfish (14-72B) | Final go/no-go memo |
|
| 67 |
|
| 68 |
-
|
|
|
|
|
|
|
| 69 |
|
| 70 |
-
|
| 71 |
-
- **Ollama (local)** — auto-detected at localhost:11434
|
| 72 |
-
- **HuggingFace Inference API (cloud)** — set HF_TOKEN env var
|
| 73 |
-
- **vLLM (local/remote)** — any OpenAI-compatible endpoint
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
|
|
|
|
|
|
| 78 |
|
| 79 |
-
|
| 80 |
-
- Schema-constrained generation — no hallucinated operators
|
| 81 |
-
- 7-layer acceptance engineering — saves 60%+ BRAIN credits
|
| 82 |
-
- Deterministic kill switches — circuit breakers for runaway pipelines
|
| 83 |
-
- Factor store — DuckDB persistence for all alpha history
|
| 84 |
-
- Dead theme registry — avoids re-exploring failed themes
|
| 85 |
-
- Local BRAIN simulator — triage alphas before spending credits
|
| 86 |
|
| 87 |
-
##
|
| 88 |
|
| 89 |
```
|
| 90 |
alpha_factory/
|
| 91 |
-
├──
|
| 92 |
-
├──
|
| 93 |
-
├──
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
│ ├──
|
| 97 |
-
│ ├──
|
| 98 |
-
│ ├──
|
| 99 |
-
│
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
│ ├──
|
| 103 |
-
│ ├──
|
| 104 |
-
│ ├──
|
| 105 |
-
│
|
| 106 |
-
|
| 107 |
-
|
| 108 |
-
├──
|
| 109 |
-
│ ├──
|
| 110 |
-
│ ├──
|
| 111 |
-
│ ├──
|
| 112 |
-
│
|
| 113 |
-
|
| 114 |
-
└──
|
| 115 |
-
|
|
|
|
|
|
|
| 116 |
```
|
| 117 |
|
| 118 |
-
##
|
| 119 |
|
| 120 |
-
|
| 121 |
-
2. `uv sync`
|
| 122 |
-
3. Install Ollama: https://ollama.ai
|
| 123 |
-
4. Pull models: `ollama pull qwen2.5:1.5b && ollama pull qwen2.5:7b`
|
| 124 |
-
5. Place your `operators.csv` and `fields_USA_TOP3000_D1.csv` in `data/`
|
| 125 |
-
6. Run: `uv run python -m alpha_factory.run --dry-run --interactive`
|
| 126 |
|
| 127 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
|
| 129 |
-
|
| 130 |
-
|------|------|
|
| 131 |
-
| Local GPU (RTX 3090/4090) | $0 (already owned) |
|
| 132 |
-
| BRAIN account | $0 (existing) |
|
| 133 |
-
| uv + Ollama + all deps | $0 |
|
| 134 |
-
| Monthly running cost | **$0** |
|
| 135 |
|
| 136 |
-
|
| 137 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 138 |
|
| 139 |
-
|
| 140 |
|
| 141 |
-
-
|
| 142 |
-
-
|
|
|
|
| 143 |
|
| 144 |
-
##
|
| 145 |
-
|
| 146 |
-
```python
|
| 147 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 148 |
-
|
| 149 |
-
model_id = "gaurv007/alpha-factory"
|
| 150 |
-
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 151 |
-
model = AutoModelForCausalLM.from_pretrained(model_id)
|
| 152 |
-
```
|
| 153 |
|
| 154 |
-
|
|
|
|
| 1 |
+
# Alpha Factory
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
**LLM-assisted alpha expression generator for WorldQuant BRAIN.**
|
| 4 |
|
| 5 |
+
> ⚠️ This is a **prototype tool**, not a production system. It generates candidate expressions for manual review and BRAIN submission.
|
| 6 |
|
| 7 |
+
## What It Actually Does
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
|
| 9 |
+
1. **Picks a theme** from 12 data-driven domains (deterministic gap analysis)
|
| 10 |
+
2. **Generates a hypothesis** using an LLM (Qwen via HuggingFace Inference API)
|
| 11 |
+
3. **Compiles to BRAIN expression** (Jinja templates for proven archetypes, LLM fallback for novel)
|
| 12 |
+
4. **Lints the expression** (validates 71 operators, checks arity, parentheses, look-ahead, coverage)
|
| 13 |
+
5. **Stores in DuckDB** for review
|
| 14 |
|
| 15 |
+
That's it. Steps 1-5 work. Everything below is scaffolding for future development.
|
|
|
|
| 16 |
|
| 17 |
+
## What Does NOT Work Yet
|
|
|
|
| 18 |
|
| 19 |
+
- ❌ BRAIN API submission (no client connected)
|
| 20 |
+
- ❌ Crowd Scout / Performance Surgeon / Gatekeeper personas (imported but never called)
|
| 21 |
+
- ❌ RAG over arXiv papers (stub only)
|
| 22 |
+
- ❌ Local BRAIN simulator (exists but not wired into pipeline)
|
| 23 |
+
- ❌ Feedback loop / evolutionary improvement
|
| 24 |
+
- ❌ Automatic iteration on near-misses
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
+
## Quickstart
|
|
|
|
| 27 |
|
| 28 |
+
```bash
|
| 29 |
+
git clone https://huggingface.co/gaurv007/alpha-factory
|
| 30 |
+
cd alpha-factory
|
| 31 |
+
uv sync
|
| 32 |
+
```
|
| 33 |
|
| 34 |
+
Create `.env`:
|
| 35 |
+
```
|
| 36 |
+
HF_TOKEN=hf_your_token_here
|
| 37 |
```
|
| 38 |
|
| 39 |
+
### Option 1: Proven Templates (RECOMMENDED — no LLM, guaranteed valid)
|
| 40 |
|
| 41 |
+
```bash
|
| 42 |
+
uv run python alpha_factory/generate_proven.py
|
|
|
|
|
|
|
| 43 |
```
|
| 44 |
|
| 45 |
+
This uses your proven Alpha 15/6 structures with novel AC=0 fields. **Every expression is syntactically valid and ready to paste into BRAIN.**
|
| 46 |
|
| 47 |
+
### Option 2: LLM-Assisted Generation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
+
```bash
|
| 50 |
+
uv run python -m alpha_factory.run --dry-run --batch-size 5
|
| 51 |
+
```
|
| 52 |
|
| 53 |
+
Uses HuggingFace Inference API (Qwen 7B) to generate novel hypotheses. Quality varies — always lint-check before submitting to BRAIN.
|
|
|
|
|
|
|
|
|
|
| 54 |
|
| 55 |
+
### Option 3: Gradio UI
|
| 56 |
|
| 57 |
+
```bash
|
| 58 |
+
uv run python -m alpha_factory.ui
|
| 59 |
+
```
|
| 60 |
|
| 61 |
+
View generated alphas with timestamps, copy expressions, generate new batches from browser.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
| 63 |
+
## Architecture
|
| 64 |
|
| 65 |
```
|
| 66 |
alpha_factory/
|
| 67 |
+
├── data/ # BRAIN field registry (3,447 candidates), operators, groups
|
| 68 |
+
│ ├── brain_fields.py # 35 highest-EV fields with AC, coverage, sign metadata
|
| 69 |
+
│ ├── brain_groups.py # 15 novel neutralization keys (AC 3-20)
|
| 70 |
+
│ └── __init__.py
|
| 71 |
+
├── deterministic/ # No LLM required
|
| 72 |
+
│ ├── lint.py # 71-operator validation + arity checks
|
| 73 |
+
│ ├── theme_sampler.py # Gap analysis across 12 themes
|
| 74 |
+
│ ├── proven_templates.py # Alpha 15/6 structure with field swaps
|
| 75 |
+
│ ├── expression_mutator.py # 5 mutation operators for iteration
|
| 76 |
+
│ └── fitness.py # Composite fitness scoring
|
| 77 |
+
├── personas/ # LLM-powered agents
|
| 78 |
+
│ ├── hypothesis_hunter.py # Generates blueprints (ACTIVE)
|
| 79 |
+
│ ├── expression_compiler.py # Blueprint → BRAIN expression (ACTIVE)
|
| 80 |
+
│ ├── crowd_scout.py # Novelty check (NOT WIRED)
|
| 81 |
+
│ ├── performance_surgeon.py # Failure diagnosis (NOT WIRED)
|
| 82 |
+
│ └── gatekeeper.py # Final go/no-go (NOT WIRED)
|
| 83 |
+
├── infra/ # Infrastructure
|
| 84 |
+
│ ├── llm_client.py # Unified Ollama/HF client
|
| 85 |
+
│ ├── factor_store.py # DuckDB persistence
|
| 86 |
+
│ ├── model_manager.py # Auto-discovers available models
|
| 87 |
+
│ ├── winner_memory.py # Feedback loop storage (NOT WIRED)
|
| 88 |
+
│ └── wq_client.py # BRAIN API wrapper (NOT CONNECTED)
|
| 89 |
+
├── orchestration/
|
| 90 |
+
│ └── pipeline.py # Main pipeline (steps 1-5 only)
|
| 91 |
+
├── run.py # CLI entry point
|
| 92 |
+
├── ui.py # Gradio dashboard
|
| 93 |
+
└── generate_proven.py # Standalone proven template generator
|
| 94 |
```
|
| 95 |
|
| 96 |
+
## Field Strategy
|
| 97 |
|
| 98 |
+
The pipeline prioritizes fields by expected value:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 99 |
|
| 100 |
+
| Tier | Dataset | Density (α/field) | Strategy |
|
| 101 |
+
|------|---------|-------------------|----------|
|
| 102 |
+
| 1 | model77 | 24 | Primary target — 5 fields with AC=0 globally |
|
| 103 |
+
| 2 | model16, news12 | 192-385 | Secondary — score derivatives |
|
| 104 |
+
| 3 | analyst4, option9, pv13 | 656-822 | Tertiary — supply chain, PCR |
|
| 105 |
+
| 4 | pv1, socialmedia | 2500-64350 | Avoid — over-mined |
|
| 106 |
|
| 107 |
+
## BRAIN Submission Settings
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
+
When pasting expressions into BRAIN manually:
|
| 110 |
+
- Region: USA
|
| 111 |
+
- Universe: TOP3000
|
| 112 |
+
- Delay: 1
|
| 113 |
+
- Decay: 5
|
| 114 |
+
- Truncation: 0.08
|
| 115 |
+
- Pasteurization: ON
|
| 116 |
+
- NaN Handling: OFF
|
| 117 |
|
| 118 |
+
## Requirements
|
| 119 |
|
| 120 |
+
- Python 3.11+
|
| 121 |
+
- HuggingFace token (free tier works for Qwen 7B)
|
| 122 |
+
- Optional: Ollama for local inference
|
| 123 |
|
| 124 |
+
## License
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
|
| 126 |
+
MIT
|