gaurv007 commited on
Commit
74675f0
·
verified ·
1 Parent(s): c92d64f

fix: honest README — removes false claims, documents what actually works, adds proper usage instructions"

Browse files
Files changed (1) hide show
  1. README.md +91 -119
README.md CHANGED
@@ -1,154 +1,126 @@
1
- ---
2
- tags:
3
- - ml-intern
4
- ---
5
- # Alpha Factory — Open-Source LLM-Driven Pipeline for WorldQuant BRAIN
6
 
7
- Autonomous alpha generation system using multi-LLM agents with 7-layer acceptance engineering.
8
 
9
- ## Quick Start
10
 
11
- ```bash
12
- # Install uv (if not already installed)
13
- # Windows:
14
- powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
15
- # macOS/Linux:
16
- curl -LsSf https://astral.sh/uv/install.sh | sh
17
 
18
- # Clone
19
- git clone https://huggingface.co/gaurv007/alpha-factory
20
- cd alpha-factory
 
 
21
 
22
- # Install (uv handles everything venv, deps, lockfile)
23
- uv sync
24
 
25
- # With optional RAG support
26
- uv sync --extra rag
27
 
28
- # With all optional deps
29
- uv sync --extra all
30
-
31
- # Start Ollama (local LLM server)
32
- ollama pull qwen2.5:1.5b
33
- ollama pull qwen2.5:7b
34
- ollama serve
35
-
36
- # Dry run (no BRAIN credits spent)
37
- uv run python -m alpha_factory.run --dry-run --batch-size 5
38
 
39
- # Interactive model selection
40
- uv run python -m alpha_factory.run --interactive --dry-run
41
 
42
- # With HuggingFace cloud models
43
- uv run python -m alpha_factory.run --hf-token hf_your_token --batch-size 10
 
 
 
44
 
45
- # Run tests
46
- uv run pytest tests/ -v
 
47
  ```
48
 
49
- ## Architecture
50
 
51
- ```
52
- Theme Sampler Hypothesis Hunter (Microfish) → Expression Compiler (Jinja/Tinyfish)
53
- → Static Lint → Dedup → BRAIN Submit → Crowd Scout (Mediumfish)
54
- → Performance Surgeon (Mediumfish) → Gatekeeper (Bigfish) → Portfolio
55
  ```
56
 
57
- ## 6 LLM Personas
58
 
59
- | # | Persona | Model Tier | Job |
60
- |---|---------|------------|-----|
61
- | 1 | Hypothesis Hunter | Microfish (1.5B) | Generate novel factor blueprints |
62
- | 2 | Expression Compiler | Tinyfish (3B) / Jinja | Convert blueprint to BRAIN expression |
63
- | 3 | Look-Ahead Sniffer | Deterministic | Static analysis for future leakage |
64
- | 4 | Crowd Scout | Mediumfish (7B) | Novelty + correlation check |
65
- | 5 | Performance Surgeon | Mediumfish (7B) | Diagnose failures, suggest fixes |
66
- | 6 | Production Gatekeeper | Bigfish (14-72B) | Final go/no-go memo |
67
 
68
- ## Model Support
 
 
69
 
70
- Automatically detects and uses:
71
- - **Ollama (local)** — auto-detected at localhost:11434
72
- - **HuggingFace Inference API (cloud)** — set HF_TOKEN env var
73
- - **vLLM (local/remote)** — any OpenAI-compatible endpoint
74
 
75
- Use `--interactive` flag to manually pick models for each tier from a dropdown.
76
 
77
- ## Key Features
 
 
78
 
79
- - Zero recurring cost all LLMs run locally via Ollama
80
- - Schema-constrained generation — no hallucinated operators
81
- - 7-layer acceptance engineering — saves 60%+ BRAIN credits
82
- - Deterministic kill switches — circuit breakers for runaway pipelines
83
- - Factor store — DuckDB persistence for all alpha history
84
- - Dead theme registry — avoids re-exploring failed themes
85
- - Local BRAIN simulator — triage alphas before spending credits
86
 
87
- ## File Structure
88
 
89
  ```
90
  alpha_factory/
91
- ├── config.py # All settings (Pydantic)
92
- ├── run.py # Entry point
93
- ├── schemas/ # Typed contracts
94
- ── deterministic/
95
- ├── lint.py # Static pre-flight (Layer 2)
96
- │ ├── theme_sampler.py # Gap analysis (Layer 1)
97
- │ ├── fitness.py # Composite scoring
98
- │ ├── regime_tagger.py # Vol/trend/rate/style regimes
99
- ── acceptance_checklist.py # 14-point checklist
100
- ── infra/
101
- ├── model_manager.py # Ollama + HF auto-detection
102
- │ ├── llm_client.py # Unified LLM interface
103
- │ ├── factor_store.py # DuckDB persistence
104
- │ ├── wq_client.py # BRAIN API wrapper
105
- ── rag.py # ChromaDB + arXiv
106
- ── local/
107
- │ └── brain_sim.py # Local BRAIN simulator (Layer 4)
108
- ├── personas/
109
- │ ├── hypothesis_hunter.py # Persona 1
110
- │ ├── expression_compiler.py # Persona 2
111
- │ ├── crowd_scout.py # Persona 4
112
- ── performance_surgeon.py # Persona 5
113
- │ └── gatekeeper.py # Persona 6
114
- └── orchestration/
115
- ── pipeline.py # Full DAG
 
 
116
  ```
117
 
118
- ## Setup
119
 
120
- 1. Install uv: https://docs.astral.sh/uv/getting-started/installation/
121
- 2. `uv sync`
122
- 3. Install Ollama: https://ollama.ai
123
- 4. Pull models: `ollama pull qwen2.5:1.5b && ollama pull qwen2.5:7b`
124
- 5. Place your `operators.csv` and `fields_USA_TOP3000_D1.csv` in `data/`
125
- 6. Run: `uv run python -m alpha_factory.run --dry-run --interactive`
126
 
127
- ## Cost
 
 
 
 
 
128
 
129
- | Item | Cost |
130
- |------|------|
131
- | Local GPU (RTX 3090/4090) | $0 (already owned) |
132
- | BRAIN account | $0 (existing) |
133
- | uv + Ollama + all deps | $0 |
134
- | Monthly running cost | **$0** |
135
 
136
- <!-- ml-intern-provenance -->
137
- ## Generated by ML Intern
 
 
 
 
 
 
138
 
139
- This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
140
 
141
- - Try ML Intern: https://smolagents-ml-intern.hf.space
142
- - Source code: https://github.com/huggingface/ml-intern
 
143
 
144
- ## Usage
145
-
146
- ```python
147
- from transformers import AutoModelForCausalLM, AutoTokenizer
148
-
149
- model_id = "gaurv007/alpha-factory"
150
- tokenizer = AutoTokenizer.from_pretrained(model_id)
151
- model = AutoModelForCausalLM.from_pretrained(model_id)
152
- ```
153
 
154
- For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.
 
1
+ # Alpha Factory
 
 
 
 
2
 
3
+ **LLM-assisted alpha expression generator for WorldQuant BRAIN.**
4
 
5
+ > ⚠️ This is a **prototype tool**, not a production system. It generates candidate expressions for manual review and BRAIN submission.
6
 
7
+ ## What It Actually Does
 
 
 
 
 
8
 
9
+ 1. **Picks a theme** from 12 data-driven domains (deterministic gap analysis)
10
+ 2. **Generates a hypothesis** using an LLM (Qwen via HuggingFace Inference API)
11
+ 3. **Compiles to BRAIN expression** (Jinja templates for proven archetypes, LLM fallback for novel)
12
+ 4. **Lints the expression** (validates 71 operators, checks arity, parentheses, look-ahead, coverage)
13
+ 5. **Stores in DuckDB** for review
14
 
15
+ That's it. Steps 1-5 work. Everything below is scaffolding for future development.
 
16
 
17
+ ## What Does NOT Work Yet
 
18
 
19
+ - BRAIN API submission (no client connected)
20
+ - Crowd Scout / Performance Surgeon / Gatekeeper personas (imported but never called)
21
+ - ❌ RAG over arXiv papers (stub only)
22
+ - Local BRAIN simulator (exists but not wired into pipeline)
23
+ - Feedback loop / evolutionary improvement
24
+ - Automatic iteration on near-misses
 
 
 
 
25
 
26
+ ## Quickstart
 
27
 
28
+ ```bash
29
+ git clone https://huggingface.co/gaurv007/alpha-factory
30
+ cd alpha-factory
31
+ uv sync
32
+ ```
33
 
34
+ Create `.env`:
35
+ ```
36
+ HF_TOKEN=hf_your_token_here
37
  ```
38
 
39
+ ### Option 1: Proven Templates (RECOMMENDED — no LLM, guaranteed valid)
40
 
41
+ ```bash
42
+ uv run python alpha_factory/generate_proven.py
 
 
43
  ```
44
 
45
+ This uses your proven Alpha 15/6 structures with novel AC=0 fields. **Every expression is syntactically valid and ready to paste into BRAIN.**
46
 
47
+ ### Option 2: LLM-Assisted Generation
 
 
 
 
 
 
 
48
 
49
+ ```bash
50
+ uv run python -m alpha_factory.run --dry-run --batch-size 5
51
+ ```
52
 
53
+ Uses HuggingFace Inference API (Qwen 7B) to generate novel hypotheses. Quality varies — always lint-check before submitting to BRAIN.
 
 
 
54
 
55
+ ### Option 3: Gradio UI
56
 
57
+ ```bash
58
+ uv run python -m alpha_factory.ui
59
+ ```
60
 
61
+ View generated alphas with timestamps, copy expressions, generate new batches from browser.
 
 
 
 
 
 
62
 
63
+ ## Architecture
64
 
65
  ```
66
  alpha_factory/
67
+ ├── data/ # BRAIN field registry (3,447 candidates), operators, groups
68
+ ├── brain_fields.py # 35 highest-EV fields with AC, coverage, sign metadata
69
+ ├── brain_groups.py # 15 novel neutralization keys (AC 3-20)
70
+ │ └── __init__.py
71
+ ├── deterministic/ # No LLM required
72
+ │ ├── lint.py # 71-operator validation + arity checks
73
+ │ ├── theme_sampler.py # Gap analysis across 12 themes
74
+ │ ├── proven_templates.py # Alpha 15/6 structure with field swaps
75
+ ── expression_mutator.py # 5 mutation operators for iteration
76
+ │ └── fitness.py # Composite fitness scoring
77
+ ├── personas/ # LLM-powered agents
78
+ │ ├── hypothesis_hunter.py # Generates blueprints (ACTIVE)
79
+ │ ├── expression_compiler.py # Blueprint → BRAIN expression (ACTIVE)
80
+ │ ├── crowd_scout.py # Novelty check (NOT WIRED)
81
+ ── performance_surgeon.py # Failure diagnosis (NOT WIRED)
82
+ │ └── gatekeeper.py # Final go/no-go (NOT WIRED)
83
+ ── infra/ # Infrastructure
84
+ ├── llm_client.py # Unified Ollama/HF client
85
+ │ ├── factor_store.py # DuckDB persistence
86
+ │ ├── model_manager.py # Auto-discovers available models
87
+ │ ├── winner_memory.py # Feedback loop storage (NOT WIRED)
88
+ ── wq_client.py # BRAIN API wrapper (NOT CONNECTED)
89
+ ── orchestration/
90
+ └── pipeline.py # Main pipeline (steps 1-5 only)
91
+ ── run.py # CLI entry point
92
+ ├── ui.py # Gradio dashboard
93
+ └── generate_proven.py # Standalone proven template generator
94
  ```
95
 
96
+ ## Field Strategy
97
 
98
+ The pipeline prioritizes fields by expected value:
 
 
 
 
 
99
 
100
+ | Tier | Dataset | Density (α/field) | Strategy |
101
+ |------|---------|-------------------|----------|
102
+ | 1 | model77 | 24 | Primary target — 5 fields with AC=0 globally |
103
+ | 2 | model16, news12 | 192-385 | Secondary — score derivatives |
104
+ | 3 | analyst4, option9, pv13 | 656-822 | Tertiary — supply chain, PCR |
105
+ | 4 | pv1, socialmedia | 2500-64350 | Avoid — over-mined |
106
 
107
+ ## BRAIN Submission Settings
 
 
 
 
 
108
 
109
+ When pasting expressions into BRAIN manually:
110
+ - Region: USA
111
+ - Universe: TOP3000
112
+ - Delay: 1
113
+ - Decay: 5
114
+ - Truncation: 0.08
115
+ - Pasteurization: ON
116
+ - NaN Handling: OFF
117
 
118
+ ## Requirements
119
 
120
+ - Python 3.11+
121
+ - HuggingFace token (free tier works for Qwen 7B)
122
+ - Optional: Ollama for local inference
123
 
124
+ ## License
 
 
 
 
 
 
 
 
125
 
126
+ MIT