fix: honest README — removes false claims, documents what actually works, adds proper usage instructions"

Browse files

Files changed (1) hide show

README.md +91 -119

README.md CHANGED Viewed

@@ -1,154 +1,126 @@
----
-tags:
-- ml-intern
----
-# Alpha Factory — Open-Source LLM-Driven Pipeline for WorldQuant BRAIN
-Autonomous alpha generation system using multi-LLM agents with 7-layer acceptance engineering.
-## Quick Start
-```bash
-# Install uv (if not already installed)
-# Windows:
-powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
-# macOS/Linux:
-curl -LsSf https://astral.sh/uv/install.sh | sh
-# Clone
-git clone https://huggingface.co/gaurv007/alpha-factory
-cd alpha-factory
-# Install (uv handles everything — venv, deps, lockfile)
-uv sync
-# With optional RAG support
-uv sync --extra rag
-# With all optional deps
-uv sync --extra all
-# Start Ollama (local LLM server)
-ollama pull qwen2.5:1.5b
-ollama pull qwen2.5:7b
-ollama serve
-# Dry run (no BRAIN credits spent)
-uv run python -m alpha_factory.run --dry-run --batch-size 5
-# Interactive model selection
-uv run python -m alpha_factory.run --interactive --dry-run
-# With HuggingFace cloud models
-uv run python -m alpha_factory.run --hf-token hf_your_token --batch-size 10
-# Run tests
-uv run pytest tests/ -v
 ```
-## Architecture
-```
-Theme Sampler → Hypothesis Hunter (Microfish) → Expression Compiler (Jinja/Tinyfish)
-     → Static Lint → Dedup → BRAIN Submit → Crowd Scout (Mediumfish)
-     → Performance Surgeon (Mediumfish) → Gatekeeper (Bigfish) → Portfolio
 ```
-## 6 LLM Personas
-| # | Persona | Model Tier | Job |
-|---|---------|------------|-----|
-| 1 | Hypothesis Hunter | Microfish (1.5B) | Generate novel factor blueprints |
-| 2 | Expression Compiler | Tinyfish (3B) / Jinja | Convert blueprint to BRAIN expression |
-| 3 | Look-Ahead Sniffer | Deterministic | Static analysis for future leakage |
-| 4 | Crowd Scout | Mediumfish (7B) | Novelty + correlation check |
-| 5 | Performance Surgeon | Mediumfish (7B) | Diagnose failures, suggest fixes |
-| 6 | Production Gatekeeper | Bigfish (14-72B) | Final go/no-go memo |
-## Model Support
-Automatically detects and uses:
-- **Ollama (local)** — auto-detected at localhost:11434
-- **HuggingFace Inference API (cloud)** — set HF_TOKEN env var
-- **vLLM (local/remote)** — any OpenAI-compatible endpoint
-Use `--interactive` flag to manually pick models for each tier from a dropdown.
-## Key Features
-- Zero recurring cost — all LLMs run locally via Ollama
-- Schema-constrained generation — no hallucinated operators
-- 7-layer acceptance engineering — saves 60%+ BRAIN credits
-- Deterministic kill switches — circuit breakers for runaway pipelines
-- Factor store — DuckDB persistence for all alpha history
-- Dead theme registry — avoids re-exploring failed themes
-- Local BRAIN simulator — triage alphas before spending credits
-## File Structure
 ```
 alpha_factory/
-├── config.py                  # All settings (Pydantic)
-├── run.py                     # Entry point
-├── schemas/                   # Typed contracts
-├── deterministic/
-│   ├── lint.py                # Static pre-flight (Layer 2)
-│   ├── theme_sampler.py       # Gap analysis (Layer 1)
-│   ├── fitness.py             # Composite scoring
-│   ├── regime_tagger.py       # Vol/trend/rate/style regimes
-│   └── acceptance_checklist.py # 14-point checklist
-├── infra/
-│   ├── model_manager.py       # Ollama + HF auto-detection
-│   ├── llm_client.py          # Unified LLM interface
-│   ├── factor_store.py        # DuckDB persistence
-│   ├── wq_client.py           # BRAIN API wrapper
-│   └── rag.py                 # ChromaDB + arXiv
-├── local/
-│   └── brain_sim.py           # Local BRAIN simulator (Layer 4)
-├── personas/
-│   ├── hypothesis_hunter.py   # Persona 1
-│   ├── expression_compiler.py # Persona 2
-│   ├── crowd_scout.py         # Persona 4
-│   ├── performance_surgeon.py # Persona 5
-│   └── gatekeeper.py          # Persona 6
-└── orchestration/
-    └── pipeline.py            # Full DAG
 ```
-## Setup
-1. Install uv: https://docs.astral.sh/uv/getting-started/installation/
-2. `uv sync`
-3. Install Ollama: https://ollama.ai
-4. Pull models: `ollama pull qwen2.5:1.5b && ollama pull qwen2.5:7b`
-5. Place your `operators.csv` and `fields_USA_TOP3000_D1.csv` in `data/`
-6. Run: `uv run python -m alpha_factory.run --dry-run --interactive`
-## Cost
-| Item | Cost |
-|------|------|
-| Local GPU (RTX 3090/4090) | $0 (already owned) |
-| BRAIN account | $0 (existing) |
-| uv + Ollama + all deps | $0 |
-| Monthly running cost | **$0** |
-<!-- ml-intern-provenance -->
-## Generated by ML Intern
-This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
-- Try ML Intern: https://smolagents-ml-intern.hf.space
-- Source code: https://github.com/huggingface/ml-intern
-## Usage
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = "gaurv007/alpha-factory"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
-```
-For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.

+# Alpha Factory
+**LLM-assisted alpha expression generator for WorldQuant BRAIN.**
+> ⚠️ This is a **prototype tool**, not a production system. It generates candidate expressions for manual review and BRAIN submission.
+## What It Actually Does
+1. **Picks a theme** from 12 data-driven domains (deterministic gap analysis)
+2. **Generates a hypothesis** using an LLM (Qwen via HuggingFace Inference API)
+3. **Compiles to BRAIN expression** (Jinja templates for proven archetypes, LLM fallback for novel)
+4. **Lints the expression** (validates 71 operators, checks arity, parentheses, look-ahead, coverage)
+5. **Stores in DuckDB** for review
+That's it. Steps 1-5 work. Everything below is scaffolding for future development.
+## What Does NOT Work Yet
+- ❌ BRAIN API submission (no client connected)
+- ❌ Crowd Scout / Performance Surgeon / Gatekeeper personas (imported but never called)
+- ❌ RAG over arXiv papers (stub only)
+- ❌ Local BRAIN simulator (exists but not wired into pipeline)
+- ❌ Feedback loop / evolutionary improvement
+- ❌ Automatic iteration on near-misses
+## Quickstart
+```bash
+git clone https://huggingface.co/gaurv007/alpha-factory
+cd alpha-factory
+uv sync
+```
+Create `.env`:
+```
+HF_TOKEN=hf_your_token_here
 ```
+### Option 1: Proven Templates (RECOMMENDED — no LLM, guaranteed valid)
+```bash
+uv run python alpha_factory/generate_proven.py
 ```
+This uses your proven Alpha 15/6 structures with novel AC=0 fields. **Every expression is syntactically valid and ready to paste into BRAIN.**
+### Option 2: LLM-Assisted Generation
+```bash
+uv run python -m alpha_factory.run --dry-run --batch-size 5
+```
+Uses HuggingFace Inference API (Qwen 7B) to generate novel hypotheses. Quality varies — always lint-check before submitting to BRAIN.
+### Option 3: Gradio UI
+```bash
+uv run python -m alpha_factory.ui
+```
+View generated alphas with timestamps, copy expressions, generate new batches from browser.
+## Architecture
 ```
 alpha_factory/
+├── data/                  # BRAIN field registry (3,447 candidates), operators, groups
+│   ├── brain_fields.py    # 35 highest-EV fields with AC, coverage, sign metadata
+│   ├── brain_groups.py    # 15 novel neutralization keys (AC 3-20)
+│   └── __init__.py
+├── deterministic/         # No LLM required
+│   ├── lint.py            # 71-operator validation + arity checks
+│   ├── theme_sampler.py   # Gap analysis across 12 themes
+│   ├── proven_templates.py # Alpha 15/6 structure with field swaps
+│   ├── expression_mutator.py # 5 mutation operators for iteration
+│   └── fitness.py         # Composite fitness scoring
+├── personas/              # LLM-powered agents
+│   ├── hypothesis_hunter.py  # Generates blueprints (ACTIVE)
+│   ├── expression_compiler.py # Blueprint → BRAIN expression (ACTIVE)
+│   ├── crowd_scout.py        # Novelty check (NOT WIRED)
+│   ├── performance_surgeon.py # Failure diagnosis (NOT WIRED)
+│   └── gatekeeper.py         # Final go/no-go (NOT WIRED)
+├── infra/                 # Infrastructure
+│   ├── llm_client.py      # Unified Ollama/HF client
+│   ├── factor_store.py    # DuckDB persistence
+│   ├── model_manager.py   # Auto-discovers available models
+│   ├── winner_memory.py   # Feedback loop storage (NOT WIRED)
+│   └── wq_client.py       # BRAIN API wrapper (NOT CONNECTED)
+├── orchestration/
+│   └── pipeline.py        # Main pipeline (steps 1-5 only)
+├── run.py                 # CLI entry point
+├── ui.py                  # Gradio dashboard
+└── generate_proven.py     # Standalone proven template generator
 ```
+## Field Strategy
+The pipeline prioritizes fields by expected value:
+| Tier | Dataset | Density (α/field) | Strategy |
+|------|---------|-------------------|----------|
+| 1 | model77 | 24 | Primary target — 5 fields with AC=0 globally |
+| 2 | model16, news12 | 192-385 | Secondary — score derivatives |
+| 3 | analyst4, option9, pv13 | 656-822 | Tertiary — supply chain, PCR |
+| 4 | pv1, socialmedia | 2500-64350 | Avoid — over-mined |
+## BRAIN Submission Settings
+When pasting expressions into BRAIN manually:
+- Region: USA
+- Universe: TOP3000
+- Delay: 1
+- Decay: 5
+- Truncation: 0.08
+- Pasteurization: ON
+- NaN Handling: OFF
+## Requirements
+- Python 3.11+
+- HuggingFace token (free tier works for Qwen 7B)
+- Optional: Ollama for local inference
+## License
+MIT