ASTERIZER
/

LUNA-Training

Model card Files Files and versions

xet

Community

ASTERIZER commited on 18 days ago

Commit

4b4cd1e

verified ·

1 Parent(s): 95e6f4e

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +311 -0

README.md ADDED Viewed

	@@ -0,0 +1,311 @@

+# LUNA - 100M Parameter LLM from Scratch
+Custom ~100M parameter GPT model (Pythia-like architecture) pretrained on 4.5B tokens of clean English text.
+## Quick Start (RunPod / Cloud GPU)
+### 1. Clone & Install (one command)
+```bash
+git clone https://huggingface.co/spaces/ASTERIZER/LUNA /workspace/LUNA && \
+cd /workspace/LUNA && \
+pip install -q -r requirements.txt
+```
+### 2. Get Dataset + Train (one command)
+The dataset (~4.5B tokens) is hosted as a zip at [ASTERIZER/Luna_Dataset](https://huggingface.co/datasets/ASTERIZER/Luna_Dataset). The script downloads, extracts, and starts training automatically.
+**From HuggingFace (recommended):**
+```bash
+bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset
+```
+**From Google Drive:**
+```bash
+bash setup_and_train.sh gdrive YOUR_GDRIVE_FOLDER_ID
+```
+**Smoke test (10M tokens only):**
+```bash
+bash setup_and_train.sh huggingface ASTERIZER/Luna_Dataset 10000000
+```
+That's it. The script auto-detects your GPU, VRAM, RAM, CPU cores and configures everything for maximum utilization.
+---
+## How It Works
+### Auto vs Manual Config
+All hyperparameters live in `train_config.yaml`:
+```yaml
+auto_config: true   # auto-detect everything from hardware
+auto_config: false  # use exact values below, no overrides
+```
+When `auto_config: true` (default), the trainer:
+- **Probes VRAM** via binary search to find max micro_batch_size (82% safety)
+- **Sets grad_accum** to hit the target global_batch_size
+- **Picks precision** (bf16 on Ampere+, fp16 otherwise)
+- **Scales workers** to half your CPU cores, capped by RAM
+- **Enables torch.compile** if Triton is available (Linux)
+When `auto_config: false`, every value in the YAML is used exactly as-is.
+### CLI Overrides
+Any config value can be overridden from the command line:
+```bash
+python train.py --config train_config.yaml --data_path /data/litdata --max_tokens 100000000
+```
+Priority: CLI args > train_config.yaml > auto-detection
+---
+## Dataset
+- **4,515,286,950 tokens** (4.5B) in 270 binary chunks
+- Sources: Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned)
+- Format: LitData binary (int32, block_size=1025, TokensLoader)
+- Tokenizer: EleutherAI/pythia-160m (50,254 vocab)
+## Model Architecture
+| Parameter | Value |
+|-----------|-------|
+| Layers | 10 |
+| Hidden dim | 768 |
+| Attention heads | 12 |
+| Vocab size | 50,304 (padded) |
+| Context length | 1,024 |
+| Total params | ~109M (70M unique, tied embeddings) |
+| Rotary % | 25% |
+## File Structure
+```
+LUNA/
+  train.py              # Main training script (config-driven, auto-detects hardware)
+  train_config.yaml     # All hyperparameters (auto_config: true/false)
+  fetch_data.py         # Downloads dataset from HuggingFace / GDrive
+  setup_and_train.sh    # One-command cloud entrypoint
+  benchmark_runpod.py   # Local benchmark + RunPod cost calculator
+  requirements.txt      # Python dependencies
+  Base/
+    checkpoints/EleutherAI/pythia-160m/   # Tokenizer files
+    configs/             # Legacy litgpt YAML configs (reference only)
+    scripts/             # Data preprocessing scripts
+```
+## Estimated Training Times (RunPod)
+| GPU | $/hr | tok/s | Hours | Cost USD | Cost INR |
+|-----|------|-------|-------|----------|----------|
+| RTX A5000 | $0.16 | ~6,400 | ~196h | ~$31 | ~2,700 |
+| RTX 3090 | $0.22 | ~7,600 | ~165h | ~$36 | ~3,100 |
+| RTX 4090 | $0.34 | ~10,000 | ~125h | ~$42 | ~3,600 |
+| RTX 5090 | $0.69 | ~16,000 | ~78h | ~$54 | ~4,600 |
+| H100 NVL | $2.59 | ~43,000 | ~29h | ~$75 | ~6,400 |
+## Resume Training
+Training auto-saves `latest.pt` every save_interval steps. If interrupted, just re-run the same command -- it picks up where it left off.
+---
+## Verified Configs (What Worked)
+These are the exact configurations that produced the current LUNA 100M model.
+Do NOT change them unless you know what you're doing — they are proven and validated.
+---
+### 1. Pretraining — 4.5 Billion Tokens
+The pretraining ran in two phases on an RTX 4060 Ti 16GB.
+**Phase 1: Bulk pretraining on 3B general web tokens**
+| Parameter | Value |
+|-----------|-------|
+| Dataset | `litdata_3b` — deduplicated, quality-filtered (score ≥ 0.96) general web |
+| Total tokens | 3,000,000,000 (3B) |
+| Precision | bf16-mixed |
+| Global batch size | 120 (micro_batch=12 × grad_accum=10) |
+| Sequence length | 1024 |
+| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95]) |
+| LR schedule | Cosine decay with 500-step warmup |
+| Gradient clip | max_norm=1.0 |
+| Checkpoints | Every 1000 steps |
+| Seed | 1337 |
+| Tokenizer | EleutherAI/pythia-160m (vocab 50,254) |
+**Phase 2: Continued pretraining on clean English (Wikipedia + FineWeb-Edu)**
+| Parameter | Value |
+|-----------|-------|
+| Dataset | `litdata_english` — ultra-clean Wikipedia + FineWeb-Edu |
+| Total tokens | 150,000,000 (150M) — ~3 epochs over ~50M unique tokens |
+| Init weights | Phase 1 checkpoint (`custom-100m-3b-full/final_raw`) |
+| Precision | bf16-mixed |
+| Global batch size | 120 (micro_batch=12 × grad_accum=10) |
+| Sequence length | 1024 |
+| Optimizer | AdamW (lr=1e-4, min_lr=1e-5, weight_decay=0.1, betas=[0.9, 0.95]) |
+| LR schedule | Cosine decay with 200-step warmup |
+| Gradient clip | max_norm=1.0 |
+| Checkpoints | Every 500 steps |
+**Final combined dataset used for the production run:**
+| Parameter | Value |
+|-----------|-------|
+| Dataset | `litdata_pretrain_final` — all sources merged |
+| Total tokens | 4,515,286,950 (~4.5B) in 270 chunks |
+| Sources | Wikipedia, FineWeb-Edu, OpenWebText (deduplicated, cleaned pure English) |
+| Format | LitData binary (int32, block_size=1025, EOS=0) |
+| Config file | `train_config.yaml` |
+| Precision | bf16 |
+| Global batch size | 120 (micro_batch=12 × grad_accum=10) |
+| Sequence length | 1024 |
+| Optimizer | AdamW (lr=6e-4, min_lr=6e-5, weight_decay=0.1, betas=[0.9, 0.95]) |
+| LR schedule | Cosine with 500-step warmup (5% of total steps when auto) |
+| Gradient clip | max_norm=1.0 |
+| torch.compile | true (Linux/cloud with Triton) |
+| auto_config | true (probes VRAM, CPU, RAM at runtime) |
+---
+### 2. SFT Fine-Tuning — ~145 Million Tokens
+Supervised fine-tuning on the pretrained LUNA 100M checkpoint.
+| Parameter | Value |
+|-----------|-------|
+| Dataset | `Base/Datasets/sft_clean/` — 574,996 train + 5,808 val samples |
+| Format | Alpaca JSON (instruction / input / output) |
+| Estimated tokens | ~145M total (574,996 samples × ~250 tokens avg × 2 epochs) |
+| Epochs | 2 |
+| Config file | `sft_config.yaml` |
+**Model (frozen architecture — matches pretrain exactly):**
+| Parameter | Value |
+|-----------|-------|
+| vocab_size | 50,304 (padded to 128 multiple) |
+| seq_len | 1024 |
+| n_layer | 10 |
+| n_embd | 768 |
+| n_head | 12 |
+| Rotary % | 25% |
+| Total params | 109,513,728 |
+**Training hyperparameters:**
+| Parameter | Value |
+|-----------|-------|
+| Optimizer | AdamW (lr=1.5e-5, min_lr=1e-6, weight_decay=0.01, betas=[0.9, 0.95]) |
+| Precision | bf16 |
+| Global batch size | 64 (micro_batch=8 × grad_accum=8) |
+| LR warmup | 200 steps |
+| Gradient clip | max_norm=1.0 |
+| Save interval | Every 500 steps |
+| Eval interval | Every 500 steps (runs val loss + eval prompts) |
+| DataLoader | 4 workers, pin_memory=true |
+| torch.compile | false |
+**Prompt format (used during training — must be matched at inference):**
+```
+### Instruction:
+{instruction}
+### Response:
+```
+With optional input field:
+```
+### Instruction:
+{instruction}
+### Input:
+{input}
+### Response:
+```
+**Loss masking:** Only the response tokens (after `### Response:\n`) contribute to the loss.
+The prompt tokens are masked out (loss_mask=0). EOS token (id=0) is appended to every response.
+---
+### 3. SFT Inference / Chat — Loaded Configs
+These are the exact generation parameters loaded when running `chat.py` or `validate_sft.py`.
+They match the training eval config from `sft_train.py`.
+```bash
+python chat.py --ckpt "Base\out\sft\model.pth"
+```
+**Model loading:**
+| Parameter | Value |
+|-----------|-------|
+| Checkpoint | `Base/out/sft/model.pth` (419 MB, raw state_dict, 154 keys) |
+| Checkpoint format | Raw `state_dict` — NOT wrapped in `{"model": ...}` dict |
+| Tokenizer | `Base/checkpoints/EleutherAI/pythia-160m` (vocab 50,254) |
+| EOS token ID | 0 (pythia tokenizer — NOT 50276) |
+| Device | auto (CUDA if available, else CPU) |
+| Precision | float32 at inference (weights loaded as-is from bf16-trained ckpt) |
+**Generation parameters:**
+| Parameter | Value | Why |
+|-----------|-------|-----|
+| temperature | 0.7 | Balanced creativity vs coherence |
+| top_k | 40 | Matches training eval (NOT 50) |
+| top_p | 0.9 | Nucleus sampling cutoff |
+| repetition_penalty | 1.0 | No penalty — matches training (NOT 1.1) |
+| max_new_tokens | 150 | Matches training eval (NOT 256) |
+**Prompt template (must match training exactly):**
+```python
+def format_prompt(instruction, context=""):
+    if instruction and context:
+        return f"### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n### Response:\n"
+    else:
+        return f"### Instruction:\n{instruction}\n\n### Response:\n"
+```
+**Critical notes:**
+- There is NO Alpaca preamble text (e.g., "Below is an instruction...") — the model was never trained with one
+- EOS token is id=0 (pythia), not 50276 (GPT-NeoX) — using the wrong EOS causes the model to never stop
+- Generation stops when EOS is produced OR max_new_tokens is reached
+- For longer responses in chat, you can override: `--max_new 512`
+- For less repetition in production, add: `--rep_pen 1.05`
+**Validation results with these configs (100 complex examples):**
+| Metric | Value |
+|--------|-------|
+| Overall Grade | A |
+| Avg Loss (CE) | 1.9167 |
+| Avg Perplexity | 7.45 |
+| Token Accuracy | 58.6% |
+| BLEU-1 | 0.589 |
+| BLEU-2 | 0.219 |
+| Empty responses | 0/100 |
+| Repetitive responses | 5/100 |
+---
+## License
+Private / ASTERIZER 2026