Sky v2.0-Lite — Lightweight CREST for Everyone
Built by 0labs — Atharvsinh Jadav, Gujarat, India
Adaptive-depth AI that runs on any laptop.
What is This Model?
Sky v2.0-Lite is the smallest and fastest member of the Sky v2.0 family. It combines K=2 step reduction (keeping only CREST Steps 1 and 2) with the option for 4-bit quantization at load time, making it possible to run a CREST adaptive-depth model on hardware as modest as an 8GB laptop GPU or even a Google Colab free tier T4.
Despite being the lightest variant, it still has adaptive depth — the model still decides per-token how much computation to use. It is not a standard transformer; it is a real CREST model.
Who is This For?
- 🎓 Students exploring adaptive-depth LLMs on a budget
- 💻 Developers building AI-powered applications on consumer hardware
- 📱 Edge/on-device deployment where every GB matters
- 🧪 Researchers who want to experiment with CREST without cloud GPUs
Model Details
| Property | Value |
|---|---|
| Architecture | CREST (0labs proprietary) |
| Total Parameters | 6.47B |
| CREST Steps (K) | 2 |
| Hidden Dimension | 2,560 |
| Layers | 32 |
| Attention Heads | 32 |
| Context Window | 32,768 tokens |
| Disk Size | ~13GB (BFloat16) |
| VRAM (BF16) | ~13GB |
| VRAM (INT4) | ~4GB |
| Vocabulary Size | 248,320 |
| License | Apache 2.0 |
Quick Start
Recommended: 4-Bit Loading (~4GB VRAM) ⭐
This is the recommended way to run Sky v2.0-Lite on consumer hardware:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# 4-bit quantization config — uses only ~4GB VRAM
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
"0labs-in/Sky-v2.0-Lite",
trust_remote_code=True, # Required — CREST is a custom architecture
quantization_config=quantization_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("0labs-in/Sky-v2.0-Lite")
# Chat
prompt = "Explain quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Full Precision Loading (needs ~13GB VRAM)
model = AutoModelForCausalLM.from_pretrained(
"0labs-in/Sky-v2.0-Lite",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
Note:
trust_remote_code=Trueis required — CREST uses custom architecture code included in this repository.
Hardware Compatibility
| Device | Loading Mode | VRAM Used | Works? |
|---|---|---|---|
| Laptop RTX 4060 (8GB) | INT4 | ~4GB | ✅ |
| Google Colab T4 (15GB) | INT4 | ~4GB | ✅ |
| Apple M1/M2 (8GB) | INT4 | ~4GB | ✅ |
| MacBook Pro M3 (18GB) | BF16 | ~13GB | ✅ |
| RTX 3090 (24GB) | BF16 | ~13GB | ✅ |
| RTX 4090 (24GB) | BF16 | ~13GB | ✅ |
| A100 (40/80GB) | BF16 | ~13GB | ✅ |
| MI300X (192GB) | BF16 | ~13GB | ✅ |
Google Colab (Free Tier)
# Run this in the first cell of a Colab notebook
!pip install -q transformers accelerate bitsandbytes safetensors sentencepiece
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model = AutoModelForCausalLM.from_pretrained(
"0labs-in/Sky-v2.0-Lite",
trust_remote_code=True,
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
),
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("0labs-in/Sky-v2.0-Lite")
# You're ready to go!
Architecture: CREST K=2
Input from Attention
↓
┌──────────┐
│ MLP₁ │ ← Step 1: Core knowledge (always runs)
└──────────┘
↓
Halting Gate: "Is this token done?"
│
├── h₁ ≈ 1.0 → HALT (easy tokens: "the", "is", "and")
│ output = state₁
│
↓ h₁ < threshold
┌──────────┐
│ MLP₂ │ ← Step 2: Deeper reasoning (only when needed)
└──────────┘
↓
Output = p₁ · state₁ + p₂ · state₂
What this means in practice:
- Simple/common tokens → processed by just MLP₁ → fast
- Complex/rare tokens → processed by MLP₁ then MLP₂ → more accurate
- The model automatically decides which path to take for each token
Comparison: Lite vs Other Variants
| Feature | Sky v2.0-11B | Sky v2.0-6.5B | Sky v2.0-11B-INT4 | Sky v2.0-Lite |
|---|---|---|---|---|
| Parameters | 11.2B | 6.47B | 11.2B | 6.47B |
| CREST Steps | 4 | 2 | 4 | 2 |
| Min VRAM (INT4) | ~6GB | ~4GB | ~6GB | ~4GB |
| Adaptive Depth | ✅ | ✅ | ✅ | ✅ |
| Best For | Max quality | Balance | Full CREST | Everyone |
Model Family
| Model | Parameters | CREST K | Best For |
|---|---|---|---|
| Sky v2.0-11B | 11.2B | 4 | Maximum capability |
| Sky v2.0-6.5B | 6.47B | 2 | Balanced performance |
| Sky v2.0-11B-INT4 | 11.2B | 4 | Full CREST, memory-efficient |
| Sky v2.0-Lite | 6.47B | 2 | Laptops, Colab, edge devices |
Citation
@article{jadav2026crest,
title={CREST: Cognitively Recurrent Estimation of Step Termination for Adaptive-Depth Language Modeling},
author={Jadav, Atharvsinh},
year={2026},
url={https://huggingface.co/0labs-in/Sky-v2.0-11B}
}
About 0labs
0labs is an independent AI research lab founded by Atharvsinh Jadav in Gujarat, India. We build adaptive-depth LLMs that think harder on hard problems — trained on a single GPU.
- 🌐 Website: 0labs.in
- 🤗 HuggingFace: huggingface.co/0labs-in
AI should be accessible to everyone. Sky v2.0-Lite makes adaptive-depth intelligence available on any hardware.
- Downloads last month
- 101