Sky v2.0-11B-INT4 — Full CREST, Memory-Efficient
Built by 0labs — Atharvsinh Jadav, Gujarat, India
Full adaptive-depth architecture. Fits on consumer GPUs.
What is This Model?
Sky v2.0-11B-INT4 is the complete 11.2 billion parameter CREST model with all K=4 adaptive-depth steps preserved. The weights are stored in BFloat16 on disk and can be loaded with 4-bit quantization at inference time, reducing VRAM usage from 22GB to **6GB**.
This gives you the full power of CREST — all 4 computational steps, all halting gates, complete adaptive-depth behavior — on hardware as modest as an NVIDIA RTX 3060.
How CREST Works
Standard transformers give every token the same amount of computation. CREST changes this by replacing each FFN with 4 independent MLPs and a learned halting gate:
Easy token ("the") → 1 step (fast exit)
Medium token ("fibonacci") → 2 steps (moderate processing)
Hard token (algorithm logic) → 3-4 steps (deep reasoning)
The model learns which tokens need more thought. This is analogous to Kahneman's System 1 (fast, automatic) vs System 2 (slow, deliberate) thinking.
Model Details
| Property | Value |
|---|---|
| Architecture | CREST (0labs proprietary) |
| Total Parameters | 11.2B |
| CREST Steps (K) | 4 (all preserved) |
| Hidden Dimension | 2,560 |
| Layers | 32 |
| Attention Heads | 32 |
| Context Window | 32,768 tokens |
| Disk Size | ~22GB (BFloat16) |
| VRAM (BF16) | ~22GB |
| VRAM (INT4) | ~6GB |
| Vocabulary Size | 248,320 |
| License | Apache 2.0 |
Quick Start
Standard Loading (needs ~22GB VRAM)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"0labs-in/Sky-v2.0-11B-INT4",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("0labs-in/Sky-v2.0-11B-INT4")
4-Bit Loading (needs only ~6GB VRAM) ⭐
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
)
model = AutoModelForCausalLM.from_pretrained(
"0labs-in/Sky-v2.0-11B-INT4",
trust_remote_code=True,
quantization_config=quantization_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("0labs-in/Sky-v2.0-11B-INT4")
Generation
prompt = "Implement a binary search algorithm in Python with detailed comments."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Note:
trust_remote_code=Trueis required — CREST uses custom architecture code included in this repository.
Hardware Requirements
| Setup | Loading Mode | VRAM | Works? |
|---|---|---|---|
| RTX 4090 (24GB) | BF16 | ~22GB | ✅ |
| RTX 3090 (24GB) | BF16 | ~22GB | ✅ |
| A100 (40/80GB) | BF16 | ~22GB | ✅ |
| MI300X (192GB) | BF16 | ~22GB | ✅ |
| RTX 3060 (12GB) | INT4 | ~6GB | ✅ |
| RTX 4060 (8GB) | INT4 | ~6GB | ✅ |
| Google Colab T4 (15GB) | INT4 | ~6GB | ✅ |
| Apple M1/M2 (8GB+) | INT4 | ~6GB | ✅ |
Architecture: Full CREST (K=4)
Input from Attention
↓
┌──────────┐
│ MLP₁ │ ← Step 1: Core knowledge retrieval
└──────────┘
↓
Halting Gate → h₁ ≈ 0.99? → HALT (easy tokens stop here)
↓
┌──────────┐
│ MLP₂ │ ← Step 2: Verification & refinement
└──────────┘
↓
Halting Gate → h₂ high? → HALT (medium tokens stop here)
↓
┌──────────┐
│ MLP₃ │ ← Step 3: Logical reasoning
└──────────┘
↓
Halting Gate → h₃ high? → HALT
↓
┌──────────┐
│ MLP₄ │ ← Step 4: Deep computation (hardest tokens only)
└──────────┘
↓
Output = p₁·s₁ + p₂·s₂ + p₃·s₃ + p₄·s₄
All 4 steps are preserved in this model. The halting gates dynamically allocate computation where it's needed.
Model Family
| Model | Parameters | CREST K | Best For |
|---|---|---|---|
| Sky v2.0-11B | 11.2B | 4 | Maximum capability |
| Sky v2.0-6.5B | 6.47B | 2 | Balanced performance |
| Sky v2.0-11B-INT4 | 11.2B | 4 | Full CREST on consumer GPUs |
| Sky v2.0-Lite | 6.47B | 2 | Laptops & edge devices |
Citation
@article{jadav2026crest,
title={CREST: Cognitively Recurrent Estimation of Step Termination for Adaptive-Depth Language Modeling},
author={Jadav, Atharvsinh},
year={2026},
url={https://huggingface.co/0labs-in/Sky-v2.0-11B}
}
About 0labs
0labs is an independent AI research lab founded by Atharvsinh Jadav in Gujarat, India. We build adaptive-depth LLMs that think harder on hard problems — trained on a single GPU.
- 🌐 Website: 0labs.in
- 🤗 HuggingFace: huggingface.co/0labs-in
- Downloads last month
- 32