Text Generation
Transformers
Safetensors
deepseek_v4
deepseek-v4
mixture-of-experts
Mixture of Experts
mhc
csa
hca
scaffold
random-init
conversational
Instructions to use kshitijthakkar/deepseek-v4-mini-3B-init with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use kshitijthakkar/deepseek-v4-mini-3B-init with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="kshitijthakkar/deepseek-v4-mini-3B-init") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/deepseek-v4-mini-3B-init") model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/deepseek-v4-mini-3B-init") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use kshitijthakkar/deepseek-v4-mini-3B-init with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "kshitijthakkar/deepseek-v4-mini-3B-init" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kshitijthakkar/deepseek-v4-mini-3B-init", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/kshitijthakkar/deepseek-v4-mini-3B-init
- SGLang
How to use kshitijthakkar/deepseek-v4-mini-3B-init with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "kshitijthakkar/deepseek-v4-mini-3B-init" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kshitijthakkar/deepseek-v4-mini-3B-init", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "kshitijthakkar/deepseek-v4-mini-3B-init" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "kshitijthakkar/deepseek-v4-mini-3B-init", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use kshitijthakkar/deepseek-v4-mini-3B-init with Docker Model Runner:
docker model run hf.co/kshitijthakkar/deepseek-v4-mini-3B-init
| license: mit | |
| library_name: transformers | |
| tags: | |
| - deepseek-v4 | |
| - mixture-of-experts | |
| - moe | |
| - mhc | |
| - csa | |
| - hca | |
| - scaffold | |
| - random-init | |
| pipeline_tag: text-generation | |
| # DeepSeek-V4 Mini (3B) — randomly-initialized architecture replica | |
| Faithful small-scale (~3.2B total / ~1.10B activated per token) | |
| replica of the DeepSeek-V4 architecture, sized to be trainable on rented GPUs | |
| and to map cleanly onto the full-scale V4-Flash dimensions for weight slicing. | |
| This is a **randomly-initialized** scaffold — generates noise. Its purpose: | |
| - reference architecture for ablation / hyperparameter-search experiments | |
| - target for weight transfer / slicing from real V4-Pro / V4-Flash | |
| ## Architecture summary | |
| | | Value | | |
| |---|---| | |
| | hidden_size | 1536 | | |
| | num_hidden_layers | 28 | | |
| | num_attention_heads | 24 | | |
| | num_key_value_heads | 1 (MQA) | | |
| | head_dim | 64 | | |
| | q_lora_rank / o_lora_rank | 512 / 512 | | |
| | qk_rope_head_dim | 32 | | |
| | o_groups | 4 | | |
| | n_routed_experts | 24 | | |
| | n_shared_experts | 1 | | |
| | num_experts_per_tok | 4 | | |
| | num_hash_layers | 2 | | |
| | moe_intermediate_size | 768 | | |
| | compress_ratios | [0, 0, 4, 112, 4, 112, 4, 112, 4, 112, 4, 112, 4, 112, 4, 112, 4, 112, 4, 112, 4, 112, 4, 112, 4, 112, 4, 0] | | |
| | index_topk / heads / head_dim | 192 / 16 / 96 | | |
| | sliding_window | 64 | | |
| | max_position_embeddings | 1,048,576 (YaRN factor=16) | | |
| | vocab_size | 129280 (real V4-Flash tokenizer) | | |
| | num_nextn_predict_layers | 1 (V3-style MTP) | | |
| | hc_mult (n_hc) | 4 | | |
| | Storage dtype | bfloat16 | | |
| ## Quick start | |
| ```python | |
| from huggingface_hub import login, snapshot_download | |
| login() # private repo | |
| local = snapshot_download(repo_id="kshitijthakkar/deepseek-v4-mini-3B-init") | |
| import sys, os | |
| sys.path.insert(0, os.path.join(local, "code")) | |
| import deepseek_v4 # registers DeepseekV4{Config,ForCausalLM} with HF auto classes | |
| import torch | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| tok = AutoTokenizer.from_pretrained(local) | |
| model = AutoModelForCausalLM.from_pretrained(local, torch_dtype=torch.bfloat16) | |
| model.eval() | |
| ids = tok.apply_chat_template( | |
| [{"role": "user", "content": "Hello"}], | |
| return_tensors="pt", add_generation_prompt=True, return_dict=True, | |
| ) | |
| with torch.no_grad(): | |
| out = model(input_ids=ids["input_ids"]) | |
| print(out.logits.shape) | |
| ``` | |
| ## Components implemented | |
| mHC (Sinkhorn-Knopp) · CSA + Lightning Indexer · HCA · pure sliding-window · | |
| Shared-KV MQA + grouped output projection (per-group `wo_a`) · partial RoPE + | |
| output `-i` rotation · attention sink · DeepseekMoE with `sqrt(softplus)` | |
| routing · hash-routed early layers · clamped SwiGLU · MTP head · YaRN. | |
| Every component is bit-equivalent in math to the official `inference/model.py` | |
| + `kernel.py:hc_split_sinkhorn` (FP4/FP8 quantization and Hadamard rotation | |
| are skipped — those are inference optimizations, not architecture). | |
| ## Citation | |
| ```bibtex | |
| @misc{deepseek_v4_2026, | |
| author = {DeepSeek-AI}, | |
| title = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence}, | |
| year = {2026}, | |
| url = {https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash} | |
| } | |
| ``` | |