Add gpu-space-cpu SKILL
Browse files- gpu-space-cpu/SKILL.md +49 -0
gpu-space-cpu/SKILL.md
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: gpu-space-cpu
|
| 3 |
+
description: Convert GPU HuggingFace Spaces to CPU-only for free tier deployment. Use when user says "make this run on CPU", "convert to CPU Space", "remove GPU dependencies", "deploy without GPU", "free HuggingFace Space", or "port CUDA app to CPU". Also use when analyzing repos with bitsandbytes, flash-attn, or CUDA dependencies for CPU compatibility.
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# GPU → CPU Space demo Conversion
|
| 7 |
+
|
| 8 |
+
- Free space Tier CPU = 2 CPU cores (vCPUs) + 16GB RAM + 50GB non-persistent disk space
|
| 9 |
+
|
| 10 |
+
## Workflow
|
| 11 |
+
|
| 12 |
+
1. Grep GPU deps: `@spaces.GPU|bitsandbytes|flash-attn|triton|xformers|auto-gptq|exllama|apex|\.cuda\(|device.*cuda`
|
| 13 |
+
2. Remove GPU packages, replace code: `cuda`→`cpu`, `float16`→`int8`, remove `.half()`, `.cuda()`, `device_map="auto"`
|
| 14 |
+
3. Create: app.py (all logic, prioritize recent built-in gradio [v6+](https://www.gradio.app/guides/gradio-6-migration-guide)), README.md, requirements.txt
|
| 15 |
+
4. Test: `pip install -r requirements.txt && python app.py` and Test API by adding mcp https://www.gradio.app/guides/building-mcp-server-with-gradio
|
| 16 |
+
5. Deploy after local test passes
|
| 17 |
+
|
| 18 |
+
## requirements.txt
|
| 19 |
+
|
| 20 |
+
```
|
| 21 |
+
--extra-index-url https://download.pytorch.org/whl/cpu
|
| 22 |
+
torch
|
| 23 |
+
```
|
| 24 |
+
Never pin transitive deps. No packages.txt (ffmpeg/git/cmake pre-installed).
|
| 25 |
+
|
| 26 |
+
## README.md
|
| 27 |
+
|
| 28 |
+
```yaml
|
| 29 |
+
---
|
| 30 |
+
title: Name
|
| 31 |
+
emoji: X
|
| 32 |
+
sdk: gradio
|
| 33 |
+
sdk_version: 6.3.0
|
| 34 |
+
app_file: app.py
|
| 35 |
+
---
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
## Quantization (model too large)
|
| 39 |
+
|
| 40 |
+
1. ONNX Runtime (Optimum+OMP_NUM_THREADS=2) INT8 (preferred - onnxruntime fast on CPU)
|
| 41 |
+
2. TorchAO
|
| 42 |
+
3. torch.quantization.quantize_dynamic
|
| 43 |
+
3. Onnx FP32 is faster than FP16 on CPU
|
| 44 |
+
|
| 45 |
+
## Stop
|
| 46 |
+
|
| 47 |
+
- Model >12GB with INT8 → needs GPU
|
| 48 |
+
- INT8 quality loss → needs GPU
|
| 49 |
+
- 3 failed approaches → tell user honestly
|