Alibrown commited on
Commit
e872955
·
verified ·
1 Parent(s): 6549bb3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -1
README.md CHANGED
@@ -10,5 +10,43 @@ pinned: false
10
  license: apache-2.0
11
  short_description: 'SmolLM2-360M-Instruct '
12
  ---
 
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  license: apache-2.0
11
  short_description: 'SmolLM2-360M-Instruct '
12
  ---
13
+ # SmolLM2 360M Instruct Demo
14
 
15
+ This Space demonstrates the SmolLM2-360M-Instruct model with a CPU fallback mechanism. It is designed to run efficiently even on the Hugging Face Free Tier (2 vCPUs).
16
+
17
+ ## Overview
18
+
19
+ A minimal but production-ready LLM service built on:
20
+ * **Model:** SmolLM2-360M-Instruct (approx. 269MB, Apache 2.0).
21
+ * **Efficiency:** Optimized to run on 2 CPUs and minimum 2 GB RAM (HF tier supports up to 16 GB).
22
+ * **Scalability:** Perfect for local training and testing.
23
+
24
+ ## Related Project: SmolLM2-customs
25
+
26
+ If you are interested in training small LLMs the lazy way, check out:
27
+ [https://github.com/VolkanSah/SmolLM2-customs](https://github.com/VolkanSah/SmolLM2-customs)
28
+
29
+ **Features of the custom implementation:**
30
+ * **FastAPI:** OpenAI-compatible `/v1/chat/completions` endpoint.
31
+ * **ADI (Anti-Dump Index):** Filters low-quality requests before they hit the model.
32
+ * **HF Dataset Integration:** Logs every request for later analysis and finetuning.
33
+
34
+ ---
35
+
36
+ ## Deployment & Usage
37
+
38
+ You do not need an API key for this public demo, but rate limits apply.
39
+
40
+ ### How to run your own instance:
41
+ 1. **Duplicate/Clone** this Space.
42
+ 2. **Environment Variables:** To use your own model access or private weights, add one of the following keys to your **Secrets**:
43
+ * `HF_TOKEN`
44
+ * `TEST_TOKEN`
45
+ * `HUGGINGFACE_TOKEN`
46
+ * `HF_API_TOKEN`
47
+
48
+ The code uses a flexible token resolution logic to ensure compatibility with older or custom keys.
49
+
50
+ ## Technical Details
51
+
52
+ The inference pipeline uses `transformers` with `torch`. It automatically detects if a GPU is available; otherwise, it falls back to CPU execution without breaking the Gradio interface.