Instructions to use qubitpage/sentinel-prime-350m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use qubitpage/sentinel-prime-350m with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="qubitpage/sentinel-prime-350m", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("qubitpage/sentinel-prime-350m", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use qubitpage/sentinel-prime-350m with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "qubitpage/sentinel-prime-350m"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qubitpage/sentinel-prime-350m",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/qubitpage/sentinel-prime-350m

SGLang

How to use qubitpage/sentinel-prime-350m with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "qubitpage/sentinel-prime-350m" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qubitpage/sentinel-prime-350m",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "qubitpage/sentinel-prime-350m" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "qubitpage/sentinel-prime-350m",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use qubitpage/sentinel-prime-350m with Docker Model Runner:
```
docker model run hf.co/qubitpage/sentinel-prime-350m
```

qubitpage commited on Apr 22

Commit

3ba2208

verified ·

1 Parent(s): 90885ba

Upload hf_model.py with huggingface_hub

Browse files

Files changed (1) hide show

hf_model.py +29 -11

hf_model.py CHANGED Viewed

@@ -357,18 +357,36 @@ class SentinelBrainForCausalLM(PreTrainedModel, GenerationMixin):
         B, T = input_ids.shape
         x = self.tok_emb(input_ids)
-        # Determine if we have valid past KV caches
         has_past = False
         past_len = 0
-        if past_key_values is not None and len(past_key_values) > 0:
-            first = past_key_values[0]
-            if first is not None:
-                if isinstance(first, (tuple, list)) and len(first) > 0 and first[0] is not None:
-                    has_past = True
-                    past_len = first[0].shape[2]
-                elif hasattr(first, 'shape'):
-                    has_past = True
-                    past_len = first.shape[2]
         rope_cos, rope_sin = self.rope(past_len + T)
         rope_cos = rope_cos[:, :, past_len:past_len + T].to(x.device)
@@ -379,7 +397,7 @@ class SentinelBrainForCausalLM(PreTrainedModel, GenerationMixin):
         total_z = 0.0
         for i, layer in enumerate(self.layers):
-            kv_cache = past_key_values[i] if has_past else None
             x, new_kv, aux, z = layer(x, rope_cos, rope_sin, kv_cache=kv_cache)
             new_kv_caches.append(new_kv)
             total_aux += aux

         B, T = input_ids.shape
         x = self.tok_emb(input_ids)
+        # Determine if we have valid past KV caches.
+        # Support: list-of-tuples (legacy), tuple-of-tuples, and DynamicCache (new transformers).
         has_past = False
         past_len = 0
+        _legacy_past = None  # normalized to list-of-tuples form
+        if past_key_values is not None:
+            # New API: DynamicCache or similar Cache object
+            if hasattr(past_key_values, "to_legacy_cache"):
+                try:
+                    legacy = past_key_values.to_legacy_cache()
+                    if legacy is not None and len(legacy) > 0:
+                        _legacy_past = list(legacy)
+                        first = _legacy_past[0]
+                        if first is not None and len(first) > 0 and first[0] is not None:
+                            has_past = True
+                            past_len = first[0].shape[2]
+                except Exception:
+                    pass
+            # Legacy API: list/tuple of (k, v) tuples
+            elif isinstance(past_key_values, (list, tuple)) and len(past_key_values) > 0:
+                _legacy_past = list(past_key_values)
+                first = _legacy_past[0]
+                if first is not None:
+                    if isinstance(first, (tuple, list)) and len(first) > 0 and first[0] is not None:
+                        has_past = True
+                        past_len = first[0].shape[2]
+                    elif hasattr(first, "shape"):
+                        has_past = True
+                        past_len = first.shape[2]
         rope_cos, rope_sin = self.rope(past_len + T)
         rope_cos = rope_cos[:, :, past_len:past_len + T].to(x.device)
         total_z = 0.0
         for i, layer in enumerate(self.layers):
+            kv_cache = _legacy_past[i] if (has_past and _legacy_past is not None and i < len(_legacy_past)) else None
             x, new_kv, aux, z = layer(x, rope_cos, rope_sin, kv_cache=kv_cache)
             new_kv_caches.append(new_kv)
             total_aux += aux