Instructions to use sinimiini/HRM-Text-1B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sinimiini/HRM-Text-1B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="sinimiini/HRM-Text-1B-GGUF",
	filename="HRM-Text-1B-BF16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use sinimiini/HRM-Text-1B-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sinimiini/HRM-Text-1B-GGUF:BF16
# Run inference directly in the terminal:
llama-cli -hf sinimiini/HRM-Text-1B-GGUF:BF16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf sinimiini/HRM-Text-1B-GGUF:BF16
# Run inference directly in the terminal:
llama-cli -hf sinimiini/HRM-Text-1B-GGUF:BF16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sinimiini/HRM-Text-1B-GGUF:BF16
# Run inference directly in the terminal:
./llama-cli -hf sinimiini/HRM-Text-1B-GGUF:BF16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sinimiini/HRM-Text-1B-GGUF:BF16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sinimiini/HRM-Text-1B-GGUF:BF16

Use Docker

docker model run hf.co/sinimiini/HRM-Text-1B-GGUF:BF16

LM Studio
Jan

vLLM

How to use sinimiini/HRM-Text-1B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sinimiini/HRM-Text-1B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sinimiini/HRM-Text-1B-GGUF",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/sinimiini/HRM-Text-1B-GGUF:BF16

Ollama
How to use sinimiini/HRM-Text-1B-GGUF with Ollama:
```
ollama run hf.co/sinimiini/HRM-Text-1B-GGUF:BF16
```

Unsloth Studio new

How to use sinimiini/HRM-Text-1B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sinimiini/HRM-Text-1B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sinimiini/HRM-Text-1B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for sinimiini/HRM-Text-1B-GGUF to start chatting

Docker Model Runner
How to use sinimiini/HRM-Text-1B-GGUF with Docker Model Runner:
```
docker model run hf.co/sinimiini/HRM-Text-1B-GGUF:BF16
```

Lemonade

How to use sinimiini/HRM-Text-1B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull sinimiini/HRM-Text-1B-GGUF:BF16

Run and chat with the model

lemonade run user.HRM-Text-1B-GGUF-BF16

List all available models

lemonade list

HRM-Text-1B-GGUF / README.md

sinimiini

Remove Q4 references from README and delete Q4 report

143f0e6 verified about 13 hours ago

preview code

raw

history blame contribute delete

6.37 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: gguf
	pipeline_tag: text-generation
	base_model: sapientinc/HRM-Text-1B
	base_model_relation: quantized
	tags:
	- gguf
	- bf16
	- q8_0
	- q6_k
	- q5_k_m
	- quantized
	- llama.cpp
	- hrm
	- hierarchical-reasoning
	- prefix-lm
	- pre-alignment
	- non-chat
	- non-instruction-tuned
	---

	# HRM-Text-1B GGUF

	This repository contains a BF16 GGUF conversion of [`sapientinc/HRM-Text-1B`](https://huggingface.co/sapientinc/HRM-Text-1B) and validated `Q8_0`, `Q6_K`, and `Q5_K_M` quantizations derived from that BF16 GGUF.

	The GGUF files use:

	- `general.architecture = hrm_text`
	- BF16 source tensor storage or standard `llama.cpp` quantized tensor storage
	- the original tokenizer from `tokenizer.json`
	- no injected chat template

	This is not a chat model and is not instruction tuned. "Useful output" for this repository means alignment with the original Transformers model on the same prompt, not chat-assistant behavior.

	## Compatibility Notice

	Standard upstream `llama.cpp`, Ollama, LM Studio, and `llama-cpp-python` are expected not to load this file until `hrm_text` is supported upstream.

	Use the included patch:

	```text
	runtime/llama.cpp-hrm_text.patch
	```

	The patch was built against:

	```text
	ggml-org/llama.cpp commit 6a257d44633d4a752183ed778b88d2924d0a6b9d
	```

	Only the normal causal generation path is implemented in the patched runtime. Prefix-LM bidirectional `token_type_ids` are not supported by the `llama.cpp` path in this release.

	## Files

	\| File \| Description \|
	\| --- \| --- \|
	\| `HRM-Text-1B-BF16.gguf` \| BF16 GGUF conversion of `sapientinc/HRM-Text-1B` \|
	\| `HRM-Text-1B-Q8_0.gguf` \| Validated `Q8_0` quantization from BF16 \|
	\| `HRM-Text-1B-Q6_K.gguf` \| Validated `Q6_K` quantization from BF16 \|
	\| `HRM-Text-1B-Q5_K_M.gguf` \| Validated `Q5_K_M` quantization from BF16 \|
	\| `runtime/llama.cpp-hrm_text.patch` \| Patch adding `hrm_text` conversion and runtime support to the clean `llama.cpp` base commit \|
	\| `reports/validation/final_report.md` \| Human-readable conversion and validation report \|
	\| `reports/validation/quantization_report.md` \| Quantization report, hashes, and pass/fail summary \|
	\| `reports/validation/baseline_transformers.json` \| Transformers baseline prompts, logits, and continuations \|
	\| `reports/validation/bf16_tensor_validation.json` \| Tensor-level GGUF validation \|
	\| `reports/validation/bf16_vs_hf.json` \| Runtime logit and text validation \|
	\| `reports/validation/q8_0_vs_bf16.json` \| `Q8_0` vs BF16 runtime validation \|
	\| `reports/validation/q6_k_vs_bf16.json` \| `Q6_K` vs BF16 runtime validation \|
	\| `reports/validation/q5_k_m_vs_bf16.json` \| `Q5_K_M` vs BF16 runtime validation \|

	## Provenance

	\| Item \| Value \|
	\| --- \| --- \|
	\| Source model \| `sapientinc/HRM-Text-1B` \|
	\| Source snapshot SHA \| `2285b999f6fb8a5b16e0cc313a9e8e4fe447140d` \|
	\| Source `model.safetensors` SHA256 \| `F8FE2B2BF6948414E8E8D6538659198726D98F967C55B533B7AABE8A1FA9A584` \|
	\| BF16 GGUF SHA256 \| `2DD5E2EF55E40C46DB0D0CB4CF1427A4E72DA34FEE36F0D2B73D081D0E1C2010` \|
	\| BF16 GGUF size \| `2,367,995,648` bytes \|
	\| llama.cpp base commit \| `6a257d44633d4a752183ed778b88d2924d0a6b9d` \|

	## Available GGUF Files

	\| Variant \| File \| Size (bytes) \| SHA256 \|
	\| --- \| --- \| ---: \| --- \|
	\| BF16 \| `HRM-Text-1B-BF16.gguf` \| `2367995648` \| `2DD5E2EF55E40C46DB0D0CB4CF1427A4E72DA34FEE36F0D2B73D081D0E1C2010` \|
	\| Q8_0 \| `HRM-Text-1B-Q8_0.gguf` \| `1259126560` \| `C0729C267C3421E1F6DE0488AC5448E98EA30E56514DAF210596B70AC3F9786D` \|
	\| Q6_K \| `HRM-Text-1B-Q6_K.gguf` \| `972668704` \| `24D93CA4EF4A02CFE415E3EA56A78AD65198A165A4157B928004B58DBDA2D93C` \|
	\| Q5_K_M \| `HRM-Text-1B-Q5_K_M.gguf` \| `851509024` \| `F6CE71A076EC897174C555D810ED6E379767D52F9396D485B42E42BF8DB1D0B7` \|

	## Validation Summary

	Validation was performed from a clean source snapshot and a clean `llama.cpp` base checkout.

	\| Check \| Result \|
	\| --- \| --- \|
	\| Tensor validation \| Pass, `259/259` tensors found and compared \|
	\| Tensor values \| BF16 tensor bits match HF after expected BF16 conversion \|
	\| Prompt token IDs \| Match for all validation prompts \|
	\| Next-token top-1 \| Match on `4/4` prompts \|
	\| Top-10 overlap \| `10/10` for all prompts \|
	\| Text validation \| BF16 GGUF continuations are aligned with Transformers baseline \|

	Quantized variants were validated against the BF16 GGUF:

	\| Variant \| Token IDs \| Top-1 matches \| Min top-10 overlap \| New loop check \| Result \|
	\| --- \| --- \| ---: \| ---: \| --- \| --- \|
	\| Q8_0 \| Pass \| `4/4` \| `9/10` \| Pass \| Pass \|
	\| Q6_K \| Pass \| `4/4` \| `9/10` \| Pass \| Pass \|
	\| Q5_K_M \| Pass \| `4/4` \| `9/10` \| Pass \| Pass \|

	Full-vocab mean absolute logit error:

	\| Prompt \| MAE \|
	\| --- \| ---: \|
	\| `The quick brown fox` \| `0.0199148655` \|
	\| `In a distant future, humanity` \| `0.0051696529` \|
	\| `Question: What is 2+2?\nAnswer:` \| `0.0076530445` \|
	\| `def fibonacci(n):` \| `0.0045031775` \|

	The original model already repeats on some prompts. Repetition by itself is not treated as a conversion failure unless it is newly introduced by the GGUF runtime. The BF16 GGUF validation did not reproduce the unrelated garbage pattern seen in a previous broken conversion attempt.

	## Example Runtime Setup

	Download this repository:

	```powershell
	pip install -U huggingface_hub
	hf download sinimiini/HRM-Text-1B-GGUF --local-dir HRM-Text-1B-GGUF
	```

	Patch and build `llama.cpp`:

	```powershell
	git clone https://github.com/ggml-org/llama.cpp
	cd llama.cpp
	git checkout 6a257d44633d4a752183ed778b88d2924d0a6b9d
	git apply ..\HRM-Text-1B-GGUF\runtime\llama.cpp-hrm_text.patch
	cmake -B build -S . -DGGML_NATIVE=OFF
	cmake --build build --config Release --target llama-cli llama-completion llama-results
	```

	Run a short causal-generation smoke test:

	```powershell
	.\build\bin\Release\llama-cli.exe -m ..\HRM-Text-1B-GGUF\HRM-Text-1B-BF16.gguf -p "The quick brown fox" -n 32 --temp 0 --no-conversation
	```

	Depending on the generator binary and `llama.cpp` build type, the executable may be under `build\bin\llama-cli.exe` instead of `build\bin\Release\llama-cli.exe`.

	## Limitations

	- `hrm_text` is a custom GGUF architecture in this conversion.
	- Generic GGUF runners will not work until they implement the HRM runtime graph.
	- Prefix-LM bidirectional attention with `token_type_ids` is not implemented in the patched `llama.cpp` path.

	## License

	The source model is released under the Apache 2.0 license. See [`LICENSE`](./LICENSE).