Move GGUF to dedicated repos, add GGUF section linking to collection

Files changed (4) hide show

README.md +26 -116
brick-complexity-extractor-BF16.gguf +0 -3
brick-complexity-extractor-Q4_K_M.gguf +0 -3
brick-complexity-extractor-Q8_0.gguf +0 -3

README.md CHANGED Viewed

@@ -7,7 +7,6 @@ tags:
   - peft
   - safetensors
   - lora
-  - gguf
   - complexity-classification
   - llm-routing
   - query-difficulty
@@ -42,11 +41,11 @@ model-index:
 <div align="center">
-# Brick Complexity Extractor
 ### A lightweight LoRA adapter for real-time query complexity classification
-**[Regolo.ai](https://regolo.ai) | [Dataset](https://huggingface.co/datasets/regolo/brick-complexity-extractor) | [Brick SR1 on GitHub](https://github.com/regolo-ai/brick-SR1) | [API Docs](https://docs.regolo.ai)**
 [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
 [![Base Model](https://img.shields.io/badge/Base-Qwen3.5--0.8B-blue)](https://huggingface.co/Qwen/Qwen3.5-0.8B)
@@ -84,7 +83,7 @@ The adapter adds only **~2M trainable parameters** on top of the 0.8B base model
 ## The Problem: Why LLM Routing Needs Complexity Classification
-Not all prompts are equal. A factual recall question ("What is the capital of France?") and a multi-step reasoning task ("Derive the optimal portfolio allocation given these constraints...") require fundamentally different compute budgets. Sending every query to a frontier reasoning model wastes resources; sending hard queries to a lightweight model degrades quality.
 **Brick** solves this by routing each query to the right model tier in real time. Complexity classification is one of several routing signals (alongside keyword matching, domain detection, and reasoning-depth estimation) that Brick uses to make sub-50ms routing decisions.
@@ -112,26 +111,26 @@ The adapter applies LoRA to the query and value projection matrices (`q_proj`, `
 ```
 Qwen3.5-0.8B (frozen)
-    +-- Attention Layers x 24
-         |-- q_proj <- LoRA(r=16, alpha=32)
-         +-- v_proj <- LoRA(r=16, alpha=32)
-    +-- Last Hidden State
-         +-- Classification Head (3 classes)
 ```
 ## Label Definitions
 | Label | Reasoning Steps | Description | Example |
 |---|---|---|---|
-| **easy** | 1-2 | Surface knowledge, factual recall, simple lookups | "What is the capital of Italy?" |
-| **medium** | 3-5 | Domain familiarity, multi-step reasoning, comparison | "Compare REST and GraphQL for a mobile app backend" |
 | **hard** | 6+ | Deep expertise, multi-constraint optimization, creative synthesis | "Design a distributed cache eviction policy that minimizes tail latency under bursty traffic" |
 Labels were generated by **Qwen3.5-122B** acting as an LLM judge on 76,831 diverse user prompts. See the [dataset card](https://huggingface.co/datasets/regolo/brick-complexity-extractor) for full labeling methodology.
 ## Performance
-### Classification Metrics (Test Set -- 3,841 samples)
 | Metric | Value |
 |---|---|
@@ -199,106 +198,17 @@ print(f"Complexity: {predicted}")
 # https://github.com/regolo-ai/brick-SR1
 ```
----
 ## GGUF Quantized Models
-Pre-built GGUF files are available for inference with [llama.cpp](https://github.com/ggml-org/llama.cpp), [Ollama](https://ollama.com), [LM Studio](https://lmstudio.ai), [vLLM](https://github.com/vllm-project/vllm), and other GGUF-compatible runtimes.
-These files contain the **full merged model** (base Qwen3.5-0.8B + LoRA adapter merged), so no separate adapter loading is needed.
-### Available Quantizations
-| File | Quant | Size | BPW | Notes |
 |---|---|---|---|---|
-| `brick-complexity-extractor-BF16.gguf` | BF16 | 1.5 GB | 16.0 | Full precision, no quality loss |
-| `brick-complexity-extractor-Q8_0.gguf` | Q8_0 | 775 MB | 8.0 | Near-lossless, recommended for accuracy |
-| `brick-complexity-extractor-Q4_K_M.gguf` | Q4_K_M | 494 MB | 5.5 | Best quality/size ratio |
-### Usage with llama.cpp
-```bash
-# Download a quantized model
-huggingface-cli download regolo/brick-complexity-extractor \
-    brick-complexity-extractor-Q8_0.gguf \
-    --local-dir ./models
-# Run inference
-./llama-cli -m ./models/brick-complexity-extractor-Q8_0.gguf \
-    -p "<|im_start|>system
-You are a query difficulty classifier for an LLM routing system.
-Classify each query as easy, medium, or hard based on the cognitive depth and domain expertise required to answer correctly.
-Respond with ONLY one word: easy, medium, or hard.<|im_end|>
-<|im_start|>user
-Classify: What is the capital of France?<|im_end|>
-<|im_start|>assistant
-" \
-    -n 5 --temp 0
-```
-### Usage with Ollama
-```bash
-# Create a Modelfile
-cat > Modelfile <<EOF
-FROM ./brick-complexity-extractor-Q8_0.gguf
-SYSTEM """You are a query difficulty classifier for an LLM routing system.
-Classify each query as easy, medium, or hard based on the cognitive depth and domain expertise required to answer correctly.
-Respond with ONLY one word: easy, medium, or hard."""
-TEMPLATE """<|im_start|>system
-{{ .System }}<|im_end|>
-<|im_start|>user
-Classify: {{ .Prompt }}<|im_end|>
-<|im_start|>assistant
-"""
-PARAMETER temperature 0
-PARAMETER num_predict 5
-EOF
-ollama create brick-complexity -f Modelfile
-ollama run brick-complexity "Design a distributed consensus algorithm"
-# Output: hard
-```
-### Usage with vLLM
-```python
-from vllm import LLM, SamplingParams
-llm = LLM(
-    model="regolo/brick-complexity-extractor",
-    quantization="gguf",
-    # Point to a specific GGUF file:
-    # model="./brick-complexity-extractor-Q8_0.gguf"
-)
-sampling_params = SamplingParams(temperature=0, max_tokens=5)
-prompt = """<|im_start|>system
-You are a query difficulty classifier for an LLM routing system.
-Classify each query as easy, medium, or hard.
-Respond with ONLY one word: easy, medium, or hard.<|im_end|>
-<|im_start|>user
-Classify: Explain the rendering equation from radiometric first principles<|im_end|>
-<|im_start|>assistant
-"""
-output = llm.generate([prompt], sampling_params)
-print(output[0].outputs[0].text.strip())
-# Output: hard
-```
-### Important Note on GGUF Inference
-The GGUF models use **generative text output** (the model generates the word "easy", "medium", or "hard") rather than the logit-based classification used by the LoRA adapter. This means:
-- **LoRA adapter (recommended for production)**: Uses logit extraction at the last token position for the three label tokens. Faster and more reliable.
-- **GGUF (recommended for local/edge deployment)**: Generates the classification label as text. Slightly lower accuracy but works with any GGUF runtime without Python dependencies.
----
 ## Integration with Brick Semantic Router
@@ -339,14 +249,14 @@ model_pools:
 ## Intended Uses
-### Primary Use Cases
-- **LLM routing**: Classify query complexity to route to the optimal model tier, reducing inference cost by 30-60% compared to always-frontier routing
 - **Reasoning budget allocation**: Decide how many reasoning tokens to allocate before inference begins
 - **Traffic shaping**: Balance GPU load across model pools based on real-time complexity distribution
 - **Cost monitoring**: Track complexity distribution over time to optimize fleet sizing
-### Out-of-Scope Uses
-- **Content moderation or safety filtering** -- this model classifies cognitive difficulty, not content safety
 - **Non-English queries** trained on English data only; accuracy degrades significantly on other languages
 - **Direct use as a chatbot or generative model** this is a classification adapter, not a generative model
@@ -364,7 +274,7 @@ model_pools:
 |---|---|
 | **Base model** | Qwen/Qwen3.5-0.8B |
 | **LoRA rank (r)** | 16 |
-| **LoRA alpha** | 32 |
 | **LoRA dropout** | 0.05 |
 | **Target modules** | q_proj, v_proj |
 | **Learning rate** | 2e-4 |
@@ -376,7 +286,7 @@ model_pools:
 | **Training samples** | 65,307 |
 | **Validation samples** | 7,683 |
 | **Test samples** | 3,841 |
-| **Training hardware** | 1x NVIDIA A100 80GB |
 | **Training time** | ~2 hours |
 | **Framework** | PyTorch + HuggingFace PEFT |
@@ -386,9 +296,9 @@ Regolo.ai is committed to sustainable AI. This model was trained on GPU infrastr
 | Metric | Value |
 |---|---|
-| **Hardware** | 1x NVIDIA A100 80GB |
 | **Training duration** | ~2 hours |
-| **Estimated CO2** | < 0.5 kg CO2eq |
 | **Energy source** | Renewable (certified) |
 | **Location** | Italy (EU) |
@@ -411,6 +321,6 @@ Regolo.ai is committed to sustainable AI. This model was trained on GPU infrastr
 <div align="center">
-**[Website](https://regolo.ai) | [Docs](https://docs.regolo.ai) | [Discord](https://discord.gg/myuuVFcfJw) | [GitHub](https://github.com/regolo-ai) | [LinkedIn](https://www.linkedin.com/company/regolo-ai/)**
 </div>

   - peft
   - safetensors
   - lora
   - complexity-classification
   - llm-routing
   - query-difficulty
 <div align="center">
+# 🧱 Brick Complexity Extractor
 ### A lightweight LoRA adapter for real-time query complexity classification
+**[Regolo.ai](https://regolo.ai) · [Dataset](https://huggingface.co/datasets/regolo/brick-complexity-extractor) · [Brick SR1 on GitHub](https://github.com/regolo-ai/brick-SR1) · [API Docs](https://docs.regolo.ai)**
 [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
 [![Base Model](https://img.shields.io/badge/Base-Qwen3.5--0.8B-blue)](https://huggingface.co/Qwen/Qwen3.5-0.8B)
 ## The Problem: Why LLM Routing Needs Complexity Classification
+Not all prompts are equal. A factual recall question ("What is the capital of France?") and a multi-step reasoning task ("Derive the optimal portfolio allocation given these constraints…") require fundamentally different compute budgets. Sending every query to a frontier reasoning model wastes resources; sending hard queries to a lightweight model degrades quality.
 **Brick** solves this by routing each query to the right model tier in real time. Complexity classification is one of several routing signals (alongside keyword matching, domain detection, and reasoning-depth estimation) that Brick uses to make sub-50ms routing decisions.
 ```
 Qwen3.5-0.8B (frozen)
+    └── Attention Layers × 24
+         ├── q_proj ← LoRA(r=16, α=32)
+         └── v_proj ← LoRA(r=16, α=32)
+    └── Last Hidden State
+         └── Classification Head (3 classes)
 ```
 ## Label Definitions
 | Label | Reasoning Steps | Description | Example |
 |---|---|---|---|
+| **easy** | 1–2 | Surface knowledge, factual recall, simple lookups | "What is the capital of Italy?" |
+| **medium** | 3–5 | Domain familiarity, multi-step reasoning, comparison | "Compare REST and GraphQL for a mobile app backend" |
 | **hard** | 6+ | Deep expertise, multi-constraint optimization, creative synthesis | "Design a distributed cache eviction policy that minimizes tail latency under bursty traffic" |
 Labels were generated by **Qwen3.5-122B** acting as an LLM judge on 76,831 diverse user prompts. See the [dataset card](https://huggingface.co/datasets/regolo/brick-complexity-extractor) for full labeling methodology.
 ## Performance
+### Classification Metrics (Test Set — 3,841 samples)
 | Metric | Value |
 |---|---|
 # https://github.com/regolo-ai/brick-SR1
 ```
 ## GGUF Quantized Models
+Pre-built GGUF files are available for inference with llama.cpp, Ollama, LM Studio, vLLM, and other GGUF-compatible runtimes. Each quantization is published as a separate model:
+| Model | Quant | Size | BPW | Notes |
 |---|---|---|---|---|
+| [brick-complexity-extractor-BF16-GGUF](https://huggingface.co/regolo/brick-complexity-extractor-BF16-GGUF) | BF16 | 1.5 GB | 16.0 | Full precision |
+| [brick-complexity-extractor-Q8_0-GGUF](https://huggingface.co/regolo/brick-complexity-extractor-Q8_0-GGUF) | Q8_0 | 775 MB | 8.0 | Recommended |
+| [brick-complexity-extractor-Q4_K_M-GGUF](https://huggingface.co/regolo/brick-complexity-extractor-Q4_K_M-GGUF) | Q4_K_M | 494 MB | 5.5 | Best size/quality ratio |
+See the [brick-complexity-extractor collection](https://huggingface.co/collections/regolo/brick-complexity-extractor-69dcc2dec2fe3b54a70b3415) for all available formats.
 ## Integration with Brick Semantic Router
 ## Intended Uses
+### ✅ Primary Use Cases
+- **LLM routing**: Classify query complexity to route to the optimal model tier, reducing inference cost by 30–60% compared to always-frontier routing
 - **Reasoning budget allocation**: Decide how many reasoning tokens to allocate before inference begins
 - **Traffic shaping**: Balance GPU load across model pools based on real-time complexity distribution
 - **Cost monitoring**: Track complexity distribution over time to optimize fleet sizing
+### ⚠️ Out-of-Scope Uses
+- **Content moderation or safety filtering** — this model classifies cognitive difficulty, not content safety
 - **Non-English queries** trained on English data only; accuracy degrades significantly on other languages
 - **Direct use as a chatbot or generative model** this is a classification adapter, not a generative model
 |---|---|
 | **Base model** | Qwen/Qwen3.5-0.8B |
 | **LoRA rank (r)** | 16 |
+| **LoRA alpha (α)** | 32 |
 | **LoRA dropout** | 0.05 |
 | **Target modules** | q_proj, v_proj |
 | **Learning rate** | 2e-4 |
 | **Training samples** | 65,307 |
 | **Validation samples** | 7,683 |
 | **Test samples** | 3,841 |
+| **Training hardware** | 1× NVIDIA A100 80GB |
 | **Training time** | ~2 hours |
 | **Framework** | PyTorch + HuggingFace PEFT |
 | Metric | Value |
 |---|---|
+| **Hardware** | 1× NVIDIA A100 80GB |
 | **Training duration** | ~2 hours |
+| **Estimated CO₂** | < 0.5 kg CO₂eq |
 | **Energy source** | Renewable (certified) |
 | **Location** | Italy (EU) |
 <div align="center">
+**[Website](https://regolo.ai) · [Docs](https://docs.regolo.ai) · [Discord](https://discord.gg/myuuVFcfJw) · [GitHub](https://github.com/regolo-ai) · [LinkedIn](https://www.linkedin.com/company/regolo-ai/)**
 </div>

brick-complexity-extractor-BF16.gguf DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:6fc8392a811ff1b3dbdb7348110893bac25f912540a58ae7ff4e1cb96ceced92
-size 1516736384

brick-complexity-extractor-Q4_K_M.gguf DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:8bb38e63a7eeabddd729f2cdadfc7bd04b82aea413778e77bd4dee2b03a5489e
-size 529289088

brick-complexity-extractor-Q8_0.gguf DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:1f74b88a1b7149dd9074eed60cadfc7555fca227ddbc1c71ec30a635f7cd3913
-size 811835264