root commited on
Commit
7e9f789
·
1 Parent(s): 60a6bce

Move GGUF to dedicated repos, add GGUF section linking to collection

Browse files
README.md CHANGED
@@ -7,7 +7,6 @@ tags:
7
  - peft
8
  - safetensors
9
  - lora
10
- - gguf
11
  - complexity-classification
12
  - llm-routing
13
  - query-difficulty
@@ -42,11 +41,11 @@ model-index:
42
 
43
  <div align="center">
44
 
45
- # Brick Complexity Extractor
46
 
47
  ### A lightweight LoRA adapter for real-time query complexity classification
48
 
49
- **[Regolo.ai](https://regolo.ai) | [Dataset](https://huggingface.co/datasets/regolo/brick-complexity-extractor) | [Brick SR1 on GitHub](https://github.com/regolo-ai/brick-SR1) | [API Docs](https://docs.regolo.ai)**
50
 
51
  [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
52
  [![Base Model](https://img.shields.io/badge/Base-Qwen3.5--0.8B-blue)](https://huggingface.co/Qwen/Qwen3.5-0.8B)
@@ -84,7 +83,7 @@ The adapter adds only **~2M trainable parameters** on top of the 0.8B base model
84
 
85
  ## The Problem: Why LLM Routing Needs Complexity Classification
86
 
87
- Not all prompts are equal. A factual recall question ("What is the capital of France?") and a multi-step reasoning task ("Derive the optimal portfolio allocation given these constraints...") require fundamentally different compute budgets. Sending every query to a frontier reasoning model wastes resources; sending hard queries to a lightweight model degrades quality.
88
 
89
  **Brick** solves this by routing each query to the right model tier in real time. Complexity classification is one of several routing signals (alongside keyword matching, domain detection, and reasoning-depth estimation) that Brick uses to make sub-50ms routing decisions.
90
 
@@ -112,26 +111,26 @@ The adapter applies LoRA to the query and value projection matrices (`q_proj`, `
112
 
113
  ```
114
  Qwen3.5-0.8B (frozen)
115
- +-- Attention Layers x 24
116
- |-- q_proj <- LoRA(r=16, alpha=32)
117
- +-- v_proj <- LoRA(r=16, alpha=32)
118
- +-- Last Hidden State
119
- +-- Classification Head (3 classes)
120
  ```
121
 
122
  ## Label Definitions
123
 
124
  | Label | Reasoning Steps | Description | Example |
125
  |---|---|---|---|
126
- | **easy** | 1-2 | Surface knowledge, factual recall, simple lookups | "What is the capital of Italy?" |
127
- | **medium** | 3-5 | Domain familiarity, multi-step reasoning, comparison | "Compare REST and GraphQL for a mobile app backend" |
128
  | **hard** | 6+ | Deep expertise, multi-constraint optimization, creative synthesis | "Design a distributed cache eviction policy that minimizes tail latency under bursty traffic" |
129
 
130
  Labels were generated by **Qwen3.5-122B** acting as an LLM judge on 76,831 diverse user prompts. See the [dataset card](https://huggingface.co/datasets/regolo/brick-complexity-extractor) for full labeling methodology.
131
 
132
  ## Performance
133
 
134
- ### Classification Metrics (Test Set -- 3,841 samples)
135
 
136
  | Metric | Value |
137
  |---|---|
@@ -199,106 +198,17 @@ print(f"Complexity: {predicted}")
199
  # https://github.com/regolo-ai/brick-SR1
200
  ```
201
 
202
- ---
203
-
204
  ## GGUF Quantized Models
205
 
206
- Pre-built GGUF files are available for inference with [llama.cpp](https://github.com/ggml-org/llama.cpp), [Ollama](https://ollama.com), [LM Studio](https://lmstudio.ai), [vLLM](https://github.com/vllm-project/vllm), and other GGUF-compatible runtimes.
207
-
208
- These files contain the **full merged model** (base Qwen3.5-0.8B + LoRA adapter merged), so no separate adapter loading is needed.
209
-
210
- ### Available Quantizations
211
 
212
- | File | Quant | Size | BPW | Notes |
213
  |---|---|---|---|---|
214
- | `brick-complexity-extractor-BF16.gguf` | BF16 | 1.5 GB | 16.0 | Full precision, no quality loss |
215
- | `brick-complexity-extractor-Q8_0.gguf` | Q8_0 | 775 MB | 8.0 | Near-lossless, recommended for accuracy |
216
- | `brick-complexity-extractor-Q4_K_M.gguf` | Q4_K_M | 494 MB | 5.5 | Best quality/size ratio |
217
-
218
- ### Usage with llama.cpp
219
-
220
- ```bash
221
- # Download a quantized model
222
- huggingface-cli download regolo/brick-complexity-extractor \
223
- brick-complexity-extractor-Q8_0.gguf \
224
- --local-dir ./models
225
-
226
- # Run inference
227
- ./llama-cli -m ./models/brick-complexity-extractor-Q8_0.gguf \
228
- -p "<|im_start|>system
229
- You are a query difficulty classifier for an LLM routing system.
230
- Classify each query as easy, medium, or hard based on the cognitive depth and domain expertise required to answer correctly.
231
- Respond with ONLY one word: easy, medium, or hard.<|im_end|>
232
- <|im_start|>user
233
- Classify: What is the capital of France?<|im_end|>
234
- <|im_start|>assistant
235
- " \
236
- -n 5 --temp 0
237
- ```
238
-
239
- ### Usage with Ollama
240
-
241
- ```bash
242
- # Create a Modelfile
243
- cat > Modelfile <<EOF
244
- FROM ./brick-complexity-extractor-Q8_0.gguf
245
-
246
- SYSTEM """You are a query difficulty classifier for an LLM routing system.
247
- Classify each query as easy, medium, or hard based on the cognitive depth and domain expertise required to answer correctly.
248
- Respond with ONLY one word: easy, medium, or hard."""
249
-
250
- TEMPLATE """<|im_start|>system
251
- {{ .System }}<|im_end|>
252
- <|im_start|>user
253
- Classify: {{ .Prompt }}<|im_end|>
254
- <|im_start|>assistant
255
- """
256
-
257
- PARAMETER temperature 0
258
- PARAMETER num_predict 5
259
- EOF
260
-
261
- ollama create brick-complexity -f Modelfile
262
- ollama run brick-complexity "Design a distributed consensus algorithm"
263
- # Output: hard
264
- ```
265
-
266
- ### Usage with vLLM
267
 
268
- ```python
269
- from vllm import LLM, SamplingParams
270
-
271
- llm = LLM(
272
- model="regolo/brick-complexity-extractor",
273
- quantization="gguf",
274
- # Point to a specific GGUF file:
275
- # model="./brick-complexity-extractor-Q8_0.gguf"
276
- )
277
-
278
- sampling_params = SamplingParams(temperature=0, max_tokens=5)
279
-
280
- prompt = """<|im_start|>system
281
- You are a query difficulty classifier for an LLM routing system.
282
- Classify each query as easy, medium, or hard.
283
- Respond with ONLY one word: easy, medium, or hard.<|im_end|>
284
- <|im_start|>user
285
- Classify: Explain the rendering equation from radiometric first principles<|im_end|>
286
- <|im_start|>assistant
287
- """
288
-
289
- output = llm.generate([prompt], sampling_params)
290
- print(output[0].outputs[0].text.strip())
291
- # Output: hard
292
- ```
293
-
294
- ### Important Note on GGUF Inference
295
-
296
- The GGUF models use **generative text output** (the model generates the word "easy", "medium", or "hard") rather than the logit-based classification used by the LoRA adapter. This means:
297
-
298
- - **LoRA adapter (recommended for production)**: Uses logit extraction at the last token position for the three label tokens. Faster and more reliable.
299
- - **GGUF (recommended for local/edge deployment)**: Generates the classification label as text. Slightly lower accuracy but works with any GGUF runtime without Python dependencies.
300
-
301
- ---
302
 
303
  ## Integration with Brick Semantic Router
304
 
@@ -339,14 +249,14 @@ model_pools:
339
 
340
  ## Intended Uses
341
 
342
- ### Primary Use Cases
343
- - **LLM routing**: Classify query complexity to route to the optimal model tier, reducing inference cost by 30-60% compared to always-frontier routing
344
  - **Reasoning budget allocation**: Decide how many reasoning tokens to allocate before inference begins
345
  - **Traffic shaping**: Balance GPU load across model pools based on real-time complexity distribution
346
  - **Cost monitoring**: Track complexity distribution over time to optimize fleet sizing
347
 
348
- ### Out-of-Scope Uses
349
- - **Content moderation or safety filtering** -- this model classifies cognitive difficulty, not content safety
350
  - **Non-English queries** trained on English data only; accuracy degrades significantly on other languages
351
  - **Direct use as a chatbot or generative model** this is a classification adapter, not a generative model
352
 
@@ -364,7 +274,7 @@ model_pools:
364
  |---|---|
365
  | **Base model** | Qwen/Qwen3.5-0.8B |
366
  | **LoRA rank (r)** | 16 |
367
- | **LoRA alpha** | 32 |
368
  | **LoRA dropout** | 0.05 |
369
  | **Target modules** | q_proj, v_proj |
370
  | **Learning rate** | 2e-4 |
@@ -376,7 +286,7 @@ model_pools:
376
  | **Training samples** | 65,307 |
377
  | **Validation samples** | 7,683 |
378
  | **Test samples** | 3,841 |
379
- | **Training hardware** | 1x NVIDIA A100 80GB |
380
  | **Training time** | ~2 hours |
381
  | **Framework** | PyTorch + HuggingFace PEFT |
382
 
@@ -386,9 +296,9 @@ Regolo.ai is committed to sustainable AI. This model was trained on GPU infrastr
386
 
387
  | Metric | Value |
388
  |---|---|
389
- | **Hardware** | 1x NVIDIA A100 80GB |
390
  | **Training duration** | ~2 hours |
391
- | **Estimated CO2** | < 0.5 kg CO2eq |
392
  | **Energy source** | Renewable (certified) |
393
  | **Location** | Italy (EU) |
394
 
@@ -411,6 +321,6 @@ Regolo.ai is committed to sustainable AI. This model was trained on GPU infrastr
411
 
412
  <div align="center">
413
 
414
- **[Website](https://regolo.ai) | [Docs](https://docs.regolo.ai) | [Discord](https://discord.gg/myuuVFcfJw) | [GitHub](https://github.com/regolo-ai) | [LinkedIn](https://www.linkedin.com/company/regolo-ai/)**
415
 
416
  </div>
 
7
  - peft
8
  - safetensors
9
  - lora
 
10
  - complexity-classification
11
  - llm-routing
12
  - query-difficulty
 
41
 
42
  <div align="center">
43
 
44
+ # 🧱 Brick Complexity Extractor
45
 
46
  ### A lightweight LoRA adapter for real-time query complexity classification
47
 
48
+ **[Regolo.ai](https://regolo.ai) · [Dataset](https://huggingface.co/datasets/regolo/brick-complexity-extractor) · [Brick SR1 on GitHub](https://github.com/regolo-ai/brick-SR1) · [API Docs](https://docs.regolo.ai)**
49
 
50
  [![License: CC BY-NC 4.0](https://img.shields.io/badge/License-CC%20BY--NC%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by-nc/4.0/)
51
  [![Base Model](https://img.shields.io/badge/Base-Qwen3.5--0.8B-blue)](https://huggingface.co/Qwen/Qwen3.5-0.8B)
 
83
 
84
  ## The Problem: Why LLM Routing Needs Complexity Classification
85
 
86
+ Not all prompts are equal. A factual recall question ("What is the capital of France?") and a multi-step reasoning task ("Derive the optimal portfolio allocation given these constraints") require fundamentally different compute budgets. Sending every query to a frontier reasoning model wastes resources; sending hard queries to a lightweight model degrades quality.
87
 
88
  **Brick** solves this by routing each query to the right model tier in real time. Complexity classification is one of several routing signals (alongside keyword matching, domain detection, and reasoning-depth estimation) that Brick uses to make sub-50ms routing decisions.
89
 
 
111
 
112
  ```
113
  Qwen3.5-0.8B (frozen)
114
+ └── Attention Layers × 24
115
+ ├── q_proj LoRA(r=16, α=32)
116
+ └── v_proj LoRA(r=16, α=32)
117
+ └── Last Hidden State
118
+ └── Classification Head (3 classes)
119
  ```
120
 
121
  ## Label Definitions
122
 
123
  | Label | Reasoning Steps | Description | Example |
124
  |---|---|---|---|
125
+ | **easy** | 12 | Surface knowledge, factual recall, simple lookups | "What is the capital of Italy?" |
126
+ | **medium** | 35 | Domain familiarity, multi-step reasoning, comparison | "Compare REST and GraphQL for a mobile app backend" |
127
  | **hard** | 6+ | Deep expertise, multi-constraint optimization, creative synthesis | "Design a distributed cache eviction policy that minimizes tail latency under bursty traffic" |
128
 
129
  Labels were generated by **Qwen3.5-122B** acting as an LLM judge on 76,831 diverse user prompts. See the [dataset card](https://huggingface.co/datasets/regolo/brick-complexity-extractor) for full labeling methodology.
130
 
131
  ## Performance
132
 
133
+ ### Classification Metrics (Test Set 3,841 samples)
134
 
135
  | Metric | Value |
136
  |---|---|
 
198
  # https://github.com/regolo-ai/brick-SR1
199
  ```
200
 
 
 
201
  ## GGUF Quantized Models
202
 
203
+ Pre-built GGUF files are available for inference with llama.cpp, Ollama, LM Studio, vLLM, and other GGUF-compatible runtimes. Each quantization is published as a separate model:
 
 
 
 
204
 
205
+ | Model | Quant | Size | BPW | Notes |
206
  |---|---|---|---|---|
207
+ | [brick-complexity-extractor-BF16-GGUF](https://huggingface.co/regolo/brick-complexity-extractor-BF16-GGUF) | BF16 | 1.5 GB | 16.0 | Full precision |
208
+ | [brick-complexity-extractor-Q8_0-GGUF](https://huggingface.co/regolo/brick-complexity-extractor-Q8_0-GGUF) | Q8_0 | 775 MB | 8.0 | Recommended |
209
+ | [brick-complexity-extractor-Q4_K_M-GGUF](https://huggingface.co/regolo/brick-complexity-extractor-Q4_K_M-GGUF) | Q4_K_M | 494 MB | 5.5 | Best size/quality ratio |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
210
 
211
+ See the [brick-complexity-extractor collection](https://huggingface.co/collections/regolo/brick-complexity-extractor-69dcc2dec2fe3b54a70b3415) for all available formats.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
212
 
213
  ## Integration with Brick Semantic Router
214
 
 
249
 
250
  ## Intended Uses
251
 
252
+ ### Primary Use Cases
253
+ - **LLM routing**: Classify query complexity to route to the optimal model tier, reducing inference cost by 3060% compared to always-frontier routing
254
  - **Reasoning budget allocation**: Decide how many reasoning tokens to allocate before inference begins
255
  - **Traffic shaping**: Balance GPU load across model pools based on real-time complexity distribution
256
  - **Cost monitoring**: Track complexity distribution over time to optimize fleet sizing
257
 
258
+ ### ⚠️ Out-of-Scope Uses
259
+ - **Content moderation or safety filtering** this model classifies cognitive difficulty, not content safety
260
  - **Non-English queries** trained on English data only; accuracy degrades significantly on other languages
261
  - **Direct use as a chatbot or generative model** this is a classification adapter, not a generative model
262
 
 
274
  |---|---|
275
  | **Base model** | Qwen/Qwen3.5-0.8B |
276
  | **LoRA rank (r)** | 16 |
277
+ | **LoRA alpha (α)** | 32 |
278
  | **LoRA dropout** | 0.05 |
279
  | **Target modules** | q_proj, v_proj |
280
  | **Learning rate** | 2e-4 |
 
286
  | **Training samples** | 65,307 |
287
  | **Validation samples** | 7,683 |
288
  | **Test samples** | 3,841 |
289
+ | **Training hardware** | NVIDIA A100 80GB |
290
  | **Training time** | ~2 hours |
291
  | **Framework** | PyTorch + HuggingFace PEFT |
292
 
 
296
 
297
  | Metric | Value |
298
  |---|---|
299
+ | **Hardware** | NVIDIA A100 80GB |
300
  | **Training duration** | ~2 hours |
301
+ | **Estimated CO₂** | < 0.5 kg CO₂eq |
302
  | **Energy source** | Renewable (certified) |
303
  | **Location** | Italy (EU) |
304
 
 
321
 
322
  <div align="center">
323
 
324
+ **[Website](https://regolo.ai) · [Docs](https://docs.regolo.ai) · [Discord](https://discord.gg/myuuVFcfJw) · [GitHub](https://github.com/regolo-ai) · [LinkedIn](https://www.linkedin.com/company/regolo-ai/)**
325
 
326
  </div>
brick-complexity-extractor-BF16.gguf DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:6fc8392a811ff1b3dbdb7348110893bac25f912540a58ae7ff4e1cb96ceced92
3
- size 1516736384
 
 
 
 
brick-complexity-extractor-Q4_K_M.gguf DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:8bb38e63a7eeabddd729f2cdadfc7bd04b82aea413778e77bd4dee2b03a5489e
3
- size 529289088
 
 
 
 
brick-complexity-extractor-Q8_0.gguf DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:1f74b88a1b7149dd9074eed60cadfc7555fca227ddbc1c71ec30a635f7cd3913
3
- size 811835264