It's really good.
Very interesting model.
Not noticing a significant difference between bf16 and f16 - there is a little, but where normally f16 would trip up this model does that, catches itself, and then corrects.
Not noticing a significant difference between bf16 and f16
I didn't understand which you find better? I assume bf16 is better unless someone confirmed the original tensors had no weights >65k which would cause clipping on the F16. imo there is no reason to use f16 when the original safetensors are bf16, and in fact it risks clipping weights. Not sure where the idea to use f16 instead of bf16 came from even?
Thanks for confirming your experience though to see if it matches what I expect.
You should read the other thread about increasing the precision of weights on the qwen3-coder-next model, will clear up the confusion.
I was just giving a little bit of feedback on this model, in regards to that discussion.
BF16 is better. It's better in every aspect, there is no argument in that.
F16 is still much higher quality than k quant, ik quant, or lower bits. Regardless whether or not you have a good imatrix dataset.
The difference between the two structures though is significant if you spend time with each and don't just run a couple benchmarks and perplexity tests.
I didn't understand which you find better?
BF16 is better.
In the case of this 35b model, the model is so good that the f16 version isn't showing such a significant loss in quality of its output - it seems to catch its errors and correct them during reasoning where most other models would just fumble and roll with it. This is similar to how gpt-oss:120b operates with f16 weights (https://huggingface.co/unsloth/gpt-oss-120b-GGUF/blob/main/gpt-oss-120b-F16.gguf) in comparison to how it operates with bf16 weights (https://huggingface.co/bartowski/openai_gpt-oss-120b-GGUF/blob/main/openai_gpt-oss-120b-bf16/openai_gpt-oss-120b-bf16-00002-of-00002.gguf).
This is good for everyone.
imo there is no reason to use f16 when the original safetensors are bf16
Not sure where the idea to use f16 instead of bf16 came from even?
Every GPU supports f16. Many people in the community are not using newer hardware.
The goal was to have higher quality quantized models for everyone, not just me.
BF16 is better.
Thanks for clarifying! Interestingly they have similar perplexity, but I wouldn't trust the downcast f16 didn't clip any weights unless someone writes a script to confirm.
I know GPUs use f16, that is fine, and yes f16 is often faster on some GPUs. the issue is the original bf16 safetensors might contain values out of range of f16 and cause clipping during the downcast.
I decided to make a mix optimized for AMD backend. I'm guessing it is faster, and measures better perplexity:
By the way for the ones who use pi agent, I fixed a chat template
https://huggingface.co/Qwen/Qwen3.5-35B-A3B/discussions/9#699f222cb59dcc76a1eef652
I know GPUs use f16, that is fine, and yes f16 is often faster on some GPUs.
It's not about just "faster", or what's fine. It's about how BF16 calculations are not even usable on some GPUS because they are so slow.
The whole experiment with bf16 and f16 was meant to be an improvement for everyone, the entire community. To enable a higher quality model that is roughly the same size as a 4 bit quant.
the issue is the original bf16 safetensors might contain values out of range of f16 and cause clipping during the downcast.
Doesn't this same logic apply to q8, q6, q4, etc...? I'm not exactly following how this is exclusive to f16 here.
Bit Allocation:
FP16 uses 5 exponent bits and 10 mantissa (fraction) bits.
BF16 uses 8 exponent bits and 7 mantissa bits.
Numerical Range & Stability:
BF16 has a much wider dynamic range (similar to FP32), capable of representing values from ~ 1.2×10⁻³⁸ to ~ 3.4×10³⁸, making it more robust for large gradients during training.
FP16 has a smaller range (~ 6.1×10⁻⁵ to ~ 6.6×10⁴), increasing the risk of overflow or underflow, especially in deep networks.
Precision:
FP16 offers higher precision for small values due to its 10-bit mantissa.
BF16 sacrifices precision (7-bit mantissa) but maintains better numerical stability for large or small numbers.
Q8 (8-bit quantization): Uses 8 bits per weight, typically as a signed integer (INT8) with values ranging from -127 to 127. It does not allocate separate exponent and mantissa bits like floating-point formats. Instead, it uses a scale and zero-point to map floating-point values to integers, preserving accuracy through affine quantization.
Q4 (4-bit quantization): Uses 4 bits per weight, allowing 16 distinct values (e.g., -8 to 7 for signed INT4). Like Q8, it relies on scaling and often employs group-wise quantization to maintain accuracy. Some variants, like FP4, do use floating-point encoding (e.g., 1 sign bit, 2 exponent bits, 1 mantissa bit), but these are less common and not standardized.
We tested this thoroughly, the difference between bf16 and f16 in this format is more so dynamic range than it is quality.
Also perplexity tests are not very reliable to give an accurate representation of a model anymore.
MXFP4/F16 (or bf16) is almost the same size as a q4_* model. It is ~1GB larger than yours.
This means that the same people who can run a 4 bit quantized model can also run this model.
The difference between these models and yours is 16-bit floating point numbers to store parameters versus 4-bits per parameter.
Here's the simplest way I can show you the difference between f16 and q4
I set the f16 model and your model into the exact same conditions and asked it the same question.
Your model:
- thought for 3 min, burned 3k tokens, and went over and over and over again.
Here is the entire thinking process
### USER
what is the nineteenth letter in this sentence?
### ASSISTANT
<details type="reasoning" done="true" duration="157">
<summary>Thought for 157 seconds</summary>
> Thinking Process:
>
> 1. **Analyze the Request:**
> * Task: Identify the nineteenth letter in the provided sentence.
> * Sentence: "what is the nineteenth letter in this sentence?"
> * Constraint: Count letters only (usually excluding spaces and punctuation, but need to be careful about how "letter" is defined in this context. Typically, it means alphanumeric characters, but often in these riddles/tasks, it implies counting every character including spaces or just alphabetic characters. I need to clarify the standard interpretation. Usually, "letter" means A-Z/a-z, excluding spaces and punctuation. However, sometimes people mean the 19th *character*. Let's look at the sentence again: "what is the nineteenth letter in this sentence?".
> * Standard Interpretation: Count only alphabetic characters (A-Z, a-z). Ignore spaces, punctuation, and potentially case sensitivity doesn't matter for counting position.
> * Alternative Interpretation: Count every character (including spaces).
> * Most common riddle/task interpretation: Count the letters of the alphabet only.
>
> 2. **Analyze the Sentence:**
> * Sentence: "what is the nineteenth letter in this sentence?"
> * Let's write it out and count.
>
> w h a t i s t h e n i n e t e e n t h l e t t e r i n t h i s s e n t e n c e ?
>
> Wait, I need to be precise.
> String: "what is the nineteenth letter in this sentence?"
>
> Let's index the characters (1-based):
> 1: w
> 2: h
> 3: a
> 4: t
> 5: (space)
> 6: i
> 7: s
> 8: (space)
> 9: t
> 10: h
> 11: e
> 12: (space)
> 13: n
> 14: i
> 15: n
> 16: e
> 17: t
> 18: e
> 19: e
> 20: n
> ...
>
> If counting *characters* (including spaces):
> 19th character is 'e' (the second 'e' in "nineteenth").
> Let's verify:
> w(1) h(2) a(3) t(4) space(5) i(6) s(7) space(8) t(9) h(10) e(11) space(12) n(13) i(14) n(15) e(16) t(17) e(18) e(19).
> So if including spaces, it's 'e'.
>
> If counting *letters only* (ignoring spaces):
> 1: w
> 2: h
> 3: a
> 4: t
> 5: i
> 6: s
> 7: t
> 8: h
> 9: e
> 10: n
> 11: i
> 12: n
> 13: e
> 14: t
> 15: e
> 16: e
> 17: n
> 18: t
> 19: h
>
> Let's recount the "letters only" scenario carefully.
> Sentence: "what is the nineteenth letter in this sentence?"
>
> Word 1: w (1), h (2), a (3), t (4)
> Word 2: i (5), s (6)
> Word 3: t (7), h (8), e (9)
> Word 4: n (10), i (11), n (12), e (13), t (14), e (15), e (16), n (17), t (18), h (19)
>
> So, the 19th letter (alphabetic) is 'h'.
> It corresponds to the last letter of the word "nineteenth".
>
> 3. **Determine the Ambiguity:**
> * The prompt asks for the "nineteenth letter".
> * In strict linguistic terms, a "letter" is an alphabetic character. Spaces are not letters. Punctuation marks are not letters.
> * If the user meant "character", they usually say "character".
> * However, in some casual contexts, people might count spaces.
> * Best approach: Assume "letter" means alphabetic character (A-Z), as that is the literal definition. I should probably state my assumption or just provide the answer based on the standard definition.
> * Let's check the position again.
> * w-h-a-t (4)
> * i-s (2) -> 6
> * t-h-e (3) -> 9
> * n-i-n-e-t-e-e-n-t-h (10 letters) -> 9 + 10 = 19.
> * The 10th letter of "nineteenth" is 'h'.
> * So the 19th letter overall is 'h'.
>
> 4. **Verification:**
> * w (1)
> * h (2)
> * a (3)
> * t (4)
> * i (5)
> * s (6)
> * t (7)
> * h (8)
> * e (9)
> * n (10)
> * i (11)
> * n (12)
> * e (13)
> * t (14)
> * e (15)
> * e (16)
> * n (17)
> * t (18)
> * h (19)
>
> Yes, the 19th letter is 'h'.
>
> 5. **Refinement:**
> * Does the question mark count? No, it's not a letter.
> * Does capitalization matter? "what" starts with lowercase in the prompt provided ("what is..."). Even if capitalized, it's still a letter. The prompt text is: "what is the nineteenth letter in this sentence?". It is all lowercase except potentially the start if I were to treat it as a sentence, but the input string provided by the user is all lowercase (except the question mark). Actually, looking at the raw input: "what is the nineteenth letter in this sentence?" -> all lowercase 'w'.
> * So, counting strictly letters:
> 1. w
> 2. h
> 3. a
> 4. t
> 5. i
> 6. s
> 7. t
> 8. h
> 9. e
> 10. n
> 11. i
> 12. n
> 13. e
> 14. t
> 15. e
> 16. e
> 17. n
> 18. t
> 19. h
>
> * Wait, let's look at the phrase "nineteenth letter in this sentence".
> * Word "nineteenth" has 10 letters: n-i-n-e-t-e-e-n-t-h.
> * Cumulative count before "nineteenth":
> * what (4)
> * is (2)
> * the (3)
> * Total = 9.
> * Next word starts at 10.
> * We need the 19th letter.
> * 19 - 9 = 10.
> * So we need the 10th letter of "nineteenth".
> * n(1), i(2), n(3), e(4), t(5), e(6), e(7), n(8), t(9), h(10).
> * The 10th letter is 'h'.
>
> 6. **Final Decision:**
> * State the answer clearly based on the count of alphabetic characters.
> * Acknowledge that spaces are not letters (optional but helpful for clarity).
>
> 7. **Drafting the response:**
> * Count the letters ignoring spaces and punctuation.
> * w-h-a-t (4)
> * i-s (2) -> total 6
> * t-h-e (3) -> total 9
> * n-i-n-e-t-e-e-n-t-h (10 letters)
> * The 10th letter of "nineteenth" is the 19th letter overall.
> * That letter is **h**.
>
> 8. **Wait, one nuance:**
> * Is the prompt asking me to analyze the text *I am about to generate* or the text *the user provided*?
> * User Context: "what is the nineteenth letter in this sentence?"
> * The user is providing a sentence. I need to count letters in *that specific sentence string*.
> * Sentence: "what is the nineteenth letter in this sentence?"
> * My analysis holds.
>
> * What if the question implies the character including spaces?
> * If spaces are counted as positions:
> 1:w, 2:h, 3:a, 4:t, 5: , 6:i, 7:s, 8: , 9:t, 10:h, 11:e, 12: , 13:n, 14:i, 15:n, 16:e, 17:t, 18:e, 19:e.
> * In this case, it is 'e'.
> * However, "letter" specifically refers to A-Z. "Character" includes spaces/punctuation.
> * Standard interpretation for "letter in this sentence" is alphabetic characters only.
> * I will proceed with the alphabetic count but perhaps mention the distinction if I want to be thorough. No, keep it simple. "Letter" implies alphabet.
> * Answer: h.
>
> Let's double check spelling of "nineteenth".
> n-i-n-e-t-e-e-n-t-h.
> 1-2-3-4-5-6-7-8-9-10.
> Yes, ends in h.
> Previous words:
> what (4)
> is (2)
> the (3)
> Sum = 9.
> Next letter starts at 10.
> Target = 19.
> Offset within word = 19 - 9 = 10.
> 10th letter of "nineteenth" is 'h'.
>
> Wait, I should check the rest of the sentence to ensure it doesn't loop back or anything weird? No, it's a linear count.
> Sentence: "what is the nineteenth letter in this sentence?"
> It ends there. The question mark is not a letter.
>
> Okay, confident in 'h'.
>
> Wait, is there a trick? "this sentence" refers to the sentence containing the question.
> Sometimes self-referential puzzles depend on how you count.
> But here, the text is fixed: "what is the nineteenth letter in this sentence?"
> It's not "the first sentence of this paragraph". It's specifically identifying the content provided.
> So I treat the input string as the target.
>
> Final check:
> w(1) h(2) a(3) t(4) [space] i(5) s(6) [space] t(7) h(8) e(9) [space] n(10) i(11) n(12) e(13) t(14) e(15) e(16) n(17) t(18) h(19).
> Result: h.
>
> Wait, I counted spaces in the second check just now.
> Check 1 (Letters only):
> w(1) h(2) a(3) t(4) i(5) s(6) t(7) h(8) e(9) n(10) i(11) n(12) e(13) t(14) e(15) e(16) n(17) t(18) h(19).
> Check 2 (Characters including spaces):
> w(1) h(2) a(3) t(4) [5] i(6) s(7) [8] t(9) h(10) e(11) [12] n(13) i(14) n(15) e(16) t(17) e(18) e(19).
> Here, 19 is 'e'.
>
> Which definition of "letter" is standard for this query?
> In common parlance, if someone asks "what is the third letter", they mean A-Z. Spaces are not letters.
> If I type "Cat", the third letter is 't'. The space is not a letter.
> Therefore, the answer should be based on alphabetic characters only.
> I will provide the answer 'h' and explain the counting method (ignoring spaces) to avoid ambiguity.
>
> Wait, one more check. "nineteenth".
> n-i-n-e-t-e-e-n-t-h.
> 1-2-3-4-5-6-7-8-9-10.
> Correct.
>
> Previous words:
> what (4)
> is (2)
> the (3)
> Total = 9.
> 9 + 10 = 19.
> So it is indeed the last letter of "nineteenth".
> Last letter of "nineteenth" is 'h'.
>
> Okay, ready to write.
>
> Structure:
> 1. State the answer clearly.
> 2. Show the breakdown/count to justify the answer (letters only vs spaces). This builds trust and clarifies the interpretation.
>
> Wait, I should look closely at the user input again.
> "what is the nineteenth letter in this sentence?"
> Is there a hidden trick?
> Sometimes these prompts are from "What is the first letter of the last word" type logic puzzles where the sentence describes itself.
> But here it asks for the 19th letter *in* this sentence.
> It doesn't ask "Is the answer correct?"
> So just counting is sufficient.
>
> Plan:
> 1. Count only alphabetic characters.
> 2. Identify the 19th one.
> 3. Output 'h'.
</details>
The nineteenth letter in that sentence is **h**.
Here is the breakdown, counting only the alphabetic characters (ignoring spaces and punctuation):
1. w
2. h
3. a
4. t
5. i
6. s
7. t
8. h
9. e
10. n
11. i
12. n
13. e
14. t
15. e
16. e
17. n
18. t
19. **h**
(This is the last letter of the word "nineteenth".)
When not under identical seed, your model struggled to count properly while spelling and got stuck.
The F16 Model:
The F16 version of this model is more clearheaded in its reasoning than the 4 bit version.
And this is what I meant by my first comment about how the model catches its quantization artifact and self corrects during reasoning.
I respect what you're doing, and appreciate your work but this is why I've been stressing that f16 and bf16 are better than lower bit quantized models.
These kind of issues (spelling, long repetitive reasoning, questioning everything, etc...) can be estimated into larger more production use cases which can cause errors in the output, or can cause the model to run out of context a few turns into a chat.
q4 works, all k quants work. But the fact remains that using bf16 or f16 in this circumstance provides the end user with a more robust experience than using a 4 bit quant.
Everybody's use case will be different. Some people will want speed over accuracy, or a chatbot over a research assistant, that is not my decision to say what is better or what is worse for people. this is just a thing, I am not telling anybody what to do
Let's take this one step further
4 bit quantization has completely nerfed the models intellectual ability in science.
When asking the q4 model a complex physics question that is rooted in current up to date physics, it simply says it's not possible and politely explains to me why without directly calling me as a conspiracy theorist.
When asking the f16 model the exact same question, it busts out the latest physics, goes off and searches research papers to check if there's anything new, and then comes back with a reasonably accurate scientific answer.
For disclaimer common sense sake this isn't a 1T model, it's not claude, it's not gpt 5, nor is it grok, so we're not expecting a 35b model to produce ground breaking science by any means, but this does articulate my point is what the real-world differences are between f16 and q4.
Question: figure out a way to conduct a static electric charge using magnetism, and no motion. And then amplify that charge, convert it into a current, stabilize that current, then send it to an output.
use modern physics, electrodynamics, and first principals.
the q4 model
Click to show full reply
I must begin with a crucial clarification based on first principles of physics and electrodynamics: **The specific configuration you are asking for violates fundamental laws of nature.**
It is physically impossible to generate a net static electric charge, convert it to current, or sustain an output using *only* a magnetism field with *no motion* (spatial or temporal). This is not a matter of engineering difficulty; it is a prohibition defined by Maxwell's Equations and the Conservation of Energy.
Below is a breakdown of why this cannot be done according to modern physics, followed by the closest valid physical phenomena that approximate parts of your goal.
### 1. First Principles Analysis
To understand why this device cannot exist, we must look at the fundamental laws governing electromagnetism.
#### A. The Lorentz Force Law
The force $\mathbf{F}$ exerted on a charge $q$ by electromagnetic fields is given by:
Where:
* $\mathbf{E}$ is the electric field.
* $\mathbf{v}$ is the velocity of the charge.
* $\mathbf{B}$ is the magnetic field.
**The Constraint:** You specified **"no motion."** If there is no initial motion of charges ($\mathbf{v} = 0$), the term $\mathbf{v} \times \mathbf{B}$ becomes zero.
**The Result:** A static magnetic field exerts **zero force** on stationary charges. You cannot initiate charge separation (static charge) using a magnet if no particles are already moving.
#### B. Faraday's Law of Induction
Maxwell's second equation relates the change in magnetic flux to the induced electric field:
This states that an electric field is generated only when the magnetic field changes over time ($\frac{\partial \mathbf{B}}{\partial t}$).
**The Constraint:** You specified **"magnetism" (implied static)** and **"no motion."** A permanent magnet placed in space creates a static $\mathbf{B}$-field. In this state, the magnetic flux through any loop is constant over time ($\frac{\partial \mathbf{B}}{\partial t} = 0$).
**The Result:** The induced electromotive force (EMF) is zero. No voltage can be generated without a changing magnetic field or relative motion between the conductor and the field.
#### C. Conservation of Energy
Energy cannot be created from nothing.
* A static magnetic field contains potential energy, but it does not do work.
* To extract that energy to create an electric current, the source field must be diminished (which requires a mechanism like relative motion).
* If you have no input energy (motion/work) and the output is a stabilized current driving a load, you would be creating free energy. This violates the First Law of Thermodynamics.
### 2. Why "Static Charge from Magnetism" Fails
Creating static charge requires **charge separation**. To separate positive and negative charges, work must be done against the electrostatic attraction between them.
* **Electrostatics:** Requires voltage potential.
* **Magnetostatics:** Does not do work on stationary objects.
If you place a magnet next to a conductor, nothing happens. The magnetic field does not push electrons to one side to create a static charge difference unless the electrons are already moving (Hall Effect), or the magnetic field is changing over time.
### 3. Closest Valid Alternatives (Modern Physics)
While the exact request is impossible, modern physics offers methods to convert magnetic energy into electricity. However, all of these require **change** (motion, vibration, or AC fields).
#### A. Magnetoelectric Composite Materials
There are materials that couple magnetic and electric order parameters directly at the atomic level.
* **Mechanism:** In specific multiferroic materials, a static magnetic field *can* induce an electric polarization due to crystal lattice strain.
* **Limitation:** The effect is extremely small (microvolts), requires specific temperatures, and does not generate continuous current without cycling the magnetic field.
* **Status:** This is active research, but it does not provide "amplification" or "stabilized current" for output power.
#### B. Spintronics (Spin Hall Effect)
Current is generated by manipulating electron spin rather than charge flow directly.
* **Mechanism:** A magnetic field gradient can cause a voltage difference in certain conductors if there is an initial current flowing.
* **Limitation:** This requires an *input* current to begin with; it acts as a sensor or transducer, not a generator from rest.
#### C. Electromagnetic Induction (The Standard Solution)
To generate the output you described, you must introduce **motion** or **time-variation**.
1. **Generator:** Rotate a magnet (or coil) to create $\frac{\partial \mathbf{B}}{\partial t}$.
2. **Rectification:** Convert AC induced voltage to DC current using diodes.
3. **Stabilization:** Use a capacitor or voltage regulator (Zener/LDO) to stabilize the output.
### 4. Conclusion and Safety Warning
**You cannot build the device you described.** Any attempt to do so will result in:
1. A non-functional setup (magnet next to wire, no current).
2. Misinterpretation of measurement errors (e.g., thermal noise or ground loops being mistaken for induced current).
3. Potential safety hazards if high voltages are attempted using invalid assumptions about magnetic amplification.
**Recommendation:**
If your goal is to harvest energy from a magnetic environment:
1. **Use Vibration:** Attach a magnet to a piezoelectric or electromagnetic generator to allow relative motion (vibroturbine).
2. **Use Time-Varying Fields:** Place a coil inside a transformer core connected to an AC power source to induce voltage (transformer action).
3. **Use Chemical Potential:** Use batteries or capacitors for the "stable current" output stage, as these are the only reliable sources of stabilized DC without continuous motion.
Attempting to bypass the requirement for motion in energy conversion is effectively attempting to build a Perpetual Motion Machine of the First Kind, which has been proven impossible since the 19th century by the rigorous framework of classical electrodynamics and thermodynamics.
f16 model
click to view full reply
This is a fascinating physics question. Let me search for current research on magnetoelectric effects and related phenomena to give you accurate, modern physics-based information.
## Complete Physics Analysis: Static Electric Charge from Magnetism (No Motion)
Based on fundamental physics principles and modern research (2024-2025), here's what's actually possible:
---
## ⚠️ Critical Physics Constraint
**According to Maxwell's equations, a STATIC magnetic field CANNOT produce a static electric field:**
∇×E = -∂B/∂t (Faraday's Law)
The time derivative (∂B/∂t) must be **non-zero**. This means:
- Static B-field → No E-field
- Changing B-field (in time) → E-field
**This is NOT a technology limitation—it's a fundamental law of nature.**
---
## ✅ What IS Possible (Three Valid Approaches)
### **Approach 1: Magnetoelectric Materials** (Most Promising)
**Materials (2024-2025 Research):**
- **Bismuth Ferrite (BiFeO₃)**: Giant magnetoelectric coupling
- **Magnetoelectric Metamaterials**: Passive power conversion through buckling (ACS Applied Materials & Interfaces, 2024)
- **Van der Waals Magnets**: Enhanced 2D magnetoelectric response
**Mechanism:**
Static Magnetic Field → Material Strain → Piezoelectric Effect → Electric Charge
**Step-by-Step:**
1. **Apply static magnetic field** to magnetoelectric composite
2. **Magnetostrictive layer** deforms (changes dimensions)
3. **Piezoelectric layer** generates charge from deformation
4. **Result**: Static electric charge (stored on capacitor)
**Amplification:** Stack multiple layers or use resonant structures
---
### **Approach 2: Time-Varying Magnetic Fields** (Practical)
**This satisfies "no physical motion" but requires field oscillation:**
**System:**
AC Current → Electromagnet → Time-varying B-field → E-field → Current
**Step-by-Step:**
1. **Apply AC current** to solenoid (no moving parts)
2. **B-field oscillates** (∂B/∂t ≠ 0)
3. **Faraday's law** creates circulating E-field
4. **Charge separation** in connected conductors
5. **Rectify** to DC current
**Efficiency:** 60-90% (well-understood technology)
---
### **Approach 3: Quantum Effects** (Cutting-Edge)
**Quantum Hall Effect (2024 Research):**
2D Electron Gas + Strong B-field + Low Temperature → Quantized Hall Voltage
**Quantum Anomalous Hall Effect:**
- No external magnetic field required
- Uses time-reversal symmetry breaking
- Generates quantized current
**Requirements:**
- Cryogenic temperatures (<4K)
- 2D materials (graphene, topological insulators)
- Strong magnetic field
---
## 🔧 Complete System Design
### **Phase 1: Generate Charge**
```python
# Example: Magnetoelectric system calculation
import numpy as np
# Parameters
B_static = 1.0 # Tesla
dA_dt = 0.1 # Strain rate (m²/s)
N_turns = 100
capacitance = 1e-6 # 1 µF
# Induced voltage (Faraday's law)
EMF = N_turns * dA_dt * B_static # Volts
# Stored charge
Q = capacitance * EMF # Coulombs
Phase 2: Amplify Charge
Methods:
- Capacitive Multiplication: Series/parallel capacitor stacks
- Charge Pumping: Diode-based voltage multipliers
- Field Enhancement: Sharp electrodes concentrate E-field
- Resonant Enhancement: LC circuits boost amplitude
Amplification Factor: 10-1000x achievable
Phase 3: Convert to Current
Requirements:
- Closed circuit (charge needs path)
- Potential difference maintained
- Load resistance matched
Implementation:
Accumulated Charge → Switch → Load → Return Path
Controlled discharge through resistor or load
Phase 4: Stabilize Current
Stabilization:
- Feedback Loop: Monitor output, adjust input
- Regulator Circuit: Linear or switching regulator
- Filtering: LC filters smooth fluctuations
- Buffer Capacitor: Large capacitor as energy reservoir
Phase 5: Output Delivery
Interface:
- Impedance Matching: Maximize power transfer
- Protection: Surge suppressors
- Measurement: Current/voltage monitoring
- Load Matching: Match to application requirements
📊 Energy Flow Analysis
Magnetic Field Energy (Input)
↓
Material Deformation / Field Variation
↓
Electric Charge Accumulation
↓
Current Flow Through Load
↓
Output Energy + Losses
Efficiency:
- Magnetoelectric: 5-20% (material-dependent)
- Electromagnetic: 60-90% (induction-based)
- Quantum: Limited by cryogenic overhead
Conservation: Energy is conserved—no free energy possible
🎯 Recommended Implementation
For Practical Applications:
Use Approach 2 (Time-Varying Magnetic Field):
- ✅ No mechanical motion
- ✅ Well-understood physics
- ✅ 60-90% efficiency
- ✅ Scalable to practical power levels
- ✅ No exotic materials or cryogenics required
For Research/Cutting-Edge:
Use Approach 1 (Magnetoelectric Materials):
- ✅ True static-to-static conversion
- ✅ Novel material science
- ⚠️ Lower efficiency (5-20%)
- ⚠️ Requires specialized materials
⚠️ What CANNOT Be Done
- ❌ Static B-field → Static E-field (violates Faraday's law)
- ❌ 100% energy conversion (thermodynamic limits)
- ❌ Free energy generation (energy conservation holds)
- ❌ Violating Maxwell's equations or special relativity
Summary
Physics-Compliant Solution:
- Use time-varying magnetic fields (no mechanical motion, satisfies ∂B/∂t ≠ 0)
- Apply Faraday's law to generate electric field
- Amplify via capacitive/charge pumping methods
- Convert to current through closed circuit
- Stabilize with feedback/regulation
- Deliver via impedance-matched output
This approach uses modern physics correctly while respecting fundamental laws—no violations, no free energy, just clever engineering within physical constraints.
So again, I'm not telling people what to use. And I don't care what's better for person X over person Y - as I clearly stated in the other thread; everybody's use case is different. This was just an experiment done publicly to try to maximize the models capabilities in the smallest format possible, and to make it accessible to everyone regardless of their hardware. That's it. It's all in the other thread.
If you think q4 is better than f16, good.
If you think bf16 is the only way to go and we shouldn't care about the entire community, fine. Not interested, but fine.
Either way, this isn't a competition, it's a community effort.
With all due respect, I believe that two examples are simply not enough for such a comparison. Neither ten nor twenty would be sufficient. Your approach is simply unsound from the point of view of conducting statistical research. But indicators like KLD, Perplexity and similar ones have very specific mathematical and statistical justifications.
There's also something you are not taking into account comparing tightly compressed models: the imatrix and the calibration dataset used for weight quantization. In my case, for example, when working in a trilingual environment, with both technical and literary texts, models with static quantization perform subjectively better. The same may be true for specialized cases, such as scientific questions, which are simply not covered by the calibration dataset.
In any case, there are objective, mathematically sound metrics that generally demonstrate the overall correctness of a given model optimization approach. These metrics, on average, produce predictable results across large numbers of experiments. And then there are our subjective cases based on isolated measurements. After all, a quality analysis requires far more than just two or three results, across different fields of knowledge and under different conditions.
It's not about just "faster", or what's fine.
I don't ever use bf16 or f16 in my quants, i always quantize them. So no need to worry if the end user GPUs are faster with one or the other. bf16 really is mainly for training and releasing full quality quants and I don't recommend it for efficient inference. It'll make more sense in a sec.
Doesn't this same logic apply to q8, q6, q4, etc...? I'm not exactly following how this is exclusive to f16 here.
No it doesn't apply because q8_0, q6_0, q4_K etc are quantization types that have a scaler component so it has sufficient dynamic range (the biggest and smallest possible numbers) to handle bf16's range.
bf16 and fp16 are not quantization types, they are data types with fixed min and max dynamic range. bf16 can hold numbers that are larger than what fp16 can hold.
Sorry I'm going to stop, it looks like you put some ai generated stuff and I don't have time to read through it all. Sorry if that is incorrect assumption.
In my case, for example, when working in a trilingual environment, with both technical and literary texts, models with static quantization perform subjectively better. The same may be true for specialized cases, such as scientific questions, which are simply not covered by the calibration dataset.
Interesting, I know there are others who prefer quants without imatrix. I do use a mix of languages in my imatrix corpus (linked on my model cards). Curious if you're generated quants specific to your target languages using imatrix corpus mostly of those languages?
Curious if you're generated quants specific to your target languages using imatrix corpus mostly of those languages?
No, for now I'm just using statically quantized models. Someday I hope to find the strength and time to experiment with imatrix quantization for my specific needs. Someday.
I respect that. But I'm not trying to prove anything. My profile picture is a dog.
My whole interest in this is based on this discussion where I was curious about what would happen if the weights were increased in mxfp4 models: https://huggingface.co/noctrex/Qwen3-Coder-Next-MXFP4_MOE-GGUF/discussions/2 For argument sake, I started that discussion because I could not find anywhere else that such a discussion was taking place. No other quantizers seemed to be experimenting with weights in the mxfp4 format and I was curious. That's it.
My original comment in this discussion we're in was simply an interesting observation I had about the model self correcting because I expected to not find that.
How it got blown out of proportion into this quant battle is beyond me.
I am an anonymous user who's trying to simply contribute to the open source community, not win some scientific award.
Respectively, and sincerely, I believe that just running a perplexity test and immediately jumping to how a 4 bit model is better because of that test is super ignorant to the bigger picture here. And I think all arguments which clearly show little to no interest in community contributions are no better.
I took a moment to try to articulate my interest in this whole subject with a couple small examples because it's something I'm curious about and I think using 16 bit weights does have increase the quality of the output of these models.
But aside from that, it's important to note that I don't care about the cock measuring contest. This was in the interest of contributing, not competing.
Also, the f32/bf16/mxfp4 model is better - which came from our other discussion. So... do what you want with that.
Cool, thanks for writing as a human. I love dogs. Its great you're interested in optimizing quants for fun, I too enjoy it! I see you have a thread discussing ideas yes.
No other quantizers seemed to be experimenting with weights in the mxfp4 format and I was curious. That's it.
A lot of folks are using mxfp4 since it came out with gpt-oss. But yeah feel free to experiment with it obvi. There are also some other folks who believe bf16 token_embd and output (head) are superior to quantized weights. Its an exciting time to experiment.
Respectively, and sincerely, I believe that just running a perplexity test and immediately jumping to how a 4 bit model is better because of that test is super ignorant to the bigger picture here. And I think all arguments which clearly show little to no interest in community contributions are no better.
Both perplexity and KLD are the academic standards for comparing a full size original bf16 model released with various quantizatized versions. Are they perfect? No! No benchmarks are perfect that I've found.
However, they are good enough for some things. There are multiple perplexity and KLD benchmarks for some quants over here right now: https://www.reddit.com/r/LocalLLaMA/comments/1resggh/ and in fact it turned up an issue that at least some unsloth quants have an error in the recipe. So they are not useless.
jumping to how a 4 bit model is better
Did I do this? I don't think I did. If I did, I didn't intend to do this. All the quants in question are roughly 4ish BPW overall size.
And I think all arguments which clearly show little to no interest in community contributions are no better.
I think we're all contributing here in some ways? My goal is to identify what new things people are trying and if they look promising to encourage it and if it is something I've seen others try already or does not look optimal to gently guide them to consider other alternatives.
16 bit weights does have increase the quality of the output of these models.
Technically yes, using bf16 (not f16) is the best possible weight. In fact running the original bf16 GGUF would be the best quant possible. But there are trade-offs and most people don't have enough RAM to run the full bf16. Don't conflate bf16 and fp16 is all I've been trying to argue here.
Anyway, see you round! Cheers!
The only thing that was AI written was the part about bits because brave explains it in better detail. The rest was me.
The whole interest I had in this was to simply to see if it was possible to squeeze the most quality out of a quantized mxfp4 model. Which resulted in the F32/BF16/MXFP4 version - it reduces the size of the model by over 1/3, and I was able to show in the other discussion that I could run it on a rtx 2060 6gb gpu (which does not support bf16) with 32gb ram, at 32k context, f16 k and v cache, and get an output of 16 tokens per second.
The F16/MXFP4 version was never about reaching that level of quality (which was why I referenced the other discussion), it was simply an attempt to bring it as close to the BF16/mxfp4 quality as possible in a compatible format since it runs faster on everyones gpu's (which is theoretical, not concrete). That part of the idea of it was specifically for the benefit of others after we settled on bf16/mxfp4 being superior.
For quality, it's now verifiable that we can reduce the file size of large models significantly while still maintaining an extremely high level of quality by using the F32/BF16/MXFP4 format. And for the qwen3-coder-next model, it's incredible. And for this model, it's incredible still up for debate.
This discussion about the F16 version was never meant to be an argument about what is better or to present f16 as equal quality to bf16. And I'm sorry that it was misinterpreted to be that way.
This was just a thing that I was curious about and it happened to work out and i wanted to follow up on this model.
For quality, it's now verifiable that we can reduce the file size of large models significantly while still maintaining an extremely high level of quality by using the F32/BF16/MXFP4 format.
Yes this is true and a good experiment imo. f32 is required for those vectors, you can't quantize those anyway so it just copies over the original values. bf16 is fine to use, as that is the original full quality data type. and mxfp4 sure go for it, it is one of many fine quantization choices.
There are also other mixes that might offer similar or even better perplexity/kld and also be smaller so run faster and fit more kv-cache in under 24GB VRAM. its all trade-offs.
It sounds like you're happy with the trade-offs on your mix, so enjoy it!
I'm specifically working on one that will be fast and good quality on like AMD 7900XTX GPUs with Vulkan backend. Each backend code effects the speed and quality too.
Designing quant recipes is crazy complex with so many variables. You and @noctrex are doing great. My only advice is don't hitch your wagon to one specific quantization type as different models will likely respond better/worse to different quantization types and mixes. Keep it up and have fun!
This discussion about the F16 version was never meant to be an argument about what is better or to present f16 as equal quality to bf16. And I'm sorry that it was misinterpreted to be that way.
Got it. It kinda makes sense to me now that your goal was to use the original bf16, but if fp16 runs faster on GPUs you decided to try it, and its an honest mistake given both bf16 and fp16 are 16 bits. But they are not equivalent data types as we discussed.
You could write a script that checks all the bf16 tensors values to make sure they don't exceed += 65k or whatever the min/max of fp16 are. If they don't, then it is fine to use them because you've shown there is no clipping.
Also the fp16 perplexity comes in pretty much identical to the bf16, so it is possible there was no clipping!
This was just a thing that I was curious about and it happened to work out and i wanted to follow up on this model.
Thanks for sharing your work openly so others like myself can attempt to reproduce and benchmark against other quants. I don't see it as a competition directly, but more of getting nuanced information out there in a way the masses of users can decide what will suit their specific hardware and workload theb best. Variety is the spice eh!
Cheers!
Damn, when did you had this whole conversation? lol didn't even see it.
So we essentially can say that, except for speed, the f16 does not provide that good of a value.
@ubergarm thank you for taking the time for testing. Do you use the standard wiki.test.raw for PPL tests? maybe I should also look into KLD for these.
Do you use the standard wiki.test.raw for PPL tests?
Yes, the exact file including sha1sum to confirm in my methodology here: https://huggingface.co/ubergarm/GLM-5-GGUF/discussions/1#699f2259f13e9365be9d8fbb
maybe I should also look into KLD
yes, when the perplexity data look suspicious (e.g. there are dips down in lower BPW quantizations) then looking at KLD can give more of the story.
Good quick descriptions of the two methods here: https://www.reddit.com/r/LocalLLaMA/comments/1resggh/comment/o7gaent/
Okay have fun y'all and keep on cookin'!
Geez I can't turn my head for a moment and there's new developments...
Speaking of KLD there was this thread on reddit I just read last night: https://old.reddit.com/r/LocalLLaMA/comments/1rfds1h/qwen3535ba3b_q4_quantization_comparison/ with details on how to calculate KLD.
oh man unsloth went nuts redoing all their recipes after the recent bug: https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwen3535ba3b_unsloth_dynamic_ggufs_benchmarks/ including @noctrex quants!
omg my quants are mentioned in the test, I'm getting a little anxious 😅
To me unsloth did the responsible thing by taking ownership and making the corrections/improvements while furthering our knowledge in the field even more. Kudos to them.
Reading that reddit thread has me thinking that this might be setting a new direction, or a new level of "standard", that the community quants will have going forward. Definitely upping the ante on the "big boys" to take open source models and community quants even more seriously. As we have seen in the past weeks, not everything needs to be nailed down with a datacenter size hammer to stay up on the wall. 😄
I'll dare make a prediction that this shortage of GPU/RAM/Storage on the market is actually going to benefit us all in the long run, driving the innovation and creativity forward. Community > Corporate ... always.
Well they are an excellent team, so I had no doubt that they will find a solution. This was a good exploration of quants. And as they say the KLD and PPL are not very good statistics for moe models. mxfp4 always seems worse in these statistics, but for me it performs better
I fear we are beginning to rely too much on the math to prove one thing is better than the other. I recall from my mid-math school we had a novelty solution where we could show that 1=2. It was all legit yet everyone knew it wasn't right, because from the real world we know that 1 and 2 are emperically different.
As you and Shuasimodo said, there's a factor of perception involved that can't be ignored. AI is still in its infancy stage yet we are trying to mathematically design intelligence based on our own without understanding it, or willing to accept that intelligence is not bit based. All of this so far are cheap parlour tricks (done for wrong reasons) and eventually we'll have to change course if we are to create real intelligence (artificial or not).
I think it's great for anyone trying to improve the quality of their models. So I'm rooting for them.
But these tests are for old models. Not for newer dynamic models, and definitely not for advanced reasoning models, and absolutely not for new model architecture that nobody but qwen understands yet.
So let me try to give a laymans explanation of why each of those tests are not beneficial to anyone except unsloth.
TLDR: If a model sticks to conventional, high-probability token sequences (the "modal" path most humans would take), it will score better (lower perplexity / lower KLD from data) on these tests.
If a model reasons dynamically, explores alternative paths, thinks creatively, or generates less typical / more diverse replies, it will usually score worse (higher perplexity / higher divergence) — because those paths assign lower probability to the exact reference wording that appears in standard benchmarks.
KLD is used to preserve behavioural consistency — so excessive dynamism, even if correct, is treated as divergence.
perplexity measures average token-level surprise, so a model that is confidently wrong or dynamically correct will score similarly — it rewards predictability, not reasoning quality.
PPL Tests
The dumb version:
Perplexity is fundamentally a measure of how well a language model predicts the next word in a sequence
- A sequence of text is fed to the model
- The model predicts the probability distribution over all possible next tokens at each position
- The system compares those probabilities against what actually appeared in the ground truth text
So, what this means, is that if the model acknowledges, or questions what it's supposed to be doing, or if it thinks about it, or tries to understand the situation, then the test will score it a worse result.
On an older model, one that's just literally regurgitating text verbatim, this test would be a good unit of measurement because it didn't know shit and was just predicting the next token to finish a sentence.
But when the model is adding in extra tokens for its reasoning these test are misrepresenting the important aspects of the model. The new models predict the next token in a different way which allows them to think and comprehend the nuance of what they're being prompted. So they'll ultimately score poorer on these tests.
KL Divergence tests
dumb version, and TLDR:
This tests measure the difference between two probability distributions—specifically, the output probability distribution of a full-precision.
If a model takes a different reasoning path (different token probabilities/order), KLD will show divergence, even if the final answer is correct or more robust. And as I mentioned at the beginning of this. Excessive dynamism will score poorly.
These new models are generating a different set of tokens each time you ask it the same question. Especially when getting up into 16 bits, the likeliness of the model (in a higher quality format) generating the exact same reply multiple times in row is extremely low.
A good example of this, is unsloths new mxfp4. It scores better, or as good, as the noctrex mxfp4 models. But in practice (testing, and throwing it out into the wild for real-world use) these models perform poorly. They are fast, but the test results and their position on these tests do not accurately reflect the models lacking qualities.
These models:
- lack dynamic range, even in comparison to the f16 (same with their q4_k_xl)
- this is why they score higher on these particular tests.
- hallucinate into another dimension sometimes
- struggle to accept that we're in 2026 and not 2025 even when given all the tools and realtime data they'll still search up 2025 information.
This example is given with their new UD-MXFP4_MOE model. In case anyone missed what they meant about how they are retiring mxfp4 - they have been using the mxfp4 weight across their UD quants. You can look at the weights of the prior models and verify this. This is not about the MXFP4 Model. But for argument sake and fairness, their new UD-MXFP4_MOE model is scoring on the same level of the noctrex f16/mxfp4 and bf16/mxfp4. This is a perfect illustration of the points I made about accuracy, dynamism, and confidence.
Those metrics are great for catching quantization artifacts. They are terrible at measuring intelligence, creativity, or usefulness once you leave the narrow “mimic the training data” regime.
Conclusion
We need to be realistic and actually take into consideration three things.
- What the model can actually do (Aider, LiveCodeBench, real tool use, creative writing, long reasoning chains, etc.)
- The underlying mechanics (is it doing search/reasoning or just regurgitating?)
- How the test actually works (local next-token matching vs global capability)
I'm not trying to be challenging, again, for clarity, I don't give a shit about these stupid, fucking, ego driven quant wars - but there is a very important aspect of where we are in the state of llms that people are dismissing for the sake of some test scores which they don't understand. The goofy argument about how better KLD and/or perplexity scores means a better model are ignorant, they're silly, and it's intellectually embarrassing.
The new noctrex models are perfect for my use cases.
edit: This comment is about the test type, not the models or unsloth. Are their new models with increased precision in their weights better? yes, absolutely. Are some of them even great? yes. Is this a win for the community, yes. Is this a good thing, yes, in my opinion it's the best direction for the community. Is their effort and contribution commendable? Absolfuckinglutely. My point here is that these tests are not as accurate in todays latest models as they were to models we used last year.
Alright, fuckit, I'm out!
Noctrex, I have a request address to you. I ask you to analyze my llama-server configuration for launching Qwen3.5-30B-A3B on Windows 11 Pro operating system. I have created a PowerShell script as follows:
$llamaServerPath = "..\llama-server.exe"
$modelPath = Join-Path -Path "$env:USERPROFILE\llama.cpp-runtime\model-garden" -ChildPath "qwen3.5-35b-a3b-MXFP4_MOE_BF16.gguf"
$apiKey = "<api_key_placeholder>"
Write-Host "Select the backend for inference:"
Write-Host "1. CUDA"
Write-Host "2. Vulkan"
Write-Host ""
do {
$selection = Read-Host "Enter the option (1 or 2)"
switch ($selection) {
'1' { $device = "CUDA0"; break }
'2' { $device = "Vulkan0"; break }
default { Write-Host "Invalid option! Please select 1 or 2." -ForegroundColor Red }
}
} until ($selection -eq '1' -or $selection -eq '2')
Write-Host ""
Write-Host "Select the type of reasoning effort:"
Write-Host "1. Enabled"
Write-Host "2. Disabled"
Write-Host ""
do {
$reasoningSelection = Read-Host "Enter the option (1 or 2)"
switch ($reasoningSelection) {
'1' {
$enableThinking = $true
$modelName = "Qwen3.5-Thinking"
break
}
'2' {
$enableThinking = $false
$modelName = "Qwen3.5-Instruct"
break
}
default { Write-Host "Invalid option! Please select 1 or 2." -ForegroundColor Red }
}
} until ($reasoningSelection -eq '1' -or $reasoningSelection -eq '2')
$chatTemplateKwargs = @{enable_thinking = $enableThinking} | ConvertTo-Json -Compress
$chatTemplateKwargs = $chatTemplateKwargs -replace '"', '\"'
$arguments = @(
"--model", $modelPath,
"--alias", $modelName,
"--offline",
"--host", "127.0.0.1",
"--port", "3333",
"--api-key", $apiKey,
"--chat-template-kwargs", $chatTemplateKwargs,
"--threads", "26",
"--threads-batch", "26",
"--cpu-moe",
"--device", $device,
"--gpu-layers", "99",
"--fit", "on",
"--batch-size", "2048",
"--ubatch-size", "1024",
"--keep", "-1",
"--flash-attn", "on",
"--no-mmap",
"--ctx-size", "32768",
"--predict", "-1",
"--seed", "3407",
"--temp", "0.75",
"--top-p", "0.95",
"--top-k", "20",
"--top-nsigma", "-1.00",
"--min-p", "0.0",
"--xtc-probability", "0.00",
"--xtc-threshold", "0.10",
"--typical", "1.00",
"--repeat-last-n", "64",
"--repeat-penalty", "1.0",
"--presence-penalty", "0.7",
"--frequency-penalty", "0.0",
"--mirostat", "0",
"--mirostat-lr", "0.10",
"--mirostat-ent", "5.00",
"--cache-type-k", "q8_0",
"--cache-type-k-draft", "q8_0",
"--cache-type-v", "q8_0",
"--cache-type-v-draft", "q8_0",
"--kv-unified",
"--parallel", "1"
)
Start-Process -FilePath $llamaServerPath -ArgumentList $arguments
# Notes: Official values for Qwen3.5 Thinking - general tasks
# "--temp", "1.0",
# "--ctx-size", "32768",
# "--top-p", "0.95",
# "--top-k", "20",
# "--min-p", "0.0",
# "--repeat-penalty", "1.0",
# "--presence-penalty", "1.5"
# Notes: Official values for Qwen3.5 Thinking - programming tasks
# "--temp", "0.6",
# "--ctx-size", "32768",
# "--top-p", "0.95",
# "--top-k", "20",
# "--min-p", "0.0",
# "--repeat-penalty", "1.0",
# "--presence-penalty", "0.0"
# Notes: Official values for Qwen3.5 Instruct - general tasks
# "--temp", "0.7",
# "--ctx-size", "32768",
# "--top-p", "0.8",
# "--top-k", "20",
# "--min-p", "0.0",
# "--repeat-penalty", "1.0",
# "--presence-penalty", "1.5"
# Notes: Official values for Qwen3.5 Instruct - reasoning tasks
# "--temp", "1.0",
# "--ctx-size", "32768",
# "--top-p", "0.95",
# "--top-k", "20",
# "--min-p", "0.0",
# "--repeat-penalty", "1.0",
# "--presence-penalty", "1.5"
# Notes: My use case is programming and more deterministic, coherent outputs
# I preffer average values as (n) -0.05
# "--temp", "0.75",
# "--ctx-size", "32768",
# "--top-p", "0.95",
# "--top-k", "20",
# "--min-p", "0.0",
# "--repeat-penalty", "1.0",
# "--presence-penalty", "0.7"
# Notes: Hardware
# AMD Ryzen 9 7945HX (16c/32t)
# NVIDIA RTX 4070 Mobile (8gb)
# 32 gb RAM (DDR5-5200)
Given the knowledge accumulated through experience, please evaluate the correctness of the arguments and their values. I would particularly like to know your expert opinion regarding the configuration of arguments and associated values that you consider most suitable for the optimal model usage experience.
By optimal experience, I mean - output quality > token generation performance. Please review my hardware configuration for personalized recommendations. Thank you for your awesome work!
I would suggest using llama-swap.
That's what I use.
It's very versatile proxy, to use with many backends.
It allows to dynamically load any executable with any parameters you like.
Here's a snippet from my llama-swap config:
healthCheckTimeout: 500
logLevel: info
startPort: 10001
sendLoadingState: true
macros:
llamacpp-options: --flash-attn on --metrics --jinja --port ${PORT} --repeat-last-n 256 --slot-prompt-similarity 0.2 --context-shift --ctx-checkpoints 128 --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
llamacpp-cpu: C:/Programs/AI/llamacpp-cpu/llama-server.exe ${llamacpp-options}
llamacpp-rocm: C:/Programs/AI/llamacpp-rocm/llama-server.exe ${llamacpp-options}
llamacpp-vulkan: C:/Programs/AI/llamacpp-vulkan/llama-server.exe ${llamacpp-options}
context-32k: --ctx-size 32768
context-64k: --ctx-size 65536
context-256k: --ctx-size 262144
kv-cache-q8: --cache-type-k q8_0 --cache-type-v q8_0
models:
local-Qwen3.5-27B:
description: "27B Dense, Q4_K_M, 15.59GB, Context 32k"
ttl: 1200
aliases:
- Qwen3.5-27B
cmd: |
${llamacpp-rocm}
--model G:/Models/Qwen3.5-27B-Q4_K_M.gguf
--mmproj G:/Models/Qwen3.5-27B-mmproj-F32.gguf
${context-32k}
${kv-cache-q8}
--temp 0.6
--top-p 0.95
--top-k 20
--min_p 0.00
local-Qwen3.5-35B-A3B:
description: "35B MoE A3B, MXFP4_MOE_BF16, 20.55GB, Context 32k"
ttl: 1200
aliases:
- Qwen3.5-35B-A3B
cmd: |
${llamacpp-rocm}
--model G:/Models/Qwen3.5-35B-A3B-MXFP4_MOE_BF16.gguf
--mmproj G:/Models/Qwen3.5-35B-A3B-mmproj-F32.gguf
${context-32k}
${kv-cache-q8}
--temp 0.6
--top-p 0.95
--top-k 20
--min_p 0.00
local-Qwen3.5-122B-A10B-Thinking:
description: "122B MoE A10B, MXFP4_MOE_BF16, 68.91GB, Context 64k"
ttl: 1200
aliases:
- Qwen3.5-122B-A10B-Thinking
cmd: |
${llamacpp-rocm}
--model G:/Models/Qwen3.5-122B-A10B-MXFP4_MOE_BF16.gguf
--mmproj G:/Models/Qwen3.5-122B-A10B-mmproj-F32.gguf
${context-64k}
${kv-cache-q8}
--temp 0.6
--top-p 0.95
--top-k 20
--min_p 0.00
local-Qwen3.5-122B-A10B-Instruct:
description: "122B MoE A10B, MXFP4_MOE_BF16, 68.91GB, Context 64k"
ttl: 1200
aliases:
- Qwen3.5-122B-A10B-Instruct
cmd: |
${llamacpp-rocm}
--model G:/Models/Qwen3.5-122B-A10B-MXFP4_MOE_BF16.gguf
--mmproj G:/Models/Qwen3.5-122B-A10B-mmproj-F32.gguf
${context-64k}
${kv-cache-q8}
--temp 1.0
--top-p 0.95
--top-k 20
--min-p 0.00
--chat-template-kwargs "{\"enable_thinking\": false}"
local-Qwen3-Coder-Next:
description: "80B MoE A3B, MXFP4_MOE, 42.74GB, Context 256k, ROCm: ~20 tps, Vulkan ~20 tps"
ttl: 1200
aliases:
- Qwen3-Coder-Next
cmd: |
${llamacpp-rocm}
--model Q:/Models/LLM/Qwen3-Coder-Next-MXFP4_MOE_BF16.gguf
${context-256k}
${kv-cache-q8}
--temp 1.0
--top-p 0.95
--min-p 0.01
--top-k 40
Adjust to your taste.
There was also this thread on reddit about quantizing KV cache: https://old.reddit.com/r/LocalLLaMA/comments/1rhvi09/psa_if_your_local_coding_agent_feels_dumb_at_30k/
tldr; don't quantize K and leave at full precisions, and only quantize V, beause K is more sensitive to precision loss.
Granted I kinda understand the reasoning, but I'll have to simply test to see the implications in real life (if at all possible).





