When I saw this level of quantization I had to give it a try
Generally it works, but words spoken around short pauses (even 500ms of no speech) get aligned really badly - sometimes even several hundres ms wrong.
I was able to get good results with int8 quantization of the audio tower and then leaving the decoder at bf16 (pytorch) - but that still leaves me around 1.5gb and I would love to get it smaller but it seems you have been a little too aggressive.
OK I see now that you have made a couple of alternative quantizations.
https://huggingface.co/OpenVoiceOS/qwen3-forced-aligner-0.6b-q5-k-m
https://huggingface.co/OpenVoiceOS/qwen3-forced-aligner-0.6b-q8-0
going to give these a try and let you know how it goes.
Ok so both q5 and q8 model suffers from the same problem.
I uploaded some result files here including my queen29s.wav test file
https://drive.google.com/drive/folders/1vUplezsSdDHD0zXCqwMd9qtJiiaXq3w0
If you use https://huggingface.co/spaces/hlevring/Timestamps-Tester you can the the problem around silence areas. For example note how the word country gets aligned totally wrong.
Ok now I just tried the f16 and getting the same issue, so that points to an issue with py-qwen3-asr-cpp / ggml inference implementation and not a quantization artifact. My own results with my own int8-bf16 work perfectly in pytorch.
Will dig some more into this and comment in the appropriate repro. I was pretty sure it was a quantization issue which is why I continued the conversation here.
Thanks for the feedback, @hlevring . Indeed, the problem may be related to the way GGML maps the tensors onto memory and to the order of operands in matrix multiplications. GGML's design targets consumer hardware while Pytorch int8-bf16 is optimized for data center hardware (e.g., Nvidia H100, Intel Xeon). I am not the author of the original GGML implementation for this aligner, I wrote the mixed quantization script to enable quantization up to Q4_K_M and the Python bindings. I will try to investigate what is going on with the GGML implementation. If I had to guess, I'd say it is related to the non-associativity of sequential quantized multiplications, exacerbated by the fact that in GGML the multiplication operands order is inverted with respect to that of Pytorch.
Looks like it is a tokenization issue. When I strip punctuation from the transcript before passing it to the GGUF pipeline I get good results.
Thanks for the investigation! I will check what is happening with the tokenization.
I added some _clean files where i stripped comma and punctuation - then compared your f16 against your lower quantization's.
The Q8 performs pretty much almost identical to f16. There is a small difference as expected to q4, but I checked some of the individual differences and they were on unusual words like H1 etc .,...all in all this indicates that q8 might be the perfect sweet spot, - but I am really surprised that Q4 performs so impressive.
I still see some minor differences to the pytorch pipeline, but at least we have a clear indication that his model survives even brutal quantization which I did not get to work myself when I tried to quantize the pytorch model beyond what I mentioned before.
Thanks a lot for your work.
@hlevring , can you test the following branch while keeping the punctuation in the input text?: https://github.com/femelo/py-qwen3-asr-cpp.git@fix-forced-aligner-tokenizer. I will try on my side with your timestamp tester too. Thanks a lot!
I am not home at the moment, will check on Sunday when I get back.
I just gave the PR a quick try and compared couple of my files with clean stripped text vs. punctuated text - and result are now the same. ;) . I did not look into the PR itself , but yeah from a couple of super quick tests it looks good.
I still seeing a few differences/outliers when comparing to the official pytorch implementation, I plan to have a closer look at that next week.