Experimenting with dynamic quantization

pinned

by Lunzima - opened Feb 18, 2025

Discussion

Lunzima

Owner Feb 18, 2025

•

edited Feb 18, 2025

https://gist.github.com/lunzima/dbca4281acf7c6bb0100e26a0a51de06

This patch modifies the llama_tensor_get_type function to optimize the quantization strategy for different
tensor types in the model. The main changes are:

FFN Layer Quantization:
- Added specific quantization types for ffn_down, ffn_gate, and ffn_up tensors, depending on their layer
  position. The first few layers use higher precision, while others use lower precision.
- Introduced quantization types for shared expert layers (ffn_down_shexp and ffn_gate_shexp) with higher
  efficiency.
Attention Layer Quantization:
- Improved quantization type allocation for attn_v.weight based on model architecture and parameters.
- Specified quantization types for MLA projection matrices (attn_kv_a_mqa.weight, attn_kv_b.weight,
  attn_q_a.weight, and attn_q_b.weight).
Model Architecture and Parameter Configuration:
- Adjusted quantization types for attn_output.weight based on model architecture and parameters.
- Included a check for is_one_bit to differentiate quantization strategies for different model types.
Code Structure:
- Added more else if conditions and function calls to clarify the quantization logic and improve
  maintainability.

Overall, the patch refines the quantization process to enhance model performance and efficiency, especially in
complex architectures.

Lunzima pinned discussion Feb 18, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment