Experimenting with dynamic quantization
#1
pinned
by Lunzima - opened
https://gist.github.com/lunzima/dbca4281acf7c6bb0100e26a0a51de06
This patch modifies the llama_tensor_get_type function to optimize the quantization strategy for different
tensor types in the model. The main changes are:
FFN Layer Quantization:
- Added specific quantization types for
ffn_down,ffn_gate, andffn_uptensors, depending on their layer
position. The first few layers use higher precision, while others use lower precision. - Introduced quantization types for shared expert layers (
ffn_down_shexpandffn_gate_shexp) with higher
efficiency.
- Added specific quantization types for
Attention Layer Quantization:
- Improved quantization type allocation for
attn_v.weightbased on model architecture and parameters. - Specified quantization types for MLA projection matrices (
attn_kv_a_mqa.weight,attn_kv_b.weight,attn_q_a.weight, andattn_q_b.weight).
- Improved quantization type allocation for
Model Architecture and Parameter Configuration:
- Adjusted quantization types for
attn_output.weightbased on model architecture and parameters. - Included a check for
is_one_bitto differentiate quantization strategies for different model types.
- Adjusted quantization types for
Code Structure:
- Added more
else ifconditions and function calls to clarify the quantization logic and improve
maintainability.
- Added more
Overall, the patch refines the quantization process to enhance model performance and efficiency, especially in
complex architectures.
Lunzima pinned discussion