Instructions to use QwQbb/Hy-MT2-30B-A3B-MLX-4bit with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use QwQbb/Hy-MT2-30B-A3B-MLX-4bit with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Hy-MT2-30B-A3B-MLX-4bit QwQbb/Hy-MT2-30B-A3B-MLX-4bit
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Hy-MT2-30B-A3B-MLX-4bit
This repository contains a community-created regular MLX affine 4-bit conversion of tencent/Hy-MT2-30B-A3B, optimized for Apple Silicon Macs through MLX-LM.
It exists for Mac users who want to run the 30B-A3B Hy-MT2 translation model locally with lower memory usage than the 8-bit MLX conversion.
Important Notice
This is not an official Tencent release. It is a community conversion.
Tencent is not affiliated with, associated with, sponsoring, endorsing, or maintaining this repository, this conversion, or any service that uses it.
The original model is licensed under the Tencent HY Community License Agreement. A copy is included as LICENSE.txt, and the redistribution notice is included as NOTICE.
The Tencent HY license states that the agreement does not apply in the European Union and defines its Territory as worldwide excluding the EU. Do not use, reproduce, modify, distribute, or display Tencent HY Works outside the permitted Territory. You are responsible for reading and complying with the full license and Acceptable Use Policy before using this model.
What Was Converted
- Base model: tencent/Hy-MT2-30B-A3B
- Architecture:
hy_v3 - Model family: Hy-MT2 multilingual translation
- Parameters: 30B total, about 3B active per token
- Source precision: BF16
- Target format: MLX safetensors
- Quantization: MLX affine 4-bit, group size 64
- Reported bits per weight: 4.502
- Output size: about 16 GB
- Conversion date: 2026-05-22
The converted repository includes a custom hy_v3.py adapter because mlx-lm 0.31.3 did not include built-in hy_v3 model support at conversion time. Loading this repository executes that local adapter file through MLX-LM's model_file mechanism. Please inspect the file before running it if you have any concern about custom model code.
Tested Hardware
Tested on:
- MacBook Pro M5 Max
- 128 GB unified memory
- macOS with Apple Silicon Metal acceleration
mlx==0.31.2mlx-lm==0.31.3- Python 3.13
Short smoke test:
Input:
오늘 날씨가 정말 좋네요.
Output:
The weather is really nice today.
Generation speed:
92.380 tokens/sec
Peak memory:
17.030 GB
This is a short smoke test, not a full benchmark. Throughput, memory use, and translation quality will vary by Mac model, macOS version, prompt length, output length, batch size, KV-cache size, and workload.
Installation
Install MLX-LM:
python3 -m pip install -U mlx-lm
For best results, use a recent macOS release and an Apple Silicon Mac. The model files are about 16 GB, and real memory use increases with prompt length and generated output length.
Quick Start: Python
from mlx_lm import load, stream_generate
from mlx_lm.sample_utils import make_sampler
model_id = "QwQbb/Hy-MT2-30B-A3B-MLX-4bit"
model, tokenizer = load(model_id)
source_text = "오늘 날씨가 정말 좋네요."
prompt = (
"Translate the following text into English. "
"Note that you should only output the translated result without any additional explanation:\n\n"
f"{source_text}"
)
messages = [{"role": "user", "content": prompt}]
prompt = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_dict=False,
)
sampler = make_sampler(temp=0.7, top_p=1.0, top_k=0)
parts = []
for response in stream_generate(
model,
tokenizer,
prompt=prompt,
max_tokens=4096,
sampler=sampler,
):
print(response.text, end="", flush=True)
parts.append(response.text)
print("\n\nFinal:", "".join(parts).strip())
Quick Start: MLX-LM Server
Start an OpenAI-compatible local server:
mlx_lm.server \
--model QwQbb/Hy-MT2-30B-A3B-MLX-4bit \
--host 127.0.0.1 \
--port 8080 \
--temp 0.7 \
--top-p 1.0 \
--max-tokens 4096 \
--trust-remote-code
Call it with curl:
curl -X POST "http://127.0.0.1:8080/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "QwQbb/Hy-MT2-30B-A3B-MLX-4bit",
"messages": [
{
"role": "user",
"content": "Translate the following text into English. Note that you should only output the translated result without any additional explanation:\n\n오늘 날씨가 정말 좋네요."
}
],
"temperature": 0.7,
"top_p": 1.0,
"max_tokens": 4096,
"stream": true
}'
Recommended Generation Settings
Tencent recommends the following parameters for Hy-MT2-30B-A3B:
{
"temperature": 0.7,
"top_p": 1.0,
"top_k": -1,
"repetition_penalty": 1.0,
"max_tokens": 4096
}
In MLX-LM, top_k=0 disables top-k filtering, which corresponds to the intent of Tencent's top_k=-1 setting.
Prompt Template
Use full language names in prompts, for example English, Korean, Japanese, Traditional Chinese, or French.
Translate the following text into {target_lang}. Note that you should only output the translated result without any additional explanation:
{source_text}
For terminology, style, background context, or structured-data translation, follow the instruction examples in the original Tencent Hy-MT2 model card.
Supported Languages
Hy-MT2 supports translation among the following languages:
| Language | Code |
|---|---|
| Chinese | zh |
| English | en |
| French | fr |
| Portuguese | pt |
| Spanish | es |
| Japanese | ja |
| Turkish | tr |
| Russian | ru |
| Arabic | ar |
| Korean | ko |
| Thai | th |
| Italian | it |
| German | de |
| Vietnamese | vi |
| Malay | ms |
| Indonesian | id |
| Filipino | tl |
| Hindi | hi |
| Traditional Chinese | zh-Hant |
| Polish | pl |
| Czech | cs |
| Dutch | nl |
| Khmer | km |
| Burmese | my |
| Persian | fa |
| Gujarati | gu |
| Urdu | ur |
| Telugu | te |
| Marathi | mr |
| Hebrew | he |
| Bengali | bn |
| Tamil | ta |
| Ukrainian | uk |
| Tibetan | bo |
| Kazakh | kk |
| Mongolian | mn |
| Uyghur | ug |
| Cantonese | yue |
Files
model-00001-of-00004.safetensorsthroughmodel-00004-of-00004.safetensors: MLX 4-bit quantized weightsmodel.safetensors.index.json: weight indexconfig.json: HyV3 model configuration plus MLX quantization metadatahy_v3.py: custom MLX-LM model adapter for HYV3tokenizer.json,tokenizer_config.json,chat_template.jinja: tokenizer and chat template files from the base modelLICENSE.txt: Tencent HY Community License Agreement from the base modelNOTICE: redistribution notice and conversion noticeconversion_info.json: conversion metadata and smoke-test result
Limitations
- This is an experimental community conversion, not an official Tencent artifact.
- This is a regular MLX affine 4-bit quantization.
- It is intended for MLX-LM on Apple Silicon. It is not a Transformers checkpoint.
- The custom
hy_v3.pyadapter was tested with a short translation smoke test, but it has not been exhaustively validated on every long-context, batching, or edge-case workload. - Lower-bit quantization may affect translation quality compared with BF16 or 8-bit.
- The benchmark number above is a short generation test and should not be treated as a universal throughput guarantee.
- Users are responsible for license compliance, applicable law compliance, and Acceptable Use Policy compliance.
Attribution
Base model:
tencent/Hy-MT2-30B-A3B
https://huggingface.co/tencent/Hy-MT2-30B-A3B
Hy-MT2 paper:
@misc{zheng2026hymt2familyfastefficient,
title={Hy-MT2: A Family of Fast, Efficient and Powerful Multilingual Translation Models in the Wild},
author={Mao Zheng and Zheng Li and Tao Chen and Bo Lv and Mingrui Sun and Mingyang Song and Jinlong Song and Hong Huang and Decheng Wu and Hai Wang and Yifan Song and Yanfeng Chen and Guanwei Zhang},
year={2026},
eprint={2605.22064},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2605.22064},
}
Required Notice
Tencent HY is licensed under the Tencent HY Community License Agreement, Copyright (c) 2026 Tencent. All Rights Reserved. The trademark rights of "Tencent HY" are owned by Tencent or its affiliate.
- Downloads last month
- -
4-bit
Model tree for QwQbb/Hy-MT2-30B-A3B-MLX-4bit
Base model
tencent/Hy-MT2-30B-A3B