Audio Models
Collection
13 items • Updated
Configuration Parsing Warning:Invalid JSON for config file config.json
meeting transcription demo on Axera
For those who are interested in model conversion, you can try to export axmodel through the original repo :
How to Convert from ONNX to axmodel
参考模型转换
在python3.10(验证)
Requirements
pip3 install -r requirements.txt
支持浏览器麦克风实时分段转录,会议结束后自动做说话人聚类 + ASR,并调用 OpenAI 兼容接口生成会议纪要。
启动:
python -m ax_meeting.server
浏览器访问:
http://127.0.0.1:8000
环境变量(可选,用作会议纪要生成):
OPENAI_API_KEY=xxx
OPENAI_BASE_URL=http://127.0.0.1:8001/v1 # 本地 OpenAI 协议服务时设置
OPENAI_MODEL=AXERA-TECH/Qwen3-1.7B
HOST=0.0.0.0
PORT=8000
SSL_CERT=cert.pem
SSL_KEY=key.pem
AX_MODEL_DIR=/path/to/ax_model
依赖提示(WebSocket):
websockets 或 uvicorn[standard],否则浏览器实时流式会失败设备权限提示:
/dev/axcl_host 权限错误,请用有权限的账号或 sudo 运行axengine 依赖提示:
pyaxengine,请将本地 wheel 放到 ax_meeting/vendor/,或设置 AXENGINE_WHEEL=/path/to/pyaxengine.whlHTTPS(推荐,便于浏览器麦克风权限):
openssl req -x509 -newkey rsa:2048 -nodes \\
-keyout key.pem -out cert.pem -days 365 \\
-subj "/CN=<你的IP>"
SSL_CERT=cert.pem SSL_KEY=key.pem python -m ax_meeting.server
使用包内自签证书(默认打包在 ax_meeting/certs/):
SSL_CERT=ax_meeting/certs/cert.pem SSL_KEY=ax_meeting/certs/key.pem python -m ax_meeting.server
网页参数说明(说话人聚类):
mer_cos 越小越容易分开说话人(更敏感,可能误分)min_cluster_size 越小越容易分出更多说话人AHC 更稳定但可能偏保守,spectral 更灵敏./build_wheel.sh
生成结果在 dist/ 目录。
对单个会议音频文件执行说话人聚类 + ASR,并导出文本,可选会议总结(LLM 通过参数配置):
python -m ax_meeting.vad_asr_cli --input wav/vad_example.wav --output_dir output_dir
说话人 + ASR(离线):
python -m ax_meeting.diar_asr_cli \\
--wav_file wav/vad_example.wav \\
--output_dir output_dir
会议总结:
python -m ax_meeting.summarize_cli \\
--input output_dir/vad_example.txt \\
--openai_base_url http://127.0.0.1:8001/v1 \\
--openai_model AXERA-TECH/Qwen3-1.7B \\
--openai_api_key xxx
from ax_meeting import VadAsrEngine, DiarAsrEngine, IncrementalSummarizer
# VAD + ASR(流式)
vad_asr = VadAsrEngine(stream=True)
vad_asr.feed(audio_chunk) # numpy / bytes / path / list
segments = vad_asr.poll() # 可能为空
# 说话人 + ASR(离线)
diar = DiarAsrEngine()
text = diar.transcribe("wav/vad_example.wav")
# 会议总结
summarizer = IncrementalSummarizer()
summary = summarizer.summarize_incrementally(text)
示例脚本:
examples/vad_asr_stream.pyexamples/diar_asr_offline.pyexamples/summarize_text.pyAX650N
| model | latency(ms) |
|---|---|
| vad | 5.441 |
| cammplus | 2.907 |
| sensevoice | 25.482 |
RTF: 约为0.2
eg:
Inference time for vad_example.wav: 10.92 seconds
- VAD processing time: 2.20 seconds
- Speaker embedding extraction time: 1.88 seconds
- Speaker clustering time: 0.16 seconds
- ASR processing time: 3.75 seconds
load model + Inference time for vad_example.wav: 13.08 seconds
Audio duration: 70.47 seconds
RTF: 0.15
参考:
Base model
FunAudioLLM/SenseVoiceSmall