YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

GLM-ASR with vLLM

An audio speech recognition (ASR) project that integrates the GLM-ASR model with vLLM for high-performance inference. This project provides both local inference capabilities and a scalable API server using Docker.

you can check more code in Github project

This project is an extension/supplement to the original GLM-ASR project, adding vLLM integration for production-ready deployment and OpenAI-compatible API support.

Features

  • Audio Transcription: Transcribe audio files using GLM-ASR model
  • Audio Description: Generate textual descriptions of audio content
  • OpenAI-Compatible API: vLLM server provides OpenAI-compatible API endpoints
  • Docker Support: Easy deployment with Docker and Docker Compose
  • High Performance: Leverages vLLM for efficient GPU-accelerated inference
  • Flexible Audio Input: Supports various audio formats and input methods

Project Structure

glm_asr_vllm/
β”œβ”€β”€ model/                  # Model configuration and implementation
β”‚   β”œβ”€β”€ configuration_glmasr.py    # GLM-ASR configuration
β”‚   β”œβ”€β”€ modeling_glmasr.py         # GLM-ASR model implementation
β”‚   β”œβ”€β”€ modeling_audio.py          # Audio encoding/decoding
β”‚   └── processing_glmasr.py       # Audio processing utilities
β”œβ”€β”€ server/                # vLLM integration files
β”‚   β”œβ”€β”€ glmasr_audio.py     # Audio processing for vLLM
β”‚   β”œβ”€β”€ glm_asr.py          # GLM-ASR vLLM model wrapper
β”‚   β”œβ”€β”€ registry.py         # Model registry (vLLM)
β”‚   └── server_ws.py        # WebSocket server
β”œβ”€β”€ wavs/                  # Sample audio files
β”œβ”€β”€ docker-compose.yaml     # Docker Compose configuration
β”œβ”€β”€ dockerfile             # Docker image build configuration
β”œβ”€β”€ hf_demo.py             # HuggingFace Transformers demo
└── test_vllm_api.py       # OpenAI API client test script

Prerequisites

  • Python 3.12+
  • CUDA-capable GPU (recommended)
  • Docker (for containerized deployment)
  • Docker Compose (optional)

Installation

Option 1: Local Installation

  1. Clone the repository:
git clone <repository-url>
cd glm_asr_vllm
  1. Install dependencies:
pip install torch transformers soundfile librosa openai
  1. Download the model from HuggingFace and place it in the ./model/ directory:
# Download using huggingface-cli (recommended)
huggingface-cli download bupalinyu/glm-asr-eligant --local-dir ./model

# Or use git lfs
git lfs install
git clone https://huggingface.co/bupalinyu/glm-asr-eligant ./model

Note: The model ID on HuggingFace is bupalinyu/glm-asr-eligant. After downloading, ensure all model files are in the ./model/ directory.

Option 2: Docker Deployment

  1. Build the Docker image:
docker build -t vllm-glmasr:latest .
  1. Deploy using Docker Compose:
docker-compose up -d

Usage

HuggingFace Transformers Demo

Run the local demo script to transcribe audio files:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "./model/",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16
).to("cuda")

processor = AutoProcessor.from_pretrained("./model/", trust_remote_code=True)

# Define conversations
conversations = [
    [
        {
            "role": "user",
            "content": [
                {"type": "audio", "path": "./wavs/dufu.wav"},
                {"type": "text", "text": "Please transcribe this audio."},
            ],
        }
    ],
]

# Process and generate
inputs = processor.apply_chat_template(
    conversations,
    return_tensors="pt",
    sampling_rate=16000,
    audio_padding="longest",
).to("cuda")

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)

print(processor.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True))

Run the demo:

python hf_demo.py

vLLM API Server

Start the Server

Using Docker Compose:

docker-compose up -d

Or manually with Docker:

docker run -d \
  --name vllm-glmasr \
  --gpus all \
  --ipc host \
  --shm-size 8gb \
  -p 8300:8300 \
  -e CUDA_VISIBLE_DEVICES=2 \
  vllm-glmasr:latest

Server will be available at http://localhost:8300

API Client Example

Use the OpenAI-compatible API to transcribe audio:

import base64
import io
import soundfile as sf
import librosa
import numpy as np
from openai import OpenAI

# Configure client
client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8300/v1"
)

# Load and prepare audio
def load_wav_16k(path: str):
    audio, sr = sf.read(path)
    if audio.ndim > 1:
        audio = audio.mean(axis=1)
    audio = audio.astype(np.float32)
    if sr != 16000:
        audio = librosa.resample(audio, orig_sr=sr, target_sr=16000).astype(np.float32)
    return audio, sr

# Convert to base64
def wav_to_base64(wav: np.ndarray, sr: int) -> str:
    buf = io.BytesIO()
    sf.write(buf, wav, sr, format="WAV", subtype="PCM_16")
    return base64.b64encode(buf.getvalue()).decode("utf-8")

# Transcribe
pcm, sr = load_wav_16k("path/to/audio.wav")
audio_b64 = wav_to_base64(pcm, sr)

resp = client.chat.completions.create(
    model="glm-asr-eligant",
    max_completion_tokens=256,
    temperature=0.0,
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Please transcribe this audio.<|audio|>"},
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": audio_b64,
                        "format": "wav",
                    },
                },
            ],
        }
    ],
)

print(resp.choices[0].message.content)

Run the test script:

python test_vllm_api.py

Configuration

Docker Compose Settings

Modify docker-compose.yaml to adjust:

  • GPU Selection: CUDA_VISIBLE_DEVICES environment variable
  • Port: ports mapping (default: 8300:8300)
  • GPU Memory: gpu-memory-utilization parameter (default: 0.1)
  • Model Length: max-model-len parameter (default: 4096)

vLLM Server Parameters

Key parameters configured in docker-compose.yaml:

  • --host: Server host address (default: 0.0.0.0)
  • --port: Server port (default: 8300)
  • --served-model-name: Model name for API calls (default: glm-asr-eligant)
  • --dtype: Data type (default: auto)
  • --tensor-parallel-size: Tensor parallelism size (default: 1)
  • --max-model-len: Maximum model sequence length (default: 4096)
  • --trust-remote-code: Allow remote code execution
  • --gpu-memory-utilization: GPU memory utilization 0-1 (default: 0.1)
  • --api-key: API key for authentication (default: EMPTY)

Model Architecture

GLM-ASR combines:

  • Whisper Encoder: Audio feature extraction
  • LLM Backbone: Text generation (based on GLM architecture)
  • Multimodal Adapter: Bridges audio and text representations

Key configurations from configuration_glmasr.py:

  • Adapter Type: MLP (default) with merge factor of 4
  • RoPE: Rotary Position Embeddings enabled
  • Spec Aug: Spectral augmentation (disabled by default)
  • Max Whisper Length: 1500 tokens
  • MLP Activation: GELU

Audio Input Requirements

  • Sampling Rate: 16 kHz (audio will be resampled if needed)
  • Channels: Mono (stereo will be downmixed to mono)
  • Formats: WAV, FLAC, OGG (via base64 encoding)
  • Duration: Limited by max_model_len parameter

The processor (processing_glmasr.py) supports:

  • Audio file paths
  • NumPy arrays
  • Base64 encoded audio
  • Batch processing with padding

API Reference

Chat Completions Endpoint

POST /v1/chat/completions

Request body:

{
  "model": "glm-asr-eligant",
  "max_completion_tokens": 256,
  "temperature": 0.0,
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "Please transcribe this audio.<|audio|>"
        },
        {
          "type": "input_audio",
          "input_audio": {
            "data": "<base64_encoded_audio>",
            "format": "wav"
          }
        }
      ]
    }
  ]
}

vLLM Integration

The project integrates GLM-ASR with vLLM through:

Troubleshooting

GPU Memory Issues

Reduce gpu-memory-utilization or decrease max-model-len in docker-compose.yaml

Slow Inference

  • Enable tensor parallelism with --tensor-parallel-size
  • Ensure proper GPU selection via CUDA_VISIBLE_DEVICES
  • Check GPU utilization with nvidia-smi

Connection Refused

  • Verify the Docker container is running: docker ps
  • Check port mapping is correct
  • Ensure firewall allows traffic on port 8300

Model Loading Issues

  • Verify model weights are in the correct directory (./model/)
  • Check trust_remote_code is enabled
  • Ensure sufficient disk space for model files

License

This project uses the Apache 2.0 license (see server/registry.py).

Acknowledgments

  • GLM-ASR Model: Original model authors
  • vLLM: High-performance LLM inference engine
  • Transformers: HuggingFace model utilities
  • Whisper: OpenAI audio encoder

Related Projects

GPA - ASR, TTS and Voice Conversion in One

GPA MODEL

A unified audio model that combines ASR (Automatic Speech Recognition), TTS (Text-to-Speech), and voice conversion in just 0.3B parameters. This model is specifically designed for:

  • Edge deployment: Lightweight model suitable for mobile devices and edge computing
  • Commercial services: Optimized for large-scale production deployment
  • All-in-one solution: Single model for speech recognition, synthesis, and voice conversion

If you need a more compact, multi-functional audio solution for edge or commercial scenarios, consider exploring the GPA project.

Downloads last month
15
Safetensors
Model size
2B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support