---
license: apache-2.0
language:
  - en
tags:
  - audio
  - medical-audio
  - respiratory-sounds
  - cardiac-sounds
  - auscultation
  - cardiopulmonary
  - representation-learning
  - cross-modal-alignment
  - audio-language-alignment
  - self-supervised-learning
  - clinical-ai
  - pytorch
pipeline_tag: feature-extraction
library_name: pytorch
arxiv: 2512.04847
---

![截圖 2026-04-27 17.30.20](https://cdn-uploads.huggingface.co/production/uploads/6506cb686ba49887d312cfa2/C4gFTr-FqYuJazDm_-Xwn.png)

# AcuLa

AcuLa (**Audio–Clinical Understanding via Language Alignment**) is a post-training alignment framework for medical audio understanding. It improves pretrained audio encoders by aligning their representations with clinical-language representations from a language model, encouraging audio embeddings to capture richer clinical semantics while preserving fine-grained acoustic information.

This repository provides the checkpoint for AcuLa. The accompanying implementation is available at:

**GitHub:** https://github.com/janine714/AcuLA

This work is described in the paper **“Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding.”**

---

## Intended Use

AcuLa is intended for research on clinically informed medical audio representation learning.

| Use case | Description |
|---|---|
| Feature extraction | Extract embeddings from cardio-respiratory audio recordings |
| Linear probing | Train lightweight classifiers or regressors on frozen embeddings |
| Transfer learning | Adapt the aligned encoder to downstream medical audio datasets |
| Respiratory analysis | Study cough, breath, exhalation, and lung sound representations |
| Cardiac audio analysis | Study heart sound representations |
| Audio-text retrieval | Retrieve semantically related clinical reports or audio samples |
| Representation analysis | Analyze how clinical semantics are reflected in audio embedding spaces |

AcuLa was evaluated on 18 downstream cardio-respiratory tasks, including respiratory condition inference, lung function estimation, and cardiac condition inference.


---

## Installation

Clone the GitHub repository:

    git clone https://github.com/janine714/AcuLA
    cd AcuLA

Install dependencies:

    pip install -r requirements.txt

If using OPERA-family encoders, please make sure the required OPERA dependencies and pretrained checkpoints are available in your environment.

---


## Training

To train AcuLa, first clone the repository:

    git clone https://github.com/janine714/AcuLA
    cd AcuLA

Then run training with:

    python main.py \
      --csv_path /path/to/combined_dataset.csv \
      --audio_ckpt /path/to/encoder-operaGT.ckpt \
      --output_dir ./checkpoints \
      --audio_backbone operaGT \
      --llm_type google/medgemma-4b-pt \
      --epochs 50 \
      --batch_size 24 \
      --grad_accum_steps 2 \
      --warmup_steps 400 \
      --lr 1e-5 \
      --lambda_align 1.0 \
      --lambda_mam 1.0 \
      --use_wandb

Expected CSV format:

| Column | Description |
|---|---|
| `audio_path` | Path to the audio recording |
| `Gen_Report` | Clinical text report paired with the audio recording |

Example:

| audio_path | Gen_Report |
|---|---|
| `/path/to/audio.wav` | `The recording is consistent with normal pulmonary findings...` |

---

## Checkpoint Loading

The checkpoint can be loaded together with the AcuLa codebase.

    import torch
    from audio_encoder import initialize_pretrained_model

    checkpoint_path = "path/to/acula_checkpoint.pt"

    audio_model = initialize_pretrained_model(pretrain="operaGT")
    ckpt = torch.load(checkpoint_path, map_location="cpu")

    if "audio_model_state_dict" in ckpt:
        state_dict = ckpt["audio_model_state_dict"]
    elif "state_dict" in ckpt:
        state_dict = ckpt["state_dict"]
    else:
        state_dict = ckpt

    audio_model.load_state_dict(state_dict, strict=False)
    audio_model.eval()

Extract audio features:

    import torch

    with torch.no_grad():
        features = audio_model.forward_feature(audio_input)

The variable `audio_input` should follow the preprocessing format expected by the selected audio encoder.

---

## Input Format

AcuLa expects medical audio recordings that are preprocessed into the format required by the selected audio encoder.

A typical preprocessing setup is:

| Step | Setting |
|---|---|
| Sampling rate | 16 kHz |
| Segment length | Fixed-length segment, commonly around 8 seconds |
| Audio representation | Log-mel spectrogram |
| Number of mel bins | 64 |
| Padding/truncation | Applied as needed |

During training, optional audio augmentations may include volume adjustment, normalization, low-pass filtering, and high-pass filtering.

---

## Training Data

AcuLa was trained using paired medical audio and clinical reports generated from structured metadata. The alignment corpus contains cardio-respiratory audio from multiple public datasets.

| Dataset | Modality |
|---|---|
| ICBHI | Lung sounds |
| HFLung | Lung sounds |
| UK COVID-19 | Induced cough and exhalation |
| CoughVID | Cough sounds |
| CirCor | Heart sounds |
| SPRSound | Lung sounds |
| ZCHSound | Heart sounds |

The paper reports more than 100,000 paired audio-report samples for alignment.

---

## Downstream Evaluation

The paper evaluates AcuLa on 18 cardio-respiratory tasks.

| Task group | Example tasks | Metric |
|---|---|---|
| Respiratory condition inference | COVID-19 detection, COPD classification, smoker classification, obstructive-vs-healthy classification, COPD severity classification | AUROC |
| Lung function estimation | FVC, FEV1, FEV1/FVC, respiratory rate | MAE |
| Cardiac condition inference | Murmur detection, symptomatic-vs-healthy classification | AUROC |

The main evaluation protocol uses frozen embeddings and lightweight supervised prediction heads, allowing performance differences to reflect representation quality.

---

## Reported Findings

The paper shows that AcuLa improves medical audio representations across diverse cardio-respiratory tasks.

| Finding | Summary |
|---|---|
| Stronger classification representations | Improved AUROC across respiratory and cardiac condition inference tasks |
| Improved cough-based analysis | Large gains on challenging COVID-19 cough detection settings |
| Better physiological estimation | Improved performance on multiple lung-function estimation tasks |
| Model-agnostic improvements | Consistent gains across several pretrained audio backbones |
| Zero-shot potential | Competitive retrieval-style audio-text similarity results on respiratory tasks |

Please refer to the paper for full task-by-task results and experimental details.

---

## Checkpoint Contents

Depending on the uploaded checkpoint variant, the checkpoint may contain one or more of the following components:

| Component | Description |
|---|---|
| Audio encoder weights | Aligned medical audio encoder parameters |
| Audio projection head | Projection layer for shared-space audio embeddings |
| Language projection head | Projection layer for shared-space text embeddings |
| Training metadata | Optional optimizer, scheduler, or epoch information |

Users can inspect the checkpoint keys with:

    import torch

    ckpt = torch.load("path/to/acula_checkpoint.pt", map_location="cpu")
    print(ckpt.keys())

---

## Limitations

| Limitation | Description |
|---|---|
| Research-stage checkpoint | Intended for research evaluation and downstream development |
| Dataset dependence | Performance may vary across datasets, devices, and recording conditions |
| Synthetic text supervision | Alignment reports are generated from metadata and may simplify clinical details |
| Clip-level representation | The method learns global clip embeddings and does not explicitly localize events |
| Downstream adaptation | Task-specific classifiers or regressors may still be needed for final applications |

---

## Citation

Please cite the paper if you use this checkpoint:

    @misc{wang2026languagemodelssemanticteachers,
      title={Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding}, 
      author={Tsai-Ning Wang and Lin-Lin Chen and Neil Zeghidour and Aaqib Saeed},
      year={2026},
      eprint={2512.04847},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2512.04847}, 
    }

---

## Acknowledgment

This checkpoint is released to support reproducibility and further research on medical audio understanding, audio-language alignment, and clinically informed representation learning.