AcuLa / README.md
tsnngw's picture
Update README.md
113625d verified
---
license: apache-2.0
language:
- en
tags:
- audio
- medical-audio
- respiratory-sounds
- cardiac-sounds
- auscultation
- cardiopulmonary
- representation-learning
- cross-modal-alignment
- audio-language-alignment
- self-supervised-learning
- clinical-ai
- pytorch
pipeline_tag: feature-extraction
library_name: pytorch
arxiv: 2512.04847
---
![截圖 2026-04-27 17.30.20](https://cdn-uploads.huggingface.co/production/uploads/6506cb686ba49887d312cfa2/C4gFTr-FqYuJazDm_-Xwn.png)
# AcuLa
AcuLa (**Audio–Clinical Understanding via Language Alignment**) is a post-training alignment framework for medical audio understanding. It improves pretrained audio encoders by aligning their representations with clinical-language representations from a language model, encouraging audio embeddings to capture richer clinical semantics while preserving fine-grained acoustic information.
This repository provides the checkpoint for AcuLa. The accompanying implementation is available at:
**GitHub:** https://github.com/janine714/AcuLA
This work is described in the paper **“Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding.”**
---
## Intended Use
AcuLa is intended for research on clinically informed medical audio representation learning.
| Use case | Description |
|---|---|
| Feature extraction | Extract embeddings from cardio-respiratory audio recordings |
| Linear probing | Train lightweight classifiers or regressors on frozen embeddings |
| Transfer learning | Adapt the aligned encoder to downstream medical audio datasets |
| Respiratory analysis | Study cough, breath, exhalation, and lung sound representations |
| Cardiac audio analysis | Study heart sound representations |
| Audio-text retrieval | Retrieve semantically related clinical reports or audio samples |
| Representation analysis | Analyze how clinical semantics are reflected in audio embedding spaces |
AcuLa was evaluated on 18 downstream cardio-respiratory tasks, including respiratory condition inference, lung function estimation, and cardiac condition inference.
---
## Installation
Clone the GitHub repository:
git clone https://github.com/janine714/AcuLA
cd AcuLA
Install dependencies:
pip install -r requirements.txt
If using OPERA-family encoders, please make sure the required OPERA dependencies and pretrained checkpoints are available in your environment.
---
## Training
To train AcuLa, first clone the repository:
git clone https://github.com/janine714/AcuLA
cd AcuLA
Then run training with:
python main.py \
--csv_path /path/to/combined_dataset.csv \
--audio_ckpt /path/to/encoder-operaGT.ckpt \
--output_dir ./checkpoints \
--audio_backbone operaGT \
--llm_type google/medgemma-4b-pt \
--epochs 50 \
--batch_size 24 \
--grad_accum_steps 2 \
--warmup_steps 400 \
--lr 1e-5 \
--lambda_align 1.0 \
--lambda_mam 1.0 \
--use_wandb
Expected CSV format:
| Column | Description |
|---|---|
| `audio_path` | Path to the audio recording |
| `Gen_Report` | Clinical text report paired with the audio recording |
Example:
| audio_path | Gen_Report |
|---|---|
| `/path/to/audio.wav` | `The recording is consistent with normal pulmonary findings...` |
---
## Checkpoint Loading
The checkpoint can be loaded together with the AcuLa codebase.
import torch
from audio_encoder import initialize_pretrained_model
checkpoint_path = "path/to/acula_checkpoint.pt"
audio_model = initialize_pretrained_model(pretrain="operaGT")
ckpt = torch.load(checkpoint_path, map_location="cpu")
if "audio_model_state_dict" in ckpt:
state_dict = ckpt["audio_model_state_dict"]
elif "state_dict" in ckpt:
state_dict = ckpt["state_dict"]
else:
state_dict = ckpt
audio_model.load_state_dict(state_dict, strict=False)
audio_model.eval()
Extract audio features:
import torch
with torch.no_grad():
features = audio_model.forward_feature(audio_input)
The variable `audio_input` should follow the preprocessing format expected by the selected audio encoder.
---
## Input Format
AcuLa expects medical audio recordings that are preprocessed into the format required by the selected audio encoder.
A typical preprocessing setup is:
| Step | Setting |
|---|---|
| Sampling rate | 16 kHz |
| Segment length | Fixed-length segment, commonly around 8 seconds |
| Audio representation | Log-mel spectrogram |
| Number of mel bins | 64 |
| Padding/truncation | Applied as needed |
During training, optional audio augmentations may include volume adjustment, normalization, low-pass filtering, and high-pass filtering.
---
## Training Data
AcuLa was trained using paired medical audio and clinical reports generated from structured metadata. The alignment corpus contains cardio-respiratory audio from multiple public datasets.
| Dataset | Modality |
|---|---|
| ICBHI | Lung sounds |
| HFLung | Lung sounds |
| UK COVID-19 | Induced cough and exhalation |
| CoughVID | Cough sounds |
| CirCor | Heart sounds |
| SPRSound | Lung sounds |
| ZCHSound | Heart sounds |
The paper reports more than 100,000 paired audio-report samples for alignment.
---
## Downstream Evaluation
The paper evaluates AcuLa on 18 cardio-respiratory tasks.
| Task group | Example tasks | Metric |
|---|---|---|
| Respiratory condition inference | COVID-19 detection, COPD classification, smoker classification, obstructive-vs-healthy classification, COPD severity classification | AUROC |
| Lung function estimation | FVC, FEV1, FEV1/FVC, respiratory rate | MAE |
| Cardiac condition inference | Murmur detection, symptomatic-vs-healthy classification | AUROC |
The main evaluation protocol uses frozen embeddings and lightweight supervised prediction heads, allowing performance differences to reflect representation quality.
---
## Reported Findings
The paper shows that AcuLa improves medical audio representations across diverse cardio-respiratory tasks.
| Finding | Summary |
|---|---|
| Stronger classification representations | Improved AUROC across respiratory and cardiac condition inference tasks |
| Improved cough-based analysis | Large gains on challenging COVID-19 cough detection settings |
| Better physiological estimation | Improved performance on multiple lung-function estimation tasks |
| Model-agnostic improvements | Consistent gains across several pretrained audio backbones |
| Zero-shot potential | Competitive retrieval-style audio-text similarity results on respiratory tasks |
Please refer to the paper for full task-by-task results and experimental details.
---
## Checkpoint Contents
Depending on the uploaded checkpoint variant, the checkpoint may contain one or more of the following components:
| Component | Description |
|---|---|
| Audio encoder weights | Aligned medical audio encoder parameters |
| Audio projection head | Projection layer for shared-space audio embeddings |
| Language projection head | Projection layer for shared-space text embeddings |
| Training metadata | Optional optimizer, scheduler, or epoch information |
Users can inspect the checkpoint keys with:
import torch
ckpt = torch.load("path/to/acula_checkpoint.pt", map_location="cpu")
print(ckpt.keys())
---
## Limitations
| Limitation | Description |
|---|---|
| Research-stage checkpoint | Intended for research evaluation and downstream development |
| Dataset dependence | Performance may vary across datasets, devices, and recording conditions |
| Synthetic text supervision | Alignment reports are generated from metadata and may simplify clinical details |
| Clip-level representation | The method learns global clip embeddings and does not explicitly localize events |
| Downstream adaptation | Task-specific classifiers or regressors may still be needed for final applications |
---
## Citation
Please cite the paper if you use this checkpoint:
@misc{wang2026languagemodelssemanticteachers,
title={Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding},
author={Tsai-Ning Wang and Lin-Lin Chen and Neil Zeghidour and Aaqib Saeed},
year={2026},
eprint={2512.04847},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2512.04847},
}
---
## Acknowledgment
This checkpoint is released to support reproducibility and further research on medical audio understanding, audio-language alignment, and clinically informed representation learning.