File size: 8,625 Bytes
724ee8a
 
 
 
 
 
 
 
 
 
9e30bee
724ee8a
 
 
 
 
 
 
 
9e30bee
724ee8a
 
113625d
a401b20
9e30bee
724ee8a
113625d
724ee8a
113625d
724ee8a
9e30bee
 
 
 
724ee8a
 
9e30bee
 
113625d
9e30bee
113625d
724ee8a
113625d
9e30bee
 
 
 
 
113625d
9e30bee
 
 
724ee8a
 
 
9e30bee
724ee8a
9e30bee
724ee8a
9e30bee
 
724ee8a
113625d
9e30bee
 
 
113625d
724ee8a
 
 
 
113625d
724ee8a
113625d
9e30bee
 
 
724ee8a
113625d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
724ee8a
9e30bee
 
724ee8a
113625d
9e30bee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
724ee8a
 
 
113625d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
724ee8a
113625d
9e30bee
724ee8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113625d
9e30bee
113625d
9e30bee
113625d
 
 
 
 
 
9e30bee
113625d
9e30bee
113625d
9e30bee
113625d
 
724ee8a
113625d
724ee8a
 
 
 
 
 
 
113625d
724ee8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
---
license: apache-2.0
language:
  - en
tags:
  - audio
  - medical-audio
  - respiratory-sounds
  - cardiac-sounds
  - auscultation
  - cardiopulmonary
  - representation-learning
  - cross-modal-alignment
  - audio-language-alignment
  - self-supervised-learning
  - clinical-ai
  - pytorch
pipeline_tag: feature-extraction
library_name: pytorch
arxiv: 2512.04847
---

![截圖 2026-04-27 17.30.20](https://cdn-uploads.huggingface.co/production/uploads/6506cb686ba49887d312cfa2/C4gFTr-FqYuJazDm_-Xwn.png)

# AcuLa

AcuLa (**Audio–Clinical Understanding via Language Alignment**) is a post-training alignment framework for medical audio understanding. It improves pretrained audio encoders by aligning their representations with clinical-language representations from a language model, encouraging audio embeddings to capture richer clinical semantics while preserving fine-grained acoustic information.

This repository provides the checkpoint for AcuLa. The accompanying implementation is available at:

**GitHub:** https://github.com/janine714/AcuLA

This work is described in the paper **“Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding.”**

---

## Intended Use

AcuLa is intended for research on clinically informed medical audio representation learning.

| Use case | Description |
|---|---|
| Feature extraction | Extract embeddings from cardio-respiratory audio recordings |
| Linear probing | Train lightweight classifiers or regressors on frozen embeddings |
| Transfer learning | Adapt the aligned encoder to downstream medical audio datasets |
| Respiratory analysis | Study cough, breath, exhalation, and lung sound representations |
| Cardiac audio analysis | Study heart sound representations |
| Audio-text retrieval | Retrieve semantically related clinical reports or audio samples |
| Representation analysis | Analyze how clinical semantics are reflected in audio embedding spaces |

AcuLa was evaluated on 18 downstream cardio-respiratory tasks, including respiratory condition inference, lung function estimation, and cardiac condition inference.


---

## Installation

Clone the GitHub repository:

    git clone https://github.com/janine714/AcuLA
    cd AcuLA

Install dependencies:

    pip install -r requirements.txt

If using OPERA-family encoders, please make sure the required OPERA dependencies and pretrained checkpoints are available in your environment.

---


## Training

To train AcuLa, first clone the repository:

    git clone https://github.com/janine714/AcuLA
    cd AcuLA

Then run training with:

    python main.py \
      --csv_path /path/to/combined_dataset.csv \
      --audio_ckpt /path/to/encoder-operaGT.ckpt \
      --output_dir ./checkpoints \
      --audio_backbone operaGT \
      --llm_type google/medgemma-4b-pt \
      --epochs 50 \
      --batch_size 24 \
      --grad_accum_steps 2 \
      --warmup_steps 400 \
      --lr 1e-5 \
      --lambda_align 1.0 \
      --lambda_mam 1.0 \
      --use_wandb

Expected CSV format:

| Column | Description |
|---|---|
| `audio_path` | Path to the audio recording |
| `Gen_Report` | Clinical text report paired with the audio recording |

Example:

| audio_path | Gen_Report |
|---|---|
| `/path/to/audio.wav` | `The recording is consistent with normal pulmonary findings...` |

---

## Checkpoint Loading

The checkpoint can be loaded together with the AcuLa codebase.

    import torch
    from audio_encoder import initialize_pretrained_model

    checkpoint_path = "path/to/acula_checkpoint.pt"

    audio_model = initialize_pretrained_model(pretrain="operaGT")
    ckpt = torch.load(checkpoint_path, map_location="cpu")

    if "audio_model_state_dict" in ckpt:
        state_dict = ckpt["audio_model_state_dict"]
    elif "state_dict" in ckpt:
        state_dict = ckpt["state_dict"]
    else:
        state_dict = ckpt

    audio_model.load_state_dict(state_dict, strict=False)
    audio_model.eval()

Extract audio features:

    import torch

    with torch.no_grad():
        features = audio_model.forward_feature(audio_input)

The variable `audio_input` should follow the preprocessing format expected by the selected audio encoder.

---

## Input Format

AcuLa expects medical audio recordings that are preprocessed into the format required by the selected audio encoder.

A typical preprocessing setup is:

| Step | Setting |
|---|---|
| Sampling rate | 16 kHz |
| Segment length | Fixed-length segment, commonly around 8 seconds |
| Audio representation | Log-mel spectrogram |
| Number of mel bins | 64 |
| Padding/truncation | Applied as needed |

During training, optional audio augmentations may include volume adjustment, normalization, low-pass filtering, and high-pass filtering.

---

## Training Data

AcuLa was trained using paired medical audio and clinical reports generated from structured metadata. The alignment corpus contains cardio-respiratory audio from multiple public datasets.

| Dataset | Modality |
|---|---|
| ICBHI | Lung sounds |
| HFLung | Lung sounds |
| UK COVID-19 | Induced cough and exhalation |
| CoughVID | Cough sounds |
| CirCor | Heart sounds |
| SPRSound | Lung sounds |
| ZCHSound | Heart sounds |

The paper reports more than 100,000 paired audio-report samples for alignment.

---

## Downstream Evaluation

The paper evaluates AcuLa on 18 cardio-respiratory tasks.

| Task group | Example tasks | Metric |
|---|---|---|
| Respiratory condition inference | COVID-19 detection, COPD classification, smoker classification, obstructive-vs-healthy classification, COPD severity classification | AUROC |
| Lung function estimation | FVC, FEV1, FEV1/FVC, respiratory rate | MAE |
| Cardiac condition inference | Murmur detection, symptomatic-vs-healthy classification | AUROC |

The main evaluation protocol uses frozen embeddings and lightweight supervised prediction heads, allowing performance differences to reflect representation quality.

---

## Reported Findings

The paper shows that AcuLa improves medical audio representations across diverse cardio-respiratory tasks.

| Finding | Summary |
|---|---|
| Stronger classification representations | Improved AUROC across respiratory and cardiac condition inference tasks |
| Improved cough-based analysis | Large gains on challenging COVID-19 cough detection settings |
| Better physiological estimation | Improved performance on multiple lung-function estimation tasks |
| Model-agnostic improvements | Consistent gains across several pretrained audio backbones |
| Zero-shot potential | Competitive retrieval-style audio-text similarity results on respiratory tasks |

Please refer to the paper for full task-by-task results and experimental details.

---

## Checkpoint Contents

Depending on the uploaded checkpoint variant, the checkpoint may contain one or more of the following components:

| Component | Description |
|---|---|
| Audio encoder weights | Aligned medical audio encoder parameters |
| Audio projection head | Projection layer for shared-space audio embeddings |
| Language projection head | Projection layer for shared-space text embeddings |
| Training metadata | Optional optimizer, scheduler, or epoch information |

Users can inspect the checkpoint keys with:

    import torch

    ckpt = torch.load("path/to/acula_checkpoint.pt", map_location="cpu")
    print(ckpt.keys())

---

## Limitations

| Limitation | Description |
|---|---|
| Research-stage checkpoint | Intended for research evaluation and downstream development |
| Dataset dependence | Performance may vary across datasets, devices, and recording conditions |
| Synthetic text supervision | Alignment reports are generated from metadata and may simplify clinical details |
| Clip-level representation | The method learns global clip embeddings and does not explicitly localize events |
| Downstream adaptation | Task-specific classifiers or regressors may still be needed for final applications |

---

## Citation

Please cite the paper if you use this checkpoint:

    @misc{wang2026languagemodelssemanticteachers,
      title={Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding}, 
      author={Tsai-Ning Wang and Lin-Lin Chen and Neil Zeghidour and Aaqib Saeed},
      year={2026},
      eprint={2512.04847},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2512.04847}, 
    }

---

## Acknowledgment

This checkpoint is released to support reproducibility and further research on medical audio understanding, audio-language alignment, and clinically informed representation learning.