tsnngw commited on
Commit
9e30bee
·
verified ·
1 Parent(s): f867275

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -93
README.md CHANGED
@@ -8,6 +8,7 @@ tags:
8
  - respiratory-sounds
9
  - cardiac-sounds
10
  - auscultation
 
11
  - representation-learning
12
  - cross-modal-alignment
13
  - audio-language-alignment
@@ -16,74 +17,104 @@ tags:
16
  - pytorch
17
  pipeline_tag: feature-extraction
18
  library_name: pytorch
 
19
  ---
20
 
21
- # AcuLa: Audio–Clinical Understanding via Language Alignment
22
 
23
- This repository provides a checkpoint associated with the paper **“Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding.”**
24
 
25
- AcuLa is a lightweight post-training alignment framework for improving medical audio representations. The method aligns a pretrained audio encoder with clinical-language representations from a language model, encouraging audio embeddings to capture clinically meaningful semantic structure while preserving fine-grained acoustic information.
26
 
27
- The checkpoint is intended for research use, especially feature extraction, representation analysis, and downstream evaluation on cardio-respiratory audio tasks.
 
 
 
 
28
 
29
  ---
30
 
31
- ![截圖 2026-04-27 17.30.20](https://cdn-uploads.huggingface.co/production/uploads/6506cb686ba49887d312cfa2/cdaYDrKIqssBeueIs_esA.png)
32
 
33
- ## Model Overview
34
 
35
- | Field | Description |
 
 
 
 
 
 
36
  |---|---|
37
- | Method name | **AcuLa**: Audio–Clinical Understanding via Language Alignment |
38
- | Paper title | **Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding** |
39
- | Model type | Post-trained / aligned medical audio encoder |
40
- | Primary function | Audio representation learning and feature extraction |
41
- | Input modality | Medical audio |
42
- | Target domains | Respiratory sounds, cough sounds, breathing sounds, exhalation sounds, heart sounds |
43
- | Training paradigm | Audio-language representation alignment with self-supervised audio preservation |
44
- | Main use cases | Feature extraction, linear probing, transfer learning, retrieval-style analysis |
45
- | Framework | PyTorch |
 
 
46
 
47
  ---
48
 
 
49
 
50
- ## Intended Applications
51
 
52
- This checkpoint is designed for research on medical audio understanding and clinically informed audio representation learning.
 
53
 
54
- | Application | Description |
55
- |---|---|
56
- | Feature extraction | Extract frozen embeddings from cardio-respiratory audio |
57
- | Linear probing | Train lightweight downstream classifiers or regressors |
58
- | Transfer learning | Adapt the aligned encoder to task-specific medical audio datasets |
59
- | Representation analysis | Study semantic organization in audio embedding spaces |
60
- | Audio-text retrieval | Explore similarity between medical audio and clinical text representations |
61
- | Benchmarking | Compare audio-language alignment methods and pretrained audio encoders |
62
 
63
  ---
64
 
65
- ## Method Summary
66
 
67
- AcuLa follows a teacher-student alignment strategy. A pretrained language model provides clinical-language representations, while a pretrained audio encoder is adapted to better reflect those semantic structures.
68
 
69
- | Component | Role |
70
- |---|---|
71
- | Audio encoder | Encodes medical audio into acoustic representations |
72
- | Language model | Provides clinical-language semantic representations |
73
- | Audio projection head | Maps audio features into a shared representation space |
74
- | Language projection head | Maps language features into the same shared space |
75
- | Alignment objective | Encourages audio and language representations to share similar geometry |
76
- | Self-supervised objective | Preserves detailed acoustic modeling ability during alignment |
77
 
78
- The total training objective combines semantic alignment and acoustic preservation:
79
 
80
- L_total = lambda_align * L_align + lambda_ssm * L_ssm
 
81
 
82
- where `L_align` denotes the audio-language alignment loss and `L_ssm` denotes the self-supervised audio modeling loss.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
  ---
85
 
86
 
 
87
  ## Training Data
88
 
89
  AcuLa was trained using paired medical audio and clinical reports generated from structured metadata. The alignment corpus contains cardio-respiratory audio from multiple public datasets.
@@ -102,46 +133,6 @@ The paper reports more than 100,000 paired audio-report samples for alignment.
102
 
103
  ---
104
 
105
- ## Clinical Report Generation
106
-
107
- The paired text reports were generated from structured metadata associated with each audio recording. This provides scalable semantic supervision for audio-language alignment.
108
-
109
- | Metadata type | Examples |
110
- |---|---|
111
- | Recording information | Dataset, modality, recording condition |
112
- | Diagnostic labels | COVID-19, COPD, smoker status, murmur, symptomatic status |
113
- | Acoustic annotations | Crackles, wheezes, murmurs, normal findings |
114
- | Physiological information | Lung-function-related information when available |
115
- | Subject metadata | Demographic information when available |
116
-
117
- The generated reports are used to guide representation learning and provide clinically meaningful textual context for the audio recordings.
118
-
119
- ---
120
-
121
- ## Input Format
122
-
123
- The expected input is medical audio. A typical preprocessing pipeline follows the alignment setup used in the paper.
124
-
125
- | Step | Setting |
126
- |---|---|
127
- | Sampling rate | 16 kHz |
128
- | Segment length | Fixed-length segments, commonly around 8 seconds |
129
- | Audio representation | Log-mel spectrogram |
130
- | Number of mel bins | 64 |
131
- | Padding/truncation | Applied as needed |
132
- | Training augmentation | Optional |
133
-
134
- Possible training augmentations include:
135
-
136
- | Augmentation | Purpose |
137
- |---|---|
138
- | Volume adjustment | Robustness to loudness variation |
139
- | Normalization | Reduced recording-level amplitude variation |
140
- | Low-pass filtering | Robustness to frequency-response differences |
141
- | High-pass filtering | Robustness to recording-condition differences |
142
-
143
- ---
144
-
145
  ## Downstream Evaluation
146
 
147
  The paper evaluates AcuLa on 18 cardio-respiratory tasks.
@@ -172,6 +163,17 @@ Please refer to the paper for full task-by-task results and experimental details
172
 
173
  ---
174
 
 
 
 
 
 
 
 
 
 
 
 
175
 
176
 
177
  ## Limitations
@@ -186,20 +188,6 @@ Please refer to the paper for full task-by-task results and experimental details
186
 
187
  ---
188
 
189
- ## Ethical Considerations
190
-
191
- Medical audio research involves sensitive data and potential real-world implications. Users should evaluate models carefully before applying them beyond research settings.
192
-
193
- | Consideration | Description |
194
- |---|---|
195
- | Privacy | Medical audio data may contain sensitive information |
196
- | Consent | Data should be collected and used with appropriate consent |
197
- | Fairness | Performance should be evaluated across relevant demographic groups |
198
- | Robustness | Models should be tested across devices, environments, and recording conditions |
199
- | Expert review | Clinical interpretation should involve domain experts |
200
-
201
- ---
202
-
203
  ## Citation
204
 
205
  Please cite the paper if you use this checkpoint:
 
8
  - respiratory-sounds
9
  - cardiac-sounds
10
  - auscultation
11
+ - cardiopulmonary
12
  - representation-learning
13
  - cross-modal-alignment
14
  - audio-language-alignment
 
17
  - pytorch
18
  pipeline_tag: feature-extraction
19
  library_name: pytorch
20
+ arxiv: 2512.04847
21
  ---
22
 
23
+ # AcuLa
24
 
25
+ AcuLa (**Audio–Clinical Understanding via Language Alignment**) is a post-training alignment framework for medical audio understanding. It improves pretrained audio encoders by aligning their representations with clinical-language representations from a language model, allowing the audio encoder to capture richer clinical semantics while preserving fine-grained acoustic information.
26
 
27
+ This repository provides the checkpoint for AcuLa. The accompanying code is available at:
28
 
29
+ **GitHub:** https://github.com/janine714/AcuLA
30
+
31
+ This work is described in the paper **“Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding.”**
32
+
33
+ ![截圖 2026-04-27 17.30.20](https://cdn-uploads.huggingface.co/production/uploads/6506cb686ba49887d312cfa2/KzktEnDsOrNqoBY-BMxuJ.png)
34
 
35
  ---
36
 
 
37
 
 
38
 
39
+ ## Intended Use
40
+
41
+ AcuLa is designed for research on clinically informed medical audio representation learning.
42
+
43
+ It can be used for:
44
+
45
+ | Task | Description |
46
  |---|---|
47
+ | Feature extraction | Extract embeddings from cardio-respiratory audio |
48
+ | Linear probing | Train lightweight classifiers or regressors on frozen embeddings |
49
+ | Transfer learning | Adapt the aligned encoder to downstream medical audio datasets |
50
+ | Respiratory analysis | Study cough, breath, exhalation, and lung sound representations |
51
+ | Cardiac audio analysis | Study heart sound representations |
52
+ | Audio-text retrieval | Retrieve semantically related clinical reports or audio samples |
53
+ | Representation analysis | Analyze how clinical semantics are reflected in audio embeddings |
54
+
55
+ AcuLa was evaluated on 18 downstream cardio-respiratory tasks, including respiratory condition inference, lung function estimation, and cardiac condition inference.
56
+
57
+ > This checkpoint is intended for research use.
58
 
59
  ---
60
 
61
+ ## Installation
62
 
63
+ Clone the GitHub repository:
64
 
65
+ git clone https://github.com/janine714/AcuLA
66
+ cd AcuLA
67
 
68
+ Install the required dependencies:
69
+
70
+ pip install -r requirements.txt
71
+
72
+ If you use OPERA-family encoders, please make sure the required OPERA dependencies and checkpoints are available in your environment.
 
 
 
73
 
74
  ---
75
 
76
+ ## How to Use
77
 
78
+ The checkpoint can be loaded together with the AcuLa codebase.
79
 
80
+ First, clone the repository and enter the project directory:
81
+
82
+ git clone https://github.com/janine714/AcuLA
83
+ cd AcuLA
 
 
 
 
84
 
85
+ Then load the checkpoint:
86
 
87
+ import torch
88
+ from audio_encoder import initialize_pretrained_model
89
 
90
+ checkpoint_path = "path/to/acula.pt"
91
+
92
+ audio_model = initialize_pretrained_model(pretrain="operaGT")
93
+ ckpt = torch.load(checkpoint_path, map_location="cpu")
94
+
95
+ if "audio_model_state_dict" in ckpt:
96
+ state_dict = ckpt["audio_model_state_dict"]
97
+ elif "state_dict" in ckpt:
98
+ state_dict = ckpt["state_dict"]
99
+ else:
100
+ state_dict = ckpt
101
+
102
+ audio_model.load_state_dict(state_dict, strict=False)
103
+ audio_model.eval()
104
+
105
+ Extract audio features:
106
+
107
+ import torch
108
+
109
+ with torch.no_grad():
110
+ features = audio_model.forward_feature(audio_input)
111
+
112
+ The variable `audio_input` should follow the preprocessing format expected by the selected audio encoder.
113
 
114
  ---
115
 
116
 
117
+
118
  ## Training Data
119
 
120
  AcuLa was trained using paired medical audio and clinical reports generated from structured metadata. The alignment corpus contains cardio-respiratory audio from multiple public datasets.
 
133
 
134
  ---
135
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
136
  ## Downstream Evaluation
137
 
138
  The paper evaluates AcuLa on 18 cardio-respiratory tasks.
 
163
 
164
  ---
165
 
166
+ ## Code
167
+
168
+ The implementation is available at:
169
+
170
+ https://github.com/janine714/AcuLA
171
+
172
+ Repository setup:
173
+
174
+ git clone https://github.com/janine714/AcuLA
175
+ cd AcuLA
176
+
177
 
178
 
179
  ## Limitations
 
188
 
189
  ---
190
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
191
  ## Citation
192
 
193
  Please cite the paper if you use this checkpoint: