| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - audio |
| - medical-audio |
| - respiratory-sounds |
| - cardiac-sounds |
| - auscultation |
| - cardiopulmonary |
| - representation-learning |
| - cross-modal-alignment |
| - audio-language-alignment |
| - self-supervised-learning |
| - clinical-ai |
| - pytorch |
| pipeline_tag: feature-extraction |
| library_name: pytorch |
| arxiv: 2512.04847 |
| --- |
| |
|  |
|
|
| # AcuLa |
|
|
| AcuLa (**Audio–Clinical Understanding via Language Alignment**) is a post-training alignment framework for medical audio understanding. It improves pretrained audio encoders by aligning their representations with clinical-language representations from a language model, encouraging audio embeddings to capture richer clinical semantics while preserving fine-grained acoustic information. |
|
|
| This repository provides the checkpoint for AcuLa. The accompanying implementation is available at: |
|
|
| **GitHub:** https://github.com/janine714/AcuLA |
|
|
| This work is described in the paper **“Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding.”** |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| AcuLa is intended for research on clinically informed medical audio representation learning. |
|
|
| | Use case | Description | |
| |---|---| |
| | Feature extraction | Extract embeddings from cardio-respiratory audio recordings | |
| | Linear probing | Train lightweight classifiers or regressors on frozen embeddings | |
| | Transfer learning | Adapt the aligned encoder to downstream medical audio datasets | |
| | Respiratory analysis | Study cough, breath, exhalation, and lung sound representations | |
| | Cardiac audio analysis | Study heart sound representations | |
| | Audio-text retrieval | Retrieve semantically related clinical reports or audio samples | |
| | Representation analysis | Analyze how clinical semantics are reflected in audio embedding spaces | |
|
|
| AcuLa was evaluated on 18 downstream cardio-respiratory tasks, including respiratory condition inference, lung function estimation, and cardiac condition inference. |
|
|
|
|
| --- |
|
|
| ## Installation |
|
|
| Clone the GitHub repository: |
|
|
| git clone https://github.com/janine714/AcuLA |
| cd AcuLA |
| |
| Install dependencies: |
|
|
| pip install -r requirements.txt |
| |
| If using OPERA-family encoders, please make sure the required OPERA dependencies and pretrained checkpoints are available in your environment. |
|
|
| --- |
|
|
|
|
| ## Training |
|
|
| To train AcuLa, first clone the repository: |
|
|
| git clone https://github.com/janine714/AcuLA |
| cd AcuLA |
| |
| Then run training with: |
|
|
| python main.py \ |
| --csv_path /path/to/combined_dataset.csv \ |
| --audio_ckpt /path/to/encoder-operaGT.ckpt \ |
| --output_dir ./checkpoints \ |
| --audio_backbone operaGT \ |
| --llm_type google/medgemma-4b-pt \ |
| --epochs 50 \ |
| --batch_size 24 \ |
| --grad_accum_steps 2 \ |
| --warmup_steps 400 \ |
| --lr 1e-5 \ |
| --lambda_align 1.0 \ |
| --lambda_mam 1.0 \ |
| --use_wandb |
| |
| Expected CSV format: |
|
|
| | Column | Description | |
| |---|---| |
| | `audio_path` | Path to the audio recording | |
| | `Gen_Report` | Clinical text report paired with the audio recording | |
|
|
| Example: |
|
|
| | audio_path | Gen_Report | |
| |---|---| |
| | `/path/to/audio.wav` | `The recording is consistent with normal pulmonary findings...` | |
|
|
| --- |
|
|
| ## Checkpoint Loading |
|
|
| The checkpoint can be loaded together with the AcuLa codebase. |
|
|
| import torch |
| from audio_encoder import initialize_pretrained_model |
| |
| checkpoint_path = "path/to/acula_checkpoint.pt" |
| |
| audio_model = initialize_pretrained_model(pretrain="operaGT") |
| ckpt = torch.load(checkpoint_path, map_location="cpu") |
| |
| if "audio_model_state_dict" in ckpt: |
| state_dict = ckpt["audio_model_state_dict"] |
| elif "state_dict" in ckpt: |
| state_dict = ckpt["state_dict"] |
| else: |
| state_dict = ckpt |
| |
| audio_model.load_state_dict(state_dict, strict=False) |
| audio_model.eval() |
| |
| Extract audio features: |
|
|
| import torch |
| |
| with torch.no_grad(): |
| features = audio_model.forward_feature(audio_input) |
| |
| The variable `audio_input` should follow the preprocessing format expected by the selected audio encoder. |
|
|
| --- |
|
|
| ## Input Format |
|
|
| AcuLa expects medical audio recordings that are preprocessed into the format required by the selected audio encoder. |
|
|
| A typical preprocessing setup is: |
|
|
| | Step | Setting | |
| |---|---| |
| | Sampling rate | 16 kHz | |
| | Segment length | Fixed-length segment, commonly around 8 seconds | |
| | Audio representation | Log-mel spectrogram | |
| | Number of mel bins | 64 | |
| | Padding/truncation | Applied as needed | |
|
|
| During training, optional audio augmentations may include volume adjustment, normalization, low-pass filtering, and high-pass filtering. |
|
|
| --- |
|
|
| ## Training Data |
|
|
| AcuLa was trained using paired medical audio and clinical reports generated from structured metadata. The alignment corpus contains cardio-respiratory audio from multiple public datasets. |
|
|
| | Dataset | Modality | |
| |---|---| |
| | ICBHI | Lung sounds | |
| | HFLung | Lung sounds | |
| | UK COVID-19 | Induced cough and exhalation | |
| | CoughVID | Cough sounds | |
| | CirCor | Heart sounds | |
| | SPRSound | Lung sounds | |
| | ZCHSound | Heart sounds | |
|
|
| The paper reports more than 100,000 paired audio-report samples for alignment. |
|
|
| --- |
|
|
| ## Downstream Evaluation |
|
|
| The paper evaluates AcuLa on 18 cardio-respiratory tasks. |
|
|
| | Task group | Example tasks | Metric | |
| |---|---|---| |
| | Respiratory condition inference | COVID-19 detection, COPD classification, smoker classification, obstructive-vs-healthy classification, COPD severity classification | AUROC | |
| | Lung function estimation | FVC, FEV1, FEV1/FVC, respiratory rate | MAE | |
| | Cardiac condition inference | Murmur detection, symptomatic-vs-healthy classification | AUROC | |
|
|
| The main evaluation protocol uses frozen embeddings and lightweight supervised prediction heads, allowing performance differences to reflect representation quality. |
|
|
| --- |
|
|
| ## Reported Findings |
|
|
| The paper shows that AcuLa improves medical audio representations across diverse cardio-respiratory tasks. |
|
|
| | Finding | Summary | |
| |---|---| |
| | Stronger classification representations | Improved AUROC across respiratory and cardiac condition inference tasks | |
| | Improved cough-based analysis | Large gains on challenging COVID-19 cough detection settings | |
| | Better physiological estimation | Improved performance on multiple lung-function estimation tasks | |
| | Model-agnostic improvements | Consistent gains across several pretrained audio backbones | |
| | Zero-shot potential | Competitive retrieval-style audio-text similarity results on respiratory tasks | |
|
|
| Please refer to the paper for full task-by-task results and experimental details. |
|
|
| --- |
|
|
| ## Checkpoint Contents |
|
|
| Depending on the uploaded checkpoint variant, the checkpoint may contain one or more of the following components: |
|
|
| | Component | Description | |
| |---|---| |
| | Audio encoder weights | Aligned medical audio encoder parameters | |
| | Audio projection head | Projection layer for shared-space audio embeddings | |
| | Language projection head | Projection layer for shared-space text embeddings | |
| | Training metadata | Optional optimizer, scheduler, or epoch information | |
|
|
| Users can inspect the checkpoint keys with: |
|
|
| import torch |
| |
| ckpt = torch.load("path/to/acula_checkpoint.pt", map_location="cpu") |
| print(ckpt.keys()) |
| |
| --- |
|
|
| ## Limitations |
|
|
| | Limitation | Description | |
| |---|---| |
| | Research-stage checkpoint | Intended for research evaluation and downstream development | |
| | Dataset dependence | Performance may vary across datasets, devices, and recording conditions | |
| | Synthetic text supervision | Alignment reports are generated from metadata and may simplify clinical details | |
| | Clip-level representation | The method learns global clip embeddings and does not explicitly localize events | |
| | Downstream adaptation | Task-specific classifiers or regressors may still be needed for final applications | |
|
|
| --- |
|
|
| ## Citation |
|
|
| Please cite the paper if you use this checkpoint: |
|
|
| @misc{wang2026languagemodelssemanticteachers, |
| title={Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding}, |
| author={Tsai-Ning Wang and Lin-Lin Chen and Neil Zeghidour and Aaqib Saeed}, |
| year={2026}, |
| eprint={2512.04847}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.SD}, |
| url={https://arxiv.org/abs/2512.04847}, |
| } |
| |
| --- |
|
|
| ## Acknowledgment |
|
|
| This checkpoint is released to support reproducibility and further research on medical audio understanding, audio-language alignment, and clinically informed representation learning. |