--- license: apache-2.0 language: - en tags: - audio - medical-audio - respiratory-sounds - cardiac-sounds - auscultation - cardiopulmonary - representation-learning - cross-modal-alignment - audio-language-alignment - self-supervised-learning - clinical-ai - pytorch pipeline_tag: feature-extraction library_name: pytorch arxiv: 2512.04847 --- ![截圖 2026-04-27 17.30.20](https://cdn-uploads.huggingface.co/production/uploads/6506cb686ba49887d312cfa2/C4gFTr-FqYuJazDm_-Xwn.png) # AcuLa AcuLa (**Audio–Clinical Understanding via Language Alignment**) is a post-training alignment framework for medical audio understanding. It improves pretrained audio encoders by aligning their representations with clinical-language representations from a language model, encouraging audio embeddings to capture richer clinical semantics while preserving fine-grained acoustic information. This repository provides the checkpoint for AcuLa. The accompanying implementation is available at: **GitHub:** https://github.com/janine714/AcuLA This work is described in the paper **“Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding.”** --- ## Intended Use AcuLa is intended for research on clinically informed medical audio representation learning. | Use case | Description | |---|---| | Feature extraction | Extract embeddings from cardio-respiratory audio recordings | | Linear probing | Train lightweight classifiers or regressors on frozen embeddings | | Transfer learning | Adapt the aligned encoder to downstream medical audio datasets | | Respiratory analysis | Study cough, breath, exhalation, and lung sound representations | | Cardiac audio analysis | Study heart sound representations | | Audio-text retrieval | Retrieve semantically related clinical reports or audio samples | | Representation analysis | Analyze how clinical semantics are reflected in audio embedding spaces | AcuLa was evaluated on 18 downstream cardio-respiratory tasks, including respiratory condition inference, lung function estimation, and cardiac condition inference. --- ## Installation Clone the GitHub repository: git clone https://github.com/janine714/AcuLA cd AcuLA Install dependencies: pip install -r requirements.txt If using OPERA-family encoders, please make sure the required OPERA dependencies and pretrained checkpoints are available in your environment. --- ## Training To train AcuLa, first clone the repository: git clone https://github.com/janine714/AcuLA cd AcuLA Then run training with: python main.py \ --csv_path /path/to/combined_dataset.csv \ --audio_ckpt /path/to/encoder-operaGT.ckpt \ --output_dir ./checkpoints \ --audio_backbone operaGT \ --llm_type google/medgemma-4b-pt \ --epochs 50 \ --batch_size 24 \ --grad_accum_steps 2 \ --warmup_steps 400 \ --lr 1e-5 \ --lambda_align 1.0 \ --lambda_mam 1.0 \ --use_wandb Expected CSV format: | Column | Description | |---|---| | `audio_path` | Path to the audio recording | | `Gen_Report` | Clinical text report paired with the audio recording | Example: | audio_path | Gen_Report | |---|---| | `/path/to/audio.wav` | `The recording is consistent with normal pulmonary findings...` | --- ## Checkpoint Loading The checkpoint can be loaded together with the AcuLa codebase. import torch from audio_encoder import initialize_pretrained_model checkpoint_path = "path/to/acula_checkpoint.pt" audio_model = initialize_pretrained_model(pretrain="operaGT") ckpt = torch.load(checkpoint_path, map_location="cpu") if "audio_model_state_dict" in ckpt: state_dict = ckpt["audio_model_state_dict"] elif "state_dict" in ckpt: state_dict = ckpt["state_dict"] else: state_dict = ckpt audio_model.load_state_dict(state_dict, strict=False) audio_model.eval() Extract audio features: import torch with torch.no_grad(): features = audio_model.forward_feature(audio_input) The variable `audio_input` should follow the preprocessing format expected by the selected audio encoder. --- ## Input Format AcuLa expects medical audio recordings that are preprocessed into the format required by the selected audio encoder. A typical preprocessing setup is: | Step | Setting | |---|---| | Sampling rate | 16 kHz | | Segment length | Fixed-length segment, commonly around 8 seconds | | Audio representation | Log-mel spectrogram | | Number of mel bins | 64 | | Padding/truncation | Applied as needed | During training, optional audio augmentations may include volume adjustment, normalization, low-pass filtering, and high-pass filtering. --- ## Training Data AcuLa was trained using paired medical audio and clinical reports generated from structured metadata. The alignment corpus contains cardio-respiratory audio from multiple public datasets. | Dataset | Modality | |---|---| | ICBHI | Lung sounds | | HFLung | Lung sounds | | UK COVID-19 | Induced cough and exhalation | | CoughVID | Cough sounds | | CirCor | Heart sounds | | SPRSound | Lung sounds | | ZCHSound | Heart sounds | The paper reports more than 100,000 paired audio-report samples for alignment. --- ## Downstream Evaluation The paper evaluates AcuLa on 18 cardio-respiratory tasks. | Task group | Example tasks | Metric | |---|---|---| | Respiratory condition inference | COVID-19 detection, COPD classification, smoker classification, obstructive-vs-healthy classification, COPD severity classification | AUROC | | Lung function estimation | FVC, FEV1, FEV1/FVC, respiratory rate | MAE | | Cardiac condition inference | Murmur detection, symptomatic-vs-healthy classification | AUROC | The main evaluation protocol uses frozen embeddings and lightweight supervised prediction heads, allowing performance differences to reflect representation quality. --- ## Reported Findings The paper shows that AcuLa improves medical audio representations across diverse cardio-respiratory tasks. | Finding | Summary | |---|---| | Stronger classification representations | Improved AUROC across respiratory and cardiac condition inference tasks | | Improved cough-based analysis | Large gains on challenging COVID-19 cough detection settings | | Better physiological estimation | Improved performance on multiple lung-function estimation tasks | | Model-agnostic improvements | Consistent gains across several pretrained audio backbones | | Zero-shot potential | Competitive retrieval-style audio-text similarity results on respiratory tasks | Please refer to the paper for full task-by-task results and experimental details. --- ## Checkpoint Contents Depending on the uploaded checkpoint variant, the checkpoint may contain one or more of the following components: | Component | Description | |---|---| | Audio encoder weights | Aligned medical audio encoder parameters | | Audio projection head | Projection layer for shared-space audio embeddings | | Language projection head | Projection layer for shared-space text embeddings | | Training metadata | Optional optimizer, scheduler, or epoch information | Users can inspect the checkpoint keys with: import torch ckpt = torch.load("path/to/acula_checkpoint.pt", map_location="cpu") print(ckpt.keys()) --- ## Limitations | Limitation | Description | |---|---| | Research-stage checkpoint | Intended for research evaluation and downstream development | | Dataset dependence | Performance may vary across datasets, devices, and recording conditions | | Synthetic text supervision | Alignment reports are generated from metadata and may simplify clinical details | | Clip-level representation | The method learns global clip embeddings and does not explicitly localize events | | Downstream adaptation | Task-specific classifiers or regressors may still be needed for final applications | --- ## Citation Please cite the paper if you use this checkpoint: @misc{wang2026languagemodelssemanticteachers, title={Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding}, author={Tsai-Ning Wang and Lin-Lin Chen and Neil Zeghidour and Aaqib Saeed}, year={2026}, eprint={2512.04847}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2512.04847}, } --- ## Acknowledgment This checkpoint is released to support reproducibility and further research on medical audio understanding, audio-language alignment, and clinically informed representation learning.