README.md · tsnngw/AcuLa at main

AcuLa / README.md

tsnngw

Update README.md

113625d verified 10 days ago

preview code

raw

history blame contribute delete

8.63 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- audio
	- medical-audio
	- respiratory-sounds
	- cardiac-sounds
	- auscultation
	- cardiopulmonary
	- representation-learning
	- cross-modal-alignment
	- audio-language-alignment
	- self-supervised-learning
	- clinical-ai
	- pytorch
	pipeline_tag: feature-extraction
	library_name: pytorch
	arxiv: 2512.04847
	---

	![截圖 2026-04-27 17.30.20](https://cdn-uploads.huggingface.co/production/uploads/6506cb686ba49887d312cfa2/C4gFTr-FqYuJazDm_-Xwn.png)

	# AcuLa

	AcuLa (Audio–Clinical Understanding via Language Alignment) is a post-training alignment framework for medical audio understanding. It improves pretrained audio encoders by aligning their representations with clinical-language representations from a language model, encouraging audio embeddings to capture richer clinical semantics while preserving fine-grained acoustic information.

	This repository provides the checkpoint for AcuLa. The accompanying implementation is available at:

	GitHub: https://github.com/janine714/AcuLA

	This work is described in the paper “Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding.”

	---

	## Intended Use

	AcuLa is intended for research on clinically informed medical audio representation learning.

	\| Use case \| Description \|
	\|---\|---\|
	\| Feature extraction \| Extract embeddings from cardio-respiratory audio recordings \|
	\| Linear probing \| Train lightweight classifiers or regressors on frozen embeddings \|
	\| Transfer learning \| Adapt the aligned encoder to downstream medical audio datasets \|
	\| Respiratory analysis \| Study cough, breath, exhalation, and lung sound representations \|
	\| Cardiac audio analysis \| Study heart sound representations \|
	\| Audio-text retrieval \| Retrieve semantically related clinical reports or audio samples \|
	\| Representation analysis \| Analyze how clinical semantics are reflected in audio embedding spaces \|

	AcuLa was evaluated on 18 downstream cardio-respiratory tasks, including respiratory condition inference, lung function estimation, and cardiac condition inference.


	---

	## Installation

	Clone the GitHub repository:

	git clone https://github.com/janine714/AcuLA
	cd AcuLA

	Install dependencies:

	pip install -r requirements.txt

	If using OPERA-family encoders, please make sure the required OPERA dependencies and pretrained checkpoints are available in your environment.

	---


	## Training

	To train AcuLa, first clone the repository:

	git clone https://github.com/janine714/AcuLA
	cd AcuLA

	Then run training with:

	python main.py \
	--csv_path /path/to/combined_dataset.csv \
	--audio_ckpt /path/to/encoder-operaGT.ckpt \
	--output_dir ./checkpoints \
	--audio_backbone operaGT \
	--llm_type google/medgemma-4b-pt \
	--epochs 50 \
	--batch_size 24 \
	--grad_accum_steps 2 \
	--warmup_steps 400 \
	--lr 1e-5 \
	--lambda_align 1.0 \
	--lambda_mam 1.0 \
	--use_wandb

	Expected CSV format:

	\| Column \| Description \|
	\|---\|---\|
	\| `audio_path` \| Path to the audio recording \|
	\| `Gen_Report` \| Clinical text report paired with the audio recording \|

	Example:

	\| audio_path \| Gen_Report \|
	\|---\|---\|
	\| `/path/to/audio.wav` \| `The recording is consistent with normal pulmonary findings...` \|

	---

	## Checkpoint Loading

	The checkpoint can be loaded together with the AcuLa codebase.

	import torch
	from audio_encoder import initialize_pretrained_model

	checkpoint_path = "path/to/acula_checkpoint.pt"

	audio_model = initialize_pretrained_model(pretrain="operaGT")
	ckpt = torch.load(checkpoint_path, map_location="cpu")

	if "audio_model_state_dict" in ckpt:
	state_dict = ckpt["audio_model_state_dict"]
	elif "state_dict" in ckpt:
	state_dict = ckpt["state_dict"]
	else:
	state_dict = ckpt

	audio_model.load_state_dict(state_dict, strict=False)
	audio_model.eval()

	Extract audio features:

	import torch

	with torch.no_grad():
	features = audio_model.forward_feature(audio_input)

	The variable `audio_input` should follow the preprocessing format expected by the selected audio encoder.

	---

	## Input Format

	AcuLa expects medical audio recordings that are preprocessed into the format required by the selected audio encoder.

	A typical preprocessing setup is:

	\| Step \| Setting \|
	\|---\|---\|
	\| Sampling rate \| 16 kHz \|
	\| Segment length \| Fixed-length segment, commonly around 8 seconds \|
	\| Audio representation \| Log-mel spectrogram \|
	\| Number of mel bins \| 64 \|
	\| Padding/truncation \| Applied as needed \|

	During training, optional audio augmentations may include volume adjustment, normalization, low-pass filtering, and high-pass filtering.

	---

	## Training Data

	AcuLa was trained using paired medical audio and clinical reports generated from structured metadata. The alignment corpus contains cardio-respiratory audio from multiple public datasets.

	\| Dataset \| Modality \|
	\|---\|---\|
	\| ICBHI \| Lung sounds \|
	\| HFLung \| Lung sounds \|
	\| UK COVID-19 \| Induced cough and exhalation \|
	\| CoughVID \| Cough sounds \|
	\| CirCor \| Heart sounds \|
	\| SPRSound \| Lung sounds \|
	\| ZCHSound \| Heart sounds \|

	The paper reports more than 100,000 paired audio-report samples for alignment.

	---

	## Downstream Evaluation

	The paper evaluates AcuLa on 18 cardio-respiratory tasks.

	\| Task group \| Example tasks \| Metric \|
	\|---\|---\|---\|
	\| Respiratory condition inference \| COVID-19 detection, COPD classification, smoker classification, obstructive-vs-healthy classification, COPD severity classification \| AUROC \|
	\| Lung function estimation \| FVC, FEV1, FEV1/FVC, respiratory rate \| MAE \|
	\| Cardiac condition inference \| Murmur detection, symptomatic-vs-healthy classification \| AUROC \|

	The main evaluation protocol uses frozen embeddings and lightweight supervised prediction heads, allowing performance differences to reflect representation quality.

	---

	## Reported Findings

	The paper shows that AcuLa improves medical audio representations across diverse cardio-respiratory tasks.

	\| Finding \| Summary \|
	\|---\|---\|
	\| Stronger classification representations \| Improved AUROC across respiratory and cardiac condition inference tasks \|
	\| Improved cough-based analysis \| Large gains on challenging COVID-19 cough detection settings \|
	\| Better physiological estimation \| Improved performance on multiple lung-function estimation tasks \|
	\| Model-agnostic improvements \| Consistent gains across several pretrained audio backbones \|
	\| Zero-shot potential \| Competitive retrieval-style audio-text similarity results on respiratory tasks \|

	Please refer to the paper for full task-by-task results and experimental details.

	---

	## Checkpoint Contents

	Depending on the uploaded checkpoint variant, the checkpoint may contain one or more of the following components:

	\| Component \| Description \|
	\|---\|---\|
	\| Audio encoder weights \| Aligned medical audio encoder parameters \|
	\| Audio projection head \| Projection layer for shared-space audio embeddings \|
	\| Language projection head \| Projection layer for shared-space text embeddings \|
	\| Training metadata \| Optional optimizer, scheduler, or epoch information \|

	Users can inspect the checkpoint keys with:

	import torch

	ckpt = torch.load("path/to/acula_checkpoint.pt", map_location="cpu")
	print(ckpt.keys())

	---

	## Limitations

	\| Limitation \| Description \|
	\|---\|---\|
	\| Research-stage checkpoint \| Intended for research evaluation and downstream development \|
	\| Dataset dependence \| Performance may vary across datasets, devices, and recording conditions \|
	\| Synthetic text supervision \| Alignment reports are generated from metadata and may simplify clinical details \|
	\| Clip-level representation \| The method learns global clip embeddings and does not explicitly localize events \|
	\| Downstream adaptation \| Task-specific classifiers or regressors may still be needed for final applications \|

	---

	## Citation

	Please cite the paper if you use this checkpoint:

	@misc{wang2026languagemodelssemanticteachers,
	title={Language Models as Semantic Teachers: Post-Training Alignment for Medical Audio Understanding},
	author={Tsai-Ning Wang and Lin-Lin Chen and Neil Zeghidour and Aaqib Saeed},
	year={2026},
	eprint={2512.04847},
	archivePrefix={arXiv},
	primaryClass={cs.SD},
	url={https://arxiv.org/abs/2512.04847},
	}

	---

	## Acknowledgment

	This checkpoint is released to support reproducibility and further research on medical audio understanding, audio-language alignment, and clinically informed representation learning.