PRAXIS-AUSCULT — Conformer-CTC ASR for British GP Audio

A 117M-parameter (live inference) Conformer-CTC speech recogniser trained from scratch on British General Practice consultation audio (PriMock57). Built as the ASR component of PRAXIS — an MSc dissertation system at the University of Leicester for AI-assisted clinical documentation in NHS general practice.

AUSCULT is named after Latin auscultare — "to listen carefully" — the root of auscultation, the foundational act of clinical listening with a stethoscope. The model learns to listen to GP consultations the way a clinician learns to listen to a chest.

Model summary


Architecture	Conformer-CTC (encoder-only, CTC head)
Parameters	119.63M (checkpoint), 117.06M (inference, CTC head removed)
Layers	18 Conformer blocks
`d_model` / heads / FFN	512 / 8 / 2048
Conv kernel	31
Vocabulary	3,001 (SentencePiece, see `praxis_tokenizer_v2.model`)
Real-time factor (RTF)	0.036–0.044 (Mac CPU, verified live)
Framework	Raw PyTorch (no `transformers`)

Training


Training data	PriMock57 (57 simulated British GP consultations), chunked into 4,824 segments
Training compute	vast.ai (NVIDIA GPUs); see `vast_train.py` for the exact training script
Loss	CTC
Tokeniser	SentencePiece, 3,001 BPE-style tokens

Headline results — PriMock57 test split

WER (lower is better)

Model	Architecture	Params	WER ↓	Notes
`openai/whisper-large-v3`	seq2seq encoder–decoder	1.55B	16.76%	Multilingual, ~680K hours pre-training
`openai/whisper-medium`	seq2seq encoder–decoder	769M	—	Listed for context
`openai/whisper-small`	seq2seq encoder–decoder	244M	—	Listed for context
PRAXIS-AUSCULT (this model, greedy)	Conformer-CTC	117M	50.12%	Trained only on PriMock57
PRAXIS-AUSCULT + SNOMED biasing	Conformer-CTC + hot-word boosted beam	117M	medical-term WER −31% (relative, POC n=14)	185-term UK primary-care lexicon, zero retraining. Full-benchmark MC-WER measurement pending.

Honest framing. This Conformer is trained from scratch on 4,824 chunked segments of British GP audio — roughly four orders of magnitude less data than Whisper-large-v3 was pre-trained on. The resulting WER (50.12%) is higher than Whisper-large-v3 (16.76%) on the same test split. This is not a SOTA WER claim. It is a deliberately reported negative result that establishes the floor against which the dissertation's medical-term-recall, hallucination, and clinical-validation contributions are evaluated. Reporting this honestly — rather than only reporting the metrics on which the model wins — is part of the methodological contribution.

Verified live (2026-05-04): RTF 0.036 on macOS CPU with a 253-second sample consultation. Model loaded in 0.47s. Live transcription is intelligible but contains the word-level errors expected at 50% WER (e.g. "diziness" for "dizziness"). Use the medical spell-checker in src/speech_to_text/medical_spell_checker.py and SNOMED biasing for a cleaner downstream string.

What the model is good for

Medical-term recall with SNOMED CT contextual biasing — 31% relative reduction in medical-term WER at decode time using a 185-term UK primary-care hotword lexicon (drugs, conditions, NHS abbreviations). Zero retraining; pyctcdecode beam search with hotword boosting. POC measurement, n=14; full PriMock57-test MC-WER sweep is pending.
Domain-specific phrasing — common British GP turns of phrase, EMIS-style consultation patterns, NHS abbreviations (e.g. CKS, BNF, NICE).
On-device, offline operation — full inference runs locally on Apple Silicon. No cloud dependency. HIPAA / GDPR-compatible by design.

Intended use

Research and educational use within PRAXIS (MSc dissertation, University of Leicester).
Studying domain-specific ASR with limited training data.
Studying contextual-biasing methods (the SNOMED hotword boost is a contribution in its own right).

Out of scope

Not a clinical device. Output must be reviewed by a qualified clinician before being entered into a patient record.
Not validated outside British primary-care GP audio. Will likely degrade on hospital, secondary-care, paediatric, ICU, or non-UK English audio.
Not validated for code-switched, multilingual, or strongly accented non-UK English.

Files in this repo

File	Purpose
`best_model_v2_120M.pt`	Generic checkpoint, val_loss 0.679 on the multi-corpus validation set.
`final_model_v2_120M.pt`	Final-epoch generic checkpoint.
`praxis_tokenizer_v2.model`	SentencePiece tokeniser, vocab 3,001. Required for inference.
`vast_train.py`	The exact training script used on vast.ai. Provided for reproducibility.

Which checkpoint to use. The PRAXIS reference engine prefers a PriMock57-fine-tuned checkpoint (best_primock57.pt) when it is present locally; otherwise it falls back to best_model_v2_120M.pt. The PriMock-fine-tuned variant is the one used to obtain the dissertation WER number (50.12% on PriMock57 test) and is recommended for British GP audio. The _80M suffix in the filename is historical (an early target size); the live inference graph is 117M params (checkpoint state-dict 119.6M; ~2.5M training-only output-head params are stripped at load).

How to use

This model is not packaged as a transformers model. It is loaded with raw PyTorch.

The full inference engine (ConformerEngine, SNOMED biaser, audio preprocessor) lives in the PRAXIS reference implementation, which is private until dissertation submission. Contact the author for research access.

Inspecting the checkpoint without the reference repo

This works standalone:

import torch
from huggingface_hub import hf_hub_download

ckpt_path = hf_hub_download(repo_id="Satyawan1/praxis-auscult", filename="best_model_v2_120M.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)

print(ckpt["config"])
# {'vocab_size': 3001, 'd_model': 512, 'n_heads': 8, 'ff_dim': 2048, 'n_layers': 18, 'conv_kernel': 31}
print(ckpt["val_loss"])  # 0.679
state_dict = ckpt["model_state"]  # PyTorch state dict, ready to load into a matching Conformer-CTC architecture

Running inference

Requires the PRAXIS engine code. Once available:

from huggingface_hub import snapshot_download
from speech_to_text.conformer_engine import ConformerEngine   # from PRAXIS repo

local_dir = snapshot_download(repo_id="Satyawan1/praxis-auscult")
engine = ConformerEngine(model_dir=local_dir, bias=False)
text = engine.transcribe("/path/to/consultation.wav")

# With SNOMED CT contextual biasing (31% relative MC-WER reduction, POC n=14):
engine_biased = ConformerEngine(model_dir=local_dir, bias=True)
text = engine_biased.transcribe("/path/to/consultation.wav")

Limitations and risks

WER is high in absolute terms (50.12%). Use only as part of a downstream pipeline that includes clinician-in-the-loop review.
Trained on simulated GP consultations (PriMock57), not real clinical recordings. Real-world performance is not yet established.
Speaker-level diarisation is handled by a separate model (PRAXIS ECAPA-TDNN, not in this repo).
The SNOMED hotword lexicon is curated for UK primary care; transferability to other healthcare systems has not been evaluated.
No safety guard rails are baked into the ASR itself. Hallucination detection lives in the downstream PRAXIS TrustScore pipeline (NLI entailment gate, Asgari-style harness, MHI), not here.

Citation

If you use this model in academic work, please cite the dissertation (in preparation):

@mastersthesis{singh2026praxis,
  author       = {Singh, Satyawan},
  title        = {{PRAXIS}: AI-Powered Clinical Documentation for {NHS} General Practice},
  school       = {University of Leicester},
  year         = {2026},
  type         = {{MSc} Dissertation},
  note         = {Programme: AI for Business Intelligence; in preparation}
}

License

Author

Satyawan Singh — University of Leicester MSc (AI for Business Intelligence) Supervisor: Eliyas HF: Satyawan1 GitHub: ss1738

Downloads last month: -; Downloads are not tracked for this model. How to track

Evaluation results

Word Error Rate (greedy CTC, no biasing) on PriMock57 (test split)
self-reported

50.120