PRAXIS-AUSCULT — Conformer-CTC ASR for British GP Audio
A 117M-parameter (live inference) Conformer-CTC speech recogniser trained from scratch on British General Practice consultation audio (PriMock57). Built as the ASR component of PRAXIS — an MSc dissertation system at the University of Leicester for AI-assisted clinical documentation in NHS general practice.
AUSCULT is named after Latin auscultare — "to listen carefully" — the root of auscultation, the foundational act of clinical listening with a stethoscope. The model learns to listen to GP consultations the way a clinician learns to listen to a chest.
Model summary
| Architecture | Conformer-CTC (encoder-only, CTC head) |
| Parameters | 119.63M (checkpoint), 117.06M (inference, CTC head removed) |
| Layers | 18 Conformer blocks |
d_model / heads / FFN |
512 / 8 / 2048 |
| Conv kernel | 31 |
| Vocabulary | 3,001 (SentencePiece, see praxis_tokenizer_v2.model) |
| Real-time factor (RTF) | 0.036–0.044 (Mac CPU, verified live) |
| Framework | Raw PyTorch (no transformers) |
Training
| Training data | PriMock57 (57 simulated British GP consultations), chunked into 4,824 segments |
| Training compute | vast.ai (NVIDIA GPUs); see vast_train.py for the exact training script |
| Loss | CTC |
| Tokeniser | SentencePiece, 3,001 BPE-style tokens |
Headline results — PriMock57 test split
WER (lower is better)
| Model | Architecture | Params | WER ↓ | Notes |
|---|---|---|---|---|
openai/whisper-large-v3 |
seq2seq encoder–decoder | 1.55B | 16.76% | Multilingual, ~680K hours pre-training |
openai/whisper-medium |
seq2seq encoder–decoder | 769M | — | Listed for context |
openai/whisper-small |
seq2seq encoder–decoder | 244M | — | Listed for context |
| PRAXIS-AUSCULT (this model, greedy) | Conformer-CTC | 117M | 50.12% | Trained only on PriMock57 |
| PRAXIS-AUSCULT + SNOMED biasing | Conformer-CTC + hot-word boosted beam | 117M | medical-term WER −31% (relative, POC n=14) | 185-term UK primary-care lexicon, zero retraining. Full-benchmark MC-WER measurement pending. |
Honest framing. This Conformer is trained from scratch on 4,824 chunked segments of British GP audio — roughly four orders of magnitude less data than Whisper-large-v3 was pre-trained on. The resulting WER (50.12%) is higher than Whisper-large-v3 (16.76%) on the same test split. This is not a SOTA WER claim. It is a deliberately reported negative result that establishes the floor against which the dissertation's medical-term-recall, hallucination, and clinical-validation contributions are evaluated. Reporting this honestly — rather than only reporting the metrics on which the model wins — is part of the methodological contribution.
Verified live (2026-05-04): RTF 0.036 on macOS CPU with a 253-second sample consultation. Model loaded in 0.47s. Live transcription is intelligible but contains the word-level errors expected at 50% WER (e.g. "diziness" for "dizziness"). Use the medical spell-checker in src/speech_to_text/medical_spell_checker.py and SNOMED biasing for a cleaner downstream string.
What the model is good for
- Medical-term recall with SNOMED CT contextual biasing — 31% relative reduction in medical-term WER at decode time using a 185-term UK primary-care hotword lexicon (drugs, conditions, NHS abbreviations). Zero retraining; pyctcdecode beam search with hotword boosting. POC measurement, n=14; full PriMock57-test MC-WER sweep is pending.
- Domain-specific phrasing — common British GP turns of phrase, EMIS-style consultation patterns, NHS abbreviations (e.g. CKS, BNF, NICE).
- On-device, offline operation — full inference runs locally on Apple Silicon. No cloud dependency. HIPAA / GDPR-compatible by design.
Intended use
- Research and educational use within PRAXIS (MSc dissertation, University of Leicester).
- Studying domain-specific ASR with limited training data.
- Studying contextual-biasing methods (the SNOMED hotword boost is a contribution in its own right).
Out of scope
- Not a clinical device. Output must be reviewed by a qualified clinician before being entered into a patient record.
- Not validated outside British primary-care GP audio. Will likely degrade on hospital, secondary-care, paediatric, ICU, or non-UK English audio.
- Not validated for code-switched, multilingual, or strongly accented non-UK English.
Files in this repo
| File | Purpose |
|---|---|
best_model_v2_120M.pt |
Generic checkpoint, val_loss 0.679 on the multi-corpus validation set. |
final_model_v2_120M.pt |
Final-epoch generic checkpoint. |
praxis_tokenizer_v2.model |
SentencePiece tokeniser, vocab 3,001. Required for inference. |
vast_train.py |
The exact training script used on vast.ai. Provided for reproducibility. |
Which checkpoint to use. The PRAXIS reference engine prefers a PriMock57-fine-tuned checkpoint (
best_primock57.pt) when it is present locally; otherwise it falls back tobest_model_v2_120M.pt. The PriMock-fine-tuned variant is the one used to obtain the dissertation WER number (50.12% on PriMock57 test) and is recommended for British GP audio. The_80Msuffix in the filename is historical (an early target size); the live inference graph is 117M params (checkpoint state-dict 119.6M; ~2.5M training-only output-head params are stripped at load).
How to use
This model is not packaged as a transformers model. It is loaded with raw PyTorch.
The full inference engine (ConformerEngine, SNOMED biaser, audio preprocessor) lives in the PRAXIS reference implementation, which is private until dissertation submission. Contact the author for research access.
Inspecting the checkpoint without the reference repo
This works standalone:
import torch
from huggingface_hub import hf_hub_download
ckpt_path = hf_hub_download(repo_id="Satyawan1/praxis-auscult", filename="best_model_v2_120M.pt")
ckpt = torch.load(ckpt_path, map_location="cpu", weights_only=False)
print(ckpt["config"])
# {'vocab_size': 3001, 'd_model': 512, 'n_heads': 8, 'ff_dim': 2048, 'n_layers': 18, 'conv_kernel': 31}
print(ckpt["val_loss"]) # 0.679
state_dict = ckpt["model_state"] # PyTorch state dict, ready to load into a matching Conformer-CTC architecture
Running inference
Requires the PRAXIS engine code. Once available:
from huggingface_hub import snapshot_download
from speech_to_text.conformer_engine import ConformerEngine # from PRAXIS repo
local_dir = snapshot_download(repo_id="Satyawan1/praxis-auscult")
engine = ConformerEngine(model_dir=local_dir, bias=False)
text = engine.transcribe("/path/to/consultation.wav")
# With SNOMED CT contextual biasing (31% relative MC-WER reduction, POC n=14):
engine_biased = ConformerEngine(model_dir=local_dir, bias=True)
text = engine_biased.transcribe("/path/to/consultation.wav")
Limitations and risks
- WER is high in absolute terms (50.12%). Use only as part of a downstream pipeline that includes clinician-in-the-loop review.
- Trained on simulated GP consultations (PriMock57), not real clinical recordings. Real-world performance is not yet established.
- Speaker-level diarisation is handled by a separate model (PRAXIS ECAPA-TDNN, not in this repo).
- The SNOMED hotword lexicon is curated for UK primary care; transferability to other healthcare systems has not been evaluated.
- No safety guard rails are baked into the ASR itself. Hallucination detection lives in the downstream PRAXIS TrustScore pipeline (NLI entailment gate, Asgari-style harness, MHI), not here.
Citation
If you use this model in academic work, please cite the dissertation (in preparation):
@mastersthesis{singh2026praxis,
author = {Singh, Satyawan},
title = {{PRAXIS}: AI-Powered Clinical Documentation for {NHS} General Practice},
school = {University of Leicester},
year = {2026},
type = {{MSc} Dissertation},
note = {Programme: AI for Business Intelligence; in preparation}
}
License
Released under custom MSc-dissertation rights ("All rights reserved") pending a formal open licence post-submission. Contact the author for research / non-commercial use.
Author
Satyawan Singh — University of Leicester MSc (AI for Business Intelligence)
Supervisor: Eliyas
HF: Satyawan1
GitHub: ss1738
Evaluation results
- Word Error Rate (greedy CTC, no biasing) on PriMock57 (test split)self-reported50.120