A newer version of the Gradio SDK is available: 6.14.0
ADR 006: BF16 Native Inference for Local Triage
Status
Approved
Context
The initial implementation of the local specialist (Tier 1) used Unsloth with 4-bit quantization (bitsandbytes) to minimize memory footprint. However, during validation on AMD MI300X hardware, we observed "semantic collapse" and repetitive token generation (e.g., repeating punctuation or phrases indefinitely).
Initial investigations suggested that 4-bit quantization artifacts, combined with the CDNA3 architecture's handling of certain sub-8bit kernels, were degrading inference quality for clinical tasks requiring high precision.
Decision
We will migrate the LocalModelManager from Unsloth/4-bit to native BFloat16 (BF16) precision using the standard transformers and peft libraries.
Given the AMD MI300X's massive 192GB of VRAM, the increased memory requirement of BF16 (~14GB for a 7B model) is well within the hardware's capabilities, even with multiple concurrent agents.
Consequences
- Positive: Improved semantic stability and elimination of repetitive output artifacts.
- Positive: Full compatibility with ROCm native bfloat16 kernels.
- Positive: Higher fidelity clinical reasoning compared to 4-bit quantized versions.
- Neutral: Higher VRAM consumption (approx. 2.5x compared to 4-bit), which is negligible on MI300X.
- Neutral: Slightly slower initialization time as larger weights are loaded into HBM3.