OncoAgent / docs /ADR /006-bf16-native-inference.md
MaximoLopezChenlo's picture
Upload folder using huggingface_hub
e1624f5 verified
# ADR 006: BF16 Native Inference for Local Triage
## Status
Approved
## Context
The initial implementation of the local specialist (Tier 1) used Unsloth with 4-bit quantization (bitsandbytes) to minimize memory footprint. However, during validation on AMD MI300X hardware, we observed "semantic collapse" and repetitive token generation (e.g., repeating punctuation or phrases indefinitely).
Initial investigations suggested that 4-bit quantization artifacts, combined with the CDNA3 architecture's handling of certain sub-8bit kernels, were degrading inference quality for clinical tasks requiring high precision.
## Decision
We will migrate the `LocalModelManager` from Unsloth/4-bit to native **BFloat16 (BF16)** precision using the standard `transformers` and `peft` libraries.
Given the AMD MI300X's massive 192GB of VRAM, the increased memory requirement of BF16 (~14GB for a 7B model) is well within the hardware's capabilities, even with multiple concurrent agents.
## Consequences
- **Positive:** Improved semantic stability and elimination of repetitive output artifacts.
- **Positive:** Full compatibility with ROCm native bfloat16 kernels.
- **Positive:** Higher fidelity clinical reasoning compared to 4-bit quantized versions.
- **Neutral:** Higher VRAM consumption (approx. 2.5x compared to 4-bit), which is negligible on MI300X.
- **Neutral:** Slightly slower initialization time as larger weights are loaded into HBM3.