Spaces:

lablab-ai-amd-developer-hackathon
/

OncoAgent

Running

OncoAgent / docs /ADR /006-bf16-native-inference.md

Upload folder using huggingface_hub

e1624f5 verified 12 days ago

1.44 kB

	# ADR 006: BF16 Native Inference for Local Triage

	## Status
	Approved

	## Context
	The initial implementation of the local specialist (Tier 1) used Unsloth with 4-bit quantization (bitsandbytes) to minimize memory footprint. However, during validation on AMD MI300X hardware, we observed "semantic collapse" and repetitive token generation (e.g., repeating punctuation or phrases indefinitely).

	Initial investigations suggested that 4-bit quantization artifacts, combined with the CDNA3 architecture's handling of certain sub-8bit kernels, were degrading inference quality for clinical tasks requiring high precision.

	## Decision
	We will migrate the `LocalModelManager` from Unsloth/4-bit to native BFloat16 (BF16) precision using the standard `transformers` and `peft` libraries.

	Given the AMD MI300X's massive 192GB of VRAM, the increased memory requirement of BF16 (~14GB for a 7B model) is well within the hardware's capabilities, even with multiple concurrent agents.

	## Consequences
	- Positive: Improved semantic stability and elimination of repetitive output artifacts.
	- Positive: Full compatibility with ROCm native bfloat16 kernels.
	- Positive: Higher fidelity clinical reasoning compared to 4-bit quantized versions.
	- Neutral: Higher VRAM consumption (approx. 2.5x compared to 4-bit), which is negligible on MI300X.
	- Neutral: Slightly slower initialization time as larger weights are loaded into HBM3.