Spaces:

bluemoonsoldout
/

llm-cal

Running

llm-cal / src /llm_cal /hardware /gpu_database.yaml

GitHub Actions

Auto-deploy from GitHub Actions

cc6274a 14 days ago

33.3 kB

	# GPU database — v0.1.
	#
	# DATA PROVENANCE:
	# Numeric specs (memory_gb, nvlink_bandwidth_gbps, fp16_tflops, fp8/fp4_support)
	# come from public vendor datasheets and commonly-cited benchmarks. Each entry
	# records its source in `spec_source` so users can audit.
	#
	# Conventions:
	# - memory_gb: per-card HBM / GDDR in GB (vendor nominal)
	# - nvlink_bandwidth_gbps: aggregate NVLink (or equivalent like xGMI/HCCS)
	# bandwidth. 0 if the GPU has no high-bandwidth interconnect (e.g. consumer
	# Ada removed NVLink).
	# - fp16_tflops: peak dense FP16/BF16 with Tensor Cores; vendor cited figure.
	# - fp8_support / fp4_support: whether the GPU has NATIVE Tensor Core
	# acceleration for that precision. Software emulation does NOT count.
	#
	# To add a new GPU: append an entry with all required fields + spec_source.
	# See docs/architecture-guide.md "How to add a new GPU".
	schema_version: 1
	gpus:
	# ========================================================================
	# NVIDIA Blackwell (2024+) — native FP4
	# ========================================================================
	- id: B200
	aliases: [B200-SXM, B200-192G]
	memory_gb: 192
	nvlink_bandwidth_gbps: 1800
	memory_bandwidth_gbps: 8000
	fp16_tflops: 2250
	fp8_support: true
	fp4_support: true
	spec_source: "NVIDIA Blackwell architecture overview (nvidia.com/blackwell)"
	notes_en: "Blackwell flagship. Native FP4 Tensor Cores. First GPU that accelerates DeepSeek-V4-Flash-style FP4 at hardware level."
	notes_zh: "Blackwell 旗舰。原生 FP4 Tensor Core，首款在硬件层加速 DeepSeek-V4-Flash 类 FP4 模型的 GPU。"

	# ========================================================================
	# NVIDIA Hopper (2022+)
	# ========================================================================
	- id: H100
	aliases: [H100-SXM5, H100-80G, H100-SXM]
	memory_gb: 80
	nvlink_bandwidth_gbps: 900
	memory_bandwidth_gbps: 3350
	fp16_tflops: 989
	fp8_support: true
	fp4_support: false
	spec_source: "NVIDIA H100 datasheet (nvidia.com/h100)"
	notes_en: "Hopper flagship. Full NVLink."
	notes_zh: "Hopper 架构旗舰，完整 NVLink 带宽。"

	- id: H800
	aliases: [H800-SXM5, H800-80G]
	memory_gb: 80
	nvlink_bandwidth_gbps: 400
	memory_bandwidth_gbps: 3350
	fp16_tflops: 989
	fp8_support: true
	fp4_support: false
	spec_source: "NVIDIA H800 compliance variant — NVLink halved from H100 per US export controls"
	notes_en: "China-regulated H100 variant. NVLink bandwidth halved (400 vs 900). Same HBM and compute as H100."
	notes_zh: "H100 的中国合规版本。NVLink 带宽减半（400 vs 900 GB/s），HBM 容量和算力与 H100 相同。"

	- id: H200
	aliases: [H200-SXM, H200-141G]
	memory_gb: 141
	nvlink_bandwidth_gbps: 900
	memory_bandwidth_gbps: 4800
	fp16_tflops: 989
	fp8_support: true
	fp4_support: false
	spec_source: "NVIDIA H200 datasheet (nvidia.com/h200)"
	notes_en: "Hopper with HBM3e. 141 GB per GPU."
	notes_zh: "搭载 HBM3e 的 Hopper，单卡 141 GB。"

	- id: GH200
	aliases: [Grace-Hopper, GH200-144G, GH200-96G]
	memory_gb: 144
	nvlink_bandwidth_gbps: 900
	memory_bandwidth_gbps: 4800
	fp16_tflops: 989
	fp8_support: true
	fp4_support: false
	spec_source: "NVIDIA GH200 Grace Hopper datasheet 2023 (144GB HBM3e variant, dense FP16=989 TFLOPS; sparsity doubles it)"
	notes_en: "Grace Hopper superchip — Hopper GPU + Grace CPU on one module. 144 GB HBM3e (96 GB HBM3 variant also exists). NVLink-C2C 900 GB/s CPU<->GPU unified. TDP programmable 450-1000W. Ideal for models that spill beyond single GPU memory because GPU can access CPU LPDDR coherently."
	notes_zh: "Grace Hopper 超级芯片 — Hopper GPU + Grace CPU 融合模组。144 GB HBM3e（另有 96 GB HBM3 版本）。NVLink-C2C 让 CPU/GPU 共享统一内存空间，900 GB/s 双向。TDP 可编程 450-1000W。模型单卡显存装不下时，可一致地访问 CPU 的 LPDDR。"

	- id: GB200
	aliases: [Grace-Blackwell, GB200-per-GPU]
	memory_gb: 192
	nvlink_bandwidth_gbps: 1800
	memory_bandwidth_gbps: 8000
	fp16_tflops: 2250
	fp8_support: true
	fp4_support: true
	spec_source: "NVIDIA GB200 Superchip datasheet 2024 — per-GPU view. Each GB200 = 2 B200 + Grace CPU. Per B200: 192 GB HBM3e, 8 TB/s, 2250 TFLOPS dense FP16 (4500 sparsity). Grace CPU adds up to 480 GB LPDDR5x accessible via NVLink-C2C."
	notes_en: "Grace Blackwell superchip — 2 B200 GPUs + Grace CPU on one module. Per-GPU specs here match B200, but each GB200 module unlocks 384 GB HBM3e total (192+192) plus coherent access to 480 GB Grace CPU LPDDR5x. FP4 native. Only deployable in NVL4/NVL72 rack-scale systems with liquid cooling. Per-GPU TDP 1200W."
	notes_zh: "Grace Blackwell 超级芯片 — 双 B200 GPU + Grace CPU 融合。此处展示单 GPU 视角规格，与 B200 基本一致。每块 GB200 模组合计 384 GB HBM3e（双卡），并通过 NVLink-C2C 一致访问 480 GB Grace CPU 的 LPDDR5x。原生 FP4。仅在 NVL4 / NVL72 液冷机架系统中部署。单 GPU TDP 1200W。"

	- id: H20
	aliases: [H20-96G, H20-SXM]
	memory_gb: 96
	nvlink_bandwidth_gbps: 900
	memory_bandwidth_gbps: 4000
	fp16_tflops: 148
	fp8_support: true
	fp4_support: false
	spec_source: "NVIDIA H20 — released 2024 as China-compliant successor to H800. Compute heavily reduced (~15% of H100); memory bandwidth and HBM3e preserved."
	notes_en: "China-compliance Hopper post-Oct-2023 export rules. Compute ~15% of H100 (148 vs 989 TFLOPS), but HBM3e memory bandwidth preserved. Good for memory-bound LLM inference, poor for training."
	notes_zh: "2023 年 10 月出口管制后的中国合规 Hopper。算力仅为 H100 的约 15%（148 vs 989 TFLOPS），但 HBM3e 显存带宽保留。推理（显存带宽受限）尚可，训练基本不实用。"

	# ========================================================================
	# NVIDIA Ada Lovelace (datacenter) — FP8 yes, NVLink no
	# ========================================================================
	- id: L40S
	aliases: [L40-S, L40S-48G]
	memory_gb: 48
	nvlink_bandwidth_gbps: 0
	memory_bandwidth_gbps: 864
	fp16_tflops: 362
	fp8_support: true
	fp4_support: false
	spec_source: "NVIDIA L40S datasheet 2023"
	notes_en: "Ada datacenter. 48 GB GDDR6. No NVLink — multi-GPU setups rely on PCIe. Cost-effective for small/medium model inference."
	notes_zh: "Ada 架构数据中心卡，48 GB GDDR6。无 NVLink，多卡需走 PCIe。中小模型推理性价比高。"

	- id: L40
	aliases: [L40-48G]
	memory_gb: 48
	nvlink_bandwidth_gbps: 0
	memory_bandwidth_gbps: 864
	fp16_tflops: 181
	fp8_support: true
	fp4_support: false
	spec_source: "NVIDIA L40 datasheet 2022"
	notes_en: "Ada datacenter predecessor to L40S. Same 48 GB, half the compute. Widely deployed in enterprise clouds."
	notes_zh: "L40S 的前代，Ada 架构数据中心卡。同为 48 GB，算力减半。企业私有云部署量较大。"

	- id: L4
	aliases: [L4-24G]
	memory_gb: 24
	nvlink_bandwidth_gbps: 0
	memory_bandwidth_gbps: 300
	fp16_tflops: 121
	fp8_support: true
	fp4_support: false
	spec_source: "NVIDIA L4 datasheet 2023"
	notes_en: "Low-profile Ada, 24 GB GDDR6. Common in low-concurrency inference / transcoding. No NVLink."
	notes_zh: "低功耗 Ada，24 GB GDDR6。常用于低并发推理和转码场景。无 NVLink。"

	- id: RTX6000-Ada
	aliases: [RTX-6000-Ada, RTX6000Ada, L6000]
	memory_gb: 48
	nvlink_bandwidth_gbps: 0
	memory_bandwidth_gbps: 960
	fp16_tflops: 365
	fp8_support: true
	fp4_support: false
	spec_source: "NVIDIA RTX 6000 Ada datasheet 2022"
	notes_en: "Ada Pro workstation. 48 GB, similar to L40S but for workstations. FP8 yes, no NVLink."
	notes_zh: "Ada Pro 工作站卡。48 GB，规格接近 L40S 但面向工作站。支持 FP8，无 NVLink。"

	- id: RTX4090
	aliases: ["4090", RTX-4090]
	memory_gb: 24
	nvlink_bandwidth_gbps: 0
	memory_bandwidth_gbps: 1008
	fp16_tflops: 165
	fp8_support: true
	fp4_support: false
	spec_source: "NVIDIA RTX 4090 datasheet 2022"
	notes_en: "Consumer Ada. No NVLink. Large models need multi-GPU via PCIe (slower)."
	notes_zh: "消费级 Ada 架构，无 NVLink。大模型多卡只能走 PCIe（明显更慢）。"

	# ========================================================================
	# NVIDIA Ampere (2020+)
	# ========================================================================
	- id: A100-80G
	aliases: [A100-80, A100-SXM-80G]
	memory_gb: 80
	nvlink_bandwidth_gbps: 600
	memory_bandwidth_gbps: 2039
	fp16_tflops: 312
	fp8_support: false
	fp4_support: false
	spec_source: "NVIDIA A100 datasheet 2020"
	notes_en: "Ampere. No native FP8. Still widely deployed."
	notes_zh: "Ampere 架构。不原生支持 FP8，但部署量仍然非常大。"

	- id: A100-40G
	aliases: [A100-40, A100-SXM-40G]
	memory_gb: 40
	nvlink_bandwidth_gbps: 600
	memory_bandwidth_gbps: 1555
	fp16_tflops: 312
	fp8_support: false
	fp4_support: false
	spec_source: "NVIDIA A100 40GB datasheet 2020"
	notes_en: "Ampere 40 GB variant. Smaller HBM limits large-model single-node deployments."
	notes_zh: "Ampere 的 40 GB 版本，显存较小，大模型单机部署受限。"

	- id: A40
	aliases: [A40-48G]
	memory_gb: 48
	nvlink_bandwidth_gbps: 112
	memory_bandwidth_gbps: 696
	fp16_tflops: 150
	fp8_support: false
	fp4_support: false
	spec_source: "NVIDIA A40 datasheet 2020"
	notes_en: "Ampere workstation. 48 GB with NVLink bridge (limited bandwidth). No FP8."
	notes_zh: "Ampere 工作站卡，48 GB + NVLink 桥接（带宽较低）。不支持 FP8。"

	- id: A10
	aliases: [A10-24G]
	memory_gb: 24
	nvlink_bandwidth_gbps: 0
	memory_bandwidth_gbps: 600
	fp16_tflops: 125
	fp8_support: false
	fp4_support: false
	spec_source: "NVIDIA A10 datasheet 2021"
	notes_en: "Ampere inference card. 24 GB GDDR6. Widely used for low-cost inference in enterprise clouds."
	notes_zh: "Ampere 推理卡，24 GB GDDR6。企业云低成本推理常用配置。"

	- id: A10G
	aliases: [A10G-24G]
	memory_gb: 24
	nvlink_bandwidth_gbps: 0
	memory_bandwidth_gbps: 600
	fp16_tflops: 125
	fp8_support: false
	fp4_support: false
	spec_source: "NVIDIA A10G — AWS-specific variant of A10, g5 instances"
	notes_en: "AWS-specific A10 variant. Same silicon as A10, deployed in g5 EC2 instances. No NVLink."
	notes_zh: "AWS 定制版 A10，用于 g5 EC2 实例。核心规格与 A10 相同，无 NVLink。"

	# ========================================================================
	# NVIDIA Volta / Turing (older, still deployed)
	# ========================================================================
	- id: V100-SXM2-32G
	aliases: [V100, V100-32G, V100-SXM2]
	memory_gb: 32
	nvlink_bandwidth_gbps: 300
	memory_bandwidth_gbps: 900
	fp16_tflops: 125
	fp8_support: false
	fp4_support: false
	spec_source: "NVIDIA V100 SXM2 datasheet 2017"
	notes_en: "Volta. No FP8. Still deployed in many existing clusters — works for smaller models, tight for 70B+."
	notes_zh: "Volta 架构。不支持 FP8，但仍在大量老集群中服役。小模型够用，70B+ 紧张。"

	- id: V100-PCIe-32G
	aliases: [V100-PCIe, V100-PCI]
	memory_gb: 32
	nvlink_bandwidth_gbps: 0
	memory_bandwidth_gbps: 900
	fp16_tflops: 112
	fp8_support: false
	fp4_support: false
	spec_source: "NVIDIA V100 PCIe datasheet 2017 — PCIe variant of V100, no NVLink."
	notes_en: "PCIe version of V100. No NVLink, lower clocks than SXM2. Common in older servers."
	notes_zh: "V100 的 PCIe 版本，无 NVLink，主频稍低。老服务器常见配置。"

	- id: T4
	aliases: [T4-16G]
	memory_gb: 16
	nvlink_bandwidth_gbps: 0
	memory_bandwidth_gbps: 320
	fp16_tflops: 65
	fp8_support: false
	fp4_support: false
	spec_source: "NVIDIA T4 datasheet 2018"
	notes_en: "Turing inference card. 16 GB, no NVLink, no FP8. Common as the cheapest cloud GPU option."
	notes_zh: "Turing 推理卡。16 GB，无 NVLink，无 FP8。各云厂商最便宜的 GPU 选项之一。"

	# ========================================================================
	# AMD (ROCm, xGMI instead of NVLink)
	# ========================================================================
	- id: MI325X
	aliases: [MI325X-256G, AMD-MI325X]
	memory_gb: 256
	nvlink_bandwidth_gbps: 896
	memory_bandwidth_gbps: 6000
	fp16_tflops: 1307
	fp8_support: true
	fp4_support: false
	spec_source: "AMD Instinct MI325X datasheet 2024 — 256 GB HBM3E, 6 TB/s bandwidth, 1000W TDP, CDNA 3."
	notes_en: "AMD flagship 2024. 256 GB HBM3E (largest single-card memory in v0.1 database). Upgraded MI300X with faster HBM3E and more capacity. Dense FP16 1307 TFLOPS, FP8 2615 TFLOPS. 1000W TDP, OAM format. ROCm software stack."
	notes_zh: "AMD 2024 年旗舰。256 GB HBM3E（v0.1 数据库中单卡最大）。MI300X 升级版，HBM3E 更快、容量更大。Dense FP16 1307 TFLOPS，FP8 2615 TFLOPS。1000W TDP，OAM 形态。需要 ROCm 软件栈。"

	- id: MI300X
	aliases: [MI300X-192G, AMD-MI300X]
	memory_gb: 192
	nvlink_bandwidth_gbps: 896
	memory_bandwidth_gbps: 5300
	fp16_tflops: 1307
	fp8_support: true
	fp4_support: false
	spec_source: "AMD Instinct MI300X datasheet 2023-12"
	notes_en: "AMD flagship 2023. 192 GB HBM3. xGMI 896 GB/s (like NVLink). Software stack: ROCm + vLLM. Support for DeepSeek V4 etc. lags Nvidia by weeks."
	notes_zh: "AMD 2023 年旗舰。192 GB HBM3。xGMI 互联 896 GB/s（类 NVLink）。需要 ROCm + vLLM 栈。新模型支持通常比 NVIDIA 晚几周。"

	- id: MI250X
	aliases: [MI250X-128G, AMD-MI250X]
	memory_gb: 128
	nvlink_bandwidth_gbps: 800
	memory_bandwidth_gbps: 3280
	fp16_tflops: 383
	fp8_support: false
	fp4_support: false
	spec_source: "AMD Instinct MI250X datasheet 2022"
	notes_en: "AMD previous-gen. 128 GB HBM2e. No FP8. Deployed in some HPC clusters (Frontier)."
	notes_zh: "AMD 上代数据中心卡。128 GB HBM2e，不支持 FP8。少数 HPC 集群（如 Frontier 超算）有部署。"

	- id: MI210
	aliases: [MI210-64G, AMD-MI210]
	memory_gb: 64
	nvlink_bandwidth_gbps: 300
	memory_bandwidth_gbps: 1600
	fp16_tflops: 181
	fp8_support: false
	fp4_support: false
	spec_source: "AMD Instinct MI210 datasheet 2022 — CDNA 2, single-die version of MI250. 64 GB HBM2e."
	notes_en: "AMD CDNA 2 single-die. 64 GB HBM2e, 1.6 TB/s. No FP8 (CDNA 2 limitation). Common as entry-level AMD datacenter card."
	notes_zh: "AMD CDNA 2 单 die 版本，64 GB HBM2e，1.6 TB/s 带宽。不支持 FP8（CDNA 2 限制）。AMD 入门数据中心卡常见配置。"

	# ========================================================================
	# Intel Habana Gaudi
	# ========================================================================
	- id: Gaudi3
	aliases: [Gaudi-3, Habana-Gaudi3]
	memory_gb: 128
	nvlink_bandwidth_gbps: 1200
	memory_bandwidth_gbps: 3700
	fp16_tflops: 1835
	fp8_support: true
	fp4_support: false
	spec_source: "Intel Gaudi 3 datasheet 2024"
	notes_en: "Intel Habana Gaudi 3. 128 GB HBM2e. FP8 support. Software stack: SynapseAI (not CUDA). vLLM support via Intel fork."
	notes_zh: "Intel Habana Gaudi 3。128 GB HBM2e，支持 FP8。软件栈为 SynapseAI（非 CUDA）。vLLM 需走 Intel 分支。"

	- id: Gaudi2
	aliases: [Gaudi-2, Habana-Gaudi2]
	memory_gb: 96
	nvlink_bandwidth_gbps: 2400
	memory_bandwidth_gbps: 2450
	fp16_tflops: 432
	fp8_support: true
	fp4_support: false
	spec_source: "Intel Gaudi 2 datasheet 2022"
	notes_en: "Intel Habana Gaudi 2. 96 GB HBM2e with 24x100GbE on-board (used for scale-out). FP8 support."
	notes_zh: "Intel Habana Gaudi 2。96 GB HBM2e，板载 24 个 100GbE（用于横向扩展）。支持 FP8。"

	# ========================================================================
	# Huawei Ascend
	# ========================================================================
	# The 910B "series" is actually a set of sub-variants (B1/B2/B3/B4) with
	# different compute tiers and memory sizes. `910B` as a plain id resolves
	# to 910B3 (the most common training configuration).
	- id: "910A"
	aliases: [Ascend-910A]
	memory_gb: 32
	nvlink_bandwidth_gbps: 400
	memory_bandwidth_gbps: 1200
	fp16_tflops: 256
	fp8_support: false
	fp4_support: false
	spec_source: "Ascend 910 (1st gen) — 7nm, 32 GB HBM. Community-compiled spec."
	notes_en: "Huawei Ascend 910 (1st gen, 2019). Predecessor to 910B. Still deployed in many older clusters. HCCS interconnect."
	notes_zh: "华为昇腾 910 第一代（2019 年），910B 的前身。很多老集群仍在使用。HCCS 互联。"

	- id: "910B1"
	aliases: [Ascend-910B1]
	memory_gb: 64
	nvlink_bandwidth_gbps: 400
	memory_bandwidth_gbps: 1600
	fp16_tflops: 414
	fp8_support: false
	fp4_support: false
	spec_source: "Ascend 910B1 — training variant, Atlas 800T A2. Commonly cited as top-tier 910B sub-variant; TSMC 7nm process."
	notes_en: "Top-tier 910B training variant. 64 GB HBM2, 414 TFLOPS FP16. Used in Atlas 800T A2 training servers. No native FP8."
	notes_zh: "910B 系列顶配训练版本。64 GB HBM2，FP16 算力 414 TFLOPS。搭载于 Atlas 800T A2 训练服务器。不原生支持 FP8。"

	- id: "910B2"
	aliases: [Ascend-910B2]
	memory_gb: 64
	nvlink_bandwidth_gbps: 400
	memory_bandwidth_gbps: 1600
	fp16_tflops: 376
	fp8_support: false
	fp4_support: false
	spec_source: "Ascend 910B2 — training variant, commonly cited as standard 910B training configuration."
	notes_en: "Standard 910B training variant. 64 GB HBM2, 376 TFLOPS FP16. General-purpose training server baseline."
	notes_zh: "910B 常规训练版本。64 GB HBM2，FP16 算力 376 TFLOPS。通用训练服务器标准配置。"

	- id: "910B3"
	aliases: [Ascend-910B3, "910B", Ascend-910B]
	memory_gb: 64
	nvlink_bandwidth_gbps: 400
	memory_bandwidth_gbps: 1600
	fp16_tflops: 313
	fp8_support: false
	fp4_support: false
	spec_source: "Ascend 910B3 — training variant, SMIC-produced per industry reports. (aliased as bare `910B` for convenience)"
	notes_en: "910B3 training variant, 313 TFLOPS FP16. Believed to be SMIC-produced (vs TSMC for B1/B2). The `910B` bare name resolves here since B3 is the most commonly referenced."
	notes_zh: "910B3 训练版本，FP16 算力 313 TFLOPS。业界普遍认为由中芯国际生产（B1/B2 据传为台积电）。裸写 `910B` 时默认解析到此条目（最常被引用）。"

	- id: "910B4"
	aliases: [Ascend-910B4]
	memory_gb: 32
	nvlink_bandwidth_gbps: 400
	memory_bandwidth_gbps: 1600
	fp16_tflops: 280
	fp8_support: false
	fp4_support: false
	spec_source: "Ascend 910B4 — inference variant, 32 GB HBM (half of B1/B2/B3). Atlas 800I A2 inference server."
	notes_en: "910B4 is the inference-oriented 910B variant. 32 GB HBM (half of training variants), 280 TFLOPS FP16. Deployed in Atlas 800I A2 inference servers."
	notes_zh: "910B4 是 910B 系列的推理版本。32 GB HBM（训练版本的一半），FP16 算力 280 TFLOPS。搭载于 Atlas 800I A2 推理服务器。"

	- id: "910C"
	aliases: [Ascend-910C]
	memory_gb: 64
	nvlink_bandwidth_gbps: 400
	memory_bandwidth_gbps: 3200
	fp16_tflops: 780
	fp8_support: false
	fp4_support: false
	spec_source: "Huawei Ascend 910C — launched 2024, commonly cited specs pending official datasheet"
	notes_en: "Huawei Ascend 910C (2024). Roughly 2x compute vs 910B at similar memory. FP8 support status unclear — check CANN version notes. Software ecosystem matures but still behind NVIDIA."
	notes_zh: "华为昇腾 910C（2024 年）。算力大约是 910B 的两倍，显存相当。FP8 支持情况需看 CANN 版本。软件生态持续完善但仍落后于 NVIDIA。"

	- id: Atlas-300I-Duo
	aliases: [Atlas300IDuo, 300I-Duo]
	memory_gb: 48
	nvlink_bandwidth_gbps: 0
	memory_bandwidth_gbps: 204
	fp16_tflops: 140
	fp8_support: false
	fp4_support: false
	spec_source: "Huawei Atlas 300I Duo inference card — 2x Ascend 310P per card. 140 TFLOPS FP16 per card, 48 GB LPDDR4X."
	notes_en: "Huawei Atlas 300I Duo inference card: 2x Ascend 310P with combined 48 GB LPDDR4X (96 GB variant available). 280 TOPS INT8. LPDDR4X gives 204 GB/s total bandwidth — much lower than HBM-based cards. PCIe-only, no NVLink. Best for cost-sensitive inference."
	notes_zh: "华为 Atlas 300I Duo 推理卡：双 Ascend 310P，合计 48 GB LPDDR4X（另有 96 GB 版本）。INT8 280 TOPS。显存是 LPDDR4X，带宽 204 GB/s，远低于 HBM 卡。仅 PCIe，无 NVLink。主要面向成本敏感的推理场景。"

	# ========================================================================
	# Chinese domestic AI accelerators (non-NVIDIA / non-AMD)
	# ========================================================================
	- id: MXC500
	aliases: [MetaX-MXC500, XiYun-C500, 曦云C500]
	memory_gb: 64
	nvlink_bandwidth_gbps: 800
	memory_bandwidth_gbps: 1800
	fp16_tflops: 240
	fp8_support: false
	fp4_support: false
	spec_source: "MetaX 沐曦 MXC500 / 曦云 C500 (PCIe variant, 350W). OAM variant has 280 TFLOPS FP16 @ 450W. 64 GB HBM2e, 1.8 TB/s memory bandwidth, MetaXLink interconnect."
	notes_en: "MetaX (沐曦) MXC500. 7nm, CUDA-compatible via MXMACA stack. PCIe variant: 240 TFLOPS FP16, 350W. OAM variant: 280 TFLOPS FP16, 450W. Targets A100-class workloads. No native FP8."
	notes_zh: "沐曦曦云 C500。7nm 工艺，通过 MXMACA 软件栈兼容 CUDA。PCIe 版本 FP16 240 TFLOPS / 350W，OAM 版本 280 TFLOPS / 450W。对标 A100 场景。不原生支持 FP8。"

	- id: MXC550
	aliases: [MetaX-MXC550, XiYun-C550, 曦云C550]
	memory_gb: 64
	nvlink_bandwidth_gbps: 896
	memory_bandwidth_gbps: 1600
	fp16_tflops: 240
	fp8_support: false
	fp4_support: false
	spec_source: "MetaX 沐曦 MXC550 / 曦云 C550 (OAM, 2024). Partial specs from third-party comparison docs; full datasheet TBD. 8-card fabric bandwidth 896 GB/s."
	notes_en: "MetaX (沐曦) MXC550 — 2024 OAM-format flagship. Supports OAM 1.5 + 2.0. 8-card fabric bandwidth 896 GB/s. Full specs pending official datasheet — figures here are from third-party comparison articles."
	notes_zh: "沐曦曦云 C550 — 2024 年 OAM 形态旗舰。支持 OAM 1.5 + 2.0 规范。八卡全互联带宽 896 GB/s。完整规格待官方数据表披露，此处数字来自第三方对比资料。"

	- id: Kunlun-P800
	aliases: [KunlunXin-P800, 昆仑芯P800, Kunlun-Gen3]
	memory_gb: 96
	nvlink_bandwidth_gbps: 400
	memory_bandwidth_gbps: 2000
	fp16_tflops: 345
	fp8_support: true
	fp4_support: false
	spec_source: "KunlunXin P800 (3rd gen, 2024). 96 GB HBM3 (largest among Chinese domestic AI chips). Baidu Cloud uses P800 for first-party inference. Specs partially inferred from public Baidu announcements; official datasheet limited distribution."
	notes_en: "Baidu KunlunXin P800 — 3rd gen, 2024. 96 GB HBM3. Reported to support 8-bit inference and MoE optimizations. Baidu's internal clusters run Kunlun P800 at 10k+ card scale. Figures here are from public Baidu materials; official spec sheet not fully public."
	notes_zh: "百度昆仑芯 P800 — 第三代，2024 年。96 GB HBM3（国产 AI 芯片中显存最大之一）。报告支持 8bit 推理和 MoE 优化。百度内部 1 万卡以上规模部署。数字来自百度公开资料，完整规格表未完全披露。"

	- id: Kunlun-R200
	aliases: [KunlunXin-R200, 昆仑芯R200, Kunlun-Gen2]
	memory_gb: 32
	nvlink_bandwidth_gbps: 200
	memory_bandwidth_gbps: 512
	fp16_tflops: 128
	fp8_support: false
	fp4_support: false
	spec_source: "KunlunXin R200 (2nd gen, 2021). 7nm XPU architecture. FP16 128 TFLOPS / INT8 256 TOPS."
	notes_en: "Baidu KunlunXin R200 — 2nd gen, 7nm. FP16 128 TFLOPS, INT8 256 TOPS. XPU architecture. PCIe 4.0 + XCCL interconnect. No FP8."
	notes_zh: "百度昆仑芯 R200 — 第二代，7nm XPU 架构。FP16 128 TFLOPS，INT8 256 TOPS。PCIe 4.0 + 昆仑芯互联 XCCL。无 FP8。"

	- id: BR100
	aliases: [Biren-BR100, 壁仞BR100, 壁砺100]
	memory_gb: 64
	nvlink_bandwidth_gbps: 512
	memory_bandwidth_gbps: 1640
	fp16_tflops: 1024
	fp8_support: false
	fp4_support: false
	spec_source: "Biren 壁仞 BR100 (OAM, 550W). 7nm Chiplet, 77B transistors. BF16/FP16 1024 TFLOPS, INT8 2048 TOPS, 64 GB HBM2e 1.64 TB/s. BLINK 512 GB/s 8-card fabric."
	notes_en: "Biren BR100 (壁仞) — 2022 flagship. OAM format, 550W. 1024 TFLOPS BF16/FP16 (PFLOPS class), 64 GB HBM2e. BLINK interconnect 512 GB/s (8-card fabric). No FP8. US export-restricted since 2022 — production status uncertain."
	notes_zh: "壁仞 BR100 — 2022 年旗舰 OAM 卡，550W。BF16/FP16 1024 TFLOPS（PFLOPS 级），64 GB HBM2e。BLINK 互联 512 GB/s（8 卡全互联）。无 FP8。2022 年被美国出口管制，后续量产状态不明。"

	- id: BR104
	aliases: [Biren-BR104, 壁仞BR104, 壁砺104]
	memory_gb: 32
	nvlink_bandwidth_gbps: 128
	memory_bandwidth_gbps: 820
	fp16_tflops: 512
	fp8_support: false
	fp4_support: false
	spec_source: "Biren 壁仞 BR104 (PCIe, 300W). Single-die version of BR100 with halved specs. BF16/FP16 512 TFLOPS, 32 GB HBM2e. Won MLPerf Inference ResNet50 and BERT single-card top-1 in its class."
	notes_en: "Biren BR104 — PCIe single-die version of BR100. 300W, 512 TFLOPS BF16/FP16, 32 GB HBM2e. Won MLPerf Inference BERT (1.58x A100 in server mode). No FP8. Export-restricted."
	notes_zh: "壁仞 BR104 — BR100 的单 die PCIe 版本。300W，BF16/FP16 512 TFLOPS，32 GB HBM2e。MLPerf Inference BERT 测试 server 模式性能达 A100 的 1.58 倍。无 FP8。已被出口管制。"

	- id: BI-V100
	aliases: [Iluvatar-BI-V100, 天数天垓100, TianGai-100]
	memory_gb: 32
	nvlink_bandwidth_gbps: 64
	memory_bandwidth_gbps: 1200
	fp16_tflops: 147
	fp8_support: false
	fp4_support: false
	spec_source: "Iluvatar CoreX 天数智芯 BI-V100 (天垓100). 7nm, SIMT, 24B transistors, 2.5D CoWoS packaging. FP16 147 TFLOPS / INT8 295 TOPS. 32 GB HBM2, 1.2 TB/s bandwidth. PCIe 4.0 x16, 250W TDP."
	notes_en: "Iluvatar (天数智芯) BI-V100 — training/general-purpose. 7nm SIMT architecture, 32 GB HBM2, 1.2 TB/s memory bandwidth. FP16 147 TFLOPS, INT8 295 TOPS. 250W TDP. Interconnect bandwidth per card is modest (~64 GB/s shared)."
	notes_zh: "天数智芯 BI-V100（天垓100）— 训练/通用 GPU。7nm SIMT 架构，32 GB HBM2，1.2 TB/s 显存带宽。FP16 147 TFLOPS，INT8 295 TOPS。250W TDP。单卡互联带宽 ~64 GB/s，相对较低。"

	- id: MR-V100
	aliases: [Iluvatar-MR-V100, 天数智铠100, ZhiKai-100]
	memory_gb: 32
	nvlink_bandwidth_gbps: 0
	memory_bandwidth_gbps: 1200
	fp16_tflops: 100
	fp8_support: false
	fp4_support: false
	spec_source: "Iluvatar CoreX 天数智芯智铠100 (MR-V100) 2022. Inference card, 32 GB HBM2E, ~200 TFLOPS BF16/FP16-low-precision-aggregated, 128-channel 1080p video decode, 150W TDP."
	notes_en: "Iluvatar inference card (智铠100). 32 GB HBM2E. 150W TDP. Primarily inference-focused — mixed-precision aggregated throughput ~200 TFLOPS."
	notes_zh: "天数智芯智铠100 推理卡。32 GB HBM2E，150W TDP。主要面向推理场景，混合精度聚合算力约 200 TFLOPS。"

	- id: MLU370-X8
	aliases: [Cambricon-MLU370-X8, 寒武纪MLU370-X8, 思元370-X8]
	memory_gb: 48
	nvlink_bandwidth_gbps: 200
	memory_bandwidth_gbps: 614
	fp16_tflops: 48
	fp8_support: false
	fp4_support: false
	spec_source: "Cambricon 寒武纪 MLU370-X8 (dual MLU370 chiplet, 250W). 48 GB LPDDR5, INT8 256 TOPS, FP32 24 TFLOPS (FP16 ~48 TFLOPS estimated, official not given). MLU-Link 200 GB/s."
	notes_en: "Cambricon (寒武纪) MLU370-X8 — dual-chip package, 250W. 48 GB LPDDR5 (not HBM), INT8 256 TOPS, FP32 24 TFLOPS. MLU-Link 200 GB/s for 8-card setups. LPDDR5 means lower memory bandwidth than HBM cards."
	notes_zh: "寒武纪 MLU370-X8 — 双芯粒封装，250W。48 GB LPDDR5（非 HBM），INT8 256 TOPS，FP32 24 TFLOPS。MLU-Link 200 GB/s，支持 8 卡部署。LPDDR5 意味着显存带宽低于 HBM 卡。"

	- id: MLU590
	aliases: [Cambricon-MLU590, 寒武纪MLU590, 思元590]
	memory_gb: 80
	nvlink_bandwidth_gbps: 372
	memory_bandwidth_gbps: 2000
	fp16_tflops: 314
	fp8_support: false
	fp4_support: false
	spec_source: "Cambricon 寒武纪思元590 (MLU590) — 7nm, MLUv02/MLUarch05. 80 GB HBM (likely HBM2e based on 2 TB/s bandwidth), FP16 314 TFLOPS, FP32 80 TFLOPS, MLU-Link 372 GB/s. Used at Baidu ERNIE (文心一言) project."
	notes_en: "Cambricon (寒武纪) MLU590 — flagship AI training chip. 80 GB HBM, 2 TB/s memory bandwidth. FP16 314 TFLOPS (dense). MLU-Link 372 GB/s 8-card fabric. Comparable FP16 compute to NVIDIA A100 level. No FP8. Production volume and ecosystem still maturing."
	notes_zh: "寒武纪思元590 — 旗舰 AI 训练芯片。80 GB HBM，2 TB/s 显存带宽。FP16 314 TFLOPS（dense），综合性能约为 A100 级别。MLU-Link 372 GB/s 八卡互联。无 FP8。量产规模和生态仍在成熟。"

	- id: Hygon-K100-AI
	aliases: [K100-AI, 海光K100AI, DCU-K100-AI]
	memory_gb: 64
	nvlink_bandwidth_gbps: 184
	memory_bandwidth_gbps: 896
	fp16_tflops: 192
	fp8_support: false
	fp4_support: false
	spec_source: "Hygon 海光 K100 AI — DCU architecture (GPGPU+AI hybrid), 64 GB HBM, 896 GB/s memory bandwidth, 350W TDP. FP16 192 TFLOPS dense (some sources cite 256 TFLOPS but values vary). xGMI 184 GB/s."
	notes_en: "Hygon (海光) K100 AI — DCU series. 64 GB HBM, 896 GB/s bandwidth. FP16 192 TFLOPS (industry reports vary 100-256 TFLOPS depending on compute unit/mode). ROCm-compatible, can leverage AMD software ecosystem. Positioned against A800 for Chinese market. 350W TDP."
	notes_zh: "海光 K100 AI — DCU 系列。64 GB HBM，896 GB/s 带宽。FP16 192 TFLOPS（公开资料数字因计算单元和精度模式不同有 100-256 TFLOPS 差异）。兼容 ROCm，可复用 AMD 软件生态。面向国产 A800 替代场景。350W TDP。"

	- id: Hygon-Z100
	aliases: [Z100, 海光Z100, DCU-Z100, 深算二号]
	memory_gb: 32
	nvlink_bandwidth_gbps: 184
	memory_bandwidth_gbps: 1000
	fp16_tflops: 180
	fp8_support: false
	fp4_support: false
	spec_source: "Hygon 海光 DCU Z100 (深算二号) — 32 GB HBM2, 1 TB/s bandwidth, 8192 compute cores, FP32 90 TFLOPS, FP16 ~180 TFLOPS (2x FP32), FP64 10.8 TFLOPS. xGMI 184 GB/s. Performance reported as 80-90% of A100. 350W TDP."
	notes_en: "Hygon (海光) DCU Z100 / 深算二号. 32 GB HBM2, 1 TB/s bandwidth, 8192 compute units. FP16 180 TFLOPS, FP32 90 TFLOPS, FP64 10.8 TFLOPS. 350W. Performance cited at 80-90% of A100. ROCm stack, PCIe Gen4 + xGMI multi-card."
	notes_zh: "海光 DCU Z100（深算二号）。32 GB HBM2，1 TB/s 带宽，8192 计算单元。FP16 180 TFLOPS，FP32 90 TFLOPS，FP64 10.8 TFLOPS。350W。综合性能约为 A100 的 80-90%。基于 ROCm 栈，PCIe Gen4 + xGMI 多卡互联。"

	- id: MTT-S4000
	aliases: [MooreThreads-S4000, 摩尔线程S4000, MTT-S4000-48G]
	memory_gb: 48
	nvlink_bandwidth_gbps: 240
	memory_bandwidth_gbps: 768
	fp16_tflops: 100
	fp8_support: false
	fp4_support: false
	spec_source: "Moore Threads MTT S4000 datasheet 2023 — 3rd-gen MUSA (曲院). 48 GB GDDR6, 768 GB/s bandwidth. FP16/BF16 100 TFLOPS, INT8 200 TOPS. MTLink 1.0 240 GB/s."
	notes_en: "Moore Threads (摩尔线程) S4000 — domestic AI training card. 48 GB GDDR6 (not HBM), 768 GB/s. FP16/BF16 100 TFLOPS. MTLink 1.0 240 GB/s. CUDA compatibility via MUSA translation."
	notes_zh: "摩尔线程 S4000 — 国产训推加速卡。48 GB GDDR6（非 HBM），768 GB/s 带宽。FP16/BF16 100 TFLOPS。MTLink 1.0 互联 240 GB/s。通过 MUSA 兼容 CUDA 生态。"

	- id: MTT-S3000
	aliases: [MooreThreads-S3000, 摩尔线程S3000]
	memory_gb: 32
	nvlink_bandwidth_gbps: 0
	memory_bandwidth_gbps: 448
	fp16_tflops: 30
	fp8_support: false
	fp4_support: false
	spec_source: "Moore Threads MTT S3000 — MUSA 春晓 architecture. 32 GB GDDR6, 448 GB/s. FP32 ~15.2 TFLOPS inferred from S4000 comparison (S4000 is 64%+ higher); FP16 ~30 TFLOPS estimate (datasheet not fully public)."
	notes_en: "Moore Threads (摩尔线程) S3000 — predecessor to S4000. 32 GB GDDR6, 448 GB/s. FP16 specs not fully published; estimated ~30 TFLOPS based on S4000 comparison. Multi-purpose server GPU, also supports rendering."
	notes_zh: "摩尔线程 S3000 — S4000 的前代。32 GB GDDR6，448 GB/s。FP16 官方未完全披露，基于 S4000 对比推算约 30 TFLOPS。通用服务器 GPU，兼顾渲染场景。"