BrahmAI/superbpe-brahmai-65k-ev
BrahmAI SuperBPE v5-EV Tokenizer — Byte-level BPE with two-phase SuperBPE training.
Specialised for EV / IoT / Smart Home / Edge domains.
| Property |
Value |
| Vocab size |
65,536 (65,536 (2¹⁶)) |
| Phase 1 vocab |
45,000 (subword BPE, whitespace-aware) |
| Phase 2 slots |
20,536 (cross-word superwords) |
| BPE type |
byte-level |
| Pretokenizer |
EV_v5 |
| Domain vocab tokens |
797 fixed EV/IoT/SmartHome acronyms |
| Domain special tokens |
26 modality control token pairs |
| TPU alignment |
✅ 2^16 = zero MXU padding waste on TPUv4/v5 |
Architecture
Two-phase SuperBPE training:
- Phase 1 — Byte-level BPE with the
EV_v5 pretokenizer (whitespace-aware, single-digit number rule). Learns subwords within word boundaries.
- Phase 2 — BPE without a pretokenizer, allowing merges across whitespace. Learns cross-word superword tokens like
ĠofĠthe, ĠBMSĠSOC, ĠOCPPĠ2.0.1.
| Feature |
Value |
Benefit |
| Byte-level BPE |
No <unk> |
Handles any Unicode / binary |
Single-digit [0-9] rule |
Freed Phase 1 merge budget |
More domain acronym tokens |
| Two-bucket rolling dedup |
Bucket size 3M |
≥50% coverage at all times |
| NUM_RESERVED |
128 |
Future domain expansion |
| Phase 2 max line chars |
1000 |
Prevents RAM explosion |
| Pinned domain shard |
syn_text.txt.gz ×3 |
Stronger EV/IoT frequency signal |
Domain Coverage (v5-EV)
This tokenizer is optimised for EV, IoT, and Smart Home workloads:
| Domain |
Examples |
| EV Charging |
OCPP, ISO 15118-20, CHAdeMO, CCS, NACS, V2G, V2H, EVSE |
| EV Battery & BMS |
SOC, SOH, LFP, NMC, CellBalancing, ThermalRunaway |
| EV Powertrain |
BEV, PHEV, SiC-Inverter, eAxle, RegenerativeBraking |
| EV Bus Protocols |
CAN-FD, UDS, J1939, AUTOSAR, SOME/IP, SecOC |
| Smart Home |
Matter, Thread, Zigbee, Z-Wave, HomeKit, HomeAssistant |
| Energy Management |
HEMS, OpenADR, VPP, TOU, BidirectionalEVSE, SunSpec |
| IoT Security |
TLS1.3, mTLS, TPM, TrustZone, SecureBoot, X.509 |
| Edge Embedded |
ESP32, STM32, nRF52, FreeRTOS, Zephyr, embassy, TFLite-Micro |
Domain Modality Tokens (26 control tokens)
| Token pair |
Use case |
<SENSOR_START> / <SENSOR_END> |
Raw sensor telemetry blocks |
<CAN_FRAME_START> / <CAN_FRAME_END> |
Raw CAN / CAN-FD frames |
<OCPP_START> / <OCPP_END> |
OCPP charge session messages |
<OBD_START> / <OBD_END> |
OBD-II / UDS diagnostic blocks |
<MATTER_START> / <MATTER_END> |
Matter protocol frames |
<V2G_START> / <V2G_END> |
ISO 15118-20 BPT V2G sessions |
<THREAD_FRAME_START> / <THREAD_FRAME_END> |
Thread TLV / MeshCoP / SRP |
<FAULT_START> / <FAULT_END> |
Fault / DTC blocks |
<ALERT_START> / <ALERT_END> |
Alert / notification blocks |
<VISION_START> / <VISION_END> |
Camera / radar / lidar frames |
<ACTION_START> / <ACTION_END> |
Actuation commands |
<JSON_START> / <JSON_END> |
Structured JSON data |
<CMD_START> / <CMD_END> |
Generic commands / instructions |
Fixed Domain Vocabulary
797 domain acronyms are registered as AddedTokens (BPE cannot split them):
OCPP, BMS, SOC, CAN-FD, EVSE, V2G, Matter, Thread, Zigbee,
FreeRTOS, Zephyr, TLS1.3, ISO15118-20, AUTOSAR, and 783 more.
Special Tokens
| Token |
Purpose |
<pad> |
Padding |
<eos> |
End of sequence |
<bos> |
Beginning of sequence |
<unk> |
Unknown (rarely used in byte-level) |
<|system|> |
System turn |
<|user|> |
User turn |
<|assistant|> |
Assistant turn |
<|end_of_turn|> |
Turn separator |
<think> / </think> |
Chain-of-thought reasoning |
<|nothink|> |
Disable chain-of-thought |
<tool_call> / </tool_call> |
Tool invocation |
<tool_response> / </tool_response> |
Tool result |
<|fim_prefix|> / <|fim_middle|> / <|fim_suffix|> |
Fill-in-the-middle |
Training Details
| Parameter |
Value |
| Phase 1 vocab target |
45,000 |
| Phase 2 superword slots |
20,536 |
| Total vocab |
65,536 |
| Merges total |
64,279 |
| Peak RSS |
34.02 GB |
| Training time |
0.19 h |
Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("BrahmAI/superbpe-brahmai-65k-ev")
ids = tokenizer.encode("BMS SOC=98.5% OCPP2.0.1 ChargeStart")
print(ids)
messages = [
{"role": "system", "content": "You are a BrahmAI EV assistant."},
{"role": "user", "content": "What is the current SOC of the battery?"},
]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
print(prompt)
messages = [
{"role": "user", "content": "Diagnose DTC P0A7F."},
{"role": "assistant", "thought": "P0A7F is a hybrid battery voltage fault.",
"content": "Replace the high-voltage battery module."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False)
messages = [
{"role": "sensor", "content": '{"ts":1712345678,"SOC":82.3,"voltage":396.1}'},
{"role": "user", "content": "Is the battery healthy?"},
]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
messages = [
{"role": "ocpp", "content": '[2,"abc","StatusNotification",{"status":"Charging"}]'},
{"role": "user", "content": "Why is the charger stuck?"},
]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
fim = [{"role": "fim_request",
"prefix": "def charge_ev(soc):\n if soc < ",
"suffix": ":\n return True"}]
prompt = tokenizer.apply_chat_template(fim, tokenize=False)
Citation
@inproceedings{liu-etal-2025-superbpe,
title={SuperBPE: Space travel for language models},
author={Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A Smith and Yejin Choi},
booktitle={Second Conference on Language Modeling},
year={2025},
url={https://arxiv.org/abs/2503.13423}
}