BrahmAI/superbpe-brahmai-65k-ev

BrahmAI SuperBPE v5-EV Tokenizer — Byte-level BPE with two-phase SuperBPE training. Specialised for EV / IoT / Smart Home / Edge domains.

Property Value
Vocab size 65,536 (65,536 (2¹⁶))
Phase 1 vocab 45,000 (subword BPE, whitespace-aware)
Phase 2 slots 20,536 (cross-word superwords)
BPE type byte-level
Pretokenizer EV_v5
Domain vocab tokens 797 fixed EV/IoT/SmartHome acronyms
Domain special tokens 26 modality control token pairs
TPU alignment ✅ 2^16 = zero MXU padding waste on TPUv4/v5

Architecture

Two-phase SuperBPE training:

  1. Phase 1 — Byte-level BPE with the EV_v5 pretokenizer (whitespace-aware, single-digit number rule). Learns subwords within word boundaries.
  2. Phase 2 — BPE without a pretokenizer, allowing merges across whitespace. Learns cross-word superword tokens like ĠofĠthe, ĠBMSĠSOC, ĠOCPPĠ2.0.1.
Feature Value Benefit
Byte-level BPE No <unk> Handles any Unicode / binary
Single-digit [0-9] rule Freed Phase 1 merge budget More domain acronym tokens
Two-bucket rolling dedup Bucket size 3M ≥50% coverage at all times
NUM_RESERVED 128 Future domain expansion
Phase 2 max line chars 1000 Prevents RAM explosion
Pinned domain shard syn_text.txt.gz ×3 Stronger EV/IoT frequency signal

Domain Coverage (v5-EV)

This tokenizer is optimised for EV, IoT, and Smart Home workloads:

Domain Examples
EV Charging OCPP, ISO 15118-20, CHAdeMO, CCS, NACS, V2G, V2H, EVSE
EV Battery & BMS SOC, SOH, LFP, NMC, CellBalancing, ThermalRunaway
EV Powertrain BEV, PHEV, SiC-Inverter, eAxle, RegenerativeBraking
EV Bus Protocols CAN-FD, UDS, J1939, AUTOSAR, SOME/IP, SecOC
Smart Home Matter, Thread, Zigbee, Z-Wave, HomeKit, HomeAssistant
Energy Management HEMS, OpenADR, VPP, TOU, BidirectionalEVSE, SunSpec
IoT Security TLS1.3, mTLS, TPM, TrustZone, SecureBoot, X.509
Edge Embedded ESP32, STM32, nRF52, FreeRTOS, Zephyr, embassy, TFLite-Micro

Domain Modality Tokens (26 control tokens)

Token pair Use case
<SENSOR_START> / <SENSOR_END> Raw sensor telemetry blocks
<CAN_FRAME_START> / <CAN_FRAME_END> Raw CAN / CAN-FD frames
<OCPP_START> / <OCPP_END> OCPP charge session messages
<OBD_START> / <OBD_END> OBD-II / UDS diagnostic blocks
<MATTER_START> / <MATTER_END> Matter protocol frames
<V2G_START> / <V2G_END> ISO 15118-20 BPT V2G sessions
<THREAD_FRAME_START> / <THREAD_FRAME_END> Thread TLV / MeshCoP / SRP
<FAULT_START> / <FAULT_END> Fault / DTC blocks
<ALERT_START> / <ALERT_END> Alert / notification blocks
<VISION_START> / <VISION_END> Camera / radar / lidar frames
<ACTION_START> / <ACTION_END> Actuation commands
<JSON_START> / <JSON_END> Structured JSON data
<CMD_START> / <CMD_END> Generic commands / instructions

Fixed Domain Vocabulary

797 domain acronyms are registered as AddedTokens (BPE cannot split them): OCPP, BMS, SOC, CAN-FD, EVSE, V2G, Matter, Thread, Zigbee, FreeRTOS, Zephyr, TLS1.3, ISO15118-20, AUTOSAR, and 783 more.

Special Tokens

Token Purpose
<pad> Padding
<eos> End of sequence
<bos> Beginning of sequence
<unk> Unknown (rarely used in byte-level)
<|system|> System turn
<|user|> User turn
<|assistant|> Assistant turn
<|end_of_turn|> Turn separator
<think> / </think> Chain-of-thought reasoning
<|nothink|> Disable chain-of-thought
<tool_call> / </tool_call> Tool invocation
<tool_response> / </tool_response> Tool result
<|fim_prefix|> / <|fim_middle|> / <|fim_suffix|> Fill-in-the-middle

Training Details

Parameter Value
Phase 1 vocab target 45,000
Phase 2 superword slots 20,536
Total vocab 65,536
Merges total 64,279
Peak RSS 34.02 GB
Training time 0.19 h

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("BrahmAI/superbpe-brahmai-65k-ev")

# Basic encoding
ids = tokenizer.encode("BMS SOC=98.5% OCPP2.0.1 ChargeStart")
print(ids)

# Chat template — standard conversation
messages = [
    {"role": "system",    "content": "You are a BrahmAI EV assistant."},
    {"role": "user",      "content": "What is the current SOC of the battery?"},
]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
print(prompt)

# Chat template — with chain-of-thought
messages = [
    {"role": "user",      "content": "Diagnose DTC P0A7F."},
    {"role": "assistant", "thought": "P0A7F is a hybrid battery voltage fault.",
                           "content": "Replace the high-voltage battery module."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False)

# Chat template — sensor telemetry block (v5-EV)
messages = [
    {"role": "sensor", "content": '{"ts":1712345678,"SOC":82.3,"voltage":396.1}'},
    {"role": "user",   "content": "Is the battery healthy?"},
]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# Chat template — OCPP session block (v5-EV)
messages = [
    {"role": "ocpp", "content": '[2,"abc","StatusNotification",{"status":"Charging"}]'},
    {"role": "user", "content": "Why is the charger stuck?"},
]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

# Fill-in-the-middle
fim = [{"role": "fim_request",
         "prefix": "def charge_ev(soc):\n    if soc < ",
         "suffix": ":\n        return True"}]
prompt = tokenizer.apply_chat_template(fim, tokenize=False)

Citation

@inproceedings{liu-etal-2025-superbpe,
  title={SuperBPE: Space travel for language models},
  author={Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A Smith and Yejin Choi},
  booktitle={Second Conference on Language Modeling},
  year={2025},
  url={https://arxiv.org/abs/2503.13423}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for BrahmAI/superbpe-brahmai-65k-ev