--- license: apache-2.0 library_name: transformers tags: - causal-lm - text-generation - transformer - decoder-only - fixed-embeddings - binary-token-codes - research language: - en --- # Fixed Minimal Binary Code Model Research checkpoint for the paper: **Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes** ## Model variant This repository contains the **fixed minimal binary token-code model**. Instead of a trainable input embedding table, each token ID is represented by its exact minimal binary code. For vocabulary size: ```text V = 65,536 ``` the minimal injective binary code width is: ```text K = ceil(log2(V)) = 16 ``` The 16-dimensional binary code is tiled to model width 1024. The model therefore uses: ```text 0 trainable input-embedding parameters ``` The output projection remains standard and trainable. ## Architecture - decoder-only Transformer - vocabulary size: 65,536 - model width: 1024 - number of layers: 32 - number of attention heads: 32 - context length: 1024 - rotary positional embeddings - GELU activations - untied trainable output projection ## Loading example ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM repo_id = "Bochkov/llm-fix-min-fixed-minimal-binary-code" tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True) model.eval() prompt = "Question: What is the capital of France?\nAnswer:" input_ids = torch.tensor([tokenizer.encode(prompt)], dtype=torch.long) with torch.no_grad(): output_ids = model.generate(input_ids, max_new_tokens=3, do_sample=False) print(tokenizer.decode(output_ids[0].tolist())) ``` ## Intended use This checkpoint is provided for reproducibility of the paper's main claim: a trainable input embedding table is not necessary for useful language modeling in the studied regime. ## Limitations This model is a research checkpoint. It is not intended for deployment. It may produce incorrect, biased, unsafe, or nonsensical outputs. ## Training data The model was trained on the same FineWeb-Edu + Cosmopedia mixture used for the matched comparisons in the paper. Dataset terms and licenses are those of the original datasets. --- ## 🧑‍🔬 Citation & Concept If you use this model or the underlying concepts in your research, please cite our work: ``` @misc{bochkov2026languagemodelstrainableinput, title={Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes}, author={A. Bochkov}, year={2026}, eprint={2605.09751}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2605.09751}, } ```