E6E831728 commited on
Commit
b9820e0
·
verified ·
1 Parent(s): 6b5e157

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -0
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ library_name: transformers
4
+ tags:
5
+ - language-modeling
6
+ - transformer
7
+ - decoder-only
8
+ - table-free-input
9
+ - binary-token-codes
10
+ - affine-recoding
11
+ - research
12
+ ---
13
+
14
+ # Affine-Recoded Minimal Code Table-Free Model
15
+
16
+ This is an anonymized research checkpoint for the paper:
17
+
18
+ **Language Models Without a Trainable Input Embedding Table: Learning from Fixed Minimal Binary Token Codes**
19
+
20
+ ## Model variant
21
+
22
+ This repository contains the **fully table-free affine-recoded minimal binary-code model**.
23
+
24
+ The model does not use an input embedding table. Instead, token codes are computed directly from token IDs.
25
+
26
+ For each token ID `t`, the model computes:
27
+
28
+ ```text
29
+ c(t) = bin_16(t)
30
+ ```
31
+
32
+ and then applies a fixed invertible affine recoding over GF(2):
33
+
34
+ ```text
35
+ c_tilde(t) = A c(t) xor b
36
+ ```
37
+
38
+ where:
39
+
40
+ - `A` is an invertible binary matrix in `GL(16, 2)`
41
+ - `b` is a fixed binary shift vector
42
+
43
+ The resulting 16-dimensional binary code is tiled to model width 1024.
44
+
45
+ The model uses:
46
+
47
+ ```text
48
+ 0 trainable input-embedding parameters
49
+ 0 input embedding table
50
+ ```
51
+
52
+ The output projection remains standard and trainable.
53
+
54
+ ## Architecture
55
+
56
+ - decoder-only Transformer
57
+ - vocabulary size: 65,536
58
+ - model width: 1024
59
+ - number of layers: 32
60
+ - number of attention heads: 32
61
+ - context length: 1024
62
+ - rotary positional embeddings
63
+ - GELU activations
64
+ - untied trainable output projection
65
+
66
+ ## Loading example
67
+
68
+ ```python
69
+ import torch
70
+ from transformers import AutoTokenizer, AutoModelForCausalLM
71
+
72
+ repo_id = "E6E831728/affine-recoded-minimal-code-table-free"
73
+
74
+ tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True)
75
+ model = AutoModelForCausalLM.from_pretrained(repo_id, trust_remote_code=True)
76
+ model.eval()
77
+
78
+ prompt = "Question: What is the capital of United Kingdom?\nAnswer:"
79
+ input_ids = torch.tensor([tokenizer.encode(prompt)], dtype=torch.long)
80
+
81
+ with torch.no_grad():
82
+ output_ids = model.generate(input_ids, max_new_tokens=16, do_sample=False)
83
+
84
+ print(tokenizer.decode(output_ids[0].tolist()))
85
+ ```
86
+
87
+ ## Intended use
88
+
89
+ This checkpoint is provided for anonymous review and reproducibility. It demonstrates that the fixed minimal-code input interface remains viable even when the canonical token-ID binary code is randomly recoded by an invertible affine transform.
90
+
91
+ ## Limitations
92
+
93
+ This model is a research checkpoint. It is not intended for deployment. It may produce incorrect, biased, unsafe, or nonsensical outputs.
94
+
95
+ ## Training data
96
+
97
+ The model was trained on the same FineWeb-Edu + Cosmopedia mixture used for the matched comparisons in the paper. Dataset terms and licenses are those of the original datasets.