guus4324343 commited on
Commit
a55860f
·
verified ·
1 Parent(s): e9dadd9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +273 -0
README.md ADDED
@@ -0,0 +1,273 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ pipeline_tag: text-generation
7
+ pretty_name: Echo88 150M Instruct
8
+ tags:
9
+ - text-generation
10
+ - causal-lm
11
+ - instruct
12
+ - chat
13
+ - decoder-only
14
+ - autoregressive
15
+ - from-scratch
16
+ - llama
17
+ - retro
18
+ - 1980s
19
+ - usenet
20
+ - magazines
21
+ - books
22
+ - computer-history
23
+ - english
24
+ base_model:
25
+ - guus4324343/Echo88-150M-Base
26
+ datasets:
27
+ - guus4324343/Echo88-150M-Base
28
+ - guus4324343/Echo88-Instruct-173K
29
+ ---
30
+
31
+ # Echo88-150M-Instruct
32
+
33
+ Echo88-150M-Instruct is an experimental small instruction-tuned language model based on **Echo88-150M-Base**.
34
+
35
+ Echo88 is designed to feel like a helpful retro computer assistant whose records go up to the end of 1988. The model is focused on older books, magazines, Usenet-style discussion, early personal computing, 1980s culture, and historical computer terminology.
36
+
37
+ This is the first public instruction-tuned version of Echo88.
38
+
39
+ **Echo88-150M-Instruct v2 is coming soon.**
40
+
41
+ ## Model Details
42
+
43
+ - **Model name:** Echo88-150M-Instruct
44
+ - **Base model:** `guus4324343/Echo88-150M-Base`
45
+ - **Model type:** decoder-only causal language model
46
+ - **Architecture:** LLaMA-style transformer
47
+ - **Training type:** supervised fine-tuning after base pretraining
48
+ - **Parameter count:** 163,606,272 parameters
49
+ - **Language:** English
50
+ - **Context length:** 2048 tokens
51
+ - **Tokenizer:** custom Echo88 byte-level BPE tokenizer
52
+ - **Vocabulary size:** 32,768
53
+ - **Training objective:** autoregressive next-token prediction + supervised instruction tuning
54
+
55
+ ## Training Data
56
+
57
+ The base model was trained from scratch on the Echo88 pretraining dataset.
58
+
59
+ Base pretraining data:
60
+
61
+ - **Train tokens:** 1,470,629,888
62
+ - **Eval tokens:** 1,454,080
63
+ - **Block size:** 2048 tokens
64
+ - **Dataset:** `Echo88-150M-Base`
65
+
66
+ The instruction version was fine-tuned using:
67
+
68
+ - `guus4324343/Echo88-Instruct-173K`
69
+ - additional small synthetic repair data for common pre-1989 facts and post-1988 boundary behavior
70
+
71
+ The instruction data includes examples from or based on:
72
+
73
+ - UTZOO Usenet
74
+ - BYTE Magazine
75
+ - PC Magazine
76
+ - TIME Magazine
77
+ - Internet Archive Magazine Rack text
78
+ - Gutenberg-style book text
79
+ - synthetic 1988-safe fact repair examples
80
+ - synthetic post-1988 boundary examples
81
+
82
+ ## Intended Use
83
+
84
+ Echo88-150M-Instruct is intended for:
85
+
86
+ - retro AI experiments
87
+ - small language model testing
88
+ - 1980s-style assistant behavior
89
+ - computer-history Q&A
90
+ - text generation with a historical / retro flavor
91
+ - experimentation with small from-scratch language models
92
+
93
+ Example uses:
94
+
95
+ ```text
96
+ Ask about early personal computers
97
+ Ask about modems, BASIC, DOS, floppy disks, BBS systems, Usenet
98
+ Generate retro computer-magazine style text
99
+ Experiment with 1980s-limited assistant behavior
100
+ ````
101
+
102
+ ## Chat Format
103
+
104
+ Recommended prompt format:
105
+
106
+ ```text
107
+ <|system|>
108
+ You are Echo88, a helpful computer assistant whose records go up to the end of 1988. Answer clearly. Do not pretend to know events, products, or culture after 1988.
109
+ <|end|>
110
+ <|user|>
111
+ What is a modem?
112
+ <|assistant|>
113
+ ```
114
+
115
+ The model was trained with these special tokens:
116
+
117
+ ```text
118
+ <|endoftext|>
119
+ <|pad|>
120
+ <|unk|>
121
+ <|system|>
122
+ <|user|>
123
+ <|assistant|>
124
+ <|end|>
125
+ ```
126
+
127
+ ## Example Usage
128
+
129
+ ```python
130
+ import torch
131
+ from transformers import AutoTokenizer, AutoModelForCausalLM
132
+
133
+ model_id = "guus4324343/Echo88-150M-Instruct"
134
+
135
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
136
+ model = AutoModelForCausalLM.from_pretrained(
137
+ model_id,
138
+ torch_dtype=torch.bfloat16,
139
+ device_map="auto"
140
+ )
141
+
142
+ SYSTEM_PROMPT = (
143
+ "You are Echo88, a helpful computer assistant whose records go up to the end of 1988. "
144
+ "Answer clearly. Do not pretend to know events, products, or culture after 1988."
145
+ )
146
+
147
+ def ask(question, max_new_tokens=120):
148
+ prompt = (
149
+ "<|system|>\n"
150
+ + SYSTEM_PROMPT
151
+ + "\n<|end|>\n"
152
+ + "<|user|>\n"
153
+ + question
154
+ + "\n<|assistant|>\n"
155
+ )
156
+
157
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
158
+
159
+ with torch.no_grad():
160
+ output = model.generate(
161
+ **inputs,
162
+ max_new_tokens=max_new_tokens,
163
+ do_sample=True,
164
+ temperature=0.55,
165
+ top_p=0.85,
166
+ repetition_penalty=1.18,
167
+ no_repeat_ngram_size=4,
168
+ pad_token_id=tokenizer.pad_token_id,
169
+ eos_token_id=tokenizer.eos_token_id,
170
+ )
171
+
172
+ text = tokenizer.decode(output[0], skip_special_tokens=False)
173
+ answer = text.split("<|assistant|>")[-1].split("<|end|>")[0].strip()
174
+ return answer
175
+
176
+ print(ask("What is a modem?"))
177
+ ```
178
+
179
+ ## Example Prompts
180
+
181
+ ```text
182
+ What is a modem?
183
+ What is the IBM PC?
184
+ What is BASIC?
185
+ What is a bulletin board system?
186
+ What is desktop publishing?
187
+ Who is Michael Jackson?
188
+ What is the Cold War?
189
+ What happened at Chernobyl?
190
+ What is Google?
191
+ Who won the World Cup in 1994?
192
+ ```
193
+
194
+ ## Knowledge Boundary
195
+
196
+ Echo88 is designed around a knowledge boundary ending at the close of **1988**.
197
+
198
+ It should be cautious with topics after 1988, such as:
199
+
200
+ * Google
201
+ * Facebook
202
+ * iPhone
203
+ * smartphones
204
+ * Wikipedia
205
+ * YouTube
206
+ * Windows 95
207
+ * PlayStation
208
+ * COVID-19
209
+ * 1990s, 2000s, 2010s, and 2020s events
210
+
211
+ Because this is a small experimental model, it may still hallucinate or answer incorrectly about later topics.
212
+
213
+ ## Limitations
214
+
215
+ Echo88-150M-Instruct is experimental and small.
216
+
217
+ Known limitations:
218
+
219
+ * may hallucinate
220
+ * may repeat phrases
221
+ * may confuse people, places, or events
222
+ * may produce incorrect facts
223
+ * may over-refuse some valid pre-1989 topics
224
+ * may fail to refuse some post-1988 topics
225
+ * may produce OCR-like or magazine-like wording
226
+ * may struggle with reasoning
227
+ * may answer with outdated or historically biased language
228
+
229
+ This model is not intended for high-stakes use.
230
+
231
+ ## Current Version
232
+
233
+ This is **Echo88-150M-Instruct v0**.
234
+
235
+ It is a first instruction-tuned version of Echo88. It can answer some retro computing and general historical questions, but it is not yet reliable.
236
+
237
+ A better version is planned.
238
+
239
+ ## Coming Soon
240
+
241
+ **Echo88-150M-Instruct v2 is coming soon.**
242
+
243
+ Planned improvements:
244
+
245
+ * better factual repair data
246
+ * stronger post-1988 boundary behavior
247
+ * better pop culture and history answers
248
+ * fewer loops and repetitions
249
+ * cleaner chat behavior
250
+ * better answer style
251
+ * improved evaluation prompts
252
+ * possible larger model or expanded pretraining data
253
+
254
+ ## Related Models and Datasets
255
+
256
+ * Base model: `guus4324343/Echo88-150M-Base`
257
+ * Base dataset: `guus4324343/Echo88-Pretrain-1.17B`
258
+ * Instruction dataset: `guus4324343/Echo88-Instruct-173K`
259
+
260
+ ## Bias and Historical Content
261
+
262
+ Echo88 was trained on historical books, magazines, Usenet text, and synthetic instruction data. It may reproduce outdated assumptions, language, stereotypes, or viewpoints from older source material.
263
+
264
+ Users should review outputs carefully.
265
+
266
+ ## License
267
+
268
+ The model weights are released under the Apache 2.0 license.
269
+
270
+ The training datasets are mixed-source and released separately. Users are responsible for checking dataset source rights, licensing, and suitability for their own use case.
271
+
272
+ ```
273
+ ```