Harley-ml commited on
Commit
d714626
·
verified ·
1 Parent(s): d76ef63

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +221 -3
README.md CHANGED
@@ -18,6 +18,224 @@ tags:
18
 
19
  # MCODLarge
20
 
21
- MCOD, which stands for "Model Configs on Drugs," large is a 4.7M parameter model trained on 7.1M tokens of Hugging Face model configs.
22
- We are well aware that 7.1M tokens is under the Chinchilla optimal target, but including more tokens wouldn't help diversity. For example, after cleaning the full 90M token dataset, we were left with 7.1M tokens after deduping (over 13k docs) and filtering (by lang and length).
23
- Anyway, MCODLarge
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  # MCODLarge
20
 
21
+ MCOD, which stands for "Model Configs on Drugs," is a 4.7M parameter model trained on 7.1M tokens of Hugging Face model configs.
22
+
23
+ We are well aware that 7.1M tokens is below the Chinchilla optimal target, but including more tokens wouldn't improve diversity. For example, after cleaning the full 90M token dataset, we were left with 7.1M tokens after deduplication (over 13k docs) and filtering (by language and length).
24
+
25
+ MCOD generates plausible-looking configs with the correct hyperparameters per model family.
26
+
27
+ ## Architecture
28
+
29
+ | Parameter | Value |
30
+ |-------------------------|-------|
31
+ | hidden_size | 256 |
32
+ | num_hidden_layers | 4 |
33
+ | num_attention_heads | 4 |
34
+ | num_key_value_heads | 4 |
35
+ | intermediate_size | 1024 |
36
+ | max_position_embeddings | 1024 |
37
+ | rope_theta | 100000.0 |
38
+ | tie_word_embeddings | true |
39
+
40
+ MCOD uses the Qwen3 architecture.
41
+
42
+ ## Training
43
+
44
+ MCOD was trained on 18k entries, 7.1M tokens, and 1M words.
45
+
46
+ ### Hardware
47
+
48
+ MCOD was trained on one NVIDIA RTX 2060 6GB for 3 epochs with a batch size of 8.
49
+
50
+ ### Training Results
51
+
52
+ | Step | Epoch | Train Loss | Train PPL | Eval Loss | Eval PPL |
53
+ |------|-------|------------|-----------|-----------|----------|
54
+ | 200 | 0.44 | 4.1022 | 60.53 | 1.8218 | 6.18 |
55
+ | 400 | 0.88 | 1.0227 | 2.78 | 0.5671 | 1.76 |
56
+ | 600 | 1.33 | 0.5434 | 1.72 | 0.3560 | 1.43 |
57
+ | 800 | 1.77 | 0.3978 | 1.49 | 0.2939 | 1.34 |
58
+ | 1000 | 2.21 | 0.3486 | 1.42 | 0.2514 | 1.29 |
59
+ | 1200 | 2.65 | 0.2944 | 1.34 | 0.2259 | 1.25 |
60
+
61
+
62
+ ## Generations
63
+
64
+ Prompt: `{"model_type": "gpt2"`
65
+
66
+ Output:
67
+ ```
68
+ , "n_ctx": 1024, "n_embd": 768, "n_head": 12, "n_inner": null, "n_layer": 6, "n_positions": 1024, "reorder_and_upcast_attn": false, "resid_pdrop": 0.1, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "task_specific_params": {"text-generation": {"do_sample": true, "max_length": 50}}, "torch_dtype": "float32", "transformers_version": "4.25.1", "use_cache": true, "vocab_size": 50257}
69
+ ```
70
+
71
+ Prompt: `{`
72
+
73
+ Output:
74
+ ```
75
+ "architectures": ["RobertaForSequenceClassification"], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "classifier_dropout": null, "eos_token_id": 2, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": {"0": "LABEL_0"}, "initializer_range": 0.02, "intermediate_size": 3072, "label2id": {"LABEL_0": 0}, "layer_norm_eps": 1e-05, "max_position_embeddings": 514, "model_type": "roberta", "num_attention_heads": 12, "num_hidden_layers": 6, "pad_token_id": 1, "position_embedding_type": "absolute", "problem_type": "single_label_classification", "torch_dtype": "float32", "transformers_version": "4.11.3", "type_vocab_size": 1, "use_cache": true, "vocab_size": 50265}
76
+ ```
77
+
78
+ Prompt: `The cat crossed`
79
+
80
+ Output:
81
+ ```
82
+ , "architectures": ["BertForSequenceClassification"], "attention_probs_dropout_prob": 0.1, "classifier_dropout": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 768, "id2label": {"0": "anger", "1": "fear", "2": "surprise"}, "initializer_range": 0.02, "intermediate_size": 3072, "label2id": {"sadness": 0, "neutral": 1, "joy": 2}, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "model_type": "bert", "num_attention_heads": 12, "num_hidden_layers": 12, "pad_token_id": 0, "position_embedding_type": "absolute", "problem_type": "single_label_classification", "torch_dtype": "float32", "transformers_version": "4.11.3", "type_vocab_size": 2, "use_cache": true, "vocab_size": 30522}
83
+ ```
84
+
85
+ ## Limitations
86
+
87
+ 1. Only generates model configs
88
+ 2. Cannot converse or reason
89
+ 3. Most unconditionally generated configs are BERT- or BART-centered
90
+
91
+ ## Use Cases
92
+
93
+ 1. Educational research
94
+ 2. JSON modeling
95
+ 3. Generating synthetic configs for pretraining or fine-tuning datasets (be careful; the model hallucinates a lot)
96
+ 4. Or, more simply, for fun.
97
+
98
+ ## Inference
99
+
100
+ ```python
101
+ # =============================================================================
102
+ # Inference
103
+ # =============================================================================
104
+
105
+ MODEL_DIR = "Harley-ml/MCOD-4.7M" # path
106
+ TOKENIZER_PATH = MODEL_DIR
107
+
108
+ # --- Generation settings ---
109
+ PROMPT = "{" # prompt
110
+ MAX_NEW_TOKENS = 1024
111
+ TEMPERATURE = 0.7
112
+ TOP_P = 0.95
113
+ TOP_K = 50
114
+ REPETITION_PENALTY = 1.1
115
+ DO_SAMPLE = True
116
+
117
+ # =============================================================================
118
+
119
+ import torch
120
+ from pathlib import Path
121
+ from transformers import (
122
+ AutoModelForCausalLM,
123
+ PreTrainedTokenizerFast,
124
+ AddedToken,
125
+ )
126
+
127
+ # ---------------------------------------------------------------------------
128
+ # Device
129
+ # ---------------------------------------------------------------------------
130
+
131
+ device = (
132
+ "cuda" if torch.cuda.is_available() else
133
+ "mps" if torch.backends.mps.is_available() else
134
+ "cpu"
135
+ )
136
+ print(f"Device : {device}")
137
+
138
+ # ---------------------------------------------------------------------------
139
+ # Tokenizer (mirrors training setup)
140
+ # ---------------------------------------------------------------------------
141
+
142
+ def load_tokenizer(path: str):
143
+ p = Path(path).resolve()
144
+ if not p.exists():
145
+ raise FileNotFoundError(f"Tokenizer not found: {p}")
146
+ tok = PreTrainedTokenizerFast(tokenizer_file=str(p))
147
+ specials = {}
148
+ if tok.bos_token is None: specials["bos_token"] = AddedToken("<|bos|>", special=True)
149
+ if tok.eos_token is None: specials["eos_token"] = AddedToken("<|eos|>", special=True)
150
+ if tok.unk_token is None: specials["unk_token"] = AddedToken("<|unk|>", special=True)
151
+ if tok.pad_token is None:
152
+ if tok.eos_token is not None:
153
+ tok.pad_token = tok.eos_token
154
+ else:
155
+ specials["pad_token"] = AddedToken("<|pad|>", special=True)
156
+ if specials:
157
+ tok.add_special_tokens(specials)
158
+ tok.padding_side = "left" # left-pad for batched generation
159
+ return tok
160
+
161
+ print("Loading tokenizer...")
162
+ tokenizer = load_tokenizer(TOKENIZER_PATH)
163
+ print(f" Vocab size : {tokenizer.vocab_size}")
164
+ print(f" BOS : {tokenizer.bos_token!r}")
165
+ print(f" EOS : {tokenizer.eos_token!r}")
166
+ print(f" PAD : {tokenizer.pad_token!r} (id={tokenizer.pad_token_id})")
167
+
168
+ # ---------------------------------------------------------------------------
169
+ # Model
170
+ # ---------------------------------------------------------------------------
171
+
172
+ print(f"\nLoading model from {MODEL_DIR} ...")
173
+ model = AutoModelForCausalLM.from_pretrained(
174
+ MODEL_DIR,
175
+ dtype=torch.float16 if device == "cuda" else torch.float32,
176
+ low_cpu_mem_usage=True,
177
+ )
178
+ model.eval()
179
+ model.to(device)
180
+
181
+ total_params = sum(p.numel() for p in model.parameters())
182
+ print(f" Parameters : {total_params:,}")
183
+
184
+ # ---------------------------------------------------------------------------
185
+ # Generation helper
186
+ # ---------------------------------------------------------------------------
187
+
188
+ def generate(
189
+ prompt: str = PROMPT,
190
+ max_new_tokens: int = MAX_NEW_TOKENS,
191
+ temperature: float = TEMPERATURE,
192
+ top_p: float = TOP_P,
193
+ top_k: int = TOP_K,
194
+ repetition_penalty: float = REPETITION_PENALTY,
195
+ do_sample: bool = DO_SAMPLE,
196
+ ) -> str:
197
+
198
+ bos = tokenizer.bos_token or ""
199
+ full_prompt = bos + prompt
200
+
201
+ inputs = tokenizer(
202
+ full_prompt,
203
+ return_tensors="pt",
204
+ add_special_tokens=False,
205
+ ).to(device)
206
+ inputs.pop("token_type_ids", None) # Qwen3 doesn't use this
207
+
208
+ gen_kwargs = dict(
209
+ max_new_tokens = max_new_tokens,
210
+ do_sample = do_sample,
211
+ repetition_penalty = repetition_penalty,
212
+ eos_token_id = tokenizer.eos_token_id,
213
+ pad_token_id = tokenizer.pad_token_id,
214
+ )
215
+ if do_sample:
216
+ gen_kwargs["temperature"] = temperature
217
+ gen_kwargs["top_p"] = top_p
218
+ gen_kwargs["top_k"] = top_k
219
+
220
+ with torch.inference_mode():
221
+ output_ids = model.generate(**inputs, **gen_kwargs)
222
+
223
+ # Strip the prompt tokens so we only return what was generated
224
+ prompt_len = inputs["input_ids"].shape[-1]
225
+ new_ids = output_ids[0][prompt_len:]
226
+ return tokenizer.decode(new_ids, skip_special_tokens=True)
227
+
228
+
229
+ # ---------------------------------------------------------------------------
230
+ # Run
231
+ # ---------------------------------------------------------------------------
232
+
233
+ if __name__ == "__main__":
234
+ print(f"\nPrompt : {PROMPT!r}")
235
+ print("-" * 60)
236
+
237
+ output = generate(PROMPT)
238
+
239
+ print("Generated:")
240
+ print(output)
241
+ ```