JaydeepR commited on
Commit
00a102d
Β·
verified Β·
1 Parent(s): 4b11e6c

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +151 -41
README.md CHANGED
@@ -18,16 +18,78 @@ datasets:
18
 
19
  # SmolLM-135M-SFT-exp01
20
 
21
- Supervised fine-tuning of [SmolLM-135M-CPT-LoRA-r32](https://huggingface.co/JaydeepR/SmolLM-135M-CPT-LoRA-r32) on 300K synthetic ML paper instruction pairs.
22
 
23
  This is **exp01** in a series of SFT experiments on top of the CPT-adapted SmolLM-135M.
24
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  ## Model Description
26
 
27
- - **Base model:** `paperbd/smollm_135M_arxiv_cpt` (CPT-adapted SmolLM-135M, merged)
28
  - **Method:** Supervised Fine-Tuning (SFT) with LoRA + `train_on_responses_only`
29
  - **Domain:** ML/arXiv paper research tasks
30
- - **Task:** Instruction following β€” bullets, QA, triplets, retrieval, comparison, etc.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
  ## Training Details
33
 
@@ -46,46 +108,95 @@ This is **exp01** in a series of SFT experiments on top of the CPT-adapted SmolL
46
  | Total steps | 11,355 |
47
  | Sequence length | 2048 (packed) |
48
  | Chat template | ChatML |
 
 
49
  | Hardware | NVIDIA RTX 4090 |
50
  | Training time | ~10 hours |
51
 
52
- ## Training Data
 
 
53
 
54
- - **Dataset:** `paperbd/paper_instructions_300K-v1` β€” 300K synthetic instruction-response pairs generated from arXiv ML papers
55
- - **Variations:** 2 (conversation extension) β†’ ~600K effective training examples
56
- - **Train/val split:** 98% / 2%
57
- - **Response-only training:** Loss computed only on assistant turns, not user prompts
58
 
59
- ## Evaluation Results
 
 
 
 
60
 
61
- Evaluated on 1000 samples from the `paper_instructions_300K-v1` test split, judged by `grok-3-mini`:
62
 
63
- | Metric | Score (1-5) |
64
  |---|---|
65
  | Faithfulness | 2.70 |
66
  | Answer Correctness | 1.98 |
67
- | Relevance | 3.04 |
68
  | Completeness | 1.85 |
69
  | **Overall** | **2.39** |
70
 
71
- ## How to Use
 
 
 
 
72
 
73
- ### As PaperResearcher API
74
 
75
  ```python
76
  from paper_researcher import PaperResearcher
77
 
78
  researcher = PaperResearcher("JaydeepR/SmolLM-135M-SFT-exp01")
79
-
80
  passage = "Attention mechanisms compute weighted sums of values..."
81
 
82
- bullets = researcher.extract_bullets(passage)
83
- qa_pairs = researcher.generate_qa_pairs(passage)
84
- triplets = researcher.extract_triplets(passage)
85
- answer = researcher.answer("What does attention compute?", passage)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
  ```
87
 
88
- ### Raw inference
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  ```python
91
  from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -99,36 +210,35 @@ model = AutoModelForCausalLM.from_pretrained(base_model_id)
99
  model = PeftModel.from_pretrained(model, adapter_id)
100
 
101
  messages = [
102
- {"role": "system", "content": "You are an expert in AI and ML research."},
103
- {"role": "user", "content": "Extract the key points from this passage as bullet points.\n\nAttention mechanisms..."},
104
  ]
105
  prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
106
  inputs = tokenizer(prompt, return_tensors="pt")
107
- outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=1.1)
108
  print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
109
  ```
110
 
111
- ## Supported Tasks
112
-
113
- | Task | Method |
114
- |---|---|
115
- | Extract bullet points | `researcher.extract_bullets(passage)` |
116
- | Generate Q&A pairs | `researcher.generate_qa_pairs(passage)` |
117
- | Generate a question | `researcher.generate_question(passage)` |
118
- | Extract a fact | `researcher.extract_fact(passage)` |
119
- | Answer a question | `researcher.answer(question, passage)` |
120
- | Rephrase passage | `researcher.rephrase(passage)` |
121
- | Continue passage | `researcher.continue_from(passage_start)` |
122
- | Extract knowledge graph | `researcher.extract_triplets(passage)` |
123
- | Compare two passages | `researcher.compare(passage_a, passage_b)` |
124
- | Retrieval | `researcher.find_relevant(question, passages)` |
125
 
126
  ## Limitations
127
 
128
- - 135M parameter model β€” limited factual recall and reasoning
129
- - Trained on synthetic data β€” may not generalise to all instruction styles
130
- - Relevance is the strongest dimension (3.04/5); correctness and completeness are weak (< 2/5)
131
- - Best used for structured extraction tasks, not open-ended QA
 
 
 
 
 
 
 
 
 
 
 
 
132
 
133
  ## Citation
134
 
 
18
 
19
  # SmolLM-135M-SFT-exp01
20
 
21
+ Supervised fine-tuning of [SmolLM-135M-CPT-LoRA-r32](https://huggingface.co/JaydeepR/SmolLM-135M-CPT-LoRA-r32) on 300K synthetic ML paper instruction pairs. The result is a structured research assistant API for ML papers β€” not a general chatbot.
22
 
23
  This is **exp01** in a series of SFT experiments on top of the CPT-adapted SmolLM-135M.
24
 
25
+ ---
26
+
27
+ ## Full Pipeline
28
+
29
+ ```
30
+ arXiv ML papers (188)
31
+ β”‚
32
+ β–Ό
33
+ text-albumentations
34
+ (chunking + constrained synthetic generation)
35
+ β”‚
36
+ β–Ό
37
+ paperbd/paper_instructions_300K-v1
38
+ (300K instruction-response pairs)
39
+ β”‚
40
+ β–Ό
41
+ SFT training (LoRA r=32, ChatML, train_on_responses_only)
42
+ β”‚
43
+ β–Ό
44
+ SmolLM-135M-SFT-exp01
45
+ β”‚
46
+ β–Ό
47
+ PaperResearcher API (10 structured tasks)
48
+ ```
49
+
50
+ ---
51
+
52
  ## Model Description
53
 
54
+ - **Base model:** `paperbd/smollm_135M_arxiv_cpt` β€” SmolLM-135M after continued pre-training on arXiv ML papers
55
  - **Method:** Supervised Fine-Tuning (SFT) with LoRA + `train_on_responses_only`
56
  - **Domain:** ML/arXiv paper research tasks
57
+ - **Design:** Restricted API β€” 10 fixed task types, not a general chatbot
58
+
59
+ ---
60
+
61
+ ## Data Generation Pipeline
62
+
63
+ The training dataset was built from raw arXiv ML papers using a synthetic data generation pipeline:
64
+
65
+ ### 1. Chunking
66
+ Raw paper text is split into overlapping 500-word chunks (100-word overlap) to create manageable context windows for generation.
67
+
68
+ ### 2. Augmentation with `text-albumentations`
69
+ Each chunk is passed through stochastic augmentation tasks. Each task runs with 25% probability per chunk, ensuring dataset diversity:
70
+
71
+ | Task | Description | Output type |
72
+ |---|---|---|
73
+ | `bullet_augmentation` | Extract key points as markdown bullets | `list[str]` |
74
+ | `qa_pair_augmentation` | Generate question-answer pairs | `list[QAPair]` |
75
+ | `rephrase_augmentation` | Elaborate and restate the passage | `str` |
76
+ | `continuation_augmentation` | Continue from a passage prefix | `str` |
77
+ | `triplet_augmentation` | Extract knowledge graph triplets | `list[Triplet]` |
78
+ | `retrieval_augmentation` | Cross-chunk: which passage answers a question | `RetrievalResult` |
79
+ | `comparison_augmentation` | Cross-chunk: compare two passages | `str` |
80
+
81
+ ### 3. Constrained Decoding via Outlines
82
+ All generation during data prep uses **[Outlines](https://github.com/dottxt-ai/outlines)** for structured output β€” a constrained decoding library that guarantees the generator returns outputs matching a predefined schema (Pydantic model or regex). This ensures:
83
+ - QA pairs always have valid `question` / `answer` fields
84
+ - Triplets always follow `(subject, relation, object)` format
85
+ - Retrieval results always return a valid passage index
86
+
87
+ Default runtime: `mlx-community/Qwen3.5-4B-OptiQ-4bit` via MLX (Apple Silicon). Async and batch variants available for large-scale generation.
88
+
89
+ ### 4. Dataset
90
+ The final dataset `paperbd/paper_instructions_300K-v1` contains **300K instruction-response pairs** across all task types, uploaded to HuggingFace for reuse.
91
+
92
+ ---
93
 
94
  ## Training Details
95
 
 
108
  | Total steps | 11,355 |
109
  | Sequence length | 2048 (packed) |
110
  | Chat template | ChatML |
111
+ | Response-only training | Yes β€” loss on assistant turns only |
112
+ | Data variations | 2 (conversation extension) β†’ ~600K effective examples |
113
  | Hardware | NVIDIA RTX 4090 |
114
  | Training time | ~10 hours |
115
 
116
+ ---
117
+
118
+ ## Evaluation
119
 
120
+ ### Method
121
+ 1000 samples drawn from the `paper_instructions_300K-v1` test split. The fine-tuned model generates responses, which are then scored by `grok-3-mini` as an LLM judge.
 
 
122
 
123
+ ### Judge Prompt (4 dimensions, 1–5 scale)
124
+ - **Faithfulness** β€” Does the response contain only factually correct claims? Penalise hallucinations.
125
+ - **Answer Correctness** β€” How closely does the response match the ground truth semantically?
126
+ - **Relevance** β€” Does the response directly address what was asked, without padding or going off-topic?
127
+ - **Completeness** β€” Does the response cover the key points from the ground truth without omitting important details?
128
 
129
+ ### Results
130
 
131
+ | Metric | Score (1–5) |
132
  |---|---|
133
  | Faithfulness | 2.70 |
134
  | Answer Correctness | 1.98 |
135
+ | Relevance | **3.04** |
136
  | Completeness | 1.85 |
137
  | **Overall** | **2.39** |
138
 
139
+ **Interpretation:** Relevance is the strongest dimension β€” the model stays on topic. Answer correctness and completeness are limited by the 135M parameter count; the model understands task structure but struggles to recall and reproduce factual content precisely.
140
+
141
+ ---
142
+
143
+ ## PaperResearcher API
144
 
145
+ The model is designed to be used as a structured API, not a free-form chatbot. The `PaperResearcher` class exposes 10 typed methods, each using the exact instruction strings the model was trained on:
146
 
147
  ```python
148
  from paper_researcher import PaperResearcher
149
 
150
  researcher = PaperResearcher("JaydeepR/SmolLM-135M-SFT-exp01")
 
151
  passage = "Attention mechanisms compute weighted sums of values..."
152
 
153
+ # Extract key points
154
+ bullets: list[str] = researcher.extract_bullets(passage)
155
+
156
+ # Generate Q&A pairs
157
+ pairs: list[QAPair] = researcher.generate_qa_pairs(passage)
158
+ # β†’ [QAPair(question="What does attention compute?", answer="Weighted sums of values")]
159
+
160
+ # Extract knowledge graph triplets
161
+ triplets: list[Triplet] = researcher.extract_triplets(passage)
162
+ # β†’ [Triplet(subject="attention", relation="computes", object="weighted sums")]
163
+
164
+ # Answer a question given a passage
165
+ answer: str = researcher.answer("What does attention compute?", passage)
166
+
167
+ # Rephrase and elaborate
168
+ rephrased: str = researcher.rephrase(passage)
169
+
170
+ # Continue a passage from its beginning
171
+ continuation: str = researcher.continue_from(passage[:200])
172
+
173
+ # Extract a single key fact
174
+ fact: str = researcher.extract_fact(passage)
175
+
176
+ # Generate a question from a passage
177
+ question: str = researcher.generate_question(passage)
178
+
179
+ # Compare two passages
180
+ comparison: str = researcher.compare(passage_a, passage_b)
181
+
182
+ # Retrieval: which passage answers the question?
183
+ result: RetrievalResult = researcher.find_relevant(question, [passage_a, passage_b])
184
+ # β†’ RetrievalResult(index=0, reasoning="Passage 1 directly defines...")
185
  ```
186
 
187
+ ### Return Types
188
+
189
+ | Method | Return Type | Description |
190
+ |---|---|---|
191
+ | `extract_bullets` | `list[str]` | Parsed bullet points |
192
+ | `generate_qa_pairs` | `list[QAPair]` | `.question` and `.answer` fields |
193
+ | `extract_triplets` | `list[Triplet]` | `.subject`, `.relation`, `.object` fields |
194
+ | `find_relevant` | `RetrievalResult` | `.index` (0-based), `.reasoning` |
195
+ | All others | `str` | Raw text response |
196
+
197
+ ---
198
+
199
+ ## Raw Inference
200
 
201
  ```python
202
  from transformers import AutoModelForCausalLM, AutoTokenizer
 
210
  model = PeftModel.from_pretrained(model, adapter_id)
211
 
212
  messages = [
213
+ {"role": "system", "content": "You are an expert in AI and ML research. Your answers are concise and helpful."},
214
+ {"role": "user", "content": "Extract the important points from this passage as markdown bullet points.\n\nAttention mechanisms..."},
215
  ]
216
  prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
217
  inputs = tokenizer(prompt, return_tensors="pt")
218
+ outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=1.1, no_repeat_ngram_size=4)
219
  print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
220
  ```
221
 
222
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
223
 
224
  ## Limitations
225
 
226
+ - 135M parameter model β€” limited factual recall and reasoning capacity
227
+ - Trained on synthetic data β€” instruction format matters; use the exact prompts from `tasks.py`
228
+ - Relevance strongest (3.04/5); correctness and completeness weak (< 2/5)
229
+ - Best suited for structured extraction (bullets, triplets, QA) over open-ended generation
230
+ - No comparison against uninstructed base model yet β€” exp02 planned
231
+
232
+ ---
233
+
234
+ ## Related Models
235
+
236
+ | Model | Description |
237
+ |---|---|
238
+ | [JaydeepR/SmolLM-135M-CPT-LoRA-r32](https://huggingface.co/JaydeepR/SmolLM-135M-CPT-LoRA-r32) | CPT base (this model's starting point) |
239
+ | [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) | Original base model |
240
+
241
+ ---
242
 
243
  ## Citation
244