Upload README.md with huggingface_hub

00a102d verified 12 days ago

9.08 kB

	---
	language:
	- en
	license: apache-2.0
	base_model: paperbd/smollm_135M_arxiv_cpt
	tags:
	- sft
	- instruction-tuning
	- lora
	- unsloth
	- scientific
	- arxiv
	- nlp
	- paper-researcher
	datasets:
	- paperbd/paper_instructions_300K-v1
	---

	# SmolLM-135M-SFT-exp01

	Supervised fine-tuning of [SmolLM-135M-CPT-LoRA-r32](https://huggingface.co/JaydeepR/SmolLM-135M-CPT-LoRA-r32) on 300K synthetic ML paper instruction pairs. The result is a structured research assistant API for ML papers — not a general chatbot.

	This is exp01 in a series of SFT experiments on top of the CPT-adapted SmolLM-135M.

	---

	## Full Pipeline

	```
	arXiv ML papers (188)
	│
	▼
	text-albumentations
	(chunking + constrained synthetic generation)
	│
	▼
	paperbd/paper_instructions_300K-v1
	(300K instruction-response pairs)
	│
	▼
	SFT training (LoRA r=32, ChatML, train_on_responses_only)
	│
	▼
	SmolLM-135M-SFT-exp01
	│
	▼
	PaperResearcher API (10 structured tasks)
	```

	---

	## Model Description

	- Base model: `paperbd/smollm_135M_arxiv_cpt` — SmolLM-135M after continued pre-training on arXiv ML papers
	- Method: Supervised Fine-Tuning (SFT) with LoRA + `train_on_responses_only`
	- Domain: ML/arXiv paper research tasks
	- Design: Restricted API — 10 fixed task types, not a general chatbot

	---

	## Data Generation Pipeline

	The training dataset was built from raw arXiv ML papers using a synthetic data generation pipeline:

	### 1. Chunking
	Raw paper text is split into overlapping 500-word chunks (100-word overlap) to create manageable context windows for generation.

	### 2. Augmentation with `text-albumentations`
	Each chunk is passed through stochastic augmentation tasks. Each task runs with 25% probability per chunk, ensuring dataset diversity:

	\| Task \| Description \| Output type \|
	\|---\|---\|---\|
	\| `bullet_augmentation` \| Extract key points as markdown bullets \| `list[str]` \|
	\| `qa_pair_augmentation` \| Generate question-answer pairs \| `list[QAPair]` \|
	\| `rephrase_augmentation` \| Elaborate and restate the passage \| `str` \|
	\| `continuation_augmentation` \| Continue from a passage prefix \| `str` \|
	\| `triplet_augmentation` \| Extract knowledge graph triplets \| `list[Triplet]` \|
	\| `retrieval_augmentation` \| Cross-chunk: which passage answers a question \| `RetrievalResult` \|
	\| `comparison_augmentation` \| Cross-chunk: compare two passages \| `str` \|

	### 3. Constrained Decoding via Outlines
	All generation during data prep uses [Outlines](https://github.com/dottxt-ai/outlines) for structured output — a constrained decoding library that guarantees the generator returns outputs matching a predefined schema (Pydantic model or regex). This ensures:
	- QA pairs always have valid `question` / `answer` fields
	- Triplets always follow `(subject, relation, object)` format
	- Retrieval results always return a valid passage index

	Default runtime: `mlx-community/Qwen3.5-4B-OptiQ-4bit` via MLX (Apple Silicon). Async and batch variants available for large-scale generation.

	### 4. Dataset
	The final dataset `paperbd/paper_instructions_300K-v1` contains 300K instruction-response pairs across all task types, uploaded to HuggingFace for reuse.

	---

	## Training Details

	\| Parameter \| Value \|
	\|---\|---\|
	\| LoRA rank \| 32 \|
	\| LoRA alpha \| 32 \|
	\| Target modules \| q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj \|
	\| Trainable params \| ~9.7M / 144M (6.77%) \|
	\| Quantization \| 4-bit (QLoRA via Unsloth) \|
	\| Batch size \| 32 \|
	\| Gradient accumulation \| 4 (effective batch: 128) \|
	\| Learning rate \| 2e-4 (linear decay) \|
	\| Warmup ratio \| 0.03 \|
	\| Epochs \| 3 \|
	\| Total steps \| 11,355 \|
	\| Sequence length \| 2048 (packed) \|
	\| Chat template \| ChatML \|
	\| Response-only training \| Yes — loss on assistant turns only \|
	\| Data variations \| 2 (conversation extension) → ~600K effective examples \|
	\| Hardware \| NVIDIA RTX 4090 \|
	\| Training time \| ~10 hours \|

	---

	## Evaluation

	### Method
	1000 samples drawn from the `paper_instructions_300K-v1` test split. The fine-tuned model generates responses, which are then scored by `grok-3-mini` as an LLM judge.

	### Judge Prompt (4 dimensions, 1–5 scale)
	- Faithfulness — Does the response contain only factually correct claims? Penalise hallucinations.
	- Answer Correctness — How closely does the response match the ground truth semantically?
	- Relevance — Does the response directly address what was asked, without padding or going off-topic?
	- Completeness — Does the response cover the key points from the ground truth without omitting important details?

	### Results

	\| Metric \| Score (1–5) \|
	\|---\|---\|
	\| Faithfulness \| 2.70 \|
	\| Answer Correctness \| 1.98 \|
	\| Relevance \| 3.04 \|
	\| Completeness \| 1.85 \|
	\| Overall \| 2.39 \|

	Interpretation: Relevance is the strongest dimension — the model stays on topic. Answer correctness and completeness are limited by the 135M parameter count; the model understands task structure but struggles to recall and reproduce factual content precisely.

	---

	## PaperResearcher API

	The model is designed to be used as a structured API, not a free-form chatbot. The `PaperResearcher` class exposes 10 typed methods, each using the exact instruction strings the model was trained on:

	```python
	from paper_researcher import PaperResearcher

	researcher = PaperResearcher("JaydeepR/SmolLM-135M-SFT-exp01")
	passage = "Attention mechanisms compute weighted sums of values..."

	# Extract key points
	bullets: list[str] = researcher.extract_bullets(passage)

	# Generate Q&A pairs
	pairs: list[QAPair] = researcher.generate_qa_pairs(passage)
	# → [QAPair(question="What does attention compute?", answer="Weighted sums of values")]

	# Extract knowledge graph triplets
	triplets: list[Triplet] = researcher.extract_triplets(passage)
	# → [Triplet(subject="attention", relation="computes", object="weighted sums")]

	# Answer a question given a passage
	answer: str = researcher.answer("What does attention compute?", passage)

	# Rephrase and elaborate
	rephrased: str = researcher.rephrase(passage)

	# Continue a passage from its beginning
	continuation: str = researcher.continue_from(passage[:200])

	# Extract a single key fact
	fact: str = researcher.extract_fact(passage)

	# Generate a question from a passage
	question: str = researcher.generate_question(passage)

	# Compare two passages
	comparison: str = researcher.compare(passage_a, passage_b)

	# Retrieval: which passage answers the question?
	result: RetrievalResult = researcher.find_relevant(question, [passage_a, passage_b])
	# → RetrievalResult(index=0, reasoning="Passage 1 directly defines...")
	```

	### Return Types

	\| Method \| Return Type \| Description \|
	\|---\|---\|---\|
	\| `extract_bullets` \| `list[str]` \| Parsed bullet points \|
	\| `generate_qa_pairs` \| `list[QAPair]` \| `.question` and `.answer` fields \|
	\| `extract_triplets` \| `list[Triplet]` \| `.subject`, `.relation`, `.object` fields \|
	\| `find_relevant` \| `RetrievalResult` \| `.index` (0-based), `.reasoning` \|
	\| All others \| `str` \| Raw text response \|

	---

	## Raw Inference

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel

	adapter_id = "JaydeepR/SmolLM-135M-SFT-exp01"
	base_model_id = "paperbd/smollm_135M_arxiv_cpt"

	tokenizer = AutoTokenizer.from_pretrained(adapter_id)
	model = AutoModelForCausalLM.from_pretrained(base_model_id)
	model = PeftModel.from_pretrained(model, adapter_id)

	messages = [
	{"role": "system", "content": "You are an expert in AI and ML research. Your answers are concise and helpful."},
	{"role": "user", "content": "Extract the important points from this passage as markdown bullet points.\n\nAttention mechanisms..."},
	]
	prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(prompt, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=256, repetition_penalty=1.1, no_repeat_ngram_size=4)
	print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
	```

	---

	## Limitations

	- 135M parameter model — limited factual recall and reasoning capacity
	- Trained on synthetic data — instruction format matters; use the exact prompts from `tasks.py`
	- Relevance strongest (3.04/5); correctness and completeness weak (< 2/5)
	- Best suited for structured extraction (bullets, triplets, QA) over open-ended generation
	- No comparison against uninstructed base model yet — exp02 planned

	---

	## Related Models

	\| Model \| Description \|
	\|---\|---\|
	\| [JaydeepR/SmolLM-135M-CPT-LoRA-r32](https://huggingface.co/JaydeepR/SmolLM-135M-CPT-LoRA-r32) \| CPT base (this model's starting point) \|
	\| [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M) \| Original base model \|

	---

	## Citation

	```
	@misc{smollm135m-sft-exp01,
	author = {Jaydeep Raijada},
	title = {SmolLM-135M SFT exp01 — Instruction Tuning on ML Paper Research Tasks},
	year = {2026},
	url = {https://huggingface.co/JaydeepR/SmolLM-135M-SFT-exp01}
	}
	```