nimendraai commited on
Commit
ad0ead8
·
verified ·
1 Parent(s): 0763735

update readme

Browse files
Files changed (1) hide show
  1. README.md +355 -10
README.md CHANGED
@@ -1,20 +1,365 @@
1
  ---
 
 
2
  tags:
3
- - gguf
4
- - llama.cpp
5
- - unsloth
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  ---
8
 
9
- # NuExtract-tiny-Resume-Data-Extractor : GGUF
10
 
11
- This model was finetuned and converted to GGUF format using [Unsloth](https://github.com/unslothai/unsloth).
 
 
 
12
 
13
- **Example usage**:
14
- - For text only LLMs: `llama-cli -hf nimendraai/NuExtract-tiny-Resume-Data-Extractor --jinja`
15
- - For multimodal models: `llama-mtmd-cli -hf nimendraai/NuExtract-tiny-Resume-Data-Extractor --jinja`
 
 
16
 
17
- ## Available Model files:
18
- - `NuExtract-tiny-v1.5.Q4_K_M.gguf`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  This was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)
20
  [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
1
  ---
2
+ license: mit
3
+ base_model: numind/NuExtract-tiny-v1.5
4
  tags:
5
+ - nlp
6
+ - json
7
+ - information-extraction
8
+ - resume-parsing
9
+ - structured-extraction
10
+ - qwen2
11
+ - unsloth
12
+ - lora
13
+ - gguf
14
+ - ollama
15
+ - langchain
16
+ language:
17
+ - en
18
+ pipeline_tag: text-generation
19
+ library_name: transformers
20
+ ---
21
+
22
+ # NuExtract-tiny-Resume-Data-Extractor
23
+
24
+ A fine-tuned version of [numind/NuExtract-tiny-v1.5](https://huggingface.co/numind/NuExtract-tiny-v1.5)
25
+ (Qwen2.5-0.5B backbone) specialised for **resume / CV structured extraction**.
26
+
27
+ Given raw resume text in any format, the model returns a clean JSON object with name,
28
+ contact details, skills, work experience, education, and other details — ready to plug
29
+ into a hiring pipeline, ATS, or LangChain workflow.
30
+
31
+ ---
32
+
33
+ ## Model Details
34
+
35
+ | Property | Value |
36
+ |---|---|
37
+ | Base model | `numind/NuExtract-tiny-v1.5` |
38
+ | Backbone | Qwen2.5-0.5B |
39
+ | Total parameters | 511,388,160 |
40
+ | Trainable (LoRA) | 17,596,416 (3.44%) |
41
+ | LoRA rank / alpha | r=32 / alpha=64 |
42
+ | Quantisation | Q4_K_M GGUF (Ollama-ready) |
43
+ | Vocabulary size | 151,665 (unchanged from base) |
44
+ | License | MIT |
45
+
46
+ ---
47
+
48
+ ## Training
49
+
50
+ | Property | Value |
51
+ |---|---|
52
+ | Method | QLoRA via Unsloth |
53
+ | Dataset | 3,000 synthetic resumes (generated) |
54
+ | Train / eval split | 95% / 5% (2,850 / 150) |
55
+ | Packed sequences | 1,125 |
56
+ | Epochs | 4 |
57
+ | Total steps | 284 |
58
+ | Batch size | 16 (2 per device × 8 grad accum) |
59
+ | Learning rate | 2e-4 (cosine schedule, 14 warmup steps) |
60
+ | Hardware | 1× NVIDIA Tesla T4 (Google Colab) |
61
+ | Training time | ~24 minutes |
62
+
63
+ ### Loss Curve
64
+
65
+ | Step | Epoch | Train Loss | Val Loss |
66
+ |---|---|---|---|
67
+ | 100 | 1.0 | 0.2355 | 0.2354 |
68
+ | 200 | 2.8 | 0.2298 | 0.2313 |
69
+ | 284 | 4.0 | 0.2276 | 0.2296 |
70
+
71
+ Near-zero train/val gap throughout — no overfitting observed.
72
+ Best checkpoint (step 284, val loss 0.2296) loaded automatically.
73
+
74
+ ---
75
+
76
+ ## Output Schema
77
+
78
+ ```json
79
+ {
80
+ "name": "string or null",
81
+ "email": "string or null",
82
+ "phone": "string or null",
83
+ "website": "string or null",
84
+ "skills": ["string"],
85
+ "experience": [{"title": "string", "company": "string", "duration": "string"}],
86
+ "education": [{"degree": "string", "institution": "string", "year": "string"}],
87
+ "other_details": ["string"]
88
+ }
89
+ ```
90
+
91
+ - Missing scalar fields → `null`
92
+ - Missing list fields → `[]`
93
+ - `skills` contains technical skills only — soft skills excluded
94
+ - `other_details` captures certifications, languages, awards, publications
95
+
96
+ ---
97
+
98
+ ## Inference Speed (Ollama, Tesla T4)
99
+
100
+ | Metric | Value |
101
+ |---|---|
102
+ | Prompt eval | 161 tokens in ~28ms |
103
+ | Generation | 154 tokens in ~2,986ms |
104
+ | Total (typical resume) | ~7.5 seconds |
105
+ | Throughput | ~52 tokens/sec |
106
+
107
+ ---
108
+
109
+ ## Usage
110
+
111
+ ### Ollama (recommended)
112
+
113
+ **Step 1 — Create Modelfile:**
114
+
115
+ ```dockerfile
116
+ FROM hf.co/nimendraai/NuExtract-tiny-Resume-Data-Extractor:Q4_K_M
117
+
118
+ PARAMETER temperature 0
119
+ PARAMETER top_k 10
120
+ PARAMETER top_p 0.9
121
+ PARAMETER repeat_penalty 1.1
122
+ PARAMETER seed 42
123
+ PARAMETER num_ctx 2048
124
+ PARAMETER num_predict 600
125
+ PARAMETER stop "<|end-output|>"
126
+ PARAMETER stop "<|endoftext|>"
127
+
128
+ TEMPLATE """<|input|>
129
+ ### Template:
130
+ {
131
+ "name": "",
132
+ "email": "",
133
+ "phone": "",
134
+ "website": "",
135
+ "skills": [""],
136
+ "experience": [{"title": "", "company": "", "duration": ""}],
137
+ "education": [{"degree": "", "institution": "", "year": ""}],
138
+ "other_details": [""]
139
+ }
140
+ ### Text:
141
+ {{ .Prompt }}
142
+
143
+ <|output|>
144
+ """
145
+
146
+ LICENSE """Apache License, Version 2.0 - http://www.apache.org/licenses/LICENSE-2.0"""
147
+ ```
148
+
149
+ **Step 2 — Create model:**
150
+
151
+ ```bash
152
+ ollama create agenthire-extractor -f Modelfile
153
+ ```
154
+
155
+ **Step 3 — Query:**
156
+
157
+ ```bash
158
+ curl http://localhost:11434/api/generate \
159
+ -X POST \
160
+ -H "Content-Type: application/json" \
161
+ -d '{
162
+ "model": "agenthire-extractor",
163
+ "format": "json",
164
+ "stream": false,
165
+ "prompt": "<resume text here>"
166
+ }'
167
+ ```
168
+
169
+ > Always apply brace-counting extraction on the response value — see Python helper below.
170
 
171
  ---
172
 
173
+ ### Python (transformers)
174
 
175
+ ```python
176
+ import json
177
+ import torch
178
+ from transformers import AutoModelForCausalLM, AutoTokenizer
179
 
180
+ model_name = "nimendraai/NuExtract-tiny-Resume-Data-Extractor"
181
+ model = AutoModelForCausalLM.from_pretrained(
182
+ model_name, torch_dtype=torch.bfloat16, trust_remote_code=True
183
+ ).eval().cuda()
184
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
185
 
186
+ TEMPLATE = json.dumps({
187
+ "name": "", "email": "", "phone": "", "website": "",
188
+ "skills": [""],
189
+ "experience": [{"title": "", "company": "", "duration": ""}],
190
+ "education": [{"degree": "", "institution": "", "year": ""}],
191
+ "other_details": [""],
192
+ }, indent=4)
193
+
194
+ def extract_first_json(text):
195
+ depth, start = 0, None
196
+ for i, ch in enumerate(text):
197
+ if ch == "{":
198
+ if start is None: start = i
199
+ depth += 1
200
+ elif ch == "}":
201
+ depth -= 1
202
+ if depth == 0 and start is not None:
203
+ return text[start:i+1]
204
+ return text
205
+
206
+ def extract(resume_text: str) -> dict:
207
+ prompt = (
208
+ "<|input|>\n"
209
+ f"### Template:\n{TEMPLATE}\n"
210
+ f"### Text:\n{resume_text}\n\n"
211
+ "<|output|>"
212
+ )
213
+ inputs = tokenizer(
214
+ prompt, return_tensors="pt", truncation=True, max_length=2048
215
+ ).to(model.device)
216
+ with torch.no_grad():
217
+ out = model.generate(
218
+ **inputs, max_new_tokens=512, do_sample=False
219
+ )
220
+ decoded = tokenizer.decode(out[0], skip_special_tokens=True)
221
+ raw = decoded.split("<|output|>")[-1].strip()
222
+ return json.loads(extract_first_json(raw))
223
+ ```
224
+
225
+ ---
226
+
227
+ ### LangChain
228
+
229
+ ```python
230
+ from langchain_ollama import OllamaLLM
231
+ from pydantic import BaseModel, Field
232
+ from typing import Optional
233
+ import json
234
+
235
+ class Experience(BaseModel):
236
+ title: str = Field(default="")
237
+ company: str = Field(default="")
238
+ duration: str = Field(default="")
239
+
240
+ class Education(BaseModel):
241
+ degree: str = Field(default="")
242
+ institution: str = Field(default="")
243
+ year: str = Field(default="")
244
+
245
+ class ResumeExtraction(BaseModel):
246
+ name: Optional[str] = None
247
+ email: Optional[str] = None
248
+ phone: Optional[str] = None
249
+ website: Optional[str] = None
250
+ skills: list[str] = Field(default_factory=list)
251
+ experience: list[Experience] = Field(default_factory=list)
252
+ education: list[Education] = Field(default_factory=list)
253
+ other_details: list[str] = Field(default_factory=list)
254
+
255
+ def extract_first_json(text):
256
+ depth, start = 0, None
257
+ for i, ch in enumerate(text):
258
+ if ch == "{":
259
+ if start is None: start = i
260
+ depth += 1
261
+ elif ch == "}":
262
+ depth -= 1
263
+ if depth == 0 and start is not None:
264
+ return text[start:i+1]
265
+ return text
266
+
267
+ llm = OllamaLLM(model="agenthire-extractor", format="json", temperature=0)
268
+
269
+ def extract_resume(text: str) -> ResumeExtraction:
270
+ raw = llm.invoke(text)
271
+ return ResumeExtraction(**json.loads(extract_first_json(raw)))
272
+
273
+ # Batch processing
274
+ resumes = [resume_1, resume_2, resume_3]
275
+ results = [
276
+ ResumeExtraction(**json.loads(extract_first_json(r)))
277
+ for r in llm.batch(resumes)
278
+ ]
279
+
280
+ # Pipeline with scoring
281
+ from langchain_core.prompts import PromptTemplate
282
+ from langchain_ollama import OllamaLLM as ScoreLLM
283
+
284
+ scoring_prompt = PromptTemplate.from_template(
285
+ "Job: {job_description}\n\nCandidate: {candidate}\n\n"
286
+ "Score 1-10 and explain."
287
+ )
288
+ scorer = ScoreLLM(model="llama3", temperature=0.3)
289
+
290
+ def process_application(resume_text, job_description):
291
+ candidate = extract_resume(resume_text).model_dump()
292
+ evaluation = (scoring_prompt | scorer).invoke({
293
+ "job_description": job_description,
294
+ "candidate": json.dumps(candidate, indent=2),
295
+ })
296
+ return {"candidate": candidate, "evaluation": evaluation}
297
+ ```
298
+
299
+ ---
300
+
301
+ ## Important Notes
302
+
303
+ **Always use brace-counting extraction** on raw model output before `json.loads()`.
304
+ The model occasionally appends a small amount of text after the closing `}`. Parsing
305
+ the raw string directly will raise `JSONDecodeError: Extra data`.
306
+
307
+ ```python
308
+ def extract_first_json(text):
309
+ depth, start = 0, None
310
+ for i, ch in enumerate(text):
311
+ if ch == "{":
312
+ if start is None: start = i
313
+ depth += 1
314
+ elif ch == "}":
315
+ depth -= 1
316
+ if depth == 0 and start is not None:
317
+ return text[start:i+1]
318
+ return text
319
+
320
+ result = json.loads(extract_first_json(raw_output))
321
+ ```
322
+
323
+ **Do not call the raw HuggingFace model directly via Ollama** (`hf.co/nimendraai/...`)
324
+ without a Modelfile. The NuExtract `<|input|> / ### Template: / ### Text:` prompt
325
+ format must be applied — the Modelfile `TEMPLATE` block handles this automatically.
326
+
327
+ **Skill capitalisation** is normalised via `.title()` during training, so `FastAPI`
328
+ may appear as `Fastapi` in output. Apply a canonical map in post-processing if needed.
329
+
330
+ ---
331
+
332
+ ## Limitations
333
+
334
+ - Trained on **synthetic** English resumes — real-world resumes with unusual layouts
335
+ may produce lower accuracy. Fine-tuning on 30+ real examples will improve results.
336
+ - Skills are extracted with light normalisation — canonical casing (FastAPI vs Fastapi)
337
+ requires a post-processing map.
338
+ - Phone numbers are extracted as-is without E.164 normalisation.
339
+ - Best suited for English resumes. Some multilingual capability exists from the
340
+ Qwen2.5 backbone but was not tested.
341
+
342
+ ---
343
+
344
+ ## Citation
345
+
346
+ If you use this model, please also cite the original NuExtract work:
347
+
348
+ ```bibtex
349
+ @misc{nuextract2024,
350
+ author = {NuMind},
351
+ title = {NuExtract: A Foundation Model for Structured Extraction},
352
+ year = {2024},
353
+ url = {https://numind.ai/blog/nuextract-a-foundation-model-for-structured-extraction}
354
+ }
355
+ ```
356
+
357
+ ---
358
+
359
+ ## License
360
+
361
+ MIT — same as the base model [`numind/NuExtract-tiny-v1.5`](https://huggingface.co/numind/NuExtract-1.5-tiny).
362
+
363
+ ---
364
  This was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth)
365
  [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)