ββββ βββββββββββββββ ββββββ ββββββ ββββββ βββ ββββββββ ββββββ βββββββββββββββββ
βββββ βββββββββββββββ ββββββ ββββββ βββββββ βββ βββββββββββββββββββββββββββββββββ
ββββββ βββ βββ βββββββββββ ββββββββββ βββ βββ ββββββ ββββββββββββββββ βββ
ββββββββββ βββ βββββββββββ ββββββββββ βββ βββ ββββββ ββββββββββββββββ βββ
βββ ββββββ βββ βββ βββββββββββββββ ββββββββββββ βββ βββ βββββββββββ βββ
βββ βββββ βββ βββ βββ βββββββ βββ βββ βββββββ βββ βββ βββββββββββ βββ
Version 2.6
Compact multimodal model for CPU-friendly image + text generation.
Nthuku-Fast 2.6
Version 2.6 is a full fine-tune from 2.5 trained on a vision-only filtered manifest. This release improves stored training loss/perplexity and slightly improves the current weakness benchmark score versus 2.3/2.4/2.5.
What Changed In 2.6
- Training source: nthuku-fast-2.5
- Mode: full fine-tune
- Epochs: 10
- Max samples: 23,086
- Data file: nthuku_dataset/manifest_train_vision_only.parquet
- Objective of this pass: reduce text-only contamination that previously triggered code-like outputs during image prompts
Key Metrics
Training Metrics (from config)
| Version | Best Loss | Perplexity (exp(loss)) |
|---|---|---|
| 2.3 | 9.129752 | 9,226 |
| 2.4 | 10.342586 | 31,026 |
| 2.5 | 9.753154 | 17,208 |
| 2.6 | 8.607122 | 5,470 |
Interpretation:
- 2.6 has the best stored training objective in this 2.3 to 2.6 range.
- Lower training loss did not fully translate into robust instruction-following behavior yet, so this should be treated as a necessary but not sufficient improvement.
Weakness Benchmark (21 auto probes) β With Real Image-Backed Probes
Evaluation re-run with 15,000 real images from vision_text sources in probes/ directory:
- eval_results_23_with_images/weakness_report.json
- eval_results_24_with_images/weakness_report.json
- eval_results_25_with_images/weakness_report.json
- eval_results_26_with_images/weakness_report.json
| Version | Overall Score | Vision Counting | Language Instruction |
|---|---|---|---|
| 2.3 | 0.6467 | 0.6166 | 0.5333 |
| 2.4 | 0.6451 | 0.6000 | 0.5333 |
| 2.5 | 0.6451 | 0.6000 | 0.5333 |
| 2.6 | 0.6483 | 0.6348 | 0.5333 |
Key Findings:
- 2.6 achieves best overall score (0.6483) with real images
- 2.6 vision/counting improved to 0.6348 (from 0.6000 in 2.4/2.5)
- Language/instruction remains stable bottleneck at 0.5333 across all versions
- Most other vision dimensions (spatial, attribute, ocr, scene) plateau at 0.6 baseline with all models
Delta vs prior:
- 2.6 - 2.5: +0.0032 overall
- 2.6 - 2.4: +0.0032 overall
- 2.6 - 2.3: +0.0016 overall
Speed (from probe_results elapsed_s)
| Version | Avg Seconds / Probe | Total Inference Time (21 probes) | Avg Generated Tokens |
|---|---|---|---|
| 2.3 | 0.275 | 5.77 | 252.5 |
| 2.4 | 0.321 | 6.74 | 252.4 |
| 2.5 | 0.346 | 7.26 | 252.1 |
| 2.6 | 0.285 | 5.98 | 252.5 |
Interpretation:
- 2.6 is materially faster than 2.4 and 2.5 on this harness.
- 2.6 is close to 2.3 latency while scoring slightly higher overall.
Evaluation Status: Image-Backed Probes Active
As of the latest evaluation run, the probe suite now contains 15,000 real images sourced from:
- sources/vision_text/caption/images/
- sources/vision_text/vqa/images/
- sources/vision_text/conversation/images/
Current state:
- β probes/ populated with real images
- β Vision probe scores now grounded in actual image inference
- β οΈ Most vision dimensions still plateau at 0.6 baseline, indicating fundamental capability gaps
Interpretation:
- Vision scores are now valid multi-modal evaluations
- 0.6 baseline on most vision dimensions suggests that instruction-constrained visual reasoning is weak across the current architecture
- Strongest signal: v2.6 improves vision/counting to 0.6348 vs. 0.6000 in prior versions, but this is still a modest gain
- Critical bottleneck: language/instruction at 0.5333 prevents models from output compliance even when visual understanding is available
Image and Response Review
Reviewed image paths
- sources/vision_text/mcqa/images/sciqa_0000000.jpg
- sources/vision_text/caption/images/coco_0004314.jpg
- sources/vision_text/detailed_caption/images/coco_0002088.jpg
Observations
- MCQA map prompt
- Image content is clear (US map with labeled states).
- Expected answer for farthest north among the given choices is West Virginia.
- Actual terminal response for 2.6 remained mostly off-task and verbose, mixing unrelated fragments.
- Conclusion: instruction-binding for constrained visual QA is still weak.
- General caption behavior
- Responses often over-generate repetitive filler patterns (for example repeated short tokens/phrases).
- Even when image context exists, completion drifts instead of staying compact and grounded.
- Conclusion: decoding and instruction adherence remain the dominant bottlenecks.
- Improvement vs 2.5
- Reduced code-like contamination is partially visible in aggregate metrics (especially counting score and overall score).
- But the model is not yet reliably production-grade for strict-format visual QA/captioning.
Weakness Ranking (2.6 with Image-Backed Probes)
From eval_results_26_with_images/weakness_report.json:
| Dimension | Score | Severity |
|---|---|---|
| language/instruction | 0.5333 | β οΈ CRITICAL |
| vision/spatial | 0.6000 | MEDIUM |
| vision/attribute | 0.6000 | MEDIUM |
| vision/ocr | 0.6000 | MEDIUM |
| vision/scene | 0.6000 | MEDIUM |
| reasoning/temporal | 0.6000 | MEDIUM |
| reasoning/counterfact | 0.6000 | MEDIUM |
| reasoning/causal | 0.6133 | MEDIUM |
| vision/counting | 0.6348 | MEDIUM |
| language/fluency | 0.7500 | LOW |
| language/detail | 1.0000 | β STRENGTH |
Strategic Insight:
- The single largest improvement lever for v2.7 is instruction adherence (0.5333 β target 0.75+)
- Vision dimensions will not improve significantly without addressing the output format constraint issue
- The model has strong language capability (detail and fluency) but cannot reliably follow instructions on what to output
Recommended Next Step for v2.7
Step 1 is now complete: real probe pack with 15,000 images has been built and vision metrics are valid.
Focus for v2.7: Address the instruction/output-constraint bottleneck (0.5333 β 0.75+)
Remaining priority actions:
- β DONE β Build real probe pack
- Probes/ populated with 15,000 images
- Vision metrics are grounded and valid
- Add constrained-format visual QA curriculum
- Add many samples with strict outputs:
- single-word answers for MCQA
- one-line numeric answers for counting
- exact 3-sentence captions for description
- Keep high weight on image-conditioned instruction-following data
- Target: increase language/instruction from 0.5333 to 0.70+
- Penalize repetition during evaluation and tune decoding defaults
- Use lower max_new_tokens for constrained prompts (e.g., 5-15 tokens for MCQA)
- Apply stronger repetition penalty (increase from 1.1 to 1.3-1.5)
- Implement early stopping for short-answer tasks
- Reduce verbose filler generation (observed "so so so" and "a a a" patterns in probes)
- Add hard pass/fail validation gate
- Create 200-500 curated high-quality items with deterministic expected outputs
- Track exact-match accuracy for MCQA and format compliance rate
- Use as release gate for v2.7
- Expected outcome: avoid production deployments where instruction binding is broken
Expected outcome of v2.7: +0.15-0.20 on language/instruction while preserving 2.6's loss/perplexity and speed gains.
Local Quick Test
Text-only:
python test_local_model.py --model_dir ./nthuku-fast-2.6 --prompt "Hello, how are you?"
Vision + text:
python test_local_model.py --model_dir ./nthuku-fast-2.6 --prompt "Describe this image in English only, in 3 short sentences." --image ./image.jpg
License
Released under the Apache 2.0 License. Free to use, modify, and distribute with attribution.
- Downloads last month
- 34