███╗   ██╗████████╗██╗  ██╗██╗   ██╗██╗  ██╗██╗   ██╗    ███████╗ █████╗ ███████╗████████╗
████╗  ██║╚══██╔══╝██║  ██║██║   ██║██║ ██╔╝██║   ██║    ██╔════╝██╔══██╗██╔════╝╚══██╔══╝
██╔██╗ ██║   ██║   ███████║██║   ██║█████╔╝ ██║   ██║    █████╗  ███████║███████╗   ██║
██║╚██╗██║   ██║   ██╔══██║██║   ██║██╔═██╗ ██║   ██║    ██╔══╝  ██╔══██║╚════██║   ██║
██║ ╚████║   ██║   ██║  ██║╚██████╔╝██║  ██╗╚██████╔╝    ██║     ██║  ██║███████║   ██║
╚═╝  ╚═══╝   ╚═╝   ╚═╝  ╚═╝ ╚═════╝ ╚═╝  ╚═╝ ╚═════╝     ╚═╝     ╚═╝  ╚═╝╚══════╝   ╚═╝

Version 2.6

Compact multimodal model for CPU-friendly image + text generation.

Nthuku-Fast 2.6

Version 2.6 is a full fine-tune from 2.5 trained on a vision-only filtered manifest. This release improves stored training loss/perplexity and slightly improves the current weakness benchmark score versus 2.3/2.4/2.5.

What Changed In 2.6

Training source: nthuku-fast-2.5
Mode: full fine-tune
Epochs: 10
Max samples: 23,086
Data file: nthuku_dataset/manifest_train_vision_only.parquet
Objective of this pass: reduce text-only contamination that previously triggered code-like outputs during image prompts

Key Metrics

Training Metrics (from config)

Version	Best Loss	Perplexity (exp(loss))
2.3	9.129752	9,226
2.4	10.342586	31,026
2.5	9.753154	17,208
2.6	8.607122	5,470

Interpretation:

2.6 has the best stored training objective in this 2.3 to 2.6 range.
Lower training loss did not fully translate into robust instruction-following behavior yet, so this should be treated as a necessary but not sufficient improvement.

Weakness Benchmark (21 auto probes) — With Real Image-Backed Probes

Evaluation re-run with 15,000 real images from vision_text sources in probes/ directory:

eval_results_23_with_images/weakness_report.json
eval_results_24_with_images/weakness_report.json
eval_results_25_with_images/weakness_report.json
eval_results_26_with_images/weakness_report.json

Version	Overall Score	Vision Counting	Language Instruction
2.3	0.6467	0.6166	0.5333
2.4	0.6451	0.6000	0.5333
2.5	0.6451	0.6000	0.5333
2.6	0.6483	0.6348	0.5333

Key Findings:

2.6 achieves best overall score (0.6483) with real images
2.6 vision/counting improved to 0.6348 (from 0.6000 in 2.4/2.5)
Language/instruction remains stable bottleneck at 0.5333 across all versions
Most other vision dimensions (spatial, attribute, ocr, scene) plateau at 0.6 baseline with all models

Delta vs prior:

2.6 - 2.5: +0.0032 overall
2.6 - 2.4: +0.0032 overall
2.6 - 2.3: +0.0016 overall

Speed (from probe_results elapsed_s)

Version	Avg Seconds / Probe	Total Inference Time (21 probes)	Avg Generated Tokens
2.3	0.275	5.77	252.5
2.4	0.321	6.74	252.4
2.5	0.346	7.26	252.1
2.6	0.285	5.98	252.5

Interpretation:

2.6 is materially faster than 2.4 and 2.5 on this harness.
2.6 is close to 2.3 latency while scoring slightly higher overall.

Evaluation Status: Image-Backed Probes Active

As of the latest evaluation run, the probe suite now contains 15,000 real images sourced from:

sources/vision_text/caption/images/
sources/vision_text/vqa/images/
sources/vision_text/conversation/images/

Current state:

✅ probes/ populated with real images
✅ Vision probe scores now grounded in actual image inference
⚠️ Most vision dimensions still plateau at 0.6 baseline, indicating fundamental capability gaps

Interpretation:

Vision scores are now valid multi-modal evaluations
0.6 baseline on most vision dimensions suggests that instruction-constrained visual reasoning is weak across the current architecture
Strongest signal: v2.6 improves vision/counting to 0.6348 vs. 0.6000 in prior versions, but this is still a modest gain
Critical bottleneck: language/instruction at 0.5333 prevents models from output compliance even when visual understanding is available

Image and Response Review

Reviewed image paths

sources/vision_text/mcqa/images/sciqa_0000000.jpg
sources/vision_text/caption/images/coco_0004314.jpg
sources/vision_text/detailed_caption/images/coco_0002088.jpg

Observations

MCQA map prompt

Image content is clear (US map with labeled states).
Expected answer for farthest north among the given choices is West Virginia.
Actual terminal response for 2.6 remained mostly off-task and verbose, mixing unrelated fragments.
Conclusion: instruction-binding for constrained visual QA is still weak.

General caption behavior

Responses often over-generate repetitive filler patterns (for example repeated short tokens/phrases).
Even when image context exists, completion drifts instead of staying compact and grounded.
Conclusion: decoding and instruction adherence remain the dominant bottlenecks.

Improvement vs 2.5

Reduced code-like contamination is partially visible in aggregate metrics (especially counting score and overall score).
But the model is not yet reliably production-grade for strict-format visual QA/captioning.

Weakness Ranking (2.6 with Image-Backed Probes)

From eval_results_26_with_images/weakness_report.json:

Dimension	Score	Severity
language/instruction	0.5333	⚠️ CRITICAL
vision/spatial	0.6000	MEDIUM
vision/attribute	0.6000	MEDIUM
vision/ocr	0.6000	MEDIUM
vision/scene	0.6000	MEDIUM
reasoning/temporal	0.6000	MEDIUM
reasoning/counterfact	0.6000	MEDIUM
reasoning/causal	0.6133	MEDIUM
vision/counting	0.6348	MEDIUM
language/fluency	0.7500	LOW
language/detail	1.0000	✅ STRENGTH

Strategic Insight:

The single largest improvement lever for v2.7 is instruction adherence (0.5333 → target 0.75+)
Vision dimensions will not improve significantly without addressing the output format constraint issue
The model has strong language capability (detail and fluency) but cannot reliably follow instructions on what to output

Recommended Next Step for v2.7

Step 1 is now complete: real probe pack with 15,000 images has been built and vision metrics are valid.

Focus for v2.7: Address the instruction/output-constraint bottleneck (0.5333 → 0.75+)

Remaining priority actions:

✅ DONE — Build real probe pack

Probes/ populated with 15,000 images
Vision metrics are grounded and valid

Add constrained-format visual QA curriculum

Add many samples with strict outputs:
- single-word answers for MCQA
- one-line numeric answers for counting
- exact 3-sentence captions for description
Keep high weight on image-conditioned instruction-following data
Target: increase language/instruction from 0.5333 to 0.70+

Penalize repetition during evaluation and tune decoding defaults

Use lower max_new_tokens for constrained prompts (e.g., 5-15 tokens for MCQA)
Apply stronger repetition penalty (increase from 1.1 to 1.3-1.5)
Implement early stopping for short-answer tasks
Reduce verbose filler generation (observed "so so so" and "a a a" patterns in probes)

Add hard pass/fail validation gate

Create 200-500 curated high-quality items with deterministic expected outputs
Track exact-match accuracy for MCQA and format compliance rate
Use as release gate for v2.7
Expected outcome: avoid production deployments where instruction binding is broken

Expected outcome of v2.7: +0.15-0.20 on language/instruction while preserving 2.6's loss/perplexity and speed gains.

Local Quick Test

Text-only:

python test_local_model.py --model_dir ./nthuku-fast-2.6 --prompt "Hello, how are you?"

Vision + text:

python test_local_model.py --model_dir ./nthuku-fast-2.6 --prompt "Describe this image in English only, in 3 short sentences." --image ./image.jpg

License

Released under the Apache 2.0 License. Free to use, modify, and distribute with attribution.

Downloads last month: 34

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Qybera
/

nthuku-fast-2.6