β–ˆβ–ˆβ–ˆβ•—   β–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•—  β–ˆβ–ˆβ•—β–ˆβ–ˆβ•—   β–ˆβ–ˆβ•—β–ˆβ–ˆβ•—  β–ˆβ–ˆβ•—β–ˆβ–ˆβ•—   β–ˆβ–ˆβ•—    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•— β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—
β–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ•‘β•šβ•β•β–ˆβ–ˆβ•”β•β•β•β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘ β–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘    β–ˆβ–ˆβ•”β•β•β•β•β•β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•—β–ˆβ–ˆβ•”β•β•β•β•β•β•šβ•β•β–ˆβ–ˆβ•”β•β•β•
β–ˆβ–ˆβ•”β–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β• β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘    β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•—   β–ˆβ–ˆβ•‘
β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ•—β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘β–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•— β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘    β–ˆβ–ˆβ•”β•β•β•  β–ˆβ–ˆβ•”β•β•β–ˆβ–ˆβ•‘β•šβ•β•β•β•β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘
β–ˆβ–ˆβ•‘ β•šβ–ˆβ–ˆβ–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•—β•šβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•”β•    β–ˆβ–ˆβ•‘     β–ˆβ–ˆβ•‘  β–ˆβ–ˆβ•‘β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ•‘   β–ˆβ–ˆβ•‘
β•šβ•β•  β•šβ•β•β•β•   β•šβ•β•   β•šβ•β•  β•šβ•β• β•šβ•β•β•β•β•β• β•šβ•β•  β•šβ•β• β•šβ•β•β•β•β•β•     β•šβ•β•     β•šβ•β•  β•šβ•β•β•šβ•β•β•β•β•β•β•   β•šβ•β•

Version 2.6

License: Apache 2.0 Model Type Parameters Architecture

Compact multimodal model for CPU-friendly image + text generation.

Nthuku-Fast 2.6

Version 2.6 is a full fine-tune from 2.5 trained on a vision-only filtered manifest. This release improves stored training loss/perplexity and slightly improves the current weakness benchmark score versus 2.3/2.4/2.5.

What Changed In 2.6

  • Training source: nthuku-fast-2.5
  • Mode: full fine-tune
  • Epochs: 10
  • Max samples: 23,086
  • Data file: nthuku_dataset/manifest_train_vision_only.parquet
  • Objective of this pass: reduce text-only contamination that previously triggered code-like outputs during image prompts

Key Metrics

Training Metrics (from config)

Version Best Loss Perplexity (exp(loss))
2.3 9.129752 9,226
2.4 10.342586 31,026
2.5 9.753154 17,208
2.6 8.607122 5,470

Interpretation:

  • 2.6 has the best stored training objective in this 2.3 to 2.6 range.
  • Lower training loss did not fully translate into robust instruction-following behavior yet, so this should be treated as a necessary but not sufficient improvement.

Weakness Benchmark (21 auto probes) β€” With Real Image-Backed Probes

Evaluation re-run with 15,000 real images from vision_text sources in probes/ directory:

  • eval_results_23_with_images/weakness_report.json
  • eval_results_24_with_images/weakness_report.json
  • eval_results_25_with_images/weakness_report.json
  • eval_results_26_with_images/weakness_report.json
Version Overall Score Vision Counting Language Instruction
2.3 0.6467 0.6166 0.5333
2.4 0.6451 0.6000 0.5333
2.5 0.6451 0.6000 0.5333
2.6 0.6483 0.6348 0.5333

Key Findings:

  • 2.6 achieves best overall score (0.6483) with real images
  • 2.6 vision/counting improved to 0.6348 (from 0.6000 in 2.4/2.5)
  • Language/instruction remains stable bottleneck at 0.5333 across all versions
  • Most other vision dimensions (spatial, attribute, ocr, scene) plateau at 0.6 baseline with all models

Delta vs prior:

  • 2.6 - 2.5: +0.0032 overall
  • 2.6 - 2.4: +0.0032 overall
  • 2.6 - 2.3: +0.0016 overall

Speed (from probe_results elapsed_s)

Version Avg Seconds / Probe Total Inference Time (21 probes) Avg Generated Tokens
2.3 0.275 5.77 252.5
2.4 0.321 6.74 252.4
2.5 0.346 7.26 252.1
2.6 0.285 5.98 252.5

Interpretation:

  • 2.6 is materially faster than 2.4 and 2.5 on this harness.
  • 2.6 is close to 2.3 latency while scoring slightly higher overall.

Evaluation Status: Image-Backed Probes Active

As of the latest evaluation run, the probe suite now contains 15,000 real images sourced from:

  • sources/vision_text/caption/images/
  • sources/vision_text/vqa/images/
  • sources/vision_text/conversation/images/

Current state:

  • βœ… probes/ populated with real images
  • βœ… Vision probe scores now grounded in actual image inference
  • ⚠️ Most vision dimensions still plateau at 0.6 baseline, indicating fundamental capability gaps

Interpretation:

  • Vision scores are now valid multi-modal evaluations
  • 0.6 baseline on most vision dimensions suggests that instruction-constrained visual reasoning is weak across the current architecture
  • Strongest signal: v2.6 improves vision/counting to 0.6348 vs. 0.6000 in prior versions, but this is still a modest gain
  • Critical bottleneck: language/instruction at 0.5333 prevents models from output compliance even when visual understanding is available

Image and Response Review

Reviewed image paths

  • sources/vision_text/mcqa/images/sciqa_0000000.jpg
  • sources/vision_text/caption/images/coco_0004314.jpg
  • sources/vision_text/detailed_caption/images/coco_0002088.jpg

Observations

  1. MCQA map prompt
  • Image content is clear (US map with labeled states).
  • Expected answer for farthest north among the given choices is West Virginia.
  • Actual terminal response for 2.6 remained mostly off-task and verbose, mixing unrelated fragments.
  • Conclusion: instruction-binding for constrained visual QA is still weak.
  1. General caption behavior
  • Responses often over-generate repetitive filler patterns (for example repeated short tokens/phrases).
  • Even when image context exists, completion drifts instead of staying compact and grounded.
  • Conclusion: decoding and instruction adherence remain the dominant bottlenecks.
  1. Improvement vs 2.5
  • Reduced code-like contamination is partially visible in aggregate metrics (especially counting score and overall score).
  • But the model is not yet reliably production-grade for strict-format visual QA/captioning.

Weakness Ranking (2.6 with Image-Backed Probes)

From eval_results_26_with_images/weakness_report.json:

Dimension Score Severity
language/instruction 0.5333 ⚠️ CRITICAL
vision/spatial 0.6000 MEDIUM
vision/attribute 0.6000 MEDIUM
vision/ocr 0.6000 MEDIUM
vision/scene 0.6000 MEDIUM
reasoning/temporal 0.6000 MEDIUM
reasoning/counterfact 0.6000 MEDIUM
reasoning/causal 0.6133 MEDIUM
vision/counting 0.6348 MEDIUM
language/fluency 0.7500 LOW
language/detail 1.0000 βœ… STRENGTH

Strategic Insight:

  • The single largest improvement lever for v2.7 is instruction adherence (0.5333 β†’ target 0.75+)
  • Vision dimensions will not improve significantly without addressing the output format constraint issue
  • The model has strong language capability (detail and fluency) but cannot reliably follow instructions on what to output

Recommended Next Step for v2.7

Step 1 is now complete: real probe pack with 15,000 images has been built and vision metrics are valid.

Focus for v2.7: Address the instruction/output-constraint bottleneck (0.5333 β†’ 0.75+)

Remaining priority actions:

  1. βœ… DONE β€” Build real probe pack
  • Probes/ populated with 15,000 images
  • Vision metrics are grounded and valid
  1. Add constrained-format visual QA curriculum
  • Add many samples with strict outputs:
    • single-word answers for MCQA
    • one-line numeric answers for counting
    • exact 3-sentence captions for description
  • Keep high weight on image-conditioned instruction-following data
  • Target: increase language/instruction from 0.5333 to 0.70+
  1. Penalize repetition during evaluation and tune decoding defaults
  • Use lower max_new_tokens for constrained prompts (e.g., 5-15 tokens for MCQA)
  • Apply stronger repetition penalty (increase from 1.1 to 1.3-1.5)
  • Implement early stopping for short-answer tasks
  • Reduce verbose filler generation (observed "so so so" and "a a a" patterns in probes)
  1. Add hard pass/fail validation gate
  • Create 200-500 curated high-quality items with deterministic expected outputs
  • Track exact-match accuracy for MCQA and format compliance rate
  • Use as release gate for v2.7
  • Expected outcome: avoid production deployments where instruction binding is broken

Expected outcome of v2.7: +0.15-0.20 on language/instruction while preserving 2.6's loss/perplexity and speed gains.

Local Quick Test

Text-only:

python test_local_model.py --model_dir ./nthuku-fast-2.6 --prompt "Hello, how are you?"

Vision + text:

python test_local_model.py --model_dir ./nthuku-fast-2.6 --prompt "Describe this image in English only, in 3 short sentences." --image ./image.jpg

License

Released under the Apache 2.0 License. Free to use, modify, and distribute with attribution.

Downloads last month
34
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Dataset used to train Qybera/nthuku-fast-2.6