Harley-ml commited on
Commit
5a43b2d
·
verified ·
1 Parent(s): aff2749

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -2
README.md CHANGED
@@ -541,10 +541,17 @@ BPB is simply the amount of yes-no questions the model needs to predict the next
541
 
542
  ---
543
 
544
- We decided to evaluate the model on each source it trained on to see the difference in perplexity.
545
 
546
- [omitted temporarily]
547
 
 
 
 
 
 
 
 
 
548
  ---
549
 
550
  ## Benchmarks
@@ -572,6 +579,7 @@ We decided to evaluate the model on each source it trained on to see the differe
572
  The model achieves random or near-random on most tasks, which is expected. An 8M parameter model cannot store world-level knowledge or thoroughly reason.
573
 
574
  Note: The full breakdown (LM Harness Output) is right [here](https://huggingface.co/Harley-ml/Tenete-8M/blob/main/raw_lmharness_eval_output.txt)
 
575
  ### Coherency Benchmark
576
 
577
  To evaluate the **coherency, factuality, and fluency** of our (and other) models, we use **Qwen3-32B** to grade **300 different generations** generated from an **unconditional prompt**.
 
541
 
542
  ---
543
 
544
+ We decided to evaluate the model on each source to see the difference in perplexity.
545
 
 
546
 
547
+ | Source | Loss | Perplexity |
548
+ |--------|------|------------|
549
+ | Textbooks | **2.02** | **7.57** |
550
+ | Q&A | 3.20 | 24.65 |
551
+ | Books | 3.73 | 41.88 |
552
+ | Medium articles | 3.79 | 44.40 |
553
+
554
+ The textbooks' perplexity is six times lower than that of the Medium articles. This is expected: Tiny-Textbooks uses a templated structure (e.g., Section 1, conclusion, etc.) and an LLM generates the rest. resulting in a lower entropy than standard English. Medium articles are structurally, tonally, and stylistically more diverse and unpredictable. The same could be said for the books.
555
  ---
556
 
557
  ## Benchmarks
 
579
  The model achieves random or near-random on most tasks, which is expected. An 8M parameter model cannot store world-level knowledge or thoroughly reason.
580
 
581
  Note: The full breakdown (LM Harness Output) is right [here](https://huggingface.co/Harley-ml/Tenete-8M/blob/main/raw_lmharness_eval_output.txt)
582
+
583
  ### Coherency Benchmark
584
 
585
  To evaluate the **coherency, factuality, and fluency** of our (and other) models, we use **Qwen3-32B** to grade **300 different generations** generated from an **unconditional prompt**.