Harley-ml
/

Tenete-8M

Text Generation

Eval Results (legacy)

Model card Files Files and versions

Harley-ml commited on 21 days ago

Commit

5a43b2d

·

verified ·

1 Parent(s): aff2749

Update README.md

Files changed (1) hide show

README.md +10 -2

README.md CHANGED Viewed

@@ -541,10 +541,17 @@ BPB is simply the amount of yes-no questions the model needs to predict the next
 ---
-We decided to evaluate the model on each source it trained on to see the difference in perplexity.
-[omitted temporarily]
 ---
 ## Benchmarks
@@ -572,6 +579,7 @@ We decided to evaluate the model on each source it trained on to see the differe
 The model achieves random or near-random on most tasks, which is expected. An 8M parameter model cannot store world-level knowledge or thoroughly reason.
 Note: The full breakdown (LM Harness Output) is right [here](https://huggingface.co/Harley-ml/Tenete-8M/blob/main/raw_lmharness_eval_output.txt)
 ### Coherency Benchmark
 To evaluate the **coherency, factuality, and fluency** of our (and other) models, we use **Qwen3-32B** to grade **300 different generations** generated from an **unconditional prompt**.

 ---
+We decided to evaluate the model on each source to see the difference in perplexity.
+| Source | Loss | Perplexity |
+|--------|------|------------|
+| Textbooks | **2.02** | **7.57** |
+| Q&A | 3.20 | 24.65 |
+| Books | 3.73 | 41.88 |
+| Medium articles | 3.79 | 44.40 |
+The textbooks' perplexity is six times lower than that of the Medium articles. This is expected: Tiny-Textbooks uses a templated structure (e.g., Section 1, conclusion, etc.) and an LLM generates the rest. resulting in a lower entropy than standard English. Medium articles are structurally, tonally, and stylistically more diverse and unpredictable. The same could be said for the books.
 ---
 ## Benchmarks
 The model achieves random or near-random on most tasks, which is expected. An 8M parameter model cannot store world-level knowledge or thoroughly reason.
 Note: The full breakdown (LM Harness Output) is right [here](https://huggingface.co/Harley-ml/Tenete-8M/blob/main/raw_lmharness_eval_output.txt)
 ### Coherency Benchmark
 To evaluate the **coherency, factuality, and fluency** of our (and other) models, we use **Qwen3-32B** to grade **300 different generations** generated from an **unconditional prompt**.