Update README.md
Browse files
README.md
CHANGED
|
@@ -541,10 +541,17 @@ BPB is simply the amount of yes-no questions the model needs to predict the next
|
|
| 541 |
|
| 542 |
---
|
| 543 |
|
| 544 |
-
We decided to evaluate the model on each source
|
| 545 |
|
| 546 |
-
[omitted temporarily]
|
| 547 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 548 |
---
|
| 549 |
|
| 550 |
## Benchmarks
|
|
@@ -572,6 +579,7 @@ We decided to evaluate the model on each source it trained on to see the differe
|
|
| 572 |
The model achieves random or near-random on most tasks, which is expected. An 8M parameter model cannot store world-level knowledge or thoroughly reason.
|
| 573 |
|
| 574 |
Note: The full breakdown (LM Harness Output) is right [here](https://huggingface.co/Harley-ml/Tenete-8M/blob/main/raw_lmharness_eval_output.txt)
|
|
|
|
| 575 |
### Coherency Benchmark
|
| 576 |
|
| 577 |
To evaluate the **coherency, factuality, and fluency** of our (and other) models, we use **Qwen3-32B** to grade **300 different generations** generated from an **unconditional prompt**.
|
|
|
|
| 541 |
|
| 542 |
---
|
| 543 |
|
| 544 |
+
We decided to evaluate the model on each source to see the difference in perplexity.
|
| 545 |
|
|
|
|
| 546 |
|
| 547 |
+
| Source | Loss | Perplexity |
|
| 548 |
+
|--------|------|------------|
|
| 549 |
+
| Textbooks | **2.02** | **7.57** |
|
| 550 |
+
| Q&A | 3.20 | 24.65 |
|
| 551 |
+
| Books | 3.73 | 41.88 |
|
| 552 |
+
| Medium articles | 3.79 | 44.40 |
|
| 553 |
+
|
| 554 |
+
The textbooks' perplexity is six times lower than that of the Medium articles. This is expected: Tiny-Textbooks uses a templated structure (e.g., Section 1, conclusion, etc.) and an LLM generates the rest. resulting in a lower entropy than standard English. Medium articles are structurally, tonally, and stylistically more diverse and unpredictable. The same could be said for the books.
|
| 555 |
---
|
| 556 |
|
| 557 |
## Benchmarks
|
|
|
|
| 579 |
The model achieves random or near-random on most tasks, which is expected. An 8M parameter model cannot store world-level knowledge or thoroughly reason.
|
| 580 |
|
| 581 |
Note: The full breakdown (LM Harness Output) is right [here](https://huggingface.co/Harley-ml/Tenete-8M/blob/main/raw_lmharness_eval_output.txt)
|
| 582 |
+
|
| 583 |
### Coherency Benchmark
|
| 584 |
|
| 585 |
To evaluate the **coherency, factuality, and fluency** of our (and other) models, we use **Qwen3-32B** to grade **300 different generations** generated from an **unconditional prompt**.
|