Are you sure you limited this to only 1899?
Random example of generated content: "mechanisation remains a calling for men and women alike". It's not the "men and women alike". Mechanisation was an exceedingly rare word in the Victorian era, only taking off in the 1920s. I've searched through the corpus to 1899. It only occurs twice, both of them duplicates, from the 1840s. Now, it's possible that it could have coined the word from "mechanise" - mechanise was a word one would find sometimes in the corpus. But it had a different meaning back then - it meant "to make a person like a machine" (e.g. enslave, zombify, etc), not "to replace human labour with that of machines". So even if it were to have coined the word from "mechanise", it seems very unlikely that it would further decide that it was a "calling". "Mechanist" was a calling (certainly not for women at the time), but it seems very unlikely it would just coin "mechanisation" from "mechanist" and decide it's a calling. I know it's a quite primitive model and is very prone to confabulation, but in this example (and others) its behavior seems... suspicious.
Short answer: the model is pretrained on British library books from 1837 to 1899, but instruct tuned on modern "synthetic" data to teach it to do turn-based conversation and recognize modern input. That might be where/how it's generating some of what you're noticing. I'm hoping future iterations will clean that up. For the long answer, check out the narrative documentation here: https://www.estragon.news/mr-chatterbox-or-the-modern-prometheus/
That would explain it, and unfortunately I don't think that's what most people who have been sharing this model thought it was. Thanks for clearing that up though.
May I suggest for getting QA pairs for finetuning: perhaps have an existing (lightweight) model scan the dataset for dialogue between two individuals and extract the dialogue into QA pairs (with varying numbers of QA entries in the context)? That should be an easy task even for a a model running on a cheap gaming GPU.
It's an a very interesting project though (even though such a "pure" Victorian model will always be fundamentally weak), and it does really seem like there's' two directions to go, both of which have their own merits. At least with Victorian texts you have the option to go for a "fully pure" model, like you're working on. But once you get to say, ancient Latin or Greek texts, let alone other less common ancient languages, there's not enough data to train a foundation (and even a foundation from Victorian texts is pretty weak). I've been thinking a lot about how to best do a non-fully-pure model. I'm torn between (A) synthetic data created from a fully pure model (e.g. model asked to expound upon / derive implications of chunks of preexisting texts, or given bullet pointed topics / facts and asked to write about them in its own voice), vs. (B) abliterating modern knowledge and vocabulary from an existing high-quality model and then fine-tuning on historic data (or the other way around, or perhaps historic data -> abliteration -> historic data).
Anyway, neat stuff, and thanks :)
(I also find this project interesting not just from the perspective of virtually-resurrecting past cultures, but also because it's an interesting research challenge for how efficiently one can learn from a limited dataset - a problem with great implications toward model training in general)
Short answer: the model is pretrained on British library books from 1837 to 1899, but instruct tuned on modern "synthetic" data to teach it to do turn-based conversation and recognize modern input. That might be where/how it's generating some of what you're noticing. I'm hoping future iterations will clean that up.
BTW, Mademe Hardy on Bluesky discovered this, which is basically an actual Victorian-era QA dataset :) https://www.loc.gov/resource/gdcmassbookdig.newcommonschoolq00crai/?sp=19
It's from the US, not UK - I don't know if you're focused on purely UK texts - but it's basically exactly what one would want to use (plus other sources for longer form / more diverse dialogue styles)
Wow! That is an incredible find. It's a great idea as a sort of pre-made QA dataset. I began to search, and I'm amazed this hadn't occurred to me before: there is an entire genera of QA literature made as pedagogical tools and study aids from that era. I'm definitely going to use this material moving forward.
To your earlier point: I did a similar process to the one you described to extract QA pairs from the corpus, albeit with a python script built by Claude Code and not a language model. The output was, admittedly, pretty poor. Maybe a language model will do a better job of extracting them, but it is a large corpus and that job will likely take a while. To your other point, as I mentioned in the piece, for supervised fine-tuning I trained on synthetic data. Abliterating modern knowledge from a model is interesting. I was planning on just fine-tuning something like Llama-3.1-4b on the corpus to generate synthetic data next, but I hadn't considered knocking out all modern knowledge in order to make it even more "pure."
I hadn't realized that I had stumbled across such an interesting research challenge when I began building! Hopefully it can contribute to that knowledge more formally moving forward.