library_name: transformers tags: - gpt2 - causal-lm - bilingual - sentencepiece - french - english pipeline_tag: text-generation datasets: - climb-mao/babylm-fra - elliepreed/l2-corpus-10m license: other # change to "apache-2.0" or "mit" if that's correct model-index: - name: BGPT (French+English) โ€“ 128k steps results: []

BGPT โ€“ French + English (GPT-2 style)

Small bilingual GPT-2โ€“style language model trained on French and English with SentencePiece tokenizers.

This model is trained on both French ๐Ÿ‡ซ๐Ÿ‡ท and English ๐Ÿ‡ฌ๐Ÿ‡ง, but it does not come with a single AutoTokenizer.
Instead, we provide two SentencePiece tokenizers:

  • tokenizers/french.model
  • tokenizers/english.model

You can load either depending on the language you want to work with.

Load the model

from transformers import AutoModelForCausalLM import torch

model_id = "elliepreed/bgpt-french-english" device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(model_id).to(device).eval()

Load both tokenizers

import sentencepiece as spm from huggingface_hub import hf_hub_download

fr_path = hf_hub_download(model_id, "tokenizers/french.model") en_path = hf_hub_download(model_id, "tokenizers/english.model")

sp_fr = spm.SentencePieceProcessor(model_file=fr_path) sp_en = spm.SentencePieceProcessor(model_file=en_path)

Example: French generation

prompt = "Paris est" ids = sp_fr.encode(prompt, out_type=int) + [sp_fr.eos_id()] input_ids = torch.tensor([ids], device=device)

out = model.generate( input_ids, max_new_tokens=40, do_sample=True, top_p=0.95, temperature=0.9, eos_token_id=sp_fr.eos_id(), pad_token_id=sp_fr.pad_id(), )

print("FR:", sp_fr.decode(out[0].tolist()[len(ids):]))

Downloads last month
1
Safetensors
Model size
50.9M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support