Model Card: Gemma3-1B Turkish CPT (2nd Epoch – Stage 2, 50K–100K Subset, 1 Epoch)

Overview

This model is the 2nd-epoch Stage 2 Turkish Continued Pretraining (CPT) variant of Gemma-3-1B.

Unlike the initial epoch stages that started from google/gemma-3-1b-pt,
this model is initialized from the previous stage of epoch-2 checkpoint:

  • canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage1

Stage 2 of the second epoch continues domain adaptation through the next shard in the sequential training pipeline.

The model was trained for 1 epoch on samples 50,000 to 100,000 of the Turkish web corpus.

Conceptually:

  • Epoch-1 was completed via sequential shards (0–200K).
  • Epoch-2 Stage 1 revisited samples 0–50K.
  • This release continues epoch-2 with the 50K–100K shard.

Training Lineage

  • Stage 0: google/gemma-3-1b-pt
  • Epoch-1 Stage 1: Samples 0–50,000 (1 epoch)
  • Epoch-1 Stage 2: Samples 50,000–100,000 (1 epoch)
  • Epoch-1 Stage 3: Samples 100,000–150,000 (1 epoch)
  • Epoch-1 Stage 4: Samples 150,000–200,000 (1 epoch, end of epoch-1)
  • Epoch-2 Stage 1: Samples 0–50,000 (1 epoch)
  • Epoch-2 Stage 2 (this release): Samples 50,000–100,000 (1 epoch)

This represents sequential CPT across disjoint data shards, repeated for a second epoch.


Training Setup

  • Dataset: canbingol/vngrs-web-corpus-200k
  • Subset Used: Samples 50,000–100,000
  • Initialization: canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage1
  • Training Objective: Continued Pretraining
  • Epochs: 1 (for this shard)
  • Data Regime: Plain text
  • Token Count (this stage): ~21.5M tokens
  • Cumulative Token Exposure:
    • After epoch-1: ~86.1M tokens (approximate)
    • After epoch-2 stage1: ~107.7M tokens (approximate)
    • After this stage (epoch-2 stage2): ~129.2M tokens (approximate)

Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage2"

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = model.to(device)

prompt = "bundan böyle"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.8,
    top_p=0.9
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Downloads last month
19
Safetensors
Model size
1.0B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage2

Dataset used to train canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage2

Collection including canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage2