A newer version of this model is available: canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage1

Model Card: Gemma3-1B Turkish CPT (50K–100K Subset, 1 Epoch – Stage 2)

Overview

This model is the Stage 2 Turkish Continued Pretraining (CPT) variant of Gemma-3-1B.

Unlike Stage 1, which was initialized from google/gemma-3-1b-pt,
this model was initialized from:

canbingol/gemma3_1B_base-tr-cpt-1epoch_stage1

Stage 2 continues domain adaptation by exposing the model to new data rather than repeating the same subset.

The model was trained for 1 epoch on samples 50,000 to 100,000 of the Turkish web corpus.

Importantly, this model is a direct continuation of Stage 1.
Therefore, cumulatively it has been trained on samples 0–100,000 of the corpus (Stage 1: 0–50K, Stage 2: 50K–100K).

Training Lineage

Stage 0: google/gemma-3-1b-pt
Stage 1: Samples 0–50,000 (1 epoch)
Stage 2 (this release): Samples 50,000–100,000 (1 epoch)

Cumulative data exposure: 0–100,000 samples

This represents sequential CPT across disjoint data shards.

Training Setup

Dataset: canbingol/vngrs-web-corpus-200k
Subset Used: Samples 50,000–100,000
Initialization: Stage 1 checkpoint
Training Objective: Continued Pretraining
Epochs: 1
Data Regime: Plain text
Token Count (approximate): ~21–22M tokens
Cumulative Token Exposure (Stage 1 + Stage 2): ~43M tokens (approximate)

Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "canbingol/gemma3_1B_base-tr-cpt-1epoch_stage2"

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = model.to(device)

prompt = "bundan böyle"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.8,
    top_p=0.9
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)