A newer version of this model is available: canbingol/gemma3_1B_base-tr-cpt-2nd_epoch_stage1

Model Card: Gemma3-1B Turkish CPT (150K–200K Subset, 1 Epoch – Stage 4)

Overview

This model is the Stage 4 Turkish Continued Pretraining (CPT) variant of Gemma-3-1B.

Unlike Stage 1, which was initialized from google/gemma-3-1b-pt,
this model was initialized from:

canbingol/gemma3_1B_base-tr-cpt-1epoch_stage3

Stage 4 continues domain adaptation by exposing the model to new data rather than repeating the same subset.

The model was trained for 1 epoch on samples 150,000 to 200,000 of the Turkish web corpus.

Importantly, this model is a direct continuation of Stage 3.
Therefore, cumulatively it has been trained on samples 0–200,000 of the corpus (Stage 1: 0–50K, Stage 2: 50K–100K, Stage 3: 100K–150K, Stage 4: 150K–200K).

This stage corresponds to the end of the 1-epoch pass over the full 200K-sample dataset (i.e., completion of the first full epoch via sequential shards).

Training Lineage

Stage 0: google/gemma-3-1b-pt
Stage 1: Samples 0–50,000 (1 epoch)
Stage 2: Samples 50,000–100,000 (1 epoch)
Stage 3: Samples 100,000–150,000 (1 epoch)
Stage 4 (this release): Samples 150,000–200,000 (1 epoch, end of epoch-1)

Cumulative data exposure: 0–200,000 samples

This represents sequential CPT across disjoint data shards.

Training Setup

Dataset: canbingol/vngrs-web-corpus-200k
Subset Used: Samples 150,000–200,000
Initialization: Stage 3 checkpoint
Training Objective: Continued Pretraining
Epochs: 1
Data Regime: Plain text
Token Count: ~21.6M tokens
Cumulative Token Exposure (Stage 1 + Stage 2 + Stage 3 + Stage 4): ~86.1M tokens (approximate)

Notes on cumulative exposure:

Although Stage 4 trains only on the 150K–200K shard, it inherits all adaptations learned from previous stages.
After this stage, the model has effectively completed exposure to the entire 0–200K dataset range through sequential continuation.

Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "canbingol/gemma3_1B_base-tr-cpt-1epoch_stage4"

device = "cuda" if torch.cuda.is_available() else "cpu"

model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = model.to(device)

prompt = "bundan böyle"
inputs = tokenizer(prompt, return_tensors="pt").to(device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.8,
    top_p=0.9
)

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)