TalTechNLP/Llama-3.1-70B-Instruct-summ-et
Model Details
- Model name: TalTechNLP/Llama-3.1-70B-Instruct-summ-et
- Base model: Meta Llama-3.1-70B-Instruct
- Model type: Causal Language Model (instruction-tuned)
- Adaptation method: LoRA fine-tuning
- Primary language: Estonian (et)
- License: Inherits from Llama 3.1 license (see Meta terms)
- Availability: Hugging Face Hub
Model Description
TalTechNLP/Llama-3.1-70B-Instruct-summ-et is a LoRA-adapted version of Meta’s Llama-3.1-70B-Instruct model, specifically optimized for Estonian abstractive summarization.
The model was fine-tuned on a diverse Estonian summarization corpus, significantly improving its ability to generate high-quality summaries using natural prompts without requiring strict formatting.
Training Data
The model was fine-tuned on a combined Estonian summarization dataset, including:
- ERR raadiouudiste korpus
- ERR veebiuudiste korpus
- DialogSum (automatically translated to Estonian)
- SAMSum (automatically translated to Estonian)
- GPT-4 generated datasets:
- Short summaries corpus
- Long summaries corpus
This dataset mix includes:
- News summarization
- Dialogue summarization
- Both short and long summaries
- Diverse writing styles and structures
Training Procedure
- Base model: Llama-3.1-70B-Instruct
- Fine-tuning method: LoRA (Low-Rank Adaptation)
- Objective: Improve Estonian summarization performance
- Prompt style: Natural language instructions
Evaluation
The model shows improved performance on Estonian summarization benchmarks.
Example: ERR Raadiouudised corpus
- ROUGE-1 score:
- Base model: 15.5
- Fine-tuned model: 20.0
This reflects improvements in:
- Content coverage
- Fluency in Estonian
- Summary relevance
Usage
Example Prompt (Estonian)
Palun tee järgmisest tekstist lühike kokkuvõte:
[TEKST]
Python Example
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "TalTechNLP/Llama-3.1-70B-Instruct-summ-et"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
text = "Sisendtekst siia..."
prompt = f"Palun tee järgmisest tekstist kokkuvõte:\n\n{text}"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Intended Use
Primary Use Cases
- Estonian news summarization
- Dialogue summarization (e.g. chat, transcripts)
- General-purpose Estonian summarization
- Research on low-resource language adaptation
Limitations
- May hallucinate or omit important details
- Performance depends on similarity to training domains
- Automatically translated datasets may introduce artifacts
- Not optimized for highly specialized domains
Recipe: Adapting the Model for Domain-Specific Summarization
This recipe assumes your data is in JSONL format.
Prepare the JSONL data Each line should contain:
{"text": "Pikk sisendtekst...", "summary": "Lühike kokkuvõte..."}
Recommended structure:
- Validate data quality
Ensure:
- valid JSON per line
- both text and summary fields exist
- no empty or corrupted examples
- consistent summary style
- Finetune the model on your data
Code:
#!/usr/bin/env python3
"""
Fine-tune TalTechNLP/Llama-3.1-70B-Instruct-summ-et on domain-specific JSONL summarization data.
Expected JSONL format:
{"text": "source document...", "summary": "reference summary..."}
Files:
data/train.jsonl
data/validation.jsonl
This script uses:
- transformers
- datasets
- peft
- trl
- bitsandbytes (optional, for 4-bit loading)
"""
import os
import argparse
from dataclasses import dataclass
import torch
from datasets import load_dataset
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTTrainer
def build_prompt(text: str) -> str:
# Keep the prompt simple and natural for Estonian summarization.
return (
"Palun tee järgmisest tekstist kokkuvõte:\n\n"
f"{text}\n\n"
"Kokkuvõte:"
)
def format_example(example):
# Convert one JSONL row into a single supervised training string.
prompt = build_prompt(example["text"].strip())
summary = example["summary"].strip()
return {"text": f"{prompt} {summary}"}
def main():
parser = argparse.ArgumentParser()
parser.add_argument("--model_name", type=str, default="TalTechNLP/Llama-3.1-70B-Instruct-summ-et")
parser.add_argument("--train_file", type=str, default="data/train.jsonl")
parser.add_argument("--validation_file", type=str, default="data/validation.jsonl")
parser.add_argument("--output_dir", type=str, default="estonian-summ-lora")
parser.add_argument("--max_seq_length", type=int, default=2048)
parser.add_argument("--num_train_epochs", type=float, default=2.0)
parser.add_argument("--per_device_train_batch_size", type=int, default=1)
parser.add_argument("--per_device_eval_batch_size", type=int, default=1)
parser.add_argument("--gradient_accumulation_steps", type=int, default=8)
parser.add_argument("--learning_rate", type=float, default=2e-5)
parser.add_argument("--warmup_ratio", type=float, default=0.03)
parser.add_argument("--logging_steps", type=int, default=10)
parser.add_argument("--eval_steps", type=int, default=200)
parser.add_argument("--save_steps", type=int, default=200)
parser.add_argument("--max_steps", type=int, default=-1)
parser.add_argument("--use_4bit", action="store_true")
args = parser.parse_args()
# Load tokenizer.
tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_fast=True)
# Llama models often do not have a pad token by default.
# Reuse EOS as PAD for training stability.
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
# Optional 4-bit quantized loading to reduce memory usage.
quantization_config = None
if args.use_4bit:
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
# Load base model.
model = AutoModelForCausalLM.from_pretrained(
args.model_name,
torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
device_map="auto",
quantization_config=quantization_config,
)
# Enable training-friendly settings for k-bit models.
if args.use_4bit:
model = prepare_model_for_kbit_training(model)
# LoRA configuration.
# These target modules are standard for Llama-style architectures.
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
)
# Load datasets from JSONL files.
dataset = load_dataset(
"json",
data_files={
"train": args.train_file,
"validation": args.validation_file,
},
)
# Convert each example to a single text field used by SFTTrainer.
dataset = dataset.map(format_example, remove_columns=dataset["train"].column_names)
# Training arguments.
training_args = TrainingArguments(
output_dir=args.output_dir,
num_train_epochs=args.num_train_epochs,
per_device_train_batch_size=args.per_device_train_batch_size,
per_device_eval_batch_size=args.per_device_eval_batch_size,
gradient_accumulation_steps=args.gradient_accumulation_steps,
learning_rate=args.learning_rate,
warmup_ratio=args.warmup_ratio,
logging_steps=args.logging_steps,
eval_strategy="steps",
eval_steps=args.eval_steps,
save_strategy="steps",
save_steps=args.save_steps,
save_total_limit=2,
bf16=torch.cuda.is_available(), # Use bf16 on supported GPUs.
fp16=False,
report_to="none",
optim="paged_adamw_8bit" if args.use_4bit else "adamw_torch",
lr_scheduler_type="cosine",
max_steps=args.max_steps,
remove_unused_columns=False,
)
# SFTTrainer handles packing, tokenization, and causal LM loss.
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["validation"],
peft_config=peft_config,
tokenizer=tokenizer,
max_seq_length=args.max_seq_length,
dataset_text_field="text",
packing=False,
)
# Train.
trainer.train()
# Save adapter and tokenizer.
trainer.model.save_pretrained(args.output_dir)
tokenizer.save_pretrained(args.output_dir)
# Optional: save the final trainer state too.
trainer.save_state()
print(f"Training complete. Adapter saved to: {args.output_dir}")
if __name__ == "__main__":
main()
Run like this:
python finetune_summarizer.py \
--train_file data/train.jsonl \
--validation_file data/validation.jsonl \
--output_dir estonian-domain-summ-lora \
--use_4bit
- Downloads last month
- 30
Model tree for TalTechNLP/Llama-3.1-70B-Instruct-summ-et
Base model
meta-llama/Llama-3.1-70B