TalTechNLP/Llama-3.1-70B-Instruct-summ-et

Model Details

  • Model name: TalTechNLP/Llama-3.1-70B-Instruct-summ-et
  • Base model: Meta Llama-3.1-70B-Instruct
  • Model type: Causal Language Model (instruction-tuned)
  • Adaptation method: LoRA fine-tuning
  • Primary language: Estonian (et)
  • License: Inherits from Llama 3.1 license (see Meta terms)
  • Availability: Hugging Face Hub

Model Description

TalTechNLP/Llama-3.1-70B-Instruct-summ-et is a LoRA-adapted version of Meta’s Llama-3.1-70B-Instruct model, specifically optimized for Estonian abstractive summarization.

The model was fine-tuned on a diverse Estonian summarization corpus, significantly improving its ability to generate high-quality summaries using natural prompts without requiring strict formatting.


Training Data

The model was fine-tuned on a combined Estonian summarization dataset, including:

  • ERR raadiouudiste korpus
  • ERR veebiuudiste korpus
  • DialogSum (automatically translated to Estonian)
  • SAMSum (automatically translated to Estonian)
  • GPT-4 generated datasets:
    • Short summaries corpus
    • Long summaries corpus

This dataset mix includes:

  • News summarization
  • Dialogue summarization
  • Both short and long summaries
  • Diverse writing styles and structures

Training Procedure

  • Base model: Llama-3.1-70B-Instruct
  • Fine-tuning method: LoRA (Low-Rank Adaptation)
  • Objective: Improve Estonian summarization performance
  • Prompt style: Natural language instructions

Evaluation

The model shows improved performance on Estonian summarization benchmarks.

Example: ERR Raadiouudised corpus

  • ROUGE-1 score:
    • Base model: 15.5
    • Fine-tuned model: 20.0

This reflects improvements in:

  • Content coverage
  • Fluency in Estonian
  • Summary relevance

Usage

Example Prompt (Estonian)

Palun tee järgmisest tekstist lühike kokkuvõte:

[TEKST]

Python Example

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "TalTechNLP/Llama-3.1-70B-Instruct-summ-et"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

text = "Sisendtekst siia..."

prompt = f"Palun tee järgmisest tekstist kokkuvõte:\n\n{text}"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Intended Use

Primary Use Cases

  • Estonian news summarization
  • Dialogue summarization (e.g. chat, transcripts)
  • General-purpose Estonian summarization
  • Research on low-resource language adaptation

Limitations

  • May hallucinate or omit important details
  • Performance depends on similarity to training domains
  • Automatically translated datasets may introduce artifacts
  • Not optimized for highly specialized domains

Recipe: Adapting the Model for Domain-Specific Summarization

This recipe assumes your data is in JSONL format.

  1. Prepare the JSONL data Each line should contain:

    {"text": "Pikk sisendtekst...", "summary": "Lühike kokkuvõte..."}

Recommended structure:

  1. Validate data quality

Ensure:

  • valid JSON per line
  • both text and summary fields exist
  • no empty or corrupted examples
  • consistent summary style
  1. Finetune the model on your data

Code:

#!/usr/bin/env python3
"""
Fine-tune TalTechNLP/Llama-3.1-70B-Instruct-summ-et on domain-specific JSONL summarization data.

Expected JSONL format:
{"text": "source document...", "summary": "reference summary..."}

Files:
  data/train.jsonl
  data/validation.jsonl

This script uses:
- transformers
- datasets
- peft
- trl
- bitsandbytes (optional, for 4-bit loading)
"""

import os
import argparse
from dataclasses import dataclass

import torch
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, prepare_model_for_kbit_training
from trl import SFTTrainer


def build_prompt(text: str) -> str:
    # Keep the prompt simple and natural for Estonian summarization.
    return (
        "Palun tee järgmisest tekstist kokkuvõte:\n\n"
        f"{text}\n\n"
        "Kokkuvõte:"
    )


def format_example(example):
    # Convert one JSONL row into a single supervised training string.
    prompt = build_prompt(example["text"].strip())
    summary = example["summary"].strip()
    return {"text": f"{prompt} {summary}"}


def main():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_name", type=str, default="TalTechNLP/Llama-3.1-70B-Instruct-summ-et")
    parser.add_argument("--train_file", type=str, default="data/train.jsonl")
    parser.add_argument("--validation_file", type=str, default="data/validation.jsonl")
    parser.add_argument("--output_dir", type=str, default="estonian-summ-lora")
    parser.add_argument("--max_seq_length", type=int, default=2048)
    parser.add_argument("--num_train_epochs", type=float, default=2.0)
    parser.add_argument("--per_device_train_batch_size", type=int, default=1)
    parser.add_argument("--per_device_eval_batch_size", type=int, default=1)
    parser.add_argument("--gradient_accumulation_steps", type=int, default=8)
    parser.add_argument("--learning_rate", type=float, default=2e-5)
    parser.add_argument("--warmup_ratio", type=float, default=0.03)
    parser.add_argument("--logging_steps", type=int, default=10)
    parser.add_argument("--eval_steps", type=int, default=200)
    parser.add_argument("--save_steps", type=int, default=200)
    parser.add_argument("--max_steps", type=int, default=-1)
    parser.add_argument("--use_4bit", action="store_true")
    args = parser.parse_args()

    # Load tokenizer.
    tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_fast=True)

    # Llama models often do not have a pad token by default.
    # Reuse EOS as PAD for training stability.
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    # Optional 4-bit quantized loading to reduce memory usage.
    quantization_config = None
    if args.use_4bit:
        quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_use_double_quant=True,
        )

    # Load base model.
    model = AutoModelForCausalLM.from_pretrained(
        args.model_name,
        torch_dtype=torch.bfloat16 if torch.cuda.is_available() else torch.float32,
        device_map="auto",
        quantization_config=quantization_config,
    )

    # Enable training-friendly settings for k-bit models.
    if args.use_4bit:
        model = prepare_model_for_kbit_training(model)

    # LoRA configuration.
    # These target modules are standard for Llama-style architectures.
    peft_config = LoraConfig(
        r=16,
        lora_alpha=32,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    )

    # Load datasets from JSONL files.
    dataset = load_dataset(
        "json",
        data_files={
            "train": args.train_file,
            "validation": args.validation_file,
        },
    )

    # Convert each example to a single text field used by SFTTrainer.
    dataset = dataset.map(format_example, remove_columns=dataset["train"].column_names)

    # Training arguments.
    training_args = TrainingArguments(
        output_dir=args.output_dir,
        num_train_epochs=args.num_train_epochs,
        per_device_train_batch_size=args.per_device_train_batch_size,
        per_device_eval_batch_size=args.per_device_eval_batch_size,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        learning_rate=args.learning_rate,
        warmup_ratio=args.warmup_ratio,
        logging_steps=args.logging_steps,
        eval_strategy="steps",
        eval_steps=args.eval_steps,
        save_strategy="steps",
        save_steps=args.save_steps,
        save_total_limit=2,
        bf16=torch.cuda.is_available(),  # Use bf16 on supported GPUs.
        fp16=False,
        report_to="none",
        optim="paged_adamw_8bit" if args.use_4bit else "adamw_torch",
        lr_scheduler_type="cosine",
        max_steps=args.max_steps,
        remove_unused_columns=False,
    )

    # SFTTrainer handles packing, tokenization, and causal LM loss.
    trainer = SFTTrainer(
        model=model,
        args=training_args,
        train_dataset=dataset["train"],
        eval_dataset=dataset["validation"],
        peft_config=peft_config,
        tokenizer=tokenizer,
        max_seq_length=args.max_seq_length,
        dataset_text_field="text",
        packing=False,
    )

    # Train.
    trainer.train()

    # Save adapter and tokenizer.
    trainer.model.save_pretrained(args.output_dir)
    tokenizer.save_pretrained(args.output_dir)

    # Optional: save the final trainer state too.
    trainer.save_state()

    print(f"Training complete. Adapter saved to: {args.output_dir}")


if __name__ == "__main__":
    main()

Run like this:

python finetune_summarizer.py \
  --train_file data/train.jsonl \
  --validation_file data/validation.jsonl \
  --output_dir estonian-domain-summ-lora \
  --use_4bit
Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TalTechNLP/Llama-3.1-70B-Instruct-summ-et

Finetuned
(90)
this model