Optimizers and schedulers

An optimizer updates model weights during training. The scheduler wraps the optimizer and adjusts the learning rate each training step. Trainer creates both when it calls create_optimizer_and_scheduler().

                                    ┌────────────┐         ┌──────────────┐
                                    │ Optimizer  │         │  Scheduler   │
                                    │ (adamw_torch_fused)◄─│  (linear)    │
                                    │            │         │              │
                                    │ param_groups         |              |
                                    │  └ lr       ◄────────┤              |
                                    │  └ weight_decay      │              │
                                    └──────┬─────┘         └──────────────┘
                                           │                      
  ┌──── EACH TRAINING STEP ───────────────────────────────────────────┐
  │                                        │                          │
  │   model(batch)                         │                          │
  │       │                                │                          │
  │       ▼                                │                          │
  │     loss ──► loss.backward() ──► param.grad                       │
  │                                        │                          │
  │                          ┌─────────────┘                          │
  │                          ▼                                        │
  │              optimizer.step()                                     │
  │                          │                                        │
  │                          ▼                                        │
  │                   param.data updated                              │
  │                          │                                        │
  │                          ▼                                        │
  │              lr_scheduler.step()  ──► recalculates lr             │
  │                          │            writes to optimizer         │
  │                          ▼            .param_groups['lr']         │
  │              model.zero_grad()                                    │
  │                                                                   │
  └───────────────────────────────────────────────────────────────────┘

Configure optimizer and scheduler behavior, like lr_scheduler_type() and optim(), in TrainingArguments. The defaults (adamw_torch optimizer and linear warmup scheduler) are a good starting point for most fine-tuning runs.

from transformers import TrainingArguments

args = TrainingArguments(
    ...,
    # Optimizer
    optim="adamw_torch",          # or "adamw_torch_fused", "adafactor", "sgd", etc.
    learning_rate=2e-5,
    weight_decay=0.01,
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-8,
    # Scheduler
    lr_scheduler_type="cosine",   # "linear", "cosine", "constant_with_warmup", etc.
    warmup_steps=500,
    lr_scheduler_kwargs={"num_cycles": 3},  # scheduler-specific extras
)

Metric-based schedulers

Some schedulers adapt to training dynamics instead of following a fixed schedule.

GreedyLR updates the learning rate from evaluation results. It raises the learning rate by dividing it by factor when the metric keeps improving, and lowers the learning rate by multiplying it by factor when the metric doesn’t improve. When the learning rate stops at min_lr and doesn’t improve after reset_start steps, GreedyLR resets to its initial state and starts a new cycle.

GreedyLR requires evaluation during training. Set eval_strategy to "steps" or "epoch".

args = TrainingArguments(
+   lr_scheduler_type="greedy",
+   lr_scheduler_kwargs={"patience": 10, "factor": 0.95, "min_lr": 1e-5},
+   eval_strategy="steps",
+   eval_steps=200,
    ...  # remaining args from the TrainingArguments intro config
)

The default mode="min" works for loss. If you’re tracking a metric where a higher value is better, like accuracy, pass "mode": "max" in lr_scheduler_kwargs.

See the GreedyLR class for the full list of configurable parameters.

Optimizer integrations

Transformers integrates third-party optimizers for specialized training scenarios.

Optimizer	Install	`optim="value"`	Description
APOLLO	`apollo-torch`	`apollo_adamw`	Memory-efficient full-param via random projections; rank-1 sufficient
FlashOptim	`flashoptim`	`flash_adamw`, `flash_adam`, `flash_sgd`, `flash_sgdw`, `flash_lion`	Reduces optimizer memory with low-precision master weights
GrokAdamW	`grokadamw`	`grokadamw`	Targets delayed generalization (grokking)
LOMO / AdaLomo	`lomo-optim`	`lomo` / `adalomo`	Fuses gradient + update step for low-memory full-param fine-tuning
Schedule Free	`schedulefree`	`schedule_free_adamw`, `schedule_free_radam`, `schedule_free_sgd`	Eliminates LR annealing; pair with `lr_scheduler_type="constant"`
GaLore	`galore-torch`	`galore_adamw`, `galore_adafactor`, `galore_adamw_8bit`	Full-parameter learning via gradient low-rank projection
StableAdamW	`torch-optimi`	`stable_adamw`	AdamW + AdaFactor update clipping; no gradient clipping needed

APOLLO

FlashOptim

GrokAdamW

LOMO

Schedule Free

StableAdamW

GaLore

pip install apollo-torch

Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO) is a memory-efficient optimizer for full-parameter learning during pretraining and fine-tuning. It matches AdamW performance with SGD-like memory cost by using cheap random projections instead of SVD. For extreme memory savings, use APOLLO-Mini, a rank-1 variant.

Use the optim_target_modules parameter to specify which layers to train.

args = TrainingArguments(
+   optim="apollo_adamw",
+   optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
    ...  # remaining args from the TrainingArguments intro config
)

Pass additional hyperparameters through optim_args.

Set scale to n/r, where n is the original space dimension and r is the low-rank space dimension. Adjusting the learning rate while keeping scale at its default achieves a similar effect.

parameter	description	APOLLO	APOLLO-Mini
rank	rank of the auxiliary sub-space for gradient scaling	256	1
scale_type	how scaling factors are applied	`channel` (per-channel scaling)	`tensor` (per-tensor scaling)
scale	adjusts gradient updates to stabilize training	1.0	128
update_proj_gap	steps before updating projection matrices	200	200
proj	projection type	`random`	`random`

Enable APOLLO-Mini with a rank-1 configuration.

args = TrainingArguments(
    optim="apollo_adamw",
    optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
    optim_args="proj=random,rank=1,scale=128.0,scale_type=tensor,update_proj_gap=200",
    ...  # remaining args from the TrainingArguments intro config
)

Customizing optimizer and scheduler

Create a custom optimizer and scheduler to use an optimizer not yet integrated, adjust per-layer learning rates, or apply custom logic.

Pass a class and kwargs

~Trainer.optimizer_cls_and_kwargs accepts a custom optimizer class while delegating parameter grouping and device placement to Trainer.

Trainer defers building the optimizer until create_optimizer() runs, so the model is already on the correct device.

import torch

trainer = Trainer(
    ...
    optimizer_cls_and_kwargs=(
        torch.optim.SGD,
        {"momentum": 0.9, "nesterov": True}
    ),
)

Pass prebuilt instances

Pass a predefined optimizer and scheduler to ~Trainer.optimizers. Trainer skips create_optimizer() and create_scheduler() when prebuilt instances are provided. If you don’t pass a scheduler, Trainer automatically creates one.

Build the optimizer after placing your model on the correct device. Parameters are resolved at construction time, before Trainer moves the model. In distributed training, mismatched devices can silently cause incorrect behavior.

import torch
from transformers import Trainer, get_cosine_schedule_with_warmup

optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
scheduler = get_cosine_schedule_with_warmup(
    optimizer, num_warmup_steps=500, num_training_steps=10_000
)

trainer = Trainer(
    ...
    optimizers=(optimizer, scheduler),
)

Prebuilt instances bypass create_optimizer() and create_scheduler(), so you need to specify your own parameter groups.

Override optimizer and scheduler methods

Subclass create_optimizer() and create_scheduler() for full control. Both methods run during train().

Override create_scheduler() to use a scheduler like OneCycleLR that isn’t available in SchedulerType.

For each method, make sure to assign to self and return it.

import torch
from transformers import Trainer

class MyTrainer(Trainer):

    def create_scheduler(self, num_training_steps, optimizer=None):
        optimizer = optimizer or self.optimizer
        self.lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(
            optimizer,
            max_lr=0.1,
            total_steps=num_training_steps,
        )
        return self.lr_scheduler

You don’t need to override create_optimizer() if the default optimizer works. Extending a method with super() is easier than replacing it entirely. For example, add an extra parameter group while keeping everything else the same.

class MyTrainer(Trainer):
    def create_optimizer(self, model=None):
        super().create_optimizer(model)  # builds the default two param groups
        # add extra param group
        self.optimizer.add_param_group({
            "params": self.model.classifier.parameters(),
            "lr": self.args.learning_rate * 10,
        })
        return self.optimizer

Update on GitHub