You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Silent Inconsistency in Data-Parallel Full Fine-Tuning

Experimental Fine-Tuned Models (S1-1 / S1-2 / S1-3)

This repository provides three fully fine-tuned models corresponding to the experimental settings in the paper:

Silent Inconsistency in Data-Parallel Full Fine-Tuning: Diagnosing Worker-Level Optimization Misalignment

The models were trained to reproduce and analyze the phenomenon of worker-level optimization misalignment under synchronous data-parallel (DDP) full-parameter fine-tuning.


πŸ“Œ Background

In synchronous data-parallel training with All-Reduce, model parameters are strictly synchronized after each update step. However, synchronization of parameters does not guarantee consistency in worker-level optimization dynamics before gradient aggregation.

The paper introduces the concept of:

Silent Inconsistency β€” a hidden divergence in worker-level loss and gradient behavior that remains invisible in the global averaged loss curve.

To diagnose this issue, the paper proposes three lightweight monitoring metrics:

  • Loss Dispersion β€” Standard deviation / range of per-worker loss
  • Gradient-Norm Dispersion β€” Standard deviation of per-worker gradient L2 norms
  • Gradient-Direction Consistency β€” Average cosine similarity between worker gradients

These metrics can be computed online with negligible overhead and without modifying the optimization algorithm.


🧠 Base Model

All three models are fully fine-tuned from:

Ascend Tribe – openPangu-Embedded-1B-V1.1 (1B parameters)
πŸ”— https://ai.gitcode.com/ascend-tribe/openPangu-Embedded-1B-V1.1

  • Architecture: Causal LM
  • Parameters: ~1B
  • Precision: bf16 mixed precision during training
  • Training mode: Full-parameter fine-tuning (no LoRA / adapters)

πŸ“š Dataset

Fine-tuned on:

tatsu-lab / alpaca (Instruction tuning dataset)
πŸ”— https://huggingface.co/datasets/tatsu-lab/alpaca

  • Template format: Instruction – Input – Response
  • Maximum sequence length: 1024
  • Loss computed only on the Response tokens
  • Prompt and template tokens are masked during loss calculation

πŸ§ͺ Experimental Settings

πŸ”Ή S1-1 β€” Strict Consistency

  • All ranks use the same random seed
  • Deterministic DistributedSampler
  • Identical shuffling behavior across workers

Expected behavior:

  • Low loss dispersion
  • Low gradient-norm dispersion
  • High gradient-direction consistency

πŸ”Ή S1-2 β€” Mild Inconsistency

  • Rank 0 uses a different random seed
  • Other ranks follow the baseline seed

Expected behavior:

  • Slight increase in worker-level dispersion
  • Global averaged loss remains smooth

πŸ”Ή S1-3 β€” Significant Inconsistency

  • Each rank uses a distinct seed dependent on rank ID

Expected behavior:

  • Large loss dispersion
  • Larger gradient-norm variation
  • Reduced gradient-direction consistency
  • Global loss may still appear normal
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including jiujiudahaozi/op_pangu