Qwen2.5-1.5B JSON Repair (Pruned + Distilled)

Overview

This model is a lightweight (~1.5B parameter) transformer specialized in repairing malformed JSON outputs generated by large language models.

It was created by:

Structured pruning of Qwen2.5-3B-Instruct (50% layer reduction)
Knowledge distillation from the 3B teacher model
Fine-tuning on synthetic malformed → corrected JSON pairs
Training on an NVIDIA A100 GPU using bfloat16 precision

The objective is syntactic repair and structural correction of JSON under realistic LLM failure patterns.

This is a specialized structural model — not a general-purpose reasoning model.

Base Model and Pruning Strategy

Teacher Model

Qwen2.5-3B-Instruct
~3B parameters
36 transformer layers

Student Model

18 transformer layers (50% structured pruning)
Retained layers: 0, 2, 4, ..., 34
Embeddings, normalization layers, and LM head copied from teacher
No random reinitialization of retained layers

Parameter Reduction

Teacher: ~3.0B parameters
Student: ~1.5B parameters
~50% reduction in depth and inference FLOPs

This uses structured depth pruning rather than magnitude pruning.

Transformers contain significant redundancy across layers. Retaining alternating layers preserves representation diversity while reducing compute and memory.

Dataset Construction

Source dataset: glaiveai/glaive-function-calling-v2

Procedure:

Extract valid JSON objects from conversational samples.
Apply synthetic corruptions:
- Remove brace
- Remove quote
- Remove colon
- Remove comma
Keep only malformed → corrected pairs where corruption differs.
Target dataset size: 10,000 pairs.
90/10 train-test split.

The corruption strategy simulates realistic LLM output failures:

Missing quotes
Missing separators
Truncated structures
Structural incompleteness

Training Methodology

Training uses Knowledge Distillation combining:

Cross-entropy loss against ground truth
KL divergence between teacher and student logits

Loss formulation:

L = α * CE(student, target) + (1 - α) * KL(student || teacher)

Hyperparameters

Epochs: 3
Training time: ~40 minutes
GPU: NVIDIA A100
Precision: bfloat16
Batch size: 4
Gradient accumulation: 4
Effective batch size: 16
Learning rate: 2e-5
Temperature (T): 2.0
Alpha: 0.5
Max sequence length: 256
Optimizer: AdamW (Transformers default)
Checkpoint saving: disabled

The teacher model remained frozen during training.

Evaluation Protocol

Evaluation was conducted on 100 samples from the test split.

Robust parsing was performed using:

json.JSONDecoder().raw_decode

This allows extraction of valid JSON even if trailing text exists.

Results

Valid JSON Rate: 55%
Exact Match Accuracy: 14%

Metric Definitions

Valid JSON Rate: Percentage of outputs that can be successfully parsed as JSON.

Exact Match Accuracy: Percentage of predictions that are semantically identical to ground truth (Python dict equality).

Interpretation:

The model frequently reconstructs syntactically valid JSON but sometimes:

Drops outer fields
Alters keys
Truncates top-level structure

This behavior is expected given:

Heavy pruning (50%)
Limited dataset size (10k)
Short training duration (3 epochs)

Intended Use

Designed for:

Post-processing node in agent pipelines
JSON validation layers
Function-calling repair
LangGraph validation nodes
Structured output correction

Not designed for:

General reasoning tasks
Complex semantic correction
Code generation
Multi-step instruction following

Known Failure Modes

Observed behaviors:

Run-on text continuation after valid JSON
Missing top-level keys
Partial schema reconstruction
Repetition artifacts

These are typical in partially converged distilled models.

How to Improve Performance

Potential improvements:

Increase dataset size (50k–100k pairs)
Train for 5–10 epochs
Use cosine learning rate decay
Introduce curriculum learning (simple → nested JSON)
Add stop-token conditioning
Apply grammar-constrained decoding
Increase effective batch size
Train longer on A100 (several hours)

With these changes, validity rate is expected to increase significantly (>80%).

Usage Example

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "jamal-ibrahim/qwen2.5-1.5b-json-repair"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

malformed_json = '{"status": "success", "message" "QR code generated successfully"}'

prompt = f"User: Repair this JSON: {malformed_json}\n\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=False
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Conceptual Perspective

This model demonstrates that:

Structural knowledge survives aggressive pruning.

Distillation can preserve formal syntax understanding.

Small specialized models can effectively act as structural validators.

Instead of relying on large general models for everything, a modular architecture using lightweight structural validators can be more efficient and robust.

Training Compute Summary

Hardware: NVIDIA A100

Precision: bfloat16

Epochs: 3

Total training time: ~40 minutes

Approximate CO₂ estimate: 0.3 kg (estimated)

License

Apache 2.0 (inherits from base model license).

Citation

If you use this model in research or production systems, please cite this repository.

Downloads last month: 4

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for jamal-ibrahim/qwen2.5-1.5b-json-repair

Base model

Qwen/Qwen2.5-3B

Finetuned

Qwen/Qwen2.5-3B-Instruct

Finetuned

(1187)

this model

Dataset used to train jamal-ibrahim/qwen2.5-1.5b-json-repair

Evaluation results

Exact Match Accuracy on glaive-function-calling-v2 (synthetic corruption subset)
test set self-reported

0.140
Valid JSON Rate on glaive-function-calling-v2 (synthetic corruption subset)
test set self-reported

0.550