Qwen2.5-1.5B JSON Repair (Pruned + Distilled)
Overview
This model is a lightweight (~1.5B parameter) transformer specialized in repairing malformed JSON outputs generated by large language models.
It was created by:
- Structured pruning of Qwen2.5-3B-Instruct (50% layer reduction)
- Knowledge distillation from the 3B teacher model
- Fine-tuning on synthetic malformed → corrected JSON pairs
- Training on an NVIDIA A100 GPU using bfloat16 precision
The objective is syntactic repair and structural correction of JSON under realistic LLM failure patterns.
This is a specialized structural model — not a general-purpose reasoning model.
Base Model and Pruning Strategy
Teacher Model
- Qwen2.5-3B-Instruct
- ~3B parameters
- 36 transformer layers
Student Model
- 18 transformer layers (50% structured pruning)
- Retained layers: 0, 2, 4, ..., 34
- Embeddings, normalization layers, and LM head copied from teacher
- No random reinitialization of retained layers
Parameter Reduction
- Teacher: ~3.0B parameters
- Student: ~1.5B parameters
- ~50% reduction in depth and inference FLOPs
This uses structured depth pruning rather than magnitude pruning.
Transformers contain significant redundancy across layers. Retaining alternating layers preserves representation diversity while reducing compute and memory.
Dataset Construction
Source dataset: glaiveai/glaive-function-calling-v2
Procedure:
- Extract valid JSON objects from conversational samples.
- Apply synthetic corruptions:
- Remove brace
- Remove quote
- Remove colon
- Remove comma
- Keep only malformed → corrected pairs where corruption differs.
- Target dataset size: 10,000 pairs.
- 90/10 train-test split.
The corruption strategy simulates realistic LLM output failures:
- Missing quotes
- Missing separators
- Truncated structures
- Structural incompleteness
Training Methodology
Training uses Knowledge Distillation combining:
- Cross-entropy loss against ground truth
- KL divergence between teacher and student logits
Loss formulation:
L = α * CE(student, target) + (1 - α) * KL(student || teacher)
Hyperparameters
- Epochs: 3
- Training time: ~40 minutes
- GPU: NVIDIA A100
- Precision: bfloat16
- Batch size: 4
- Gradient accumulation: 4
- Effective batch size: 16
- Learning rate: 2e-5
- Temperature (T): 2.0
- Alpha: 0.5
- Max sequence length: 256
- Optimizer: AdamW (Transformers default)
- Checkpoint saving: disabled
The teacher model remained frozen during training.
Evaluation Protocol
Evaluation was conducted on 100 samples from the test split.
Robust parsing was performed using:
json.JSONDecoder().raw_decode
This allows extraction of valid JSON even if trailing text exists.
Results
- Valid JSON Rate: 55%
- Exact Match Accuracy: 14%
Metric Definitions
Valid JSON Rate: Percentage of outputs that can be successfully parsed as JSON.
Exact Match Accuracy: Percentage of predictions that are semantically identical to ground truth (Python dict equality).
Interpretation:
The model frequently reconstructs syntactically valid JSON but sometimes:
- Drops outer fields
- Alters keys
- Truncates top-level structure
This behavior is expected given:
- Heavy pruning (50%)
- Limited dataset size (10k)
- Short training duration (3 epochs)
Intended Use
Designed for:
- Post-processing node in agent pipelines
- JSON validation layers
- Function-calling repair
- LangGraph validation nodes
- Structured output correction
Not designed for:
- General reasoning tasks
- Complex semantic correction
- Code generation
- Multi-step instruction following
Known Failure Modes
Observed behaviors:
- Run-on text continuation after valid JSON
- Missing top-level keys
- Partial schema reconstruction
- Repetition artifacts
These are typical in partially converged distilled models.
How to Improve Performance
Potential improvements:
- Increase dataset size (50k–100k pairs)
- Train for 5–10 epochs
- Use cosine learning rate decay
- Introduce curriculum learning (simple → nested JSON)
- Add stop-token conditioning
- Apply grammar-constrained decoding
- Increase effective batch size
- Train longer on A100 (several hours)
With these changes, validity rate is expected to increase significantly (>80%).
Usage Example
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "jamal-ibrahim/qwen2.5-1.5b-json-repair"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
malformed_json = '{"status": "success", "message" "QR code generated successfully"}'
prompt = f"User: Repair this JSON: {malformed_json}\n\nAssistant:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Conceptual Perspective
This model demonstrates that:
Structural knowledge survives aggressive pruning.
Distillation can preserve formal syntax understanding.
Small specialized models can effectively act as structural validators.
Instead of relying on large general models for everything, a modular architecture using lightweight structural validators can be more efficient and robust.
Training Compute Summary
Hardware: NVIDIA A100
Precision: bfloat16
Epochs: 3
Total training time: ~40 minutes
Approximate COâ‚‚ estimate: 0.3 kg (estimated)
License
Apache 2.0 (inherits from base model license).
Citation
If you use this model in research or production systems, please cite this repository.
- Downloads last month
- 4
Model tree for jamal-ibrahim/qwen2.5-1.5b-json-repair
Dataset used to train jamal-ibrahim/qwen2.5-1.5b-json-repair
Evaluation results
- Exact Match Accuracy on glaive-function-calling-v2 (synthetic corruption subset)test set self-reported0.140
- Valid JSON Rate on glaive-function-calling-v2 (synthetic corruption subset)test set self-reported0.550