Disclaimer: This model is a research experiment and does not provide legal advice. The author is not responsible for any legal inaccuracies or misuse of the information generated by this model. Always consult the official EU OJ (Official Journal) for the definitive text of the AI Act.
Gemma 4 (E4B) Fine-tuned on EU AI Act — Spanish Legal Dataset
Domain-adapted LLM for the EU AI Act, with full training pipeline, evaluation, and overfitting analysis.
v1 — Baseline fine-tuning + real-world limitations
Overview
This repository contains a fine-tuned version of Gemma 4 (E4B) trained on a Spanish dataset derived from the EU AI Act.
The goal was to evaluate how far a relatively small model (≈4B parameters) can be pushed in a highly specialized legal domain using supervised fine-tuning.
The dataset (~7.3k Q&A pairs) was generated from a single legal source, which makes it ideal for testing domain adaptation — but also exposes important limitations.
Training Metrics
Final metrics:
- Train loss: 1.08
- Best eval loss: 1.98
- Perplexity: 7.26
- ROUGE-L (avg): 0.28
Interpretation:
The model converges properly without strong overfitting, but evaluation results highlight limited generalization due to dataset homogeneity.
Key Findings
- Fine-tuning improves domain awareness significantly
- The model stops asking for clarification and answers confidently
- However, generalization remains limited
Observed behavior:
- Learns structure and terminology of the AI Act
- Produces plausible but sometimes incorrect legal statements
- Hallucinates when precise recall is required
Overfitting Analysis
This project intentionally documents a realistic failure mode:
- Train loss decreases consistently
- Eval loss plateaus early (~1.98)
- Minimal gap between best and final eval → no catastrophic overfitting
- BUT: poor factual robustness
👉 The model is not simply memorizing —
it is learning patterns without true grounding.
Core Insight
Dataset diversity matters more than dataset size.
Even with thousands of samples, fine-tuning on a single document does not produce a reliable standalone legal model.
This is especially critical in domains like law, where precision matters.
Training Setup
- Framework: Unsloth + TRL (SFTTrainer)
- Method: LoRA fine-tuning
- Precision: bfloat16 training
- Export: GGUF (q4_k_m)
- Max sequence length: 512
- Learning rate: 5e-5
- Batch (effective): 8
Pipeline includes:
- Dataset validation and deduplication
- Baseline model evaluation
- Post-training qualitative comparison
- ROUGE-L evaluation on test set
- Perplexity tracking
- Full logging + reproducibility config
Model Outputs
The repository includes:
- GGUF model (quantized for local inference)
- Multimodal projection file (if applicable)
- Training logs and evaluation outputs
- Training scripts (end-to-end pipeline)
Usage
The model is exported to GGUF and can be used with:
- LM Studio
- Ollama
- llama.cpp
Example use case:
- Spanish legal assistant for querying the EU AI Act
- Internal compliance support tools
- Prototyping legal AI workflows
Limitations
This model should NOT be used as a standalone legal authority.
Known limitations:
- Inaccurate or incomplete legal interpretations
- Hallucinations in edge cases
- Lack of grounding in exact legal text
- Sensitivity to prompt phrasing
Dataset
⚠️ Dataset Notice The Q&A dataset (~7.3k pairs) was synthetically generated using Claude (Anthropic) from the official EU AI Act text. It has not been verified by legal experts. Only some samples were manually reviewed. Do not use for real legal advice.
Known limitation: dataset contains significant topical redundancy due to the generation method. Multiple Q&A pairs cover identical concepts with slight phrasing variations. This contributes to the observed generalization plateau.
This model was fine-tuned on the EU AI Act Spanish Dataset
Recommended Use (Important)
This model performs significantly better when combined with a Retrieval-Augmented Generation (RAG) pipeline.
👉 Without RAG → approximate reasoning
👉 With RAG → grounded legal assistant
Next Steps
- Incorporate multiple legal sources (GDPR, case law, guidelines)
- Compare fine-tuning vs RAG vs hybrid approaches
- Evaluate against real-world legal queries
- Improve dataset diversity and structure
Why this repo
This is not a "perfect model" repository.
It is a transparent, real-world experiment showing:
- What works in legal LLM fine-tuning
- What breaks
- And why RAG is often necessary
Author note
Built as part of a practical exploration of AI systems applied to legal and regulatory domains (EU AI Act).
Focus: real deployment constraints, not just benchmarks.
- Downloads last month
- 174
4-bit

