Disclaimer: This model is a research experiment and does not provide legal advice. The author is not responsible for any legal inaccuracies or misuse of the information generated by this model. Always consult the official EU OJ (Official Journal) for the definitive text of the AI Act.

Gemma 4 (E4B) Fine-tuned on EU AI Act — Spanish Legal Dataset

Domain-adapted LLM for the EU AI Act, with full training pipeline, evaluation, and overfitting analysis.

v1 — Baseline fine-tuning + real-world limitations

Overview

This repository contains a fine-tuned version of Gemma 4 (E4B) trained on a Spanish dataset derived from the EU AI Act.

The goal was to evaluate how far a relatively small model (≈4B parameters) can be pushed in a highly specialized legal domain using supervised fine-tuning.

The dataset (~7.3k Q&A pairs) was generated from a single legal source, which makes it ideal for testing domain adaptation — but also exposes important limitations.

Training Metrics

Final metrics:

Train loss: 1.08
Best eval loss: 1.98
Perplexity: 7.26
ROUGE-L (avg): 0.28

Interpretation:

The model converges properly without strong overfitting, but evaluation results highlight limited generalization due to dataset homogeneity.

Key Findings

Fine-tuning improves domain awareness significantly
The model stops asking for clarification and answers confidently
However, generalization remains limited

Observed behavior:

Learns structure and terminology of the AI Act
Produces plausible but sometimes incorrect legal statements
Hallucinates when precise recall is required

Overfitting Analysis

This project intentionally documents a realistic failure mode:

Train loss decreases consistently
Eval loss plateaus early (~1.98)
Minimal gap between best and final eval → no catastrophic overfitting
BUT: poor factual robustness

👉 The model is not simply memorizing —
it is learning patterns without true grounding.

Core Insight

Dataset diversity matters more than dataset size.

Even with thousands of samples, fine-tuning on a single document does not produce a reliable standalone legal model.

This is especially critical in domains like law, where precision matters.

Training Setup

Framework: Unsloth + TRL (SFTTrainer)
Method: LoRA fine-tuning
Precision: bfloat16 training
Export: GGUF (q4_k_m)
Max sequence length: 512
Learning rate: 5e-5
Batch (effective): 8

Pipeline includes:

Dataset validation and deduplication
Baseline model evaluation
Post-training qualitative comparison
ROUGE-L evaluation on test set
Perplexity tracking
Full logging + reproducibility config

Model Outputs

The repository includes:

GGUF model (quantized for local inference)
Multimodal projection file (if applicable)
Training logs and evaluation outputs
Training scripts (end-to-end pipeline)

Usage

The model is exported to GGUF and can be used with:

LM Studio
Ollama
llama.cpp

Example use case:

Spanish legal assistant for querying the EU AI Act
Internal compliance support tools
Prototyping legal AI workflows

Limitations

This model should NOT be used as a standalone legal authority.

Known limitations:

Inaccurate or incomplete legal interpretations
Hallucinations in edge cases
Lack of grounding in exact legal text
Sensitivity to prompt phrasing

Dataset

⚠️ Dataset Notice The Q&A dataset (~7.3k pairs) was synthetically generated using Claude (Anthropic) from the official EU AI Act text. It has not been verified by legal experts. Only some samples were manually reviewed. Do not use for real legal advice.

Known limitation: dataset contains significant topical redundancy due to the generation method. Multiple Q&A pairs cover identical concepts with slight phrasing variations. This contributes to the observed generalization plateau.

This model was fine-tuned on the EU AI Act Spanish Dataset

Recommended Use (Important)

This model performs significantly better when combined with a Retrieval-Augmented Generation (RAG) pipeline.

👉 Without RAG → approximate reasoning
👉 With RAG → grounded legal assistant

Next Steps

Incorporate multiple legal sources (GDPR, case law, guidelines)
Compare fine-tuning vs RAG vs hybrid approaches
Evaluate against real-world legal queries
Improve dataset diversity and structure

Why this repo

This is not a "perfect model" repository.

It is a transparent, real-world experiment showing:

What works in legal LLM fine-tuning
What breaks
And why RAG is often necessary

Author note

Built as part of a practical exploration of AI systems applied to legal and regulatory domains (EU AI Act).

Focus: real deployment constraints, not just benchmarks.

Downloads last month: 174

GGUF

Model size

8B params

Architecture

gemma4

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support