Salamandra 7B - Legal Administrative (SINAI)
This repository contains a domain-adapted version of the Salamandra 7B model, optimized for the Spanish legal and administrative domain.
This model is the result of a continual pre-training process on the Salamandra 7B Base model, followed by instruction tuning using specialized datasets.
The original Salamandra family is released under a permissive Apache 2.0 license.
DISCLAIMER: This model is a domain-specific proof-of-concept designed to demonstrate the capabilities of Salamandra in the legal-administrative field. While optimized for this domain, it has NOT been aligned through RLHF to filter or avoid sensitive topics. As a result, it may generate harmful or inappropriate content, or legally inaccurate information. Users should verify any legal information generated against official sources.
Model Details
Description
This model is a Transformer-based decoder-only language model. It builds upon the Salamandra 7B architecture through a adaptation process:
Continual Pre-training (CPT): The base model was further pre-trained on the SINAI/ALIA-legal-administrative corpus to adapt its weights to the specific vocabulary and structures of legal and administrative Spanish.
Architecture
| Base Model | Salamandra 7B |
| Total Parameters | 7,768,117,248 |
| Embedding Parameters | 1,048,576,000 |
| Layers | 32 |
| Hidden size | 4,096 |
| Attention heads | 32 |
| Context length | 8,192 |
| Vocabulary size | 256,000 |
| Precision | bfloat16 |
| Embedding type | RoPE |
| Activation Function | SwiGLU |
| Layer normalization | RMS Norm |
| Flash attention | ✅ |
| Grouped Query Attention | ✅ |
| Num. query groups | 8 |
Intended Use
Direct Use
The model is intended for research and commercial use specifically within the Spanish legal and public administration context. Typical use cases include:
- Summarization of administrative documents.
- Question answering regarding public procedures.
- Simplification of legal jargon ("Plain Language").
Out-of-scope Use
The model is not intended for malicious activities. It is explicitly out of scope to use this model as a replacement for a qualified lawyer or legal advisor. Any downstream application must comply with current laws and regulations.
How to use
Python Example
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Replace with the actual path to your uploaded model
model_id = "SINAI/salamandra-7b-legal-admin"
text = "¿Cuáles son los requisitos para presentar una instancia administrativa?"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
message = [ { "role": "user", "content": text } ]
prompt = tokenizer.apply_chat_template(
message,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Data
Domain Adaptation Data (SINAI)
To specialize the model, we utilized high-quality datasets provided by the SINAI Research Group (Universidad de Jaén):
- Continual Pre-training
- Dataset:
SINAI/ALIA-legal-administrative - Description: A large corpus of texts belonging to the legal and administrative domain in Spanish. This dataset was used to adapt the linguistic distribution of the base model to the target domain.
- Dataset:
Original Pre-training Data (Base Model)
The underlying base model (Salamandra 7B) was pre-trained on 12.875 trillion tokens of highly curated data, covering 35 European languages and code. For a full detailed list of the original pre-training sources, please refer to the Original Salamandra Model Card.
Citation
If you use this model or the datasets, please cite the Salamandra Technical Report and the SINAI datasets accordingly.
- Downloads last month
- -