jua-4B-legal-only
jua-4B-legal-only is a Brazilian Portuguese legal embedding model based on Qwen/Qwen3-Embedding-4B. It was adapted for legal retrieval with legal-domain supervision only, and is intended for scenarios where stronger specialization on institutionally framed legal search is preferred over broader cross-domain robustness.
This model is presented in the paper Domain-Adaptive Dense Retrieval for Brazilian Legal Search. It is the legal-only condition discussed in the paper.
Model Overview
- Base model:
Qwen/Qwen3-Embedding-4B - Model type: text embedding
- Primary language: Brazilian Portuguese
- Intended use: dense retrieval for Brazilian legal search
- Training profile: legal-only adaptation
The legal-only training regime uses legal supervision from:
JUÁ-Juristraining pairs- Ulysses-derived legislative supervision
- a small synthetic legislative extension based on alternative query formulations
Unlike the mixed model, this model does not add SQuAD-pt.
Intended Use
This model is best suited for:
- jurisprudence retrieval
- institutionally framed legal search
- retrieval settings where legal phrasing and specialized domain supervision are especially important
If your use case is more heterogeneous, question-driven, or closer to broader semantic retrieval, the mixed model may be a better option:
ufca-llms/jua-4B-mixed
Usage
Sentence Transformers
# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("ufca-llms/jua-4B-legal-only")
queries = [
"Instruct: Given a Brazilian legal search query, retrieve relevant legal passages or documents.\nQuery: aposentadoria por pensão estatutária",
"Instruct: Given a Brazilian legal search query, retrieve relevant legal passages or documents.\nQuery: normas de auditoria operacional do TCU",
]
documents = [
"O art. 5º da Lei 9.717/1998 trata do regime previdenciário dos servidores públicos.",
"As normas de auditoria operacional do TCU estabelecem diretrizes para planejamento, execução e relatório.",
]
query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
Transformers
# Requires transformers>=4.51.0
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoModel, AutoTokenizer
def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[
torch.arange(batch_size, device=last_hidden_states.device),
sequence_lengths,
]
def get_detailed_instruct(task_description: str, query: str) -> str:
return f"Instruct: {task_description}\nQuery: {query}"
task = "Given a Brazilian legal search query, retrieve relevant legal passages or documents."
queries = [
get_detailed_instruct(task, "aposentadoria por pensão estatutária"),
get_detailed_instruct(task, "normas de auditoria operacional do TCU"),
]
documents = [
"O art. 5º da Lei 9.717/1998 trata do regime previdenciário dos servidores públicos.",
"As normas de auditoria operacional do TCU estabelecem diretrizes para planejamento, execução e relatório.",
]
input_texts = queries + documents
tokenizer = AutoTokenizer.from_pretrained(
"ufca-llms/jua-4B-legal-only",
padding_side="left",
)
model = AutoModel.from_pretrained("ufca-llms/jua-4B-legal-only")
batch_dict = tokenizer(
input_texts,
padding=True,
truncation=True,
max_length=8192,
return_tensors="pt",
)
batch_dict.to(model.device)
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict["attention_mask"])
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = embeddings[: len(queries)] @ embeddings[len(queries) :].T
print(scores.tolist())
Evaluation
JUÁ + Quati
The table below reproduces the legal-only results reported in the paper over the five legal datasets in the JUÁ evaluation environment plus Quati.
| Dataset | NDCG@10 | MRR@10 | MAP@10 |
|---|---|---|---|
| JUÁ-Juris | 0.294 | 0.233 | 0.233 |
| JurisTCU | 0.375 | 0.650 | 0.179 |
| NormasTCU | 0.310 | 0.461 | 0.186 |
| Ulysses-RFCorpus | 0.426 | 0.619 | 0.301 |
| BR-TaxQA-R | 0.756 | 0.779 | 0.677 |
| Quati | 0.438 | 0.770 | 0.197 |
| Average | 0.433 | 0.585 | 0.296 |
Shared legal comparison against broader baselines
On the four legal datasets shared by all baselines in the paper's broader comparison (JUÁ-Juris, JurisTCU, NormasTCU, and BR-TaxQA-R), this model obtains:
NDCG@10:0.434MRR@10:0.531MAP@10:0.319
Notes
- Query-side instructions are recommended.
- This model is specialized for Brazilian legal retrieval and may be less robust than the mixed model on broader semantic retrieval settings.
- For a more balanced profile across legal and question-driven retrieval regimes, see
ufca-llms/jua-4B-mixed.
Citation
If you use this model, please cite:
@misc{pereira2026domainadaptivedenseretrievalbrazilian,
title={Domain-Adaptive Dense Retrieval for Brazilian Legal Search},
author={Jayr Pereira and Roberto Lotufo and Luiz Bonifacio},
year={2026},
eprint={2605.04005},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2605.04005},
}
- Downloads last month
- 41