jua-4B-legal-only

jua-4B-legal-only is a Brazilian Portuguese legal embedding model based on Qwen/Qwen3-Embedding-4B. It was adapted for legal retrieval with legal-domain supervision only, and is intended for scenarios where stronger specialization on institutionally framed legal search is preferred over broader cross-domain robustness.

This model is presented in the paper Domain-Adaptive Dense Retrieval for Brazilian Legal Search. It is the legal-only condition discussed in the paper.

Model Overview

  • Base model: Qwen/Qwen3-Embedding-4B
  • Model type: text embedding
  • Primary language: Brazilian Portuguese
  • Intended use: dense retrieval for Brazilian legal search
  • Training profile: legal-only adaptation

The legal-only training regime uses legal supervision from:

  • JUÁ-Juris training pairs
  • Ulysses-derived legislative supervision
  • a small synthetic legislative extension based on alternative query formulations

Unlike the mixed model, this model does not add SQuAD-pt.

Intended Use

This model is best suited for:

  • jurisprudence retrieval
  • institutionally framed legal search
  • retrieval settings where legal phrasing and specialized domain supervision are especially important

If your use case is more heterogeneous, question-driven, or closer to broader semantic retrieval, the mixed model may be a better option:

  • ufca-llms/jua-4B-mixed

Usage

Sentence Transformers

# Requires transformers>=4.51.0
# Requires sentence-transformers>=2.7.0

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ufca-llms/jua-4B-legal-only")

queries = [
    "Instruct: Given a Brazilian legal search query, retrieve relevant legal passages or documents.\nQuery: aposentadoria por pensão estatutária",
    "Instruct: Given a Brazilian legal search query, retrieve relevant legal passages or documents.\nQuery: normas de auditoria operacional do TCU",
]

documents = [
    "O art. 5º da Lei 9.717/1998 trata do regime previdenciário dos servidores públicos.",
    "As normas de auditoria operacional do TCU estabelecem diretrizes para planejamento, execução e relatório.",
]

query_embeddings = model.encode(queries)
document_embeddings = model.encode(documents)

similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)

Transformers

# Requires transformers>=4.51.0

import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoModel, AutoTokenizer


def last_token_pool(last_hidden_states: Tensor, attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    sequence_lengths = attention_mask.sum(dim=1) - 1
    batch_size = last_hidden_states.shape[0]
    return last_hidden_states[
        torch.arange(batch_size, device=last_hidden_states.device),
        sequence_lengths,
    ]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f"Instruct: {task_description}\nQuery: {query}"


task = "Given a Brazilian legal search query, retrieve relevant legal passages or documents."
queries = [
    get_detailed_instruct(task, "aposentadoria por pensão estatutária"),
    get_detailed_instruct(task, "normas de auditoria operacional do TCU"),
]

documents = [
    "O art. 5º da Lei 9.717/1998 trata do regime previdenciário dos servidores públicos.",
    "As normas de auditoria operacional do TCU estabelecem diretrizes para planejamento, execução e relatório.",
]

input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained(
    "ufca-llms/jua-4B-legal-only",
    padding_side="left",
)
model = AutoModel.from_pretrained("ufca-llms/jua-4B-legal-only")

batch_dict = tokenizer(
    input_texts,
    padding=True,
    truncation=True,
    max_length=8192,
    return_tensors="pt",
)
batch_dict.to(model.device)

outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict["attention_mask"])
embeddings = F.normalize(embeddings, p=2, dim=1)

scores = embeddings[: len(queries)] @ embeddings[len(queries) :].T
print(scores.tolist())

Evaluation

JUÁ + Quati

The table below reproduces the legal-only results reported in the paper over the five legal datasets in the JUÁ evaluation environment plus Quati.

Dataset NDCG@10 MRR@10 MAP@10
JUÁ-Juris 0.294 0.233 0.233
JurisTCU 0.375 0.650 0.179
NormasTCU 0.310 0.461 0.186
Ulysses-RFCorpus 0.426 0.619 0.301
BR-TaxQA-R 0.756 0.779 0.677
Quati 0.438 0.770 0.197
Average 0.433 0.585 0.296

Shared legal comparison against broader baselines

On the four legal datasets shared by all baselines in the paper's broader comparison (JUÁ-Juris, JurisTCU, NormasTCU, and BR-TaxQA-R), this model obtains:

  • NDCG@10: 0.434
  • MRR@10: 0.531
  • MAP@10: 0.319

Notes

  • Query-side instructions are recommended.
  • This model is specialized for Brazilian legal retrieval and may be less robust than the mixed model on broader semantic retrieval settings.
  • For a more balanced profile across legal and question-driven retrieval regimes, see ufca-llms/jua-4B-mixed.

Citation

If you use this model, please cite:

@misc{pereira2026domainadaptivedenseretrievalbrazilian,
      title={Domain-Adaptive Dense Retrieval for Brazilian Legal Search}, 
      author={Jayr Pereira and Roberto Lotufo and Luiz Bonifacio},
      year={2026},
      eprint={2605.04005},
      archivePrefix={arXiv},
      primaryClass={cs.IR},
      url={https://arxiv.org/abs/2605.04005}, 
}
Downloads last month
41
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ufca-llms/jua-4B-legal-only

Finetuned
(48)
this model

Dataset used to train ufca-llms/jua-4B-legal-only

Paper for ufca-llms/jua-4B-legal-only