Doc-to-LoRA: Ministral-3-3B Hypernetwork

Built for the Mistral AI Worldwide Hackathon 2026

A Perceiver-based hypernetwork that converts documents into rank-8 LoRA adapters for Ministral-3-3B-Instruct-2512 in sub-second time, enabling knowledge injection without context window overhead.

Ported from Sakana AI's Doc-to-LoRA (originally targeting Gemma-2-2B) to Ministral-3-3B.

Used by: Thoth Agent

This hypernetwork powers Thoth, also built for the Mistral AI Worldwide Hackathon 2026. Thoth is an agent that uses Doc-to-LoRA to alleviate context size pressure -- instead of feeding entire documents into the context window, it converts them into LoRA adapters on the fly, freeing up the context for reasoning and conversation while retaining document knowledge in the model's weights. Thoth includes an MLX-based inference implementation, enabling the trained hypernetwork to run natively on Apple Silicon Macs.

How It Works

Doc-to-LoRA is a Perceiver-based hypernetwork (~309M parameters) that reads a document and generates a rank-8 LoRA adapter for Ministral-3-3B's MLP down_proj layers. Instead of stuffing documents into the context window at inference time, the model "absorbs" the document into its weights via the generated LoRA.

Key properties:

Sub-second LoRA generation from any document
No context window consumed at inference time
Composable: long documents are chunked and their LoRAs composed along the rank dimension

Why Not RAG?

	Doc-to-LoRA	RAG
Context window	Free -- knowledge lives in LoRA weights	Consumed by retrieved chunks
Latency	Sub-second LoRA generation, then normal inference	Retrieval + reranking at every query
Knowledge depth	Full document absorbed into weights	Limited to retrieved snippets
Composability	Multiple document LoRAs can be composed	Context window limits how many chunks fit
Trade-off	Requires training a hypernetwork	Works out of the box with any LLM

Doc-to-LoRA is complementary to RAG -- it works best for documents that are queried repeatedly, where the upfront cost of LoRA generation pays off across many queries.

Results

W&B Report: Training Report

Training Charts

Metric	Value
Final Train Loss	0.744
Final KL Loss	0.824
Training Steps	4,000
Training Time	~8.2 hours

Training

Trained using context distillation:

Ministral-3-3B reads a document and answers questions (teacher signal with logprobs)
The hypernetwork generates a LoRA from the same document
The model without the document but with the generated LoRA tries to answer the same questions
KL divergence loss between the teacher (full context) and student (LoRA only) logprobs

Training Data

Self-generated QA pairs (using Ministral-3-3B via vLLM) over 4 compact datasets:

SQuAD -- Wikipedia factual QA
DROP -- Discrete reasoning over paragraphs
ROPES -- Science cause/effect reasoning
PwC (Papers with Code) -- Academic papers

Training Configuration

Parameter	Value
Base model	`mistralai/Ministral-3-3B-Instruct-2512`
LoRA rank	8
Target modules	`down_proj`
Context encoder	Per-layer activations
Latent queries	8
Perceiver blocks	9
Max steps	4,000
Gradient accumulation	8
Max packed context length	2,048
Max packed input length	2,048
KL loss	Enabled
L1 regularization	0.1
Hardware	4x NVIDIA A100 80GB

Porting: Gemma-2-2B to Ministral-3-3B

Ministral-3-3B-Instruct-2512 presented several compatibility challenges with the original Doc-to-LoRA codebase:

Challenge	Solution
Model packaged as multimodal (`Mistral3ForConditionalGeneration`)	Extract text-only `MistralForCausalLM` from the multimodal wrapper
FP8 weights incompatible with vLLM 0.8.5 and transformers 4.51.3	Use official BF16 variant (`Ministral-3-3B-Instruct-2512-BF16`)
Tekken v13 tokenizer not supported	Upgrade `mistral-common>=1.9.0`, patch vLLM tokenizer assertions
`ministral3` model type unrecognized by transformers	Register as `MistralConfig` at import time

Usage

This model is designed to be used with the doc-to-lora codebase or Thoth agent.

# See the doc-to-lora repository for full usage instructions
# https://github.com/Neopolita/doc-to-lora

Limitations

Smaller training set: Trained on a 10% subset of 4 compact QA datasets, without the FineWeb QA dataset used in the original paper. This may limit generalization to out-of-domain documents.
Fewer training steps: 4,000 steps vs ~20,000 in the original Gemma-2-2B training. Longer training with more data would likely improve quality.
Single-document focus: Each LoRA is generated from a single document chunk (up to 2,048 tokens). Very long documents require chunking and LoRA composition, which was not extensively evaluated.

License

This model is based on the Doc-to-LoRA codebase by Sakana AI, which does not specify a license. Please refer to Sakana AI for licensing terms regarding the original code and methodology.

Citation

@article{doc-to-lora,
  title={Doc-to-LoRA: Sub-Second Knowledge Injection into LLMs via Document-to-LoRA Translation},
  author={Sakana AI},
  year={2026},
  url={https://pub.sakana.ai/doc-to-lora/}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for neopolita/doc-to-lora-ministral-3b-2512

Base model

mistralai/Ministral-3-3B-Base-2512

Quantized

mistralai/Ministral-3-3B-Instruct-2512

Adapter

(4)

this model

neopolita
/

doc-to-lora-ministral-3b-2512