md-reheader

Restore heading hierarchy in markdown documents with a fine-tuned 0.6B LLM.

The problem

PDF-to-markdown tools like MinerU, Docling, and Marker do great text extraction — then collapse your document structure. Every heading becomes # or ##. TOCs break. RAG chunking breaks. Navigation breaks.

md-reheader fixes it. A 0.6B-parameter Qwen3 fine-tune reads the document and predicts the correct H1–H6 level for every heading in a single forward pass.

Source PDF → md-reheader → hierarchy restored vs. flat PDF-parser output

Quick start

CLI

pip install md-reheader
rehead --input flat.md --output fixed.md

Auto-detects CUDA. Use --cpu or --gpu to override. Omit --output to stream to stdout (pipe-friendly).

rehead -i flat.md | tee fixed.md      # pipe
rehead -i flat.md --gpu -o out/fixed.md  # creates nested dirs
rehead --help                          # all flags

Python API

from md_reheader.inference.predict import load_model, reheader_document

model, tokenizer = load_model("joelbarmettler/md-reheader")

flat = open("document.md").read()
fixed = reheader_document(flat, model, tokenizer)

The package handles preprocessing (flattening + body stripping) and postprocessing (applying predicted levels back to the original document) automatically.

Direct `transformers` usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("joelbarmettler/md-reheader")
model = AutoModelForCausalLM.from_pretrained(
    "joelbarmettler/md-reheader",
    dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a markdown document structure expert. Given a markdown document with incorrect or flattened heading levels, output each heading with its correct markdown prefix (# for level 1, ## for level 2, etc.), one per line."},
    {"role": "user", "content": "# Introduction\n\nSome text...\n\n# Background\n\nMore text...\n\n# Methods"},
]

input_text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=4096, do_sample=False)

generated = outputs[0][inputs["input_ids"].shape[1]:]
print(tokenizer.decode(generated, skip_special_tokens=True))
# # Introduction
# ## Background
# ## Methods

Important: pass enable_thinking=False to apply_chat_template. Without it, the model enters a repetition loop because training used the non-thinking chat format.

Self-host with vLLM

md-reheader exposes the standard OpenAI-compatible chat endpoint when served with vLLM — higher throughput than raw transformers, and drop-in client compatibility.

pip install vllm
vllm serve joelbarmettler/md-reheader --dtype bfloat16 --max-model-len 8192

On <10 GB cards (e.g. RTX 2000/3060), add --enforce-eager --gpu-memory-utilization 0.70 to skip CUDA-graph allocations that otherwise OOM.

Remote inference (vLLM or any OpenAI-compatible endpoint)

Once a server is running, use md-reheader as a thin client — no local weights needed.

CLI:

rehead -i flat.md -o fixed.md --endpoint http://localhost:8000/v1
# With auth:
rehead -i flat.md -o fixed.md --endpoint https://api.example.com/v1 --api-key sk-xxx
# or set MD_REHEADER_API_KEY in the environment

Python:

from md_reheader.inference.remote import reheader_document_remote

fixed = reheader_document_remote(
    open("flat.md").read(),
    endpoint="http://localhost:8000/v1",
    model="joelbarmettler/md-reheader",
    api_key=None,  # or a bearer token
)

The remote client preprocesses locally (flatten + strip), sends a chat completion to the server with chat_template_kwargs={"enable_thinking": false} to match training, and applies predicted levels back to the original document. Identical output to local inference.

How it works

flat markdown  ──►  flatten headings to #  ──►  strip body to 128+128 tokens
                                                         │
                                                         ▼
        restored markdown  ◄──  apply predicted levels  ◄── Qwen3-0.6B (fine-tuned)

Extract headings with markdown-it-py (correctly skips code blocks).
Flatten every heading to # — the model ignores input levels.
Strip each section's body to its first 128 + last 128 tokens — preserves structural cues, kills the context bloat.
Qwen3-0.6B predicts the correct # prefix per heading.
Levels get mapped back to the original document.

Evaluation

Benchmarked on 7,321 held-out documents from GitHub markdown and Wikipedia.

Metric	All-H1 baseline	Heuristic	md-reheader
Exact match	0.0%	14.5%	56.1%
Per-heading accuracy	13.1%	49.1%	80.6%
Hierarchy preservation	61.3%	68.6%	91.0%
Mean absolute error	1.38	0.62	0.22

Per-level accuracy

	H1	H2	H3	H4	H5	H6
Accuracy	77%	85%	78%	68%	45%	50%

H1–H3 land in the 77–85% band; H5/H6 drop but still beat baselines. Most deep-level errors are off-by-one — the relative structure survives.

By document depth

Max depth	Exact match	Per-heading accuracy	Hierarchy
Depth 2	83%	91%	95%
Depth 3	54%	82%	90%
Depth 4	32%	70%	88%
Depth 5-6	33%	65%	89%

By source

Source	Exact match	Per-heading accuracy
GitHub markdown	49.5%	74.0%
Wikipedia	71.3%	95.5%

Speed

Document size	RTX 4090 (BF16)	CPU (fp32)
< 1k tokens	0.4s	5s
1k–2k tokens	0.8s	10s
2k–4k tokens	1.4s	~20s
4k–8k tokens	3.4s	~60s

Documents longer than ~8k tokens (after stripping) are truncated from the tail.

Training

Item	Value
Base model	Qwen/Qwen3-0.6B (text-only)
Framework	Axolotl
Training data	~197k markdown docs (GitHub + Wikipedia, depth 4+ oversampled 2–8×)
Hardware	2× RTX 4090, DDP, BF16
Sequence length	8,192 tokens with sample packing
Learning rate	5e-5, cosine schedule
Epochs	2 (epoch-1 checkpoint — epoch 2 overfits)
Effective batch	24

Input format during training

All headings flattened to # .
Body text per section truncated to first 128 + last 128 tokens.
Document truncated to 8k tokens.
Assistant output: one heading per line with its correct # prefix.

Limitations

Deep nesting (H5/H6): accuracy drops to 45–50%. Relative structure is preserved but absolute depth gets compressed by 1–2 levels.
Ambiguous structure: heading levels are subjective. The model learns common conventions; it can't resolve genuine ambiguity.
Long documents: >8k tokens (after stripping) get truncated. Headings past the cutoff retain their input levels.
English-centric: trained primarily on English content.

Author

Built by Joel Barmettler.

Citation

@software{barmettler2026mdreheader,
  author = {Barmettler, Joel},
  title  = {md-reheader: Restoring Heading Hierarchy in Markdown Documents},
  year   = {2026},
  url    = {https://github.com/joelbarmettlerUZH/md-reheader}
}

Downloads last month: 340

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for joelbarmettler/md-reheader

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

(795)

this model

joelbarmettler
/

md-reheader

md-reheader

The problem

Quick start

CLI

Python API

Direct `transformers` usage

Self-host with vLLM

Remote inference (vLLM or any OpenAI-compatible endpoint)

How it works

Evaluation

Per-level accuracy

By document depth

By source

Speed

Training

Input format during training

Limitations

Author

Citation

Model tree for joelbarmettler/md-reheader

Dataset used to train joelbarmettler/md-reheader

md-reheader

The problem

Quick start

CLI

Python API

Direct transformers usage

Self-host with vLLM

Remote inference (vLLM or any OpenAI-compatible endpoint)

How it works

Evaluation

Per-level accuracy

By document depth

By source

Speed

Training

Input format during training

Limitations

Author

Citation

Model tree for joelbarmettler/md-reheader

Dataset used to train joelbarmettler/md-reheader

Direct `transformers` usage