Chunky — Bitext Chunk Alignment Model (LoRA)

A LoRA adapter fine-tuned on Qwen3-4B-Instruct-2507 for the task of finding optimal split points in parallel bilingual text (bitext chunking). Given a source and target text pair with pre-inserted split markers, the model predicts which pairs of split indices align semantically.

Task

Given <src> and <tgt> blocks with numbered split markers [|1|], [|2|], ..., predict the optimal alignment pairs as <answer>src_idx-tgt_idx, ...</answer>.

Example input:

<src>Document title[|1|]First paragraph content.[|2|]Second paragraph.</src>
<tgt>문서 제목[|1|]첫 번째 단락 내용.[|2|]두 번째 단락.</tgt>

Expected output:

<answer>1-1, 2-2</answer>

Model Details

Property	Value
Base model	`unsloth/Qwen3-4B-Instruct-2507`
Method	SFT with LoRA (PEFT)
LoRA rank	32
LoRA alpha	64
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training precision	fp32
Training steps	8000 (best checkpoint: step 7000)
Max sequence length	12800 tokens
Training samples	~316k augmented bitext alignment samples (Korean–English)
Optimizer	AdamW (lr=2e-4, warmup=5%, cosine decay)
Effective batch size	8 (4 × grad_accum 2)
Best eval loss	0.03788 (step 7000)
Final eval loss	0.03844 (step 8000)

Framework Versions

PEFT 0.18.1
TRL 0.23.0
Transformers 4.56.2
PyTorch 2.9.1
Unsloth 2026.4.4

Usage

With Unsloth (recommended)

from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from peft import PeftModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-4B-Instruct-2507",
    max_seq_length=12800,
    load_in_4bit=False,
)
tokenizer = get_chat_template(tokenizer, chat_template="qwen3-instruct")
model = PeftModel.from_pretrained(model, "p4b/chunky-qwen3-4b-sft")
model = FastLanguageModel.for_inference(model)

With standard transformers + PEFT

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model = "unsloth/Qwen3-4B-Instruct-2507"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(model, "p4b/chunky-qwen3-4b-sft")
model.eval()

Inference Example

SYSTEM_PROMPT = """## Task Description
You are a Linguistic Structure Analyst and Translation Alignment Expert. Your task is to analyze the provided `<src>` (source) and `<tgt>` (target) blocks to identify "Optimal Split Points.". Each split should be in closed form when translating bidirectionally. You should carefully look for pronoun relation or tense.

## Task Objective
Find the index numbers `[|n|]` where the text can be naturally divided into two parts. A split is considered "optimal" if the segments before and after the split remain independently understandable and do not break the semantic flow.

## Guidelines for Selection
1. **Structural Cues**: Prioritize indices located next to structural markers, such as hyphens (`-`), bullet points, or section dividers.
2. **Contextual Independence**: The content after the split point should start a new logical section or thought (e.g., a new heading or a different category of information).
3. Example Logic: In the text `...Information [|32|]-[|33|] Nearby...`, the index `[|33|]` is an ideal split point because it follows a hyphen and precedes a new sub-topic.
4. Alignment: Match the corresponding split point index from the `<src>` block with the equivalent split point index in the `<tgt>` block.

## Constraint
- The output must be formatted strictly as: `<answer>SourceIndex-TargetIndex, SourceIndex-TargetIndex</answer>`

## Input
"""

def insert_split_tokens(chunks: list[str]) -> str:
    parts = []
    for i, chunk in enumerate(chunks, start=1):
        parts.append(chunk)
        parts.append(f"[|{i}|]")
    return "".join(parts)

src_chunks = ["Introduction paragraph.", "Main content section.", "Conclusion."]
tgt_chunks = ["서론 단락.", "본문 내용 섹션.", "결론."]

text = f"<src>{insert_split_tokens(src_chunks)}</src><tgt>{insert_split_tokens(tgt_chunks)}</tgt>"
messages = [{"role": "user", "content": SYSTEM_PROMPT + text}]

input_text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(input_text, return_tensors="pt", add_special_tokens=False).to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False,
        eos_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(
    output_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
)
print(response)  # <answer>2-2</answer>

Training Data

Fine-tuned on augmented bitext alignment samples generated from Korean–English parallel corpora. Augmentation was applied to increase diversity in split-point configurations. Training used the augmented subset only (~316k samples), leaving the original ~105k samples as an unseen evaluation pool.

Evaluation Metrics

Evaluated using FN/FP reward scoring (same as GRPO training objective):

Metric	Description	Weight
FN (false negative)	Missed ground-truth split pairs	1.0
FP (false positive)	Predicted pairs not in ground truth	0.2
Length penalty	Predicted segments >3× longer than GT	1.0
Reward	`-(1.0×FN + 0.2×FP + length_penalty)`	—

Evaluation Results

Evaluated on 300 randomly sampled held-out original (non-augmented) samples from train_split.jsonl:

Metric	Value
Perfect reward (=0)	53.7% (161/300)
Reward mean / median	-1.654 / 0.000
Reward stdev	3.277
FN mean	1.393
FP mean	0.687
Length penalty mean	0.070
Parse error rate	5.3%

Reward distribution:

Range	Count	%
= 0 (perfect)	161	53.7%
-1 ~ 0	3	1.0%
-2 ~ -1	70	23.3%
-5 ~ -2	43	14.3%
≤ -5	23	7.7%

The median reward is 0 — the majority of samples are predicted perfectly. The mean is pulled down by a small number of hard samples with many chunks (100+).

Limitations

Primarily trained on Korean–English parallel text; other language pairs are untested.
May underperform on documents longer than 12800 tokens.
Trained without thinking/reasoning mode — purely output-direct fine-tuning.

Downloads last month: 19

Model tree for p4b/qwen3-4b-chunky

Base model

Qwen/Qwen3-4B-Instruct-2507

Finetuned

unsloth/Qwen3-4B-Instruct-2507

Adapter

(400)

this model

Quantizations

1 model