ALJIACHI commited on
Commit
ef36cb3
·
verified ·
1 Parent(s): cf797aa

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +229 -14
README.md CHANGED
@@ -1,20 +1,235 @@
1
  ---
2
- title: Mizan-Rerank-V2 Demo
3
- emoji: 🔍
4
- colorFrom: blue
5
- colorTo: indigo
6
- sdk: gradio
7
- sdk_version: 5.29.0
8
- app_file: app.py
9
- pinned: false
10
  license: apache-2.0
11
- models:
12
- - ALJIACHI/Mizan-Rerank-v2
13
- short_description: Arabic Long-Context Reranking Model Demo
 
 
 
 
 
 
 
 
 
 
 
14
  ---
15
 
16
- # Mizan-Rerank-v2 Demo
17
 
18
- Interactive demo for **Mizan-Rerank-v2**, a 305M-parameter cross-encoder model for Arabic long-context text reranking.
19
 
20
- Enter a query and a list of documents (one per line) to see them reranked by relevance.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
+ language:
4
+ - ar
5
+ - en
6
+ base_model:
7
+ - Alibaba-NLP/gte-multilingual-reranker-base
8
+ tags:
9
+ - sentence-transformers
10
+ - cross-encoder
11
+ - reranker
12
+ - arabic
13
+ - long-context
14
+ pipeline_tag: text-ranking
15
+ library_name: sentence-transformers
16
+ inference: true
17
  ---
18
 
19
+ # Mizan-Rerank-v2
20
 
21
+ A high-performance open-source cross-encoder model for reranking Arabic long texts, fine-tuned from Alibaba-NLP/gte-multilingual-reranker-base with state-of-the-art results on Arabic reranking benchmarks.
22
 
23
+ ![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Mizan--Rerank--v2-blue)
24
+ ![Model Size](https://img.shields.io/badge/Parameters-305M-green)
25
+ ![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen)
26
+ [![Demo](https://img.shields.io/badge/🤗%20Demo-Mizan--Rerank--V2-yellow)](https://huggingface.co/spaces/ALJIACHI/Mizan-Rerank-V2-Demo)
27
+
28
+ ## Try It Out
29
+
30
+ <iframe
31
+ src="https://ALJIACHI-Mizan-Rerank-V2-Demo.hf.space"
32
+ frameborder="0"
33
+ width="100%"
34
+ height="600"
35
+ ></iframe>
36
+
37
+ > **[Open full demo →](https://huggingface.co/spaces/ALJIACHI/Mizan-Rerank-V2-Demo)**
38
+
39
+ ## Overview
40
+
41
+ Mizan-Rerank-v2 is a cross-encoder reranking model based on [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base), specifically fine-tuned for Arabic text reranking. It excels at reranking long documents (up to 8192 tokens) and outperforms both its base model and larger competitors on Arabic reranking benchmarks.
42
+
43
+ ## Key Features
44
+
45
+ - **Long Document Support**: Handles up to 8192 tokens using RoPE position embeddings with NTK scaling
46
+ - **Superior Arabic Performance**: Outperforms BAAI/bge-reranker-v2-m3 (568M) despite being nearly half the size
47
+ - **Arabic Language Optimization**: Fine-tuned on 1.2M+ Arabic query-document pairs from diverse sources
48
+
49
+ ## Performance Benchmarks
50
+
51
+ ![Reranker Benchmark Comparison](chart-1.png)
52
+
53
+ ### Reranking Evaluation (ndcg@10)
54
+
55
+ | Model | Parameters | Reranking | Triplet | MIRACL (Long Docs) | WikiQA | MedQA |
56
+ |-------|-----------|-----------|---------|---------------------|--------|-------|
57
+ | **Mizan-Rerank-v2** | **305M** | **1.0000** | 0.9993 | **0.8091** | 0.8258 | **0.6775** |
58
+ | BAAI/bge-reranker-v2-m3 | 568M | **1.0000** | **0.9998** | 0.7231 | **0.8669** | 0.6584 |
59
+ | Alibaba-NLP/gte-multilingual-reranker-base | 305M | **1.0000** | 0.9991 | 0.7539 | 0.8275 | 0.6648 |
60
+ | ALJIACHI/Mizan-Rerank-v1 | 149M | 0.9986 | 0.9955 | 0.7370 | 0.7739 | 0.5502 |
61
+
62
+ ### Key Improvements over Base Model
63
+
64
+ | Benchmark | Base Model | Mizan-Rerank-v2 | Improvement |
65
+ |-----------|-----------|-----------------|-------------|
66
+ | Reranking | 1.0000 | 1.0000 | -- |
67
+ | Triplet | 0.9991 | 0.9993 | +0.0002 |
68
+ | MIRACL (Long Docs) | 0.7539 | 0.8091 | **+0.0552** |
69
+ | WikiQA | 0.8275 | 0.8258 | -0.0017 |
70
+ | MedQA | 0.6648 | 0.6775 | **+0.0127** |
71
+
72
+ ### Key Improvements over BAAI/bge-reranker-v2-m3
73
+
74
+ | Benchmark | bge-reranker-v2-m3 | Mizan-Rerank-v2 | Improvement |
75
+ |-----------|-------------------|-----------------|-------------|
76
+ | Reranking | 1.0000 | 1.0000 | -- |
77
+ | Triplet | 0.9998 | 0.9993 | -0.0005 |
78
+ | MIRACL (Long Docs) | 0.7231 | 0.8091 | **+0.0860** |
79
+ | WikiQA | 0.8669 | 0.8258 | -0.0411 |
80
+ | MedQA | 0.6584 | 0.6775 | **+0.0191** |
81
+
82
+ ## Model Details
83
+
84
+ - **Model Type:** Cross Encoder
85
+ - **Base Model:** [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base)
86
+ - **Architecture:** NewForSequenceClassification (12 layers, 768 hidden, 12 heads)
87
+ - **Maximum Sequence Length:** 8192 tokens
88
+ - **Position Embeddings:** RoPE with NTK scaling (factor 8.0)
89
+ - **Number of Output Labels:** 1
90
+ - **Language:** Arabic (ar), English (en)
91
+ - **License:** Apache 2.0
92
+
93
+ ## Usage
94
+
95
+ ### Using Sentence Transformers
96
+
97
+ ```bash
98
+ pip install -U sentence-transformers
99
+ ```
100
+
101
+ ```python
102
+ from sentence_transformers import CrossEncoder
103
+
104
+ # Load model
105
+ model = CrossEncoder("ALJIACHI/Mizan-Rerank-v2", max_length=8192, trust_remote_code=True)
106
+
107
+ # Score query-document pairs
108
+ pairs = [
109
+ ["ما هو تفسير الآية وجعلنا من الماء كل شيء حي",
110
+ "تعني الآية أن الماء هو عنصر أساسي في حياة جميع الكائنات الحية، وهو ضروري لاستمرار الحياة."],
111
+ ["ما هو تفسير الآية وجعلنا من الماء كل شيء حي",
112
+ "تم اكتشاف كواكب خارج المجموعة الشمسية تحتوي على مياه متجمدة."],
113
+ ["ما هو تفسير الآية وجعلنا من الماء كل شيء حي",
114
+ "تحدث القرآن الكريم عن البرق والرعد في عدة مواضع مختلفة."],
115
+ ]
116
+
117
+ scores = model.predict(pairs)
118
+ print(scores)
119
+ # High score for the relevant passage, low scores for irrelevant ones
120
+
121
+ # Or rank documents for a query
122
+ ranks = model.rank(
123
+ "ما هو تفسير الآية وجعلنا من الماء كل شيء حي",
124
+ [
125
+ "تعني الآية أن الماء هو عنصر أساسي في حياة جميع الكائنات الحية، وهو ضروري لاستمرار الحياة.",
126
+ "تم اكتشاف كواكب خارج المجموعة الشمسية تحتوي على مياه متجمدة.",
127
+ "تحدث القرآن الكريم عن البرق والرعد في عدة مواضع مختلفة.",
128
+ ]
129
+ )
130
+ print(ranks)
131
+ # [{'corpus_id': 0, 'score': ...}, {'corpus_id': 1, 'score': ...}, ...]
132
+ ```
133
+
134
+ ### Using Transformers Directly
135
+
136
+ ```python
137
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
138
+ import torch
139
+
140
+ model = AutoModelForSequenceClassification.from_pretrained(
141
+ "ALJIACHI/Mizan-Rerank-v2",
142
+ trust_remote_code=True,
143
+ torch_dtype=torch.float16,
144
+ )
145
+ tokenizer = AutoTokenizer.from_pretrained("ALJIACHI/Mizan-Rerank-v2")
146
+
147
+ def get_relevance_score(query, passage):
148
+ inputs = tokenizer(query, passage, return_tensors="pt", padding=True, truncation=True, max_length=8192)
149
+ with torch.no_grad():
150
+ outputs = model(**inputs)
151
+ return torch.sigmoid(outputs.logits).item()
152
+
153
+ query = "ما هي فوائد فيتامين د؟"
154
+ passages = [
155
+ "يساعد فيتامين د في تعزيز صحة العظام وتقوية الجهاز المناعي، كما يلعب دوراً مهماً في امتصاص الكالسيوم.",
156
+ "يستخدم فيتامين د في بعض الصناعات الغذائية كمادة حافظة.",
157
+ "أطلقت وزارة الزراعة حملة وطنية لزيادة الوعي بأهمية الزراعة العضوية.",
158
+ ]
159
+
160
+ scores = [(p, get_relevance_score(query, p)) for p in passages]
161
+ reranked = sorted(scores, key=lambda x: x[1], reverse=True)
162
+
163
+ for passage, score in reranked:
164
+ print(f"Score: {score:.4f} | {passage[:80]}...")
165
+ ```
166
+
167
+ ## Training Details
168
+
169
+ ### Training Data
170
+
171
+ Trained on **1,199,634 query-document pairs** from diverse Arabic sources
172
+
173
+ ### Training Configuration
174
+
175
+ | Parameter | Value |
176
+ |-----------|-------|
177
+ | Base Model | Alibaba-NLP/gte-multilingual-reranker-base |
178
+ | Max Sequence Length | 8192 |
179
+ | Batch Size | 2 |
180
+ | Gradient Accumulation Steps | 16 |
181
+ | Effective Batch Size | 32 |
182
+ | Learning Rate | 5e-7 |
183
+ | LR Scheduler | Cosine |
184
+ | Warmup Ratio | 0.1 |
185
+ | Precision | FP16 |
186
+ | Gradient Checkpointing | Enabled |
187
+ | Loss Function | BinaryCrossEntropyLoss (pos_weight=1.24) |
188
+
189
+
190
+ ## Applications
191
+
192
+ - Arabic search engines and information retrieval systems
193
+ - RAG (Retrieval-Augmented Generation) pipelines
194
+ - Islamic text search and jurisprudence Q&A
195
+ - Digital library and archive search
196
+ - Long-document Arabic content analysis
197
+ - E-learning platforms with Arabic content
198
+
199
+ ### Framework Versions
200
+
201
+ - Python: 3.10.14
202
+ - Sentence Transformers: 5.4.1
203
+ - Transformers: 4.55.4
204
+ - PyTorch: 2.8.0+cu126
205
+ - Accelerate: 1.10.0
206
+ - Datasets: 3.5.0
207
+ - Tokenizers: 0.21.0
208
+
209
+ ## Citation
210
+
211
+ ```bibtex
212
+ @software{Mizan_Rerank_v2_2026,
213
+ author = {Ali Aljiachi},
214
+ title = {Mizan-Rerank-v2: Arabic Long-Context Text Reranking Model},
215
+ year = {2026},
216
+ publisher = {Hugging Face},
217
+ url = {https://huggingface.co/ALJIACHI/Mizan-Rerank-v2}
218
+ }
219
+ ```
220
+
221
+ ```bibtex
222
+ @inproceedings{reimers-2019-sentence-bert,
223
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
224
+ author = "Reimers, Nils and Gurevych, Iryna",
225
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
226
+ month = "11",
227
+ year = "2019",
228
+ publisher = "Association for Computational Linguistics",
229
+ url = "https://arxiv.org/abs/1908.10084",
230
+ }
231
+ ```
232
+
233
+ ## License
234
+
235
+ Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).