ALJIACHI commited on
Commit
cf797aa
·
verified ·
1 Parent(s): c19131a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +14 -217
README.md CHANGED
@@ -1,223 +1,20 @@
1
  ---
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
- language:
4
- - ar
5
- - en
6
- base_model:
7
- - Alibaba-NLP/gte-multilingual-reranker-base
8
- tags:
9
- - sentence-transformers
10
- - cross-encoder
11
- - reranker
12
- - arabic
13
- - long-context
14
- pipeline_tag: text-ranking
15
- library_name: sentence-transformers
16
- inference: true
17
  ---
18
 
19
- # Mizan-Rerank-v2
20
 
21
- A high-performance open-source cross-encoder model for reranking Arabic long texts, fine-tuned from Alibaba-NLP/gte-multilingual-reranker-base with state-of-the-art results on Arabic reranking benchmarks.
22
 
23
- ![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Mizan--Rerank--v2-blue)
24
- ![Model Size](https://img.shields.io/badge/Parameters-305M-green)
25
- ![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen)
26
-
27
- ## Overview
28
-
29
- Mizan-Rerank-v2 is a cross-encoder reranking model based on [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base), specifically fine-tuned for Arabic text reranking. It excels at reranking long documents (up to 8192 tokens) and outperforms both its base model and larger competitors on Arabic reranking benchmarks.
30
-
31
- ## Key Features
32
-
33
- - **Long Document Support**: Handles up to 8192 tokens using RoPE position embeddings with NTK scaling
34
- - **Superior Arabic Performance**: Outperforms BAAI/bge-reranker-v2-m3 (568M) despite being nearly half the size
35
- - **Arabic Language Optimization**: Fine-tuned on 1.2M+ Arabic query-document pairs from diverse sources
36
-
37
- ## Performance Benchmarks
38
-
39
- ![Reranker Benchmark Comparison](chart-1.png)
40
-
41
- ### Reranking Evaluation (ndcg@10)
42
-
43
- | Model | Parameters | Reranking | Triplet | MIRACL (Long Docs) | WikiQA | MedQA |
44
- |-------|-----------|-----------|---------|---------------------|--------|-------|
45
- | **Mizan-Rerank-v2** | **305M** | **1.0000** | 0.9993 | **0.8091** | 0.8258 | **0.6775** |
46
- | BAAI/bge-reranker-v2-m3 | 568M | **1.0000** | **0.9998** | 0.7231 | **0.8669** | 0.6584 |
47
- | Alibaba-NLP/gte-multilingual-reranker-base | 305M | **1.0000** | 0.9991 | 0.7539 | 0.8275 | 0.6648 |
48
- | ALJIACHI/Mizan-Rerank-v1 | 149M | 0.9986 | 0.9955 | 0.7370 | 0.7739 | 0.5502 |
49
-
50
- ### Key Improvements over Base Model
51
-
52
- | Benchmark | Base Model | Mizan-Rerank-v2 | Improvement |
53
- |-----------|-----------|-----------------|-------------|
54
- | Reranking | 1.0000 | 1.0000 | -- |
55
- | Triplet | 0.9991 | 0.9993 | +0.0002 |
56
- | MIRACL (Long Docs) | 0.7539 | 0.8091 | **+0.0552** |
57
- | WikiQA | 0.8275 | 0.8258 | -0.0017 |
58
- | MedQA | 0.6648 | 0.6775 | **+0.0127** |
59
-
60
- ### Key Improvements over BAAI/bge-reranker-v2-m3
61
-
62
- | Benchmark | bge-reranker-v2-m3 | Mizan-Rerank-v2 | Improvement |
63
- |-----------|-------------------|-----------------|-------------|
64
- | Reranking | 1.0000 | 1.0000 | -- |
65
- | Triplet | 0.9998 | 0.9993 | -0.0005 |
66
- | MIRACL (Long Docs) | 0.7231 | 0.8091 | **+0.0860** |
67
- | WikiQA | 0.8669 | 0.8258 | -0.0411 |
68
- | MedQA | 0.6584 | 0.6775 | **+0.0191** |
69
-
70
- ## Model Details
71
-
72
- - **Model Type:** Cross Encoder
73
- - **Base Model:** [Alibaba-NLP/gte-multilingual-reranker-base](https://huggingface.co/Alibaba-NLP/gte-multilingual-reranker-base)
74
- - **Architecture:** NewForSequenceClassification (12 layers, 768 hidden, 12 heads)
75
- - **Maximum Sequence Length:** 8192 tokens
76
- - **Position Embeddings:** RoPE with NTK scaling (factor 8.0)
77
- - **Number of Output Labels:** 1
78
- - **Language:** Arabic (ar), English (en)
79
- - **License:** Apache 2.0
80
-
81
- ## Usage
82
-
83
- ### Using Sentence Transformers
84
-
85
- ```bash
86
- pip install -U sentence-transformers
87
- ```
88
-
89
- ```python
90
- from sentence_transformers import CrossEncoder
91
-
92
- # Load model
93
- model = CrossEncoder("ALJIACHI/Mizan-Rerank-v2", max_length=8192, trust_remote_code=True)
94
-
95
- # Score query-document pairs
96
- pairs = [
97
- ["ما هو تفسير الآية وجعلنا من الماء كل شيء حي",
98
- "تعني الآية أن الماء هو عنصر أساسي في حياة جميع الكائنات الحية، وهو ضروري لاستمرار الحياة."],
99
- ["ما هو تفسير الآية وجعلنا من الماء كل شيء حي",
100
- "تم اكتشاف كواكب خارج المجموعة الشمسية تحتوي على مياه متجمدة."],
101
- ["ما هو تفسير الآية وجعلنا من الماء كل شيء حي",
102
- "تحدث القرآن الكريم عن البرق والرعد في عدة مواضع مختلفة."],
103
- ]
104
-
105
- scores = model.predict(pairs)
106
- print(scores)
107
- # High score for the relevant passage, low scores for irrelevant ones
108
-
109
- # Or rank documents for a query
110
- ranks = model.rank(
111
- "ما هو تفسير الآية وجعلنا من الماء كل شيء حي",
112
- [
113
- "تعني الآية أن الماء هو عنصر أس��سي في حياة جميع الكائنات الحية، وهو ضروري لاستمرار الحياة.",
114
- "تم اكتشاف كواكب خارج المجموعة الشمسية تحتوي على مياه متجمدة.",
115
- "تحدث القرآن الكريم عن البرق والرعد في عدة مواضع مختلفة.",
116
- ]
117
- )
118
- print(ranks)
119
- # [{'corpus_id': 0, 'score': ...}, {'corpus_id': 1, 'score': ...}, ...]
120
- ```
121
-
122
- ### Using Transformers Directly
123
-
124
- ```python
125
- from transformers import AutoModelForSequenceClassification, AutoTokenizer
126
- import torch
127
-
128
- model = AutoModelForSequenceClassification.from_pretrained(
129
- "ALJIACHI/Mizan-Rerank-v2",
130
- trust_remote_code=True,
131
- torch_dtype=torch.float16,
132
- )
133
- tokenizer = AutoTokenizer.from_pretrained("ALJIACHI/Mizan-Rerank-v2")
134
-
135
- def get_relevance_score(query, passage):
136
- inputs = tokenizer(query, passage, return_tensors="pt", padding=True, truncation=True, max_length=8192)
137
- with torch.no_grad():
138
- outputs = model(**inputs)
139
- return torch.sigmoid(outputs.logits).item()
140
-
141
- query = "ما هي فوائد فيتامين د؟"
142
- passages = [
143
- "يساعد فيتامين د في تعزيز صحة العظام وتقوية الجهاز المناعي، كما يلعب دوراً مهماً في امتصاص الكالسيوم.",
144
- "يستخدم فيتامين د في بعض الصناعات الغذائية كمادة حافظة.",
145
- "أطلقت وزارة الزراعة حملة وطنية لزيادة الوعي بأهمية الزراعة العضوية.",
146
- ]
147
-
148
- scores = [(p, get_relevance_score(query, p)) for p in passages]
149
- reranked = sorted(scores, key=lambda x: x[1], reverse=True)
150
-
151
- for passage, score in reranked:
152
- print(f"Score: {score:.4f} | {passage[:80]}...")
153
- ```
154
-
155
- ## Training Details
156
-
157
- ### Training Data
158
-
159
- Trained on **1,199,634 query-document pairs** from diverse Arabic sources
160
-
161
- ### Training Configuration
162
-
163
- | Parameter | Value |
164
- |-----------|-------|
165
- | Base Model | Alibaba-NLP/gte-multilingual-reranker-base |
166
- | Max Sequence Length | 8192 |
167
- | Batch Size | 2 |
168
- | Gradient Accumulation Steps | 16 |
169
- | Effective Batch Size | 32 |
170
- | Learning Rate | 5e-7 |
171
- | LR Scheduler | Cosine |
172
- | Warmup Ratio | 0.1 |
173
- | Precision | FP16 |
174
- | Gradient Checkpointing | Enabled |
175
- | Loss Function | BinaryCrossEntropyLoss (pos_weight=1.24) |
176
-
177
-
178
- ## Applications
179
-
180
- - Arabic search engines and information retrieval systems
181
- - RAG (Retrieval-Augmented Generation) pipelines
182
- - Islamic text search and jurisprudence Q&A
183
- - Digital library and archive search
184
- - Long-document Arabic content analysis
185
- - E-learning platforms with Arabic content
186
-
187
- ### Framework Versions
188
-
189
- - Python: 3.10.14
190
- - Sentence Transformers: 5.4.1
191
- - Transformers: 4.55.4
192
- - PyTorch: 2.8.0+cu126
193
- - Accelerate: 1.10.0
194
- - Datasets: 3.5.0
195
- - Tokenizers: 0.21.0
196
-
197
- ## Citation
198
-
199
- ```bibtex
200
- @software{Mizan_Rerank_v2_2026,
201
- author = {Ali Aljiachi},
202
- title = {Mizan-Rerank-v2: Arabic Long-Context Text Reranking Model},
203
- year = {2026},
204
- publisher = {Hugging Face},
205
- url = {https://huggingface.co/ALJIACHI/Mizan-Rerank-v2}
206
- }
207
- ```
208
-
209
- ```bibtex
210
- @inproceedings{reimers-2019-sentence-bert,
211
- title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
212
- author = "Reimers, Nils and Gurevych, Iryna",
213
- booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
214
- month = "11",
215
- year = "2019",
216
- publisher = "Association for Computational Linguistics",
217
- url = "https://arxiv.org/abs/1908.10084",
218
- }
219
- ```
220
-
221
- ## License
222
-
223
- Released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
 
1
  ---
2
+ title: Mizan-Rerank-V2 Demo
3
+ emoji: 🔍
4
+ colorFrom: blue
5
+ colorTo: indigo
6
+ sdk: gradio
7
+ sdk_version: 5.29.0
8
+ app_file: app.py
9
+ pinned: false
10
  license: apache-2.0
11
+ models:
12
+ - ALJIACHI/Mizan-Rerank-v2
13
+ short_description: Arabic Long-Context Reranking Model Demo
 
 
 
 
 
 
 
 
 
 
 
14
  ---
15
 
16
+ # Mizan-Rerank-v2 Demo
17
 
18
+ Interactive demo for **Mizan-Rerank-v2**, a 305M-parameter cross-encoder model for Arabic long-context text reranking.
19
 
20
+ Enter a query and a list of documents (one per line) to see them reranked by relevance.