RobBobin commited on
Commit
371a39d
·
verified ·
1 Parent(s): 8608c67

Upload paper/math_embeddings.tex with huggingface_hub

Browse files
Files changed (1) hide show
  1. paper/math_embeddings.tex +469 -0
paper/math_embeddings.tex ADDED
@@ -0,0 +1,469 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ \documentclass[11pt,a4paper]{article}
2
+
3
+ \usepackage[utf8]{inputenc}
4
+ \usepackage[T1]{fontenc}
5
+ \usepackage{amsmath,amssymb,amsthm}
6
+ \usepackage{booktabs}
7
+ \usepackage{graphicx}
8
+ \usepackage{hyperref}
9
+ \usepackage[margin=1in]{geometry}
10
+ \usepackage{enumitem}
11
+ \usepackage{xcolor}
12
+ \usepackage{algorithm}
13
+ \usepackage{algpseudocode}
14
+
15
+ \hypersetup{
16
+ colorlinks=true,
17
+ linkcolor=blue!70!black,
18
+ citecolor=green!50!black,
19
+ urlcolor=blue!70!black,
20
+ }
21
+
22
+ \title{Knowledge-Graph-Guided Fine-Tuning of Embedding Models\\
23
+ for Mathematical Document Retrieval}
24
+ \author{Robin Langer\thanks{The author thanks Claude (Anthropic) for assistance with code development and manuscript preparation.}}
25
+ \date{}
26
+
27
+ \begin{document}
28
+
29
+ \maketitle
30
+
31
+ \begin{abstract}
32
+ We present a method for improving semantic search over mathematical research
33
+ papers by fine-tuning embedding models using contrastive learning, guided by
34
+ a knowledge graph extracted from the corpus. General-purpose embedding models
35
+ (e.g., OpenAI's \texttt{text-embedding-3-small}) and even scientific embedding
36
+ models (SPECTER2, SciNCL) perform poorly on mathematical retrieval tasks because
37
+ they lack understanding of the semantic relationships between mathematical
38
+ concepts. Our approach exploits an existing knowledge graph --- whose nodes are
39
+ mathematical concepts and whose edges encode relationships such as
40
+ \emph{generalizes}, \emph{proves}, and \emph{is\_instance\_of} --- to
41
+ automatically generate training data for contrastive fine-tuning. We benchmark
42
+ baseline models against our fine-tuned model on a retrieval task over 4,794
43
+ paper chunks spanning 75 papers in algebraic combinatorics, and demonstrate
44
+ that domain-specific fine-tuning significantly outperforms all baselines.
45
+ The method is general: given any corpus of mathematical papers and a
46
+ knowledge graph over their concepts, the same pipeline produces a
47
+ domain-adapted embedding model.
48
+ \end{abstract}
49
+
50
+
51
+ \section{Introduction}
52
+
53
+ The increasing volume of mathematical literature makes automated retrieval
54
+ tools indispensable for researchers. A common approach is
55
+ \emph{retrieval-augmented generation} (RAG): chunk papers into passages, embed
56
+ them in a vector space, and retrieve relevant passages via nearest-neighbor
57
+ search over embeddings. The quality of retrieval depends critically on the
58
+ embedding model's ability to capture \emph{mathematical semantic similarity}
59
+ --- the idea that a query like ``Rogers--Ramanujan identities'' should retrieve
60
+ not only passages containing that exact phrase but also passages discussing
61
+ Bailey's lemma, $q$-series transformations, and partition identities.
62
+
63
+ General-purpose embedding models are trained on broad web text and lack this
64
+ kind of domain knowledge. Scientific embedding models such as SPECTER2
65
+ \cite{specter2} and SciNCL \cite{scincl} are trained on citation graphs from
66
+ Semantic Scholar, but mathematics is underrepresented in their training data,
67
+ and they are optimized for \emph{paper-to-paper} similarity rather than
68
+ \emph{concept-to-passage} retrieval.
69
+
70
+ We address this gap by fine-tuning an embedding model specifically for
71
+ mathematical concept retrieval. Our key insight is that a \textbf{knowledge
72
+ graph} (KG) extracted from the corpus provides exactly the supervision signal
73
+ needed for contrastive learning:
74
+ \begin{itemize}[nosep]
75
+ \item Each KG concept (e.g., ``Macdonald polynomials'') maps to specific
76
+ papers, and hence to specific text chunks. These form
77
+ \emph{positive pairs} for contrastive training.
78
+ \item KG edges (e.g., ``Bailey's lemma \emph{generalizes}
79
+ Rogers--Ramanujan identities'') provide \emph{cross-concept
80
+ positives} that teach the model about mathematical relationships.
81
+ \item In-batch negatives from unrelated concepts provide the contrastive
82
+ signal automatically.
83
+ \end{itemize}
84
+
85
+ This paper makes the following contributions:
86
+ \begin{enumerate}[nosep]
87
+ \item A benchmark comparing general-purpose and scientific embedding
88
+ models on mathematical concept retrieval (Section~\ref{sec:benchmark}).
89
+ \item A method for automatically generating contrastive training data from
90
+ a knowledge graph (Section~\ref{sec:training-data}).
91
+ \item A fine-tuned embedding model that outperforms all baselines on our
92
+ benchmark (Section~\ref{sec:finetuning}).
93
+ \item An open-source pipeline\footnote{Code available at
94
+ \url{https://github.com/RaggedR/embeddings}. Model available at
95
+ \url{https://huggingface.co/RobBobin/math-embed}.} that can be applied to any
96
+ mathematical corpus with an associated knowledge graph.
97
+ \end{enumerate}
98
+
99
+
100
+ \section{Related Work}
101
+
102
+ \paragraph{Scientific document embeddings.}
103
+ SPECTER \cite{specter} introduced citation-based contrastive learning for
104
+ scientific document embeddings, training on (paper, cited paper, non-cited
105
+ paper) triplets. SPECTER2 \cite{specter2} extended this to 6 million citation
106
+ triplets across 23 fields of study and introduced task-specific adapters
107
+ (proximity, classification, regression). SciNCL \cite{scincl} improved on
108
+ SPECTER by using citation graph \emph{neighborhood} sampling for harder
109
+ negatives. All three models use SciBERT \cite{scibert} as their backbone and
110
+ produce 768-dimensional embeddings.
111
+
112
+ \paragraph{Mathematics-specific models.}
113
+ MathBERT \cite{mathbert} pre-trained BERT on mathematical curricula and arXiv
114
+ abstracts, but only with masked language modeling --- it was not contrastively
115
+ trained for retrieval. No widely adopted embedding model exists that is
116
+ specifically trained for mathematical semantic similarity.
117
+
118
+ \paragraph{Contrastive fine-tuning.}
119
+ The sentence-transformers framework \cite{sbert} provides
120
+ \texttt{MultipleNegativesRankingLoss} (MNRL), which treats all other examples
121
+ in a batch as negatives. Matryoshka Representation Learning \cite{matryoshka}
122
+ trains embeddings so that any prefix of the full vector is itself a useful
123
+ embedding, enabling flexible dimensionality--quality tradeoffs at inference.
124
+
125
+
126
+ \section{Data}
127
+ \label{sec:data}
128
+
129
+ \subsection{Corpus}
130
+
131
+ Our corpus consists of 75 research papers in algebraic combinatorics,
132
+ $q$-series, and related areas, sourced from arXiv. Papers are chunked into
133
+ passages of up to 1,500 characters with 200-character overlap, yielding
134
+ \textbf{4,794 chunks}. The chunks are stored in a ChromaDB vector database
135
+ with embeddings from OpenAI's \texttt{text-embedding-3-small} (1536-dim).
136
+
137
+ \subsection{Knowledge graph}
138
+
139
+ A knowledge graph was constructed by having GPT-4o-mini extract concepts and
140
+ relationships from representative chunks (first two and last two) of each
141
+ paper \cite{kg-extraction}. After normalization and deduplication, the KG
142
+ contains:
143
+ \begin{itemize}[nosep]
144
+ \item \textbf{559 concepts} (218 objects, 92 theorems, 77 definitions,
145
+ 56 techniques, 28 persons, 26 formulas, 25 identities, 11
146
+ conjectures, and others)
147
+ \item \textbf{486 edges} with typed relationships (\emph{related\_to}:
148
+ 110, \emph{uses}: 78, \emph{generalizes}: 54,
149
+ \emph{is\_instance\_of}: 45, \emph{implies}: 40, \emph{defines}: 39,
150
+ and others)
151
+ \item Coverage of all 75 papers
152
+ \end{itemize}
153
+
154
+
155
+ \section{Benchmark}
156
+ \label{sec:benchmark}
157
+
158
+ \subsection{Ground truth construction}
159
+
160
+ We construct a retrieval benchmark from the KG. For each concept $c$ with at
161
+ least $\text{min\_degree} = 2$ matched papers in the corpus:
162
+ \begin{itemize}[nosep]
163
+ \item \textbf{Query}: the concept's display name (e.g., ``Rogers--Ramanujan
164
+ identities'')
165
+ \item \textbf{Relevant documents}: all chunks from the concept's source
166
+ papers
167
+ \end{itemize}
168
+
169
+ This yields \textbf{108 queries}. The ground truth is approximate --- not
170
+ every chunk in a relevant paper directly discusses the concept --- but this
171
+ bias is consistent across models, making relative comparisons valid.
172
+
173
+ \subsection{Metrics}
174
+
175
+ We report:
176
+ \begin{itemize}[nosep]
177
+ \item \textbf{MRR} (Mean Reciprocal Rank): the average inverse rank of the
178
+ first relevant result.
179
+ \item \textbf{NDCG@$k$} (Normalized Discounted Cumulative Gain): measures
180
+ ranking quality with position-dependent discounting.
181
+ \item \textbf{Recall@$k$}: fraction of relevant documents retrieved in the
182
+ top $k$. Note that Recall@$k$ appears low because relevant sets are
183
+ large (often 100+ chunks per concept); MRR and NDCG are the
184
+ meaningful comparison metrics.
185
+ \end{itemize}
186
+
187
+ All metrics are computed using a Rust implementation with rayon parallelism
188
+ for batch kNN and metric aggregation \cite{rust-metrics}.
189
+
190
+ \subsection{Baseline results}
191
+
192
+ \begin{table}[h]
193
+ \centering
194
+ \caption{Baseline embedding model comparison on mathematical concept retrieval.
195
+ All models evaluated on 108 queries over 4,794 chunks.}
196
+ \label{tab:baselines}
197
+ \begin{tabular}{lcccccc}
198
+ \toprule
199
+ Model & Dim & R@5 & R@10 & R@20 & MRR & NDCG@10 \\
200
+ \midrule
201
+ \texttt{openai-small} & 1536 & 0.010 & 0.019 & 0.037 & \textbf{0.461} & \textbf{0.324} \\
202
+ SPECTER2 (proximity) & 768 & 0.007 & 0.013 & 0.024 & 0.360 & 0.225 \\
203
+ SciNCL & 768 & 0.006 & 0.012 & 0.024 & 0.306 & 0.205 \\
204
+ \midrule
205
+ Math-Embed (ours) & 768 & \textbf{0.030} & \textbf{0.058} & \textbf{0.111} & \textbf{0.816} & \textbf{0.736} \\
206
+ \bottomrule
207
+ \end{tabular}
208
+ \end{table}
209
+
210
+ The general-purpose OpenAI model outperforms both scientific models by a wide
211
+ margin (28\% higher MRR than SPECTER2, 51\% higher than SciNCL). This is
212
+ notable because SPECTER2 was trained on 6 million scientific citation triplets
213
+ --- yet it underperforms a model with no scientific specialization. We
214
+ attribute this to two factors:
215
+ \begin{enumerate}[nosep]
216
+ \item \textbf{Dimensionality}: OpenAI's 1536-dim space has more capacity
217
+ than the 768-dim BERT-based models.
218
+ \item \textbf{Task mismatch}: SPECTER2 and SciNCL were trained for
219
+ paper-to-paper similarity (title + abstract), not concept-to-chunk
220
+ retrieval. A query like ``Rogers--Ramanujan identities'' is not a
221
+ paper title --- it is a mathematical concept name, and retrieving
222
+ relevant passages requires understanding what that concept means.
223
+ \end{enumerate}
224
+
225
+
226
+ \section{Training Data from Knowledge Graphs}
227
+ \label{sec:training-data}
228
+
229
+ We generate contrastive training data automatically from the KG and corpus.
230
+
231
+ \subsection{Direct pairs}
232
+
233
+ For each concept $c$ with papers $P_1, \ldots, P_m$ in the KG, and each
234
+ paper $P_j$ with chunks $\{d_{j,1}, \ldots, d_{j,n_j}\}$ in the corpus:
235
+ \begin{align}
236
+ \text{Pairs}_{\text{name}}(c) &= \{(\texttt{name}(c),\; d_{j,k}) :
237
+ j \in [m],\; k \in [n_j]\} \\
238
+ \text{Pairs}_{\text{desc}}(c) &= \{(\texttt{desc}(c),\; d_{j,k}) :
239
+ j \in [m],\; k \in [n_j]\}
240
+ \end{align}
241
+
242
+ Using both the concept name and its description as anchors provides anchor
243
+ diversity: short anchors (e.g., ``Macdonald polynomials'') train exact-match
244
+ retrieval, while longer descriptions (e.g., ``A family of orthogonal
245
+ symmetric polynomials generalizing Schur functions'') train paraphrase
246
+ retrieval.
247
+
248
+ We cap at 20 chunks per concept to prevent over-representation of
249
+ high-degree concepts.
250
+
251
+ \subsection{Edge pairs}
252
+
253
+ For each edge $(c_1, c_2, r)$ in the KG with relation $r$ (e.g.,
254
+ \emph{generalizes}, \emph{uses}):
255
+ \begin{equation}
256
+ \text{Pairs}_{\text{edge}}(c_1, c_2) = \{(\texttt{name}(c_1),\; d) :
257
+ d \in \text{chunks}(c_2)\} \cup \{(\texttt{name}(c_2),\; d) :
258
+ d \in \text{chunks}(c_1)\}
259
+ \end{equation}
260
+
261
+ These cross-concept pairs teach the model that mathematically related concepts
262
+ should embed nearby. For example, if ``Bailey's lemma'' \emph{generalizes}
263
+ ``Rogers--Ramanujan identities,'' then chunks about Rogers--Ramanujan should
264
+ be somewhat relevant to queries about Bailey's lemma.
265
+
266
+ We cap at 5 chunks per edge direction to prevent edge pairs from dominating
267
+ the dataset.
268
+
269
+ \subsection{Dataset statistics}
270
+
271
+ \begin{table}[h]
272
+ \centering
273
+ \caption{Training dataset statistics.}
274
+ \label{tab:dataset}
275
+ \begin{tabular}{lr}
276
+ \toprule
277
+ Direct pairs (concept $\to$ chunk) & 21,544 \\
278
+ Edge pairs (cross-concept) & 4,855 \\
279
+ Total unique pairs & 25,121 \\
280
+ Training set (90\%) & 22,609 \\
281
+ Validation set (10\%) & 2,512 \\
282
+ Unique anchors & 1,114 \\
283
+ \bottomrule
284
+ \end{tabular}
285
+ \end{table}
286
+
287
+
288
+ \section{Fine-Tuning}
289
+ \label{sec:finetuning}
290
+
291
+ \subsection{Method}
292
+
293
+ We fine-tune the SPECTER2 base model (\texttt{allenai/specter2\_base},
294
+ 768-dim, SciBERT backbone) using the sentence-transformers framework
295
+ \cite{sbert}. Despite SPECTER2's poor off-the-shelf performance on our
296
+ benchmark, its pre-training on 6 million scientific citation triplets provides
297
+ a strong initialization for mathematical text --- the model already understands
298
+ scientific language structure, and we teach it mathematical concept semantics
299
+ on top.
300
+
301
+ \paragraph{Loss function.}
302
+ We use \texttt{MultipleNegativesRankingLoss} (MNRL) wrapped in
303
+ \texttt{MatryoshkaLoss}. MNRL treats all other examples in a batch as
304
+ negatives, providing $B(B-1)$ negative comparisons per batch of size $B$
305
+ without explicit negative mining. MatryoshkaLoss computes the same contrastive
306
+ loss at multiple embedding truncation points (768, 512, 256, 128 dimensions),
307
+ training the model to frontload important information into the first
308
+ dimensions.
309
+
310
+ \paragraph{Training details.}
311
+ \begin{itemize}[nosep]
312
+ \item Micro-batch size: 8, with gradient accumulation over 4 steps
313
+ (effective batch size 32, yielding 56 in-batch negative comparisons
314
+ per micro-batch)
315
+ \item Max sequence length: 256 tokens (truncating longer chunks)
316
+ \item Learning rate: $2 \times 10^{-5}$ with 10\% linear warmup
317
+ \item Epochs: 3 (2,118 optimization steps)
318
+ \item Duplicate-free batch sampling to maximize negative diversity
319
+ \item Final model selected after epoch 3 (training loss converged
320
+ from $\sim$11 to $\sim$5)
321
+ \item Hardware: Apple M-series GPU (MPS backend), $\sim$4 hours wall time
322
+ \end{itemize}
323
+
324
+ \subsection{Results}
325
+
326
+ \begin{table}[h]
327
+ \centering
328
+ \caption{Final comparison including fine-tuned model. All models evaluated
329
+ on 108 queries over 4,794 chunks. Best results in bold.}
330
+ \label{tab:final}
331
+ \begin{tabular}{lcccccc}
332
+ \toprule
333
+ Model & Dim & R@5 & R@10 & R@20 & MRR & NDCG@10 \\
334
+ \midrule
335
+ \texttt{openai-small} & 1536 & 0.010 & 0.019 & 0.037 & 0.461 & 0.324 \\
336
+ SPECTER2 (proximity) & 768 & 0.007 & 0.013 & 0.024 & 0.360 & 0.225 \\
337
+ SciNCL & 768 & 0.006 & 0.012 & 0.024 & 0.306 & 0.205 \\
338
+ \midrule
339
+ Math-Embed (ours) & 768 & \textbf{0.030} & \textbf{0.058} & \textbf{0.111} & \textbf{0.816} & \textbf{0.736} \\
340
+ \bottomrule
341
+ \end{tabular}
342
+ \end{table}
343
+
344
+ Our fine-tuned model outperforms all baselines by a wide margin.
345
+ MRR improves from 0.461 (OpenAI) to \textbf{0.816} --- a 77\% relative
346
+ improvement, meaning the first relevant result now appears on average at
347
+ rank $\sim$1.2 rather than rank $\sim$2.2. NDCG@10 more than doubles from
348
+ 0.324 to 0.736, and Recall@20 triples from 0.037 to 0.111.
349
+
350
+ Remarkably, the fine-tuned model uses half the embedding dimensions (768
351
+ vs.\ 1536) of the OpenAI model yet dramatically outperforms it. The same
352
+ base model (SPECTER2) that scored worst among baselines (MRR 0.360) becomes
353
+ the best performer after fine-tuning --- a 127\% improvement from the same
354
+ architecture with no additional parameters, demonstrating that the
355
+ knowledge-graph-derived training signal is highly effective.
356
+
357
+
358
+ \section{Discussion}
359
+
360
+ \subsection{Why general-purpose models fail at math}
361
+
362
+ The poor performance of SPECTER2 and SciNCL --- models explicitly trained on
363
+ scientific literature --- highlights that \emph{scientific} training is not
364
+ the same as \emph{mathematical} training. These models learn paper-level
365
+ similarity from citation patterns: ``paper A cites paper B, so they should
366
+ embed nearby.'' But mathematical retrieval requires a different kind of
367
+ similarity: understanding that the text ``$\sum_{n=0}^{\infty}
368
+ \frac{q^{n^2}}{(q;q)_n}$'' is about the Rogers--Ramanujan identities, even
369
+ though it contains no occurrence of that phrase.
370
+
371
+ Standard tokenizers (BERT WordPiece) fragment mathematical notation into
372
+ meaningless subwords. Fine-tuning cannot fix the tokenizer, but it can teach
373
+ the model that certain patterns of subword tokens, when they appear together,
374
+ carry specific mathematical meaning.
375
+
376
+ \subsection{Knowledge graphs as supervision}
377
+
378
+ Our approach requires a knowledge graph, which itself requires an LLM
379
+ extraction step (GPT-4o-mini in our case). This may seem circular --- we use
380
+ an LLM to generate training data for a different model. But the key insight is
381
+ that these are \emph{complementary capabilities}:
382
+ \begin{itemize}[nosep]
383
+ \item The LLM excels at \emph{reading individual passages} and extracting
384
+ structured information (concepts, relationships), but is too slow
385
+ and expensive for real-time retrieval over thousands of chunks.
386
+ \item The embedding model excels at \emph{fast similarity search} over
387
+ large corpora, but needs training data to learn domain-specific
388
+ semantics.
389
+ \end{itemize}
390
+
391
+ The KG is a one-time cost that distills the LLM's understanding into a
392
+ reusable supervision signal.
393
+
394
+ \subsection{Generalizability}
395
+
396
+ The pipeline is not specific to algebraic combinatorics. Given:
397
+ \begin{enumerate}[nosep]
398
+ \item A corpus of mathematical papers (any subfield)
399
+ \item A knowledge graph over their concepts (extractable by LLM)
400
+ \end{enumerate}
401
+ the same code produces a domain-adapted embedding model. The fine-tuned model
402
+ should generalize to new papers in the same mathematical area, since it learns
403
+ \emph{concept semantics} rather than memorizing specific passages.
404
+
405
+
406
+ \section{Conclusion}
407
+
408
+ We demonstrated that general-purpose and scientific embedding models perform
409
+ poorly on mathematical concept retrieval, and presented a pipeline that
410
+ automatically generates contrastive training data from a knowledge graph to
411
+ fine-tune a domain-specific embedding model. Our approach requires no manual
412
+ annotation --- the knowledge graph provides the supervision signal --- and
413
+ produces a portable model that can be deployed in any RAG system for
414
+ mathematical literature.
415
+
416
+ Future work includes: (1) scaling to larger mathematical corpora spanning
417
+ multiple subfields, (2) incorporating mathematical notation awareness into
418
+ the tokenizer, and (3) exploring whether the fine-tuned model's understanding
419
+ of mathematical relationships transfers across subfields.
420
+
421
+
422
+ \begin{thebibliography}{10}
423
+
424
+ \bibitem{specter}
425
+ A.~Cohan, S.~Feldman, I.~Beltagy, D.~Downey, and D.~S.~Weld,
426
+ ``SPECTER: Document-level representation learning using citation-informed
427
+ transformers,'' in \emph{Proc.\ ACL}, 2020.
428
+
429
+ \bibitem{specter2}
430
+ A.~Singh, M.~D'Arcy, A.~Cohan, D.~Downey, and S.~Feldman,
431
+ ``SciRepEval: A multi-format benchmark for scientific document
432
+ representations,'' in \emph{Proc.\ EMNLP}, 2023.
433
+
434
+ \bibitem{scincl}
435
+ M.~Ostendorff, N.~Rethmeier, I.~Augenstein, B.~Gipp, and G.~Rehm,
436
+ ``Neighborhood contrastive learning for scientific document
437
+ representations with citation embeddings,'' in \emph{Proc.\ EMNLP}, 2022.
438
+
439
+ \bibitem{scibert}
440
+ I.~Beltagy, K.~Lo, and A.~Cohan,
441
+ ``SciBERT: A pretrained language model for scientific text,'' in
442
+ \emph{Proc.\ EMNLP}, 2019.
443
+
444
+ \bibitem{mathbert}
445
+ S.~Peng, K.~Yuan, L.~Gao, and Z.~Tang,
446
+ ``MathBERT: A pre-trained model for mathematical formula understanding,''
447
+ \emph{arXiv:2105.00377}, 2021.
448
+
449
+ \bibitem{sbert}
450
+ N.~Reimers and I.~Gurevych,
451
+ ``Sentence-BERT: Sentence embeddings using Siamese BERT-networks,'' in
452
+ \emph{Proc.\ EMNLP}, 2019.
453
+
454
+ \bibitem{matryoshka}
455
+ A.~Kusupati, G.~Bhatt, A.~Rege, M.~Wallingford, A.~Sinha, V.~Ramanujan,
456
+ W.~Howard-Snyder, K.~Chen, S.~Kakade, P.~Jain, and A.~Farhadi,
457
+ ``Matryoshka representation learning,'' in \emph{Proc.\ NeurIPS}, 2022.
458
+
459
+ \bibitem{kg-extraction}
460
+ Knowledge graph extraction via LLM-based concept and relationship
461
+ identification from scientific text, internal methodology.
462
+
463
+ \bibitem{rust-metrics}
464
+ Custom Rust implementation of batch kNN and IR metrics (Recall@$k$, MRR,
465
+ NDCG@$k$) with rayon parallelism and PyO3 Python bindings.
466
+
467
+ \end{thebibliography}
468
+
469
+ \end{document}