File size: 8,403 Bytes
0a55f0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
title: SciPaths
emoji: 🔬
colorFrom: blue
colorTo: indigo
sdk: docker
pinned: false
---

# SciPaths

SciPaths runs an end-to-end target-contribution pathway pipeline for arXiv papers. It collects downstream citation evidence, derives target contributions from refined citation clusters, decomposes each target contribution into enabling contributions, and grounds those enabling contributions in prior studies.

The Hugging Face Space launches the Streamlit app from `hf_space/streamlit_app.py`.

## Citation

If you find this useful, please cite our paper as:

```bibtex
@misc{chamoun2026scipathsforecastingpathwaysscientific,
      title={SciPaths: Forecasting Pathways to Scientific Discovery}, 
      author={Eric Chamoun and Yizhou Chi and Yulong Chen and Rui Cao and Zifeng Ding and Michalis Korakakis and Andreas Vlachos},
      year={2026},
      eprint={2605.14600},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2605.14600}, 
}
```

Paper URL: https://arxiv.org/abs/2605.14600

## Required Secrets

Set this in the Space settings before publishing:

```text
GEMINI_API_KEY=<Google Gemini API key>
```

Optional, for saving completed run artifacts to a Hugging Face Dataset:

```text
HF_WRITE_TOKEN=<Hugging Face write token>
RUNS_REPO_ID=<owner/dataset-name>
RUNS_REPO_TYPE=dataset
```

Optional, for higher Semantic Scholar limits:

```text
SEMANTIC_SCHOLAR_API_KEY=<Semantic Scholar API key>
```

## Run The Demo Locally

```bash
pip install -r requirements.txt
streamlit run hf_space/streamlit_app.py
```

Then enter an arXiv URL or ID, for example:

```text
https://arxiv.org/abs/2211.08788
```

The app writes each run under:

```text
hf_space/runs/<job_id>/
```

## Run One Example From The Command Line

This example stores all intermediate files under `runs/example/processed_papers`.

```bash
mkdir -p runs/example
printf '[{"id":"2211.08788","title":"","id_type":"ArXiv"}]\n' > runs/example/input_ids.json

python src/step_01_fetch/fetch_metadata.py \
  --ids runs/example/input_ids.json \
  --outdir runs/example/processed_papers

python src/step_02_mark_citations/replace_citation_markers.py \
  --root runs/example/processed_papers

python src/step_03_usage_contexts/build_usage_contexts.py \
  --root runs/example/processed_papers \
  --out-name usage_contexts.json

python src/step_04_label_citations/label_citation_functions.py \
  --root runs/example/processed_papers \
  --model-path Deep-Citation/Workspace/acl_scicite_wksp_trl/best_model.pt \
  --model-data-dir Deep-Citation/Data \
  --model-class-def Deep-Citation/Data/class_def.json \
  --model-lm scibert \
  --device cpu

python src/step_05_verify_uses_extends/verify_uses_extends.py \
  --root runs/example/processed_papers \
  --k 0 \
  --batch-size 25

python src/step_06_extract_paragraphs/extract_arxiv_paragraphs.py \
  --root runs/example/processed_papers

python src/step_07_extract_and_refine/extract_contributions_from_citations.py \
  --root runs/example/processed_papers

python src/step_07_extract_and_refine/refine_and_filter_clusters_llm.py \
  --root runs/example/processed_papers \
  --inplace \
  --overwrite

PYTHONPATH=src \
python -m step_08_annotation.cli run \
  --paper-dir runs/example/processed_papers/2211.08788 \
  --provider gemini \
  --model gemini/gemini-3.1-pro-preview \
  --formatter-model gemini/gemini-3.1-pro-preview \
  --judge-model gemini/gemini-3.1-pro-preview \
  --candidate-count 3 \
  --output-root runs/example/two_pass_outputs
```

The final UI payload is written as `pass_2_ui_payload.json` inside the annotation run directory printed by the last command.

## Run Each Step On A Set Of Papers

Create an ID file with one entry per paper:

```json
[
  {"id": "2211.08788", "title": "", "id_type": "ArXiv"},
  {"id": "2311.14919", "title": "", "id_type": "ArXiv"}
]
```

Save it as `runs/batch/input_ids.json`, then run:

```bash
mkdir -p runs/batch

# 1. Fetch metadata + LaTeX for each input paper.
python src/step_01_fetch/fetch_metadata.py \
  --ids runs/batch/input_ids.json \
  --outdir runs/batch/processed_papers

# 2. Add explicit citation markers to the target-paper text.
python src/step_02_mark_citations/replace_citation_markers.py \
  --root runs/batch/processed_papers

# 3. Build downstream citation usage contexts.
python src/step_03_usage_contexts/build_usage_contexts.py \
  --root runs/batch/processed_papers \
  --out-name usage_contexts.json

# 4. Label citation functions with the bundled Deep-Citation classifier.
python src/step_04_label_citations/label_citation_functions.py \
  --root runs/batch/processed_papers \
  --model-path Deep-Citation/Workspace/acl_scicite_wksp_trl/best_model.pt \
  --model-data-dir Deep-Citation/Data \
  --model-class-def Deep-Citation/Data/class_def.json \
  --model-lm scibert \
  --device cpu

# 5. Verify USES/EXTENDS citations with an LLM.
python src/step_05_verify_uses_extends/verify_uses_extends.py \
  --root runs/batch/processed_papers \
  --k 0 \
  --batch-size 25

# 6. Extract arXiv paragraphs from downstream citing papers.
python src/step_06_extract_paragraphs/extract_arxiv_paragraphs.py \
  --root runs/batch/processed_papers

# 7. Extract downstream contribution clusters, then merge/filter them.
python src/step_07_extract_and_refine/extract_contributions_from_citations.py \
  --root runs/batch/processed_papers

python src/step_07_extract_and_refine/refine_and_filter_clusters_llm.py \
  --root runs/batch/processed_papers \
  --inplace \
  --overwrite

# 8. Annotate each ready paper: target contributions, enabling contributions, and groundings.
for paper_dir in runs/batch/processed_papers/*; do
  [ -d "$paper_dir" ] || continue
  [ -f "$paper_dir/usage_discovery_from_contributions.json" ] || continue
  PYTHONPATH=src \
  python -m step_08_annotation.cli run \
    --paper-dir "$paper_dir" \
    --provider gemini \
    --model gemini/gemini-3.1-pro-preview \
    --formatter-model gemini/gemini-3.1-pro-preview \
    --judge-model gemini/gemini-3.1-pro-preview \
    --candidate-count 3 \
    --output-root runs/batch/two_pass_outputs
done
```

## Pipeline Steps

1. **Fetch metadata + LaTeX.** Downloads target-paper metadata, references, citing-paper metadata, and arXiv source where available.
2. **Add citation markers.** Inserts normalized citation markers into the target paper so downstream citation contexts can be aligned.
3. **Build usage contexts.** Collects text windows around downstream citations to the target paper.
4. **Label citation functions.** Uses the bundled Deep-Citation classifier to label citation contexts as background, use, extension, comparison, and related categories.
5. **Verify USES/EXTENDS.** Uses an LLM to check whether candidate downstream citations genuinely use or extend the target paper.
6. **Extract arXiv paragraphs.** Retrieves fuller paragraphs from citing papers so the system has enough context for contribution extraction.
7. **Extract and refine target-contribution clusters.** Extracts what downstream papers use the target paper for, clusters near-duplicates, and filters weak/non-usage evidence.
8. **Annotate pathways.** Derives target contributions from the refined clusters, decomposes each into enabling contributions, selects primary groundings, and records additional grounding studies.

## Important Files

```text
hf_space/streamlit_app.py                  Streamlit UI
hf_space/runner.py                         Orchestrates steps 1-7 for the UI
hf_space/streamlit_config.py               Example papers and tab names
src/common/                                Shared LLM and paper-package utilities
src/step_01_fetch/                         Metadata, references, citations, and LaTeX
src/step_02_mark_citations/                Citation-marker insertion
src/step_03_usage_contexts/                Downstream usage-context construction
src/step_04_label_citations/               Deep-Citation citation-function labeling
src/step_05_verify_uses_extends/           LLM verification of USES/EXTENDS citations
src/step_06_extract_paragraphs/            ArXiv paragraph extraction from citing papers
src/step_07_extract_and_refine/            Contribution extraction and cluster refinement
src/step_08_annotation/                    Target/enabling contribution annotation and grounding
Deep-Citation/                             Bundled citation-function classifier assets
```