| --- |
| title: SciPaths |
| emoji: 🔬 |
| colorFrom: blue |
| colorTo: indigo |
| sdk: docker |
| pinned: false |
| --- |
| |
| # SciPaths |
|
|
| SciPaths runs an end-to-end target-contribution pathway pipeline for arXiv papers. It collects downstream citation evidence, derives target contributions from refined citation clusters, decomposes each target contribution into enabling contributions, and grounds those enabling contributions in prior studies. |
|
|
| The Hugging Face Space launches the Streamlit app from `hf_space/streamlit_app.py`. |
|
|
| ## Citation |
|
|
| If you find this useful, please cite our paper as: |
|
|
| ```bibtex |
| @misc{chamoun2026scipathsforecastingpathwaysscientific, |
| title={SciPaths: Forecasting Pathways to Scientific Discovery}, |
| author={Eric Chamoun and Yizhou Chi and Yulong Chen and Rui Cao and Zifeng Ding and Michalis Korakakis and Andreas Vlachos}, |
| year={2026}, |
| eprint={2605.14600}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.CL}, |
| url={https://arxiv.org/abs/2605.14600}, |
| } |
| ``` |
|
|
| Paper URL: https://arxiv.org/abs/2605.14600 |
|
|
| ## Required Secrets |
|
|
| Set this in the Space settings before publishing: |
|
|
| ```text |
| GEMINI_API_KEY=<Google Gemini API key> |
| ``` |
|
|
| Optional, for saving completed run artifacts to a Hugging Face Dataset: |
|
|
| ```text |
| HF_WRITE_TOKEN=<Hugging Face write token> |
| RUNS_REPO_ID=<owner/dataset-name> |
| RUNS_REPO_TYPE=dataset |
| ``` |
|
|
| Optional, for higher Semantic Scholar limits: |
|
|
| ```text |
| SEMANTIC_SCHOLAR_API_KEY=<Semantic Scholar API key> |
| ``` |
|
|
| ## Run The Demo Locally |
|
|
| ```bash |
| pip install -r requirements.txt |
| streamlit run hf_space/streamlit_app.py |
| ``` |
|
|
| Then enter an arXiv URL or ID, for example: |
|
|
| ```text |
| https://arxiv.org/abs/2211.08788 |
| ``` |
|
|
| The app writes each run under: |
|
|
| ```text |
| hf_space/runs/<job_id>/ |
| ``` |
|
|
| ## Run One Example From The Command Line |
|
|
| This example stores all intermediate files under `runs/example/processed_papers`. |
|
|
| ```bash |
| mkdir -p runs/example |
| printf '[{"id":"2211.08788","title":"","id_type":"ArXiv"}]\n' > runs/example/input_ids.json |
| |
| python src/step_01_fetch/fetch_metadata.py \ |
| --ids runs/example/input_ids.json \ |
| --outdir runs/example/processed_papers |
| |
| python src/step_02_mark_citations/replace_citation_markers.py \ |
| --root runs/example/processed_papers |
| |
| python src/step_03_usage_contexts/build_usage_contexts.py \ |
| --root runs/example/processed_papers \ |
| --out-name usage_contexts.json |
| |
| python src/step_04_label_citations/label_citation_functions.py \ |
| --root runs/example/processed_papers \ |
| --model-path Deep-Citation/Workspace/acl_scicite_wksp_trl/best_model.pt \ |
| --model-data-dir Deep-Citation/Data \ |
| --model-class-def Deep-Citation/Data/class_def.json \ |
| --model-lm scibert \ |
| --device cpu |
| |
| python src/step_05_verify_uses_extends/verify_uses_extends.py \ |
| --root runs/example/processed_papers \ |
| --k 0 \ |
| --batch-size 25 |
| |
| python src/step_06_extract_paragraphs/extract_arxiv_paragraphs.py \ |
| --root runs/example/processed_papers |
| |
| python src/step_07_extract_and_refine/extract_contributions_from_citations.py \ |
| --root runs/example/processed_papers |
| |
| python src/step_07_extract_and_refine/refine_and_filter_clusters_llm.py \ |
| --root runs/example/processed_papers \ |
| --inplace \ |
| --overwrite |
| |
| PYTHONPATH=src \ |
| python -m step_08_annotation.cli run \ |
| --paper-dir runs/example/processed_papers/2211.08788 \ |
| --provider gemini \ |
| --model gemini/gemini-3.1-pro-preview \ |
| --formatter-model gemini/gemini-3.1-pro-preview \ |
| --judge-model gemini/gemini-3.1-pro-preview \ |
| --candidate-count 3 \ |
| --output-root runs/example/two_pass_outputs |
| ``` |
|
|
| The final UI payload is written as `pass_2_ui_payload.json` inside the annotation run directory printed by the last command. |
|
|
| ## Run Each Step On A Set Of Papers |
|
|
| Create an ID file with one entry per paper: |
|
|
| ```json |
| [ |
| {"id": "2211.08788", "title": "", "id_type": "ArXiv"}, |
| {"id": "2311.14919", "title": "", "id_type": "ArXiv"} |
| ] |
| ``` |
|
|
| Save it as `runs/batch/input_ids.json`, then run: |
|
|
| ```bash |
| mkdir -p runs/batch |
| |
| # 1. Fetch metadata + LaTeX for each input paper. |
| python src/step_01_fetch/fetch_metadata.py \ |
| --ids runs/batch/input_ids.json \ |
| --outdir runs/batch/processed_papers |
| |
| # 2. Add explicit citation markers to the target-paper text. |
| python src/step_02_mark_citations/replace_citation_markers.py \ |
| --root runs/batch/processed_papers |
| |
| # 3. Build downstream citation usage contexts. |
| python src/step_03_usage_contexts/build_usage_contexts.py \ |
| --root runs/batch/processed_papers \ |
| --out-name usage_contexts.json |
| |
| # 4. Label citation functions with the bundled Deep-Citation classifier. |
| python src/step_04_label_citations/label_citation_functions.py \ |
| --root runs/batch/processed_papers \ |
| --model-path Deep-Citation/Workspace/acl_scicite_wksp_trl/best_model.pt \ |
| --model-data-dir Deep-Citation/Data \ |
| --model-class-def Deep-Citation/Data/class_def.json \ |
| --model-lm scibert \ |
| --device cpu |
| |
| # 5. Verify USES/EXTENDS citations with an LLM. |
| python src/step_05_verify_uses_extends/verify_uses_extends.py \ |
| --root runs/batch/processed_papers \ |
| --k 0 \ |
| --batch-size 25 |
| |
| # 6. Extract arXiv paragraphs from downstream citing papers. |
| python src/step_06_extract_paragraphs/extract_arxiv_paragraphs.py \ |
| --root runs/batch/processed_papers |
| |
| # 7. Extract downstream contribution clusters, then merge/filter them. |
| python src/step_07_extract_and_refine/extract_contributions_from_citations.py \ |
| --root runs/batch/processed_papers |
| |
| python src/step_07_extract_and_refine/refine_and_filter_clusters_llm.py \ |
| --root runs/batch/processed_papers \ |
| --inplace \ |
| --overwrite |
| |
| # 8. Annotate each ready paper: target contributions, enabling contributions, and groundings. |
| for paper_dir in runs/batch/processed_papers/*; do |
| [ -d "$paper_dir" ] || continue |
| [ -f "$paper_dir/usage_discovery_from_contributions.json" ] || continue |
| PYTHONPATH=src \ |
| python -m step_08_annotation.cli run \ |
| --paper-dir "$paper_dir" \ |
| --provider gemini \ |
| --model gemini/gemini-3.1-pro-preview \ |
| --formatter-model gemini/gemini-3.1-pro-preview \ |
| --judge-model gemini/gemini-3.1-pro-preview \ |
| --candidate-count 3 \ |
| --output-root runs/batch/two_pass_outputs |
| done |
| ``` |
|
|
| ## Pipeline Steps |
|
|
| 1. **Fetch metadata + LaTeX.** Downloads target-paper metadata, references, citing-paper metadata, and arXiv source where available. |
| 2. **Add citation markers.** Inserts normalized citation markers into the target paper so downstream citation contexts can be aligned. |
| 3. **Build usage contexts.** Collects text windows around downstream citations to the target paper. |
| 4. **Label citation functions.** Uses the bundled Deep-Citation classifier to label citation contexts as background, use, extension, comparison, and related categories. |
| 5. **Verify USES/EXTENDS.** Uses an LLM to check whether candidate downstream citations genuinely use or extend the target paper. |
| 6. **Extract arXiv paragraphs.** Retrieves fuller paragraphs from citing papers so the system has enough context for contribution extraction. |
| 7. **Extract and refine target-contribution clusters.** Extracts what downstream papers use the target paper for, clusters near-duplicates, and filters weak/non-usage evidence. |
| 8. **Annotate pathways.** Derives target contributions from the refined clusters, decomposes each into enabling contributions, selects primary groundings, and records additional grounding studies. |
|
|
| ## Important Files |
|
|
| ```text |
| hf_space/streamlit_app.py Streamlit UI |
| hf_space/runner.py Orchestrates steps 1-7 for the UI |
| hf_space/streamlit_config.py Example papers and tab names |
| src/common/ Shared LLM and paper-package utilities |
| src/step_01_fetch/ Metadata, references, citations, and LaTeX |
| src/step_02_mark_citations/ Citation-marker insertion |
| src/step_03_usage_contexts/ Downstream usage-context construction |
| src/step_04_label_citations/ Deep-Citation citation-function labeling |
| src/step_05_verify_uses_extends/ LLM verification of USES/EXTENDS citations |
| src/step_06_extract_paragraphs/ ArXiv paragraph extraction from citing papers |
| src/step_07_extract_and_refine/ Contribution extraction and cluster refinement |
| src/step_08_annotation/ Target/enabling contribution annotation and grounding |
| Deep-Citation/ Bundled citation-function classifier assets |
| ``` |
|
|