# GLEN: Generative Retrieval via Lexical Index Learning (EMNLP 2023)
This is the official code for the EMNLP 2023 paper "[GLEN: Generative Retrieval via Lexical Index Learning](https://arxiv.org/abs/2311.03057)".
## Overview
GLEN (***G**enerative retrieval via **LE**xical ***N***dex learning*) is a generative retrieval model that learns to dynamically assign lexical identifiers using a two-phase index learning strategy.

The poster and the slide files are available at each link: [poster](assets/glen_poster.pdf) and [slide](assets/glen_slide.pdf). We also provide blog posts (Korean) at [here](https://dial.skku.edu/blog/2023_glen). Please refer to the paper for more details: [arXiv](https://arxiv.org/abs/2311.03057) or [ACL Anthology](https://aclanthology.org/2023.emnlp-main.477/).
## Environment
We have confirmed that the results are reproduced successfully in `python==3.8.12`, `transformers==4.15.0`, `pytorch==1.10.0` with `cuda 12.0`. Please create a conda environment and install the required packages with `requirements.txt`.
```
# Clone this repo
git clone https://github.com/skleee/GLEN.git
cd GLEN
# Set conda environment
conda create -n glen python=3.8
conda activate glen
# Install tevatron as editable
pip install --editable .
# Install dependencies
pip install -r requirements.txt
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
```
Optionally, you can also install [GradCache](https://github.com/luyug/GradCache) to gradient cache feature during training **ranking-based ID refinement** by:
```
git clone https://github.com/luyug/GradCache
cd GradCache
pip install .
```
## Dataset
Datasets can be downloaded from: [NQ320k](https://drive.google.com/drive/folders/1qYV-kAUpSDKkzvcy36pSoelTbvsiZtcQ?usp=sharing), [MS MARCO Passage Ranking set](https://drive.google.com/drive/folders/1rErON3bK0-_DeNCSQUHxcSkewSIs5c2r?usp=sharing), [BEIR](https://drive.google.com/drive/folders/1bBNnqbEPOQ5ic1ybiVULAd8meZXA4pqC?usp=sharing).
After downloading each folder, unzip it into the `data` folder. The structure of each folder is as follows.
```
data
├── BEIR_dataset
│ ├── arguana
│ └── nfcorpus
├── nq320k
└── marco_passage
```
- For NQ320k, we follow the same data preprocessing as [NCI](https://github.com/solidsea98/Neural-Corpus-Indexer-NCI) and the setup in [GENRET](https://github.com/sunnweiwei/GenRet), splitting the test set into two subsets; *seen test* and *unseen test*.
- For MS MARCO passage ranking set, we use the official development set consisting of 6,980 queries with a **full corpus**, i.e., 8.8M passages.
- For BEIR, we assess the model on Arguana and NFCorpus and the code is based on [BEIR](https://github.com/beir-cellar/beir).
- Further details are described in the paper.
## Training
The training process consists of two phases: **(1) Keyword-based ID assignment** and **(2) Ranking-based ID refinement**. In the `/examples` folder, we provide GLEN code for each phase: `glen_phase1`, `glen_phase2`. Please refer to `src/tevatron` for the trainer.
Run the scripts to train GLEN from the scratch for NQ320k or MS MARCO.
### NQ320k
```
# (1) Keyword-based ID assignment
sh scripts/train_glen_p1_nq.sh
```
```
# (2) Ranking-based ID refinement
sh scripts/train_glen_p2_nq.sh
```
### MS MARCO
```
# (1) Keyword-based ID assignment
sh scripts/train_glen_p1_marco.sh
```
```
# (2) Ranking-based ID refinement
sh scripts/train_glen_p2_marco.sh
```
You can directly download our trained checkpoints for each stage from the following link: [NQ320k](https://drive.google.com/drive/folders/1ERopkRAJf7Ea-r_nJWoeaZFUp7e54eok?usp=sharing), [MS MARCO](https://drive.google.com/drive/folders/1mp4HIIbKnohNizLccaNFkJVMS-pJl_6T?usp=sharing)
## Evaluation
The evaluation process consists of two stages: **(1) Document processing via making document identifiers** and **(2) Query processing via inference**.

Run the scripts to evalute GLEN for each dataset.
### NQ320k
```
sh scripts/eval_make_docid_glen_nq.sh
sh scripts/eval_inference_query_glen_nq.sh
```
### MS MARCO
```
sh scripts/eval_make_docid_glen_marco.sh
sh scripts/eval_inference_query_glen_marco.sh
```
### BEIR
```
# Arguana
sh scripts/eval_make_docid_glen_arguana.sh
sh scripts/eval_inference_query_glen_arguana.sh
```
```
# NFCorpus
sh scripts/eval_make_docid_glen_nfcorpus.sh
sh scripts/eval_inference_query_glen_nfcorpus.sh
```
## Acknowledgement
Our code is mainly based on [Tevatron](https://github.com/texttron/tevatron). Also, we learned a lot from [NCI](https://github.com/solidsea98/Neural-Corpus-Indexer-NCI), [Transformers](https://github.com/huggingface/transformers), and [BEIR](https://github.com/beir-cellar/beir). We appreciate all the authors for sharing their codes.
## Citation
If you find this work useful for your research, please cite our paper:
```
@inproceedings{lee-etal-2023-glen,
title = "{GLEN}: Generative Retrieval via Lexical Index Learning",
author = "Lee, Sunkyung and
Choi, Minjin and
Lee, Jongwuk",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.477",
doi = "10.18653/v1/2023.emnlp-main.477",
pages = "7693--7704",
}
```
## Contacts
For any questions, please contact the following authors via email or feel free to open an issue 😊
- Sunkyung Lee sk1027@skku.edu
- Minjin Choi zxcvxd@skku.edu
# GLEN Model for The Vault Dataset
This repository contains the implementation of the GLEN (Generative Language ENcoder) model for document retrieval and query processing on The Vault dataset.
## Table of Contents
- [Prerequisites](#prerequisites)
- [Environment Setup](#environment-setup)
- [Data Preparation](#data-preparation)
- [Quick Testing](#quick-testing)
- [Full Training](#full-training)
- [Model Evaluation](#model-evaluation)
- [Troubleshooting](#troubleshooting)
## Prerequisites
- Python 3.8 or higher
- CUDA-capable GPU (recommended) or CPU
- Git
- pip (Python package manager)
## Environment Setup
1. Clone the repository:
```bash
git clone
cd GLEN-model
```
2. Create and activate a virtual environment:
```bash
# Windows
python -m venv .env
.env\Scripts\activate
# Linux/Mac
python -m venv .env
source .env/bin/activate
```
3. Install required packages:
```bash
pip install -r requirements.txt
```
4. Create necessary directories:
```bash
mkdir -p logs/test_glen_vault
mkdir -p data/the_vault
```
## Data Preparation
1. Place your dataset in the `the_vault_dataset` directory:
```
the_vault_dataset/
├── DOC_VAULT_train.tsv
├── GTQ_VAULT_train.tsv
└── GTQ_VAULT_dev.tsv
```
2. Run data preprocessing:
```bash
python scripts/preprocess_vault_dataset.py \
--input_dir the_vault_dataset/ \
--output_dir data/the_vault/ \
--sample_size 1000 \
--create_test_set
```
## Quick Testing
To test the model with a small dataset (1000 samples):
1. Run the test script:
```bash
bash scripts/test_small_training.sh
```
This script will:
- Preprocess a small subset of data
- Train Phase 1 (Document ID Assignment)
- Train Phase 2 (Ranking-based Refinement)
- Generate document IDs
- Run query inference
Expected output directories:
```
logs/test_glen_vault/
├── GLEN_P1_test/ # Phase 1 model
├── GLEN_P2_test/ # Phase 2 model
└── GLEN_P2_test_docids.tsv # Generated document IDs
```
## Full Training
To train the model on the complete dataset:
1. Run the full training script:
```bash
bash scripts/train_full_vault.sh
```
This script will:
- Use the entire dataset
- Train both phases with full parameters
- Generate document IDs for all documents
- Run comprehensive query inference
Expected output directories:
```
logs/glen_vault/
├── GLEN_P1/ # Phase 1 model
├── GLEN_P2/ # Phase 2 model
└── GLEN_P2_docids.tsv # Generated document IDs
```
## Model Evaluation
After training, you can evaluate the model:
1. For test results:
```bash
python examples/glen_phase2/evaluate_glen.py \
--model_name_or_path logs/glen_vault/GLEN_P2 \
--infer_dir logs/glen_vault/GLEN_P2 \
--dataset_name the_vault \
--docid_file_name GLEN_P2_docids \
--per_device_eval_batch_size 1 \
--q_max_len 32 \
--num_return_sequences 5 \
--logs_dir logs/glen_vault
```
## Troubleshooting
### Common Issues
1. **CUDA Out of Memory**:
- Reduce batch sizes in the training scripts
- Enable gradient accumulation
- Use smaller model (e.g., t5-small instead of t5-base)
2. **CPU Training is Slow**:
- Reduce dataset size for testing
- Increase gradient accumulation steps
- Use smaller batch sizes
3. **Missing Files**:
- Ensure all required directories exist
- Check file permissions
- Verify data preprocessing completed successfully
### Resource Requirements
Minimum recommended specifications:
- CPU: 8 cores
- RAM: 16GB
- GPU: 8GB VRAM (for full training)
- Storage: 10GB free space
### Performance Tips
1. For CPU-only training:
- Use smaller batch sizes (1-2)
- Increase gradient accumulation steps
- Disable dataloader workers
- Use FP16 precision
2. For GPU training:
- Adjust batch sizes based on GPU memory
- Enable dataloader workers
- Use mixed precision training
## Directory Structure
```
GLEN-model/
├── data/
│ └── the_vault/ # Processed dataset
├── examples/
│ ├── glen_phase1/ # Phase 1 implementation
│ └── glen_phase2/ # Phase 2 implementation
├── logs/
│ ├── test_glen_vault/ # Test run outputs
│ └── glen_vault/ # Full training outputs
├── scripts/
│ ├── preprocess_vault_dataset.py
│ ├── test_small_training.sh
│ └── train_full_vault.sh
├── .env/ # Virtual environment
├── requirements.txt # Python dependencies
└── README.md # This file
```
## License
[Add your license information here]
## Citation
[Add citation information here]