# GLEN: Generative Retrieval via Lexical Index Learning (EMNLP 2023)
This is the official code for the EMNLP 2023 paper "[GLEN: Generative Retrieval via Lexical Index Learning](https://arxiv.org/abs/2311.03057)".  


## Overview

GLEN (***G**enerative retrieval via **LE**xical ***N***dex learning*) is a generative retrieval model that learns to dynamically assign lexical identifiers using a two-phase index learning strategy.


![GLEN](assets/model.png)

The poster and the slide files are available at each link: [poster](assets/glen_poster.pdf) and [slide](assets/glen_slide.pdf). We also provide blog posts (Korean) at [here](https://dial.skku.edu/blog/2023_glen). Please refer to the paper for more details: [arXiv](https://arxiv.org/abs/2311.03057) or [ACL Anthology](https://aclanthology.org/2023.emnlp-main.477/).


## Environment
We have confirmed that the results are reproduced successfully in `python==3.8.12`, `transformers==4.15.0`, `pytorch==1.10.0` with `cuda 12.0`. Please create a conda environment and install the required packages with `requirements.txt`.
```
# Clone this repo
git clone https://github.com/skleee/GLEN.git
cd GLEN

# Set conda environment
conda create -n glen python=3.8
conda activate glen

# Install tevatron as editable
pip install --editable .

# Install dependencies 
pip install -r requirements.txt
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
```
Optionally, you can also install [GradCache](https://github.com/luyug/GradCache) to gradient cache feature during training **ranking-based ID refinement** by:
```
git clone https://github.com/luyug/GradCache
cd GradCache
pip install .
```

## Dataset
Datasets can be downloaded from: [NQ320k](https://drive.google.com/drive/folders/1qYV-kAUpSDKkzvcy36pSoelTbvsiZtcQ?usp=sharing), [MS MARCO Passage Ranking set](https://drive.google.com/drive/folders/1rErON3bK0-_DeNCSQUHxcSkewSIs5c2r?usp=sharing), [BEIR](https://drive.google.com/drive/folders/1bBNnqbEPOQ5ic1ybiVULAd8meZXA4pqC?usp=sharing).  
After downloading each folder, unzip it into the `data` folder. The structure of each folder is as follows.

```
data
├── BEIR_dataset
│   ├── arguana
│   └── nfcorpus
├── nq320k
└── marco_passage
```

- For NQ320k, we follow the same data preprocessing as [NCI](https://github.com/solidsea98/Neural-Corpus-Indexer-NCI) and the setup in [GENRET](https://github.com/sunnweiwei/GenRet), splitting the test set into two subsets; *seen test* and *unseen test*. 
- For MS MARCO passage ranking set, we use the official development set consisting of 6,980 queries with a **full corpus**, i.e., 8.8M passages. 
- For BEIR, we assess the model on Arguana and NFCorpus and the code is based on [BEIR](https://github.com/beir-cellar/beir).
- Further details are described in the paper.


## Training
The training process consists of two phases: **(1) Keyword-based ID assignment** and **(2) Ranking-based ID refinement**. In the `/examples` folder, we provide GLEN code for each phase: `glen_phase1`, `glen_phase2`.  Please refer to `src/tevatron` for the trainer.
Run the scripts to train GLEN from the scratch for NQ320k or MS MARCO.<br>

### NQ320k
```
# (1) Keyword-based ID assignment
sh scripts/train_glen_p1_nq.sh
```

```
# (2) Ranking-based ID refinement
sh scripts/train_glen_p2_nq.sh
```

### MS MARCO
```
# (1) Keyword-based ID assignment
sh scripts/train_glen_p1_marco.sh
```

```
# (2) Ranking-based ID refinement
sh scripts/train_glen_p2_marco.sh
```

You can directly download our trained checkpoints for each stage from the following link: [NQ320k](https://drive.google.com/drive/folders/1ERopkRAJf7Ea-r_nJWoeaZFUp7e54eok?usp=sharing), [MS MARCO](https://drive.google.com/drive/folders/1mp4HIIbKnohNizLccaNFkJVMS-pJl_6T?usp=sharing)


## Evaluation
The evaluation process consists of two stages: **(1) Document processing via making document identifiers** and **(2) Query processing via inference**.  

![GLEN](assets/evaluation.png)
Run the scripts to evalute GLEN for each dataset.<br>

### NQ320k
```
sh scripts/eval_make_docid_glen_nq.sh
sh scripts/eval_inference_query_glen_nq.sh
```

### MS MARCO
```
sh scripts/eval_make_docid_glen_marco.sh
sh scripts/eval_inference_query_glen_marco.sh
```

### BEIR
```
# Arguana
sh scripts/eval_make_docid_glen_arguana.sh
sh scripts/eval_inference_query_glen_arguana.sh
```

```
# NFCorpus
sh scripts/eval_make_docid_glen_nfcorpus.sh
sh scripts/eval_inference_query_glen_nfcorpus.sh 
```

## Acknowledgement
Our code is mainly based on [Tevatron](https://github.com/texttron/tevatron). Also, we learned a lot from [NCI](https://github.com/solidsea98/Neural-Corpus-Indexer-NCI), [Transformers](https://github.com/huggingface/transformers), and [BEIR](https://github.com/beir-cellar/beir). We appreciate all the authors for sharing their codes.


## Citation
If you find this work useful for your research, please cite our paper:
```
@inproceedings{lee-etal-2023-glen,
    title = "{GLEN}: Generative Retrieval via Lexical Index Learning",
    author = "Lee, Sunkyung  and
      Choi, Minjin  and
      Lee, Jongwuk",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.477",
    doi = "10.18653/v1/2023.emnlp-main.477",
    pages = "7693--7704",
}
```

## Contacts
For any questions, please contact the following authors via email or feel free to open an issue 😊
- Sunkyung Lee sk1027@skku.edu
- Minjin Choi zxcvxd@skku.edu

# GLEN Model for The Vault Dataset

This repository contains the implementation of the GLEN (Generative Language ENcoder) model for document retrieval and query processing on The Vault dataset.

## Table of Contents
- [Prerequisites](#prerequisites)
- [Environment Setup](#environment-setup)
- [Data Preparation](#data-preparation)
- [Quick Testing](#quick-testing)
- [Full Training](#full-training)
- [Model Evaluation](#model-evaluation)
- [Troubleshooting](#troubleshooting)

## Prerequisites

- Python 3.8 or higher
- CUDA-capable GPU (recommended) or CPU
- Git
- pip (Python package manager)

## Environment Setup

1. Clone the repository:
```bash
git clone <repository-url>
cd GLEN-model
```

2. Create and activate a virtual environment:
```bash
# Windows
python -m venv .env
.env\Scripts\activate

# Linux/Mac
python -m venv .env
source .env/bin/activate
```

3. Install required packages:
```bash
pip install -r requirements.txt
```

4. Create necessary directories:
```bash
mkdir -p logs/test_glen_vault
mkdir -p data/the_vault
```

## Data Preparation

1. Place your dataset in the `the_vault_dataset` directory:
```
the_vault_dataset/
├── DOC_VAULT_train.tsv
├── GTQ_VAULT_train.tsv
└── GTQ_VAULT_dev.tsv
```

2. Run data preprocessing:
```bash
python scripts/preprocess_vault_dataset.py \
    --input_dir the_vault_dataset/ \
    --output_dir data/the_vault/ \
    --sample_size 1000 \
    --create_test_set
```

## Quick Testing

To test the model with a small dataset (1000 samples):

1. Run the test script:
```bash
bash scripts/test_small_training.sh
```

This script will:
- Preprocess a small subset of data
- Train Phase 1 (Document ID Assignment)
- Train Phase 2 (Ranking-based Refinement)
- Generate document IDs
- Run query inference

Expected output directories:
```
logs/test_glen_vault/
├── GLEN_P1_test/           # Phase 1 model
├── GLEN_P2_test/           # Phase 2 model
└── GLEN_P2_test_docids.tsv # Generated document IDs
```

## Full Training

To train the model on the complete dataset:

1. Run the full training script:
```bash
bash scripts/train_full_vault.sh
```

This script will:
- Use the entire dataset
- Train both phases with full parameters
- Generate document IDs for all documents
- Run comprehensive query inference

Expected output directories:
```
logs/glen_vault/
├── GLEN_P1/           # Phase 1 model
├── GLEN_P2/           # Phase 2 model
└── GLEN_P2_docids.tsv # Generated document IDs
```

## Model Evaluation

After training, you can evaluate the model:

1. For test results:
```bash
python examples/glen_phase2/evaluate_glen.py \
    --model_name_or_path logs/glen_vault/GLEN_P2 \
    --infer_dir logs/glen_vault/GLEN_P2 \
    --dataset_name the_vault \
    --docid_file_name GLEN_P2_docids \
    --per_device_eval_batch_size 1 \
    --q_max_len 32 \
    --num_return_sequences 5 \
    --logs_dir logs/glen_vault
```

## Troubleshooting

### Common Issues

1. **CUDA Out of Memory**:
   - Reduce batch sizes in the training scripts
   - Enable gradient accumulation
   - Use smaller model (e.g., t5-small instead of t5-base)

2. **CPU Training is Slow**:
   - Reduce dataset size for testing
   - Increase gradient accumulation steps
   - Use smaller batch sizes

3. **Missing Files**:
   - Ensure all required directories exist
   - Check file permissions
   - Verify data preprocessing completed successfully

### Resource Requirements

Minimum recommended specifications:
- CPU: 8 cores
- RAM: 16GB
- GPU: 8GB VRAM (for full training)
- Storage: 10GB free space

### Performance Tips

1. For CPU-only training:
   - Use smaller batch sizes (1-2)
   - Increase gradient accumulation steps
   - Disable dataloader workers
   - Use FP16 precision

2. For GPU training:
   - Adjust batch sizes based on GPU memory
   - Enable dataloader workers
   - Use mixed precision training

## Directory Structure

```
GLEN-model/
├── data/
│   └── the_vault/           # Processed dataset
├── examples/
│   ├── glen_phase1/        # Phase 1 implementation
│   └── glen_phase2/        # Phase 2 implementation
├── logs/
│   ├── test_glen_vault/    # Test run outputs
│   └── glen_vault/         # Full training outputs
├── scripts/
│   ├── preprocess_vault_dataset.py
│   ├── test_small_training.sh
│   └── train_full_vault.sh
├── .env/                   # Virtual environment
├── requirements.txt        # Python dependencies
└── README.md              # This file
```

## License

[Add your license information here]

## Citation

[Add citation information here]