---
metrics:
- perplexity
library_name: transformers
---
# **LABOR-LLM Replication Models**

This repository contains replication models for the paper **"LABOR-LLM: Language-Based Occupational Representations with Large Language Models"** (Athey et al., 2024).

Link to the paper: https://arxiv.org/abs/2406.17972

These models are **Llama-2** checkpoints fine-tuned on longitudinal survey data (**NLSY79** and **NLSY97**) to predict labor market transitions. By converting tabular career histories into text-based "resumes," these models leverage the semantic knowledge of LLMs to outperform traditional econometric benchmarks in predicting a worker's next occupation.

## **📦 Model Variants & Ablations**

The repository hosts **12 specific model checkpoints**. You must specify the subfolder argument to load the correct variant.

| Model Size | Dataset | Variant Type | Description |
| :---- | :---- | :---- | :---- |
| **7B / 13B** | NLSY79 / NLSY97 | with\_birth\_year | **Main Model.** Uses natural language job titles and includes the worker's birth year in the context window to capture cohort/age effects. |
| **7B** | NLSY79 / NLSY97 | numeric | **Ablation Baseline.** Replaces natural language job titles with unique numeric codes. Used to demonstrate that the LLM's performance drops when semantic job information is removed. |

**Note on PSID Models:** Models fine-tuned on the Panel Study of Income Dynamics (PSID) are not hosted here due to data licensing restrictions. They will be distributed through the Inter-university Consortium for Political and Social Research (ICPSR).

### **📝 Understanding Checkpoints (ckpt\_3 vs ckpt\_bo5)**

* **ckpt\_3**: The model with the lowest validation loss from a training run scheduled for **3 epochs**.  
* **ckpt\_bo5**: The model with the lowest validation loss from a training run scheduled for **5 epochs**.

**Note:** Due to learning rate scheduling (e.g., warmup and decay steps being calculated based on total epochs), the first 3 epochs of the "5-epoch run" are different from the "3-epoch run." Therefore, ckpt\_bo5 is not simply a longer version of ckpt\_3; they represent distinct training trajectories. Empirically, we found that the \`ckpt\_bo5\` model achieves better performance.

## **🚀 Usage**

**Crucial:** You cannot load the model using just the repository ID. You must provide the subfolder name corresponding to the specific variant you want to use.

### **Installation**

```
pip install transformers torch
```
### **Loading a Model**

You can load the model both through huggingface hub directly or by downloading the model checkpoints manually to your local disk.

**Note** For a complete demonstration of usage, please refer to the `demo_notebook.ipynb` included in this repository.

```python
from transformers import AutoTokenizer, AutoModelForCausalLM  
import torch
# 1. Select your variant  
# Example: 7B model on NLSY79 with birth year data (Best-of-5 checkpoint)  
repo_id = "tianyudu/LABOR_LLM"  
variant = "ft_7b_NLSY79_with_birth_year_ckpt_bo5" 

# 2. Load Tokenizer & Model  
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=variant)  
model = AutoModelForCausalLM.from_pretrained(  
    repo_id,  
    subfolder=variant,  
    torch_dtype=torch.float16,  
    device_map="auto"  
)

# 3. Inference  
# The model expects a career history formatted as a text sequence.  
# Example prompt format (varies by dataset, check paper for exact templates):  
input_text = "Born in 1980. In 2002, works as a Waiter. In 2003, works as a"  
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=10)  
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## **📂 Full List of Subfolders**

**Main Models (Natural Language \+ Birth Year)**

* ft\_7b\_NLSY79\_with\_birth\_year\_ckpt\_3  
* ft\_7b\_NLSY79\_with\_birth\_year\_ckpt\_bo5  
* ft\_7b\_NLSY97\_with\_birth\_year\_ckpt\_3  
* ft\_7b\_NLSY97\_with\_birth\_year\_ckpt\_bo5  
* ft\_13b\_NLSY79\_with\_birth\_year\_ckpt\_3  
* ft\_13b\_NLSY79\_with\_birth\_year\_ckpt\_bo5  
* ft\_13b\_NLSY97\_with\_birth\_year\_ckpt\_3  
* ft\_13b\_NLSY97\_with\_birth\_year\_ckpt\_bo5

**Ablation Models (Numeric Codes)**

* ft\_7b\_NLSY79\_numeric\_ckpt\_3  
* ft\_7b\_NLSY79\_numeric\_ckpt\_bo5  
* ft\_7b\_NLSY97\_numeric\_ckpt\_3  
* ft\_7b\_NLSY97\_numeric\_ckpt\_bo5

## **📜 Citation**

If you use these models, please cite the original paper:
```
@article{athey2024labor,  
  title={LABOR-LLM: Language-Based Occupational Representations with Large Language Models},  
  author={Athey, Susan and Brunborg, Herman and Du, Tianyu and Kanodia, Ayush and Vafa, Keyon},  
  journal={arXiv preprint arXiv:2406.17972},  
  year={2024},  
  url={[https://arxiv.org/abs/2406.17972](https://arxiv.org/abs/2406.17972)}
}
```