Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,105 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
metrics:
|
| 3 |
+
- perplexity
|
| 4 |
+
library_name: transformers
|
| 5 |
+
---
|
| 6 |
+
# **LABOR-LLM Replication Models**
|
| 7 |
+
|
| 8 |
+
This repository contains replication models for the paper **"LABOR-LLM: Language-Based Occupational Representations with Large Language Models"** (Athey et al., 2024).
|
| 9 |
+
|
| 10 |
+
Link to the paper: https://arxiv.org/abs/2406.17972
|
| 11 |
+
|
| 12 |
+
These models are **Llama-2** checkpoints fine-tuned on longitudinal survey data (**NLSY79** and **NLSY97**) to predict labor market transitions. By converting tabular career histories into text-based "resumes," these models leverage the semantic knowledge of LLMs to outperform traditional econometric benchmarks in predicting a worker's next occupation.
|
| 13 |
+
|
| 14 |
+
## **๐ฆ Model Variants & Ablations**
|
| 15 |
+
|
| 16 |
+
The repository hosts **12 specific model checkpoints**. You must specify the subfolder argument to load the correct variant.
|
| 17 |
+
|
| 18 |
+
| Model Size | Dataset | Variant Type | Description |
|
| 19 |
+
| :---- | :---- | :---- | :---- |
|
| 20 |
+
| **7B / 13B** | NLSY79 / NLSY97 | with\_birth\_year | **Main Model.** Uses natural language job titles and includes the worker's birth year in the context window to capture cohort/age effects. |
|
| 21 |
+
| **7B** | NLSY79 / NLSY97 | numeric | **Ablation Baseline.** Replaces natural language job titles with unique numeric codes. Used to demonstrate that the LLM's performance drops when semantic job information is removed. |
|
| 22 |
+
|
| 23 |
+
**Note on PSID Models:** Models fine-tuned on the Panel Study of Income Dynamics (PSID) are not hosted here due to data licensing restrictions. They will be distributed through the Inter-university Consortium for Political and Social Research (ICPSR).
|
| 24 |
+
|
| 25 |
+
### **๐ Understanding Checkpoints (ckpt\_3 vs ckpt\_bo5)**
|
| 26 |
+
|
| 27 |
+
* **ckpt\_3**: The model with the lowest validation loss from a training run scheduled for **3 epochs**.
|
| 28 |
+
* **ckpt\_bo5**: The model with the lowest validation loss from a training run scheduled for **5 epochs**.
|
| 29 |
+
|
| 30 |
+
**Note:** Due to learning rate scheduling (e.g., warmup and decay steps being calculated based on total epochs), the first 3 epochs of the "5-epoch run" are different from the "3-epoch run." Therefore, ckpt\_bo5 is not simply a longer version of ckpt\_3; they represent distinct training trajectories. Empirically, we found that the \`ckpt\_bo5\` model achieves better performance.
|
| 31 |
+
|
| 32 |
+
## **๐ Usage**
|
| 33 |
+
|
| 34 |
+
**Crucial:** You cannot load the model using just the repository ID. You must provide the subfolder name corresponding to the specific variant you want to use.
|
| 35 |
+
|
| 36 |
+
### **Installation**
|
| 37 |
+
|
| 38 |
+
```
|
| 39 |
+
pip install transformers torch
|
| 40 |
+
```
|
| 41 |
+
### **Loading a Model**
|
| 42 |
+
|
| 43 |
+
You can load the model both through huggingface hub directly or by downloading the model checkpoints manually to your local disk.
|
| 44 |
+
|
| 45 |
+
**Note** For a complete demonstration of usage, please refer to the `demo_notebook.ipynb` included in this repository.
|
| 46 |
+
|
| 47 |
+
```python
|
| 48 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 49 |
+
import torch
|
| 50 |
+
# 1. Select your variant
|
| 51 |
+
# Example: 7B model on NLSY79 with birth year data (Best-of-5 checkpoint)
|
| 52 |
+
repo_id = "tianyudu/LABOR_LLM"
|
| 53 |
+
variant = "ft_7b_NLSY79_with_birth_year_ckpt_bo5"
|
| 54 |
+
|
| 55 |
+
# 2. Load Tokenizer & Model
|
| 56 |
+
tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=variant)
|
| 57 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 58 |
+
repo_id,
|
| 59 |
+
subfolder=variant,
|
| 60 |
+
torch_dtype=torch.float16,
|
| 61 |
+
device_map="auto"
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
# 3. Inference
|
| 65 |
+
# The model expects a career history formatted as a text sequence.
|
| 66 |
+
# Example prompt format (varies by dataset, check paper for exact templates):
|
| 67 |
+
input_text = "Born in 1980. In 2002, works as a Waiter. In 2003, works as a"
|
| 68 |
+
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
|
| 69 |
+
|
| 70 |
+
outputs = model.generate(**inputs, max_new_tokens=10)
|
| 71 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
## **๐ Full List of Subfolders**
|
| 75 |
+
|
| 76 |
+
**Main Models (Natural Language \+ Birth Year)**
|
| 77 |
+
|
| 78 |
+
* ft\_7b\_NLSY79\_with\_birth\_year\_ckpt\_3
|
| 79 |
+
* ft\_7b\_NLSY79\_with\_birth\_year\_ckpt\_bo5
|
| 80 |
+
* ft\_7b\_NLSY97\_with\_birth\_year\_ckpt\_3
|
| 81 |
+
* ft\_7b\_NLSY97\_with\_birth\_year\_ckpt\_bo5
|
| 82 |
+
* ft\_13b\_NLSY79\_with\_birth\_year\_ckpt\_3
|
| 83 |
+
* ft\_13b\_NLSY79\_with\_birth\_year\_ckpt\_bo5
|
| 84 |
+
* ft\_13b\_NLSY97\_with\_birth\_year\_ckpt\_3
|
| 85 |
+
* ft\_13b\_NLSY97\_with\_birth\_year\_ckpt\_bo5
|
| 86 |
+
|
| 87 |
+
**Ablation Models (Numeric Codes)**
|
| 88 |
+
|
| 89 |
+
* ft\_7b\_NLSY79\_numeric\_ckpt\_3
|
| 90 |
+
* ft\_7b\_NLSY79\_numeric\_ckpt\_bo5
|
| 91 |
+
* ft\_7b\_NLSY97\_numeric\_ckpt\_3
|
| 92 |
+
* ft\_7b\_NLSY97\_numeric\_ckpt\_bo5
|
| 93 |
+
|
| 94 |
+
## **๐ Citation**
|
| 95 |
+
|
| 96 |
+
If you use these models, please cite the original paper:
|
| 97 |
+
```
|
| 98 |
+
@article{athey2024labor,
|
| 99 |
+
title={LABOR-LLM: Language-Based Occupational Representations with Large Language Models},
|
| 100 |
+
author={Athey, Susan and Brunborg, Herman and Du, Tianyu and Kanodia, Ayush and Vafa, Keyon},
|
| 101 |
+
journal={arXiv preprint arXiv:2406.17972},
|
| 102 |
+
year={2024},
|
| 103 |
+
url={[https://arxiv.org/abs/2406.17972](https://arxiv.org/abs/2406.17972)}
|
| 104 |
+
}
|
| 105 |
+
```
|