tianyudu
/

LABOR_LLM

Transformers

Safetensors

Model card Files Files and versions

xet

Community

tianyudu commited on Feb 16

Commit

37ef536

verified ·

1 Parent(s): 40e2941

Update README.md

Browse files

Files changed (1) hide show

README.md +105 -3

README.md CHANGED Viewed

@@ -1,3 +1,105 @@
----
-license: apache-2.0
----

+---
+metrics:
+- perplexity
+library_name: transformers
+---
+# **LABOR-LLM Replication Models**
+This repository contains replication models for the paper **"LABOR-LLM: Language-Based Occupational Representations with Large Language Models"** (Athey et al., 2024).
+Link to the paper: https://arxiv.org/abs/2406.17972
+These models are **Llama-2** checkpoints fine-tuned on longitudinal survey data (**NLSY79** and **NLSY97**) to predict labor market transitions. By converting tabular career histories into text-based "resumes," these models leverage the semantic knowledge of LLMs to outperform traditional econometric benchmarks in predicting a worker's next occupation.
+## **📦 Model Variants & Ablations**
+The repository hosts **12 specific model checkpoints**. You must specify the subfolder argument to load the correct variant.
+| Model Size | Dataset | Variant Type | Description |
+| :---- | :---- | :---- | :---- |
+| **7B / 13B** | NLSY79 / NLSY97 | with\_birth\_year | **Main Model.** Uses natural language job titles and includes the worker's birth year in the context window to capture cohort/age effects. |
+| **7B** | NLSY79 / NLSY97 | numeric | **Ablation Baseline.** Replaces natural language job titles with unique numeric codes. Used to demonstrate that the LLM's performance drops when semantic job information is removed. |
+**Note on PSID Models:** Models fine-tuned on the Panel Study of Income Dynamics (PSID) are not hosted here due to data licensing restrictions. They will be distributed through the Inter-university Consortium for Political and Social Research (ICPSR).
+### **📝 Understanding Checkpoints (ckpt\_3 vs ckpt\_bo5)**
+* **ckpt\_3**: The model with the lowest validation loss from a training run scheduled for **3 epochs**.
+* **ckpt\_bo5**: The model with the lowest validation loss from a training run scheduled for **5 epochs**.
+**Note:** Due to learning rate scheduling (e.g., warmup and decay steps being calculated based on total epochs), the first 3 epochs of the "5-epoch run" are different from the "3-epoch run." Therefore, ckpt\_bo5 is not simply a longer version of ckpt\_3; they represent distinct training trajectories. Empirically, we found that the \`ckpt\_bo5\` model achieves better performance.
+## **🚀 Usage**
+**Crucial:** You cannot load the model using just the repository ID. You must provide the subfolder name corresponding to the specific variant you want to use.
+### **Installation**
+```
+pip install transformers torch
+```
+### **Loading a Model**
+You can load the model both through huggingface hub directly or by downloading the model checkpoints manually to your local disk.
+**Note** For a complete demonstration of usage, please refer to the `demo_notebook.ipynb` included in this repository.
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+# 1. Select your variant
+# Example: 7B model on NLSY79 with birth year data (Best-of-5 checkpoint)
+repo_id = "tianyudu/LABOR_LLM"
+variant = "ft_7b_NLSY79_with_birth_year_ckpt_bo5"
+# 2. Load Tokenizer & Model
+tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=variant)
+model = AutoModelForCausalLM.from_pretrained(
+    repo_id,
+    subfolder=variant,
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+# 3. Inference
+# The model expects a career history formatted as a text sequence.
+# Example prompt format (varies by dataset, check paper for exact templates):
+input_text = "Born in 1980. In 2002, works as a Waiter. In 2003, works as a"
+inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
+outputs = model.generate(**inputs, max_new_tokens=10)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## **📂 Full List of Subfolders**
+**Main Models (Natural Language \+ Birth Year)**
+* ft\_7b\_NLSY79\_with\_birth\_year\_ckpt\_3
+* ft\_7b\_NLSY79\_with\_birth\_year\_ckpt\_bo5
+* ft\_7b\_NLSY97\_with\_birth\_year\_ckpt\_3
+* ft\_7b\_NLSY97\_with\_birth\_year\_ckpt\_bo5
+* ft\_13b\_NLSY79\_with\_birth\_year\_ckpt\_3
+* ft\_13b\_NLSY79\_with\_birth\_year\_ckpt\_bo5
+* ft\_13b\_NLSY97\_with\_birth\_year\_ckpt\_3
+* ft\_13b\_NLSY97\_with\_birth\_year\_ckpt\_bo5
+**Ablation Models (Numeric Codes)**
+* ft\_7b\_NLSY79\_numeric\_ckpt\_3
+* ft\_7b\_NLSY79\_numeric\_ckpt\_bo5
+* ft\_7b\_NLSY97\_numeric\_ckpt\_3
+* ft\_7b\_NLSY97\_numeric\_ckpt\_bo5
+## **📜 Citation**
+If you use these models, please cite the original paper:
+```
+@article{athey2024labor,
+  title={LABOR-LLM: Language-Based Occupational Representations with Large Language Models},
+  author={Athey, Susan and Brunborg, Herman and Du, Tianyu and Kanodia, Ayush and Vafa, Keyon},
+  journal={arXiv preprint arXiv:2406.17972},
+  year={2024},
+  url={[https://arxiv.org/abs/2406.17972](https://arxiv.org/abs/2406.17972)}
+}
+```