--- metrics: - perplexity library_name: transformers --- # **LABOR-LLM Replication Models** This repository contains replication models for the paper **"LABOR-LLM: Language-Based Occupational Representations with Large Language Models"** (Athey et al., 2024). Link to the paper: https://arxiv.org/abs/2406.17972 These models are **Llama-2** checkpoints fine-tuned on longitudinal survey data (**NLSY79** and **NLSY97**) to predict labor market transitions. By converting tabular career histories into text-based "resumes," these models leverage the semantic knowledge of LLMs to outperform traditional econometric benchmarks in predicting a worker's next occupation. ## **📦 Model Variants & Ablations** The repository hosts **12 specific model checkpoints**. You must specify the subfolder argument to load the correct variant. | Model Size | Dataset | Variant Type | Description | | :---- | :---- | :---- | :---- | | **7B / 13B** | NLSY79 / NLSY97 | with\_birth\_year | **Main Model.** Uses natural language job titles and includes the worker's birth year in the context window to capture cohort/age effects. | | **7B** | NLSY79 / NLSY97 | numeric | **Ablation Baseline.** Replaces natural language job titles with unique numeric codes. Used to demonstrate that the LLM's performance drops when semantic job information is removed. | **Note on PSID Models:** Models fine-tuned on the Panel Study of Income Dynamics (PSID) are not hosted here due to data licensing restrictions. They will be distributed through the Inter-university Consortium for Political and Social Research (ICPSR). ### **📝 Understanding Checkpoints (ckpt\_3 vs ckpt\_bo5)** * **ckpt\_3**: The model with the lowest validation loss from a training run scheduled for **3 epochs**. * **ckpt\_bo5**: The model with the lowest validation loss from a training run scheduled for **5 epochs**. **Note:** Due to learning rate scheduling (e.g., warmup and decay steps being calculated based on total epochs), the first 3 epochs of the "5-epoch run" are different from the "3-epoch run." Therefore, ckpt\_bo5 is not simply a longer version of ckpt\_3; they represent distinct training trajectories. Empirically, we found that the \`ckpt\_bo5\` model achieves better performance. ## **🚀 Usage** **Crucial:** You cannot load the model using just the repository ID. You must provide the subfolder name corresponding to the specific variant you want to use. ### **Installation** ``` pip install transformers torch ``` ### **Loading a Model** You can load the model both through huggingface hub directly or by downloading the model checkpoints manually to your local disk. **Note** For a complete demonstration of usage, please refer to the `demo_notebook.ipynb` included in this repository. ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # 1. Select your variant # Example: 7B model on NLSY79 with birth year data (Best-of-5 checkpoint) repo_id = "tianyudu/LABOR_LLM" variant = "ft_7b_NLSY79_with_birth_year_ckpt_bo5" # 2. Load Tokenizer & Model tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=variant) model = AutoModelForCausalLM.from_pretrained( repo_id, subfolder=variant, torch_dtype=torch.float16, device_map="auto" ) # 3. Inference # The model expects a career history formatted as a text sequence. # Example prompt format (varies by dataset, check paper for exact templates): input_text = "Born in 1980. In 2002, works as a Waiter. In 2003, works as a" inputs = tokenizer(input_text, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=10) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## **📂 Full List of Subfolders** **Main Models (Natural Language \+ Birth Year)** * ft\_7b\_NLSY79\_with\_birth\_year\_ckpt\_3 * ft\_7b\_NLSY79\_with\_birth\_year\_ckpt\_bo5 * ft\_7b\_NLSY97\_with\_birth\_year\_ckpt\_3 * ft\_7b\_NLSY97\_with\_birth\_year\_ckpt\_bo5 * ft\_13b\_NLSY79\_with\_birth\_year\_ckpt\_3 * ft\_13b\_NLSY79\_with\_birth\_year\_ckpt\_bo5 * ft\_13b\_NLSY97\_with\_birth\_year\_ckpt\_3 * ft\_13b\_NLSY97\_with\_birth\_year\_ckpt\_bo5 **Ablation Models (Numeric Codes)** * ft\_7b\_NLSY79\_numeric\_ckpt\_3 * ft\_7b\_NLSY79\_numeric\_ckpt\_bo5 * ft\_7b\_NLSY97\_numeric\_ckpt\_3 * ft\_7b\_NLSY97\_numeric\_ckpt\_bo5 ## **📜 Citation** If you use these models, please cite the original paper: ``` @article{athey2024labor, title={LABOR-LLM: Language-Based Occupational Representations with Large Language Models}, author={Athey, Susan and Brunborg, Herman and Du, Tianyu and Kanodia, Ayush and Vafa, Keyon}, journal={arXiv preprint arXiv:2406.17972}, year={2024}, url={[https://arxiv.org/abs/2406.17972](https://arxiv.org/abs/2406.17972)} } ```