tianyudu commited on
Commit
37ef536
ยท
verified ยท
1 Parent(s): 40e2941

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -3
README.md CHANGED
@@ -1,3 +1,105 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ metrics:
3
+ - perplexity
4
+ library_name: transformers
5
+ ---
6
+ # **LABOR-LLM Replication Models**
7
+
8
+ This repository contains replication models for the paper **"LABOR-LLM: Language-Based Occupational Representations with Large Language Models"** (Athey et al., 2024).
9
+
10
+ Link to the paper: https://arxiv.org/abs/2406.17972
11
+
12
+ These models are **Llama-2** checkpoints fine-tuned on longitudinal survey data (**NLSY79** and **NLSY97**) to predict labor market transitions. By converting tabular career histories into text-based "resumes," these models leverage the semantic knowledge of LLMs to outperform traditional econometric benchmarks in predicting a worker's next occupation.
13
+
14
+ ## **๐Ÿ“ฆ Model Variants & Ablations**
15
+
16
+ The repository hosts **12 specific model checkpoints**. You must specify the subfolder argument to load the correct variant.
17
+
18
+ | Model Size | Dataset | Variant Type | Description |
19
+ | :---- | :---- | :---- | :---- |
20
+ | **7B / 13B** | NLSY79 / NLSY97 | with\_birth\_year | **Main Model.** Uses natural language job titles and includes the worker's birth year in the context window to capture cohort/age effects. |
21
+ | **7B** | NLSY79 / NLSY97 | numeric | **Ablation Baseline.** Replaces natural language job titles with unique numeric codes. Used to demonstrate that the LLM's performance drops when semantic job information is removed. |
22
+
23
+ **Note on PSID Models:** Models fine-tuned on the Panel Study of Income Dynamics (PSID) are not hosted here due to data licensing restrictions. They will be distributed through the Inter-university Consortium for Political and Social Research (ICPSR).
24
+
25
+ ### **๐Ÿ“ Understanding Checkpoints (ckpt\_3 vs ckpt\_bo5)**
26
+
27
+ * **ckpt\_3**: The model with the lowest validation loss from a training run scheduled for **3 epochs**.
28
+ * **ckpt\_bo5**: The model with the lowest validation loss from a training run scheduled for **5 epochs**.
29
+
30
+ **Note:** Due to learning rate scheduling (e.g., warmup and decay steps being calculated based on total epochs), the first 3 epochs of the "5-epoch run" are different from the "3-epoch run." Therefore, ckpt\_bo5 is not simply a longer version of ckpt\_3; they represent distinct training trajectories. Empirically, we found that the \`ckpt\_bo5\` model achieves better performance.
31
+
32
+ ## **๐Ÿš€ Usage**
33
+
34
+ **Crucial:** You cannot load the model using just the repository ID. You must provide the subfolder name corresponding to the specific variant you want to use.
35
+
36
+ ### **Installation**
37
+
38
+ ```
39
+ pip install transformers torch
40
+ ```
41
+ ### **Loading a Model**
42
+
43
+ You can load the model both through huggingface hub directly or by downloading the model checkpoints manually to your local disk.
44
+
45
+ **Note** For a complete demonstration of usage, please refer to the `demo_notebook.ipynb` included in this repository.
46
+
47
+ ```python
48
+ from transformers import AutoTokenizer, AutoModelForCausalLM
49
+ import torch
50
+ # 1. Select your variant
51
+ # Example: 7B model on NLSY79 with birth year data (Best-of-5 checkpoint)
52
+ repo_id = "tianyudu/LABOR_LLM"
53
+ variant = "ft_7b_NLSY79_with_birth_year_ckpt_bo5"
54
+
55
+ # 2. Load Tokenizer & Model
56
+ tokenizer = AutoTokenizer.from_pretrained(repo_id, subfolder=variant)
57
+ model = AutoModelForCausalLM.from_pretrained(
58
+ repo_id,
59
+ subfolder=variant,
60
+ torch_dtype=torch.float16,
61
+ device_map="auto"
62
+ )
63
+
64
+ # 3. Inference
65
+ # The model expects a career history formatted as a text sequence.
66
+ # Example prompt format (varies by dataset, check paper for exact templates):
67
+ input_text = "Born in 1980. In 2002, works as a Waiter. In 2003, works as a"
68
+ inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
69
+
70
+ outputs = model.generate(**inputs, max_new_tokens=10)
71
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
72
+ ```
73
+
74
+ ## **๐Ÿ“‚ Full List of Subfolders**
75
+
76
+ **Main Models (Natural Language \+ Birth Year)**
77
+
78
+ * ft\_7b\_NLSY79\_with\_birth\_year\_ckpt\_3
79
+ * ft\_7b\_NLSY79\_with\_birth\_year\_ckpt\_bo5
80
+ * ft\_7b\_NLSY97\_with\_birth\_year\_ckpt\_3
81
+ * ft\_7b\_NLSY97\_with\_birth\_year\_ckpt\_bo5
82
+ * ft\_13b\_NLSY79\_with\_birth\_year\_ckpt\_3
83
+ * ft\_13b\_NLSY79\_with\_birth\_year\_ckpt\_bo5
84
+ * ft\_13b\_NLSY97\_with\_birth\_year\_ckpt\_3
85
+ * ft\_13b\_NLSY97\_with\_birth\_year\_ckpt\_bo5
86
+
87
+ **Ablation Models (Numeric Codes)**
88
+
89
+ * ft\_7b\_NLSY79\_numeric\_ckpt\_3
90
+ * ft\_7b\_NLSY79\_numeric\_ckpt\_bo5
91
+ * ft\_7b\_NLSY97\_numeric\_ckpt\_3
92
+ * ft\_7b\_NLSY97\_numeric\_ckpt\_bo5
93
+
94
+ ## **๐Ÿ“œ Citation**
95
+
96
+ If you use these models, please cite the original paper:
97
+ ```
98
+ @article{athey2024labor,
99
+ title={LABOR-LLM: Language-Based Occupational Representations with Large Language Models},
100
+ author={Athey, Susan and Brunborg, Herman and Du, Tianyu and Kanodia, Ayush and Vafa, Keyon},
101
+ journal={arXiv preprint arXiv:2406.17972},
102
+ year={2024},
103
+ url={[https://arxiv.org/abs/2406.17972](https://arxiv.org/abs/2406.17972)}
104
+ }
105
+ ```