burtenshaw HF Staff commited on
Commit
4f4b027
·
verified ·
1 Parent(s): 832c6f7

Publish codex workspace

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. AGENTS.md +0 -0
  2. README.md +32 -0
  3. skills/.system/.codex-system-skills.marker +1 -0
  4. skills/.system/openai-docs/LICENSE.txt +201 -0
  5. skills/.system/openai-docs/SKILL.md +69 -0
  6. skills/.system/openai-docs/agents/openai.yaml +14 -0
  7. skills/.system/openai-docs/assets/openai-small.svg +3 -0
  8. skills/.system/openai-docs/assets/openai.png +0 -0
  9. skills/.system/openai-docs/references/gpt-5p4-prompting-guide.md +433 -0
  10. skills/.system/openai-docs/references/latest-model.md +35 -0
  11. skills/.system/openai-docs/references/upgrading-to-gpt-5p4.md +164 -0
  12. skills/.system/skill-creator/SKILL.md +413 -0
  13. skills/.system/skill-creator/agents/openai.yaml +5 -0
  14. skills/.system/skill-creator/assets/skill-creator-small.svg +3 -0
  15. skills/.system/skill-creator/assets/skill-creator.png +0 -0
  16. skills/.system/skill-creator/license.txt +202 -0
  17. skills/.system/skill-creator/references/openai_yaml.md +49 -0
  18. skills/.system/skill-creator/scripts/generate_openai_yaml.py +226 -0
  19. skills/.system/skill-creator/scripts/init_skill.py +400 -0
  20. skills/.system/skill-creator/scripts/quick_validate.py +101 -0
  21. skills/.system/skill-installer/LICENSE.txt +202 -0
  22. skills/.system/skill-installer/SKILL.md +58 -0
  23. skills/.system/skill-installer/agents/openai.yaml +5 -0
  24. skills/.system/skill-installer/assets/skill-installer-small.svg +3 -0
  25. skills/.system/skill-installer/assets/skill-installer.png +0 -0
  26. skills/.system/skill-installer/scripts/github_utils.py +21 -0
  27. skills/.system/skill-installer/scripts/install-skill-from-github.py +308 -0
  28. skills/.system/skill-installer/scripts/list-skills.py +107 -0
  29. skills/agent-kernel/SKILL.md +379 -0
  30. skills/hugging-face-evaluation/SKILL.md +262 -0
  31. skills/hugging-face-evaluation/examples/.env.example +7 -0
  32. skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +380 -0
  33. skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +220 -0
  34. skills/hugging-face-evaluation/examples/eval.example.yaml +11 -0
  35. skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
  36. skills/hugging-face-evaluation/examples/metric_mapping.json +118 -0
  37. skills/hugging-face-evaluation/references/hf_cli_for_prs.md +258 -0
  38. skills/hugging-face-evaluation/references/hf_papers_extraction.md +297 -0
  39. skills/hugging-face-evaluation/references/model_card_extraction.md +244 -0
  40. skills/hugging-face-evaluation/scripts/check_prs.py +98 -0
  41. skills/hugging-face-evaluation/scripts/import_aa.py +353 -0
  42. skills/hugging-face-model-trainer/SKILL.md +718 -0
  43. skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
  44. skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
  45. skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
  46. skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
  47. skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
  48. skills/hugging-face-model-trainer/references/training_methods.md +150 -0
  49. skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
  50. skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
AGENTS.md ADDED
File without changes
README.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - codex
4
+ - agent
5
+ library_name: codex
6
+ agent_name: ".codex"
7
+ agent_emoji: "🤖"
8
+ ---
9
+
10
+ # 🤖 .codex
11
+
12
+ A Codex workspace published to the Hugging Face Hub.
13
+
14
+ ## Quick Start
15
+
16
+ ```bash
17
+ hf claw install --codex burtenshaw/codex
18
+
19
+ # Or install and launch it in one shot
20
+ hf claw run --codex burtenshaw/codex
21
+ ```
22
+
23
+ ## AGENTS.md
24
+
25
+
26
+ ## Included Directories
27
+
28
+ - `skills/`
29
+
30
+ ---
31
+
32
+ *Published with [hf-claw](https://github.com/huggingface/harness)*
skills/.system/.codex-system-skills.marker ADDED
@@ -0,0 +1 @@
 
 
1
+ 415286eb412224fe
skills/.system/openai-docs/LICENSE.txt ADDED
@@ -0,0 +1,201 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Apache License
2
+ Version 2.0, January 2004
3
+ http://www.apache.org/licenses/
4
+
5
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
6
+
7
+ 1. Definitions.
8
+
9
+ "License" shall mean the terms and conditions for use, reproduction,
10
+ and distribution as defined by Sections 1 through 9 of this document.
11
+
12
+ "Licensor" shall mean the copyright owner or entity authorized by
13
+ the copyright owner that is granting the License.
14
+
15
+ "Legal Entity" shall mean the union of the acting entity and all
16
+ other entities that control, are controlled by, or are under common
17
+ control with that entity. For the purposes of this definition,
18
+ "control" means (i) the power, direct or indirect, to cause the
19
+ direction or management of such entity, whether by contract or
20
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
21
+ outstanding shares, or (iii) beneficial ownership of such entity.
22
+
23
+ "You" (or "Your") shall mean an individual or Legal Entity
24
+ exercising permissions granted by this License.
25
+
26
+ "Source" form shall mean the preferred form for making modifications,
27
+ including but not limited to software source code, documentation
28
+ source, and configuration files.
29
+
30
+ "Object" form shall mean any form resulting from mechanical
31
+ transformation or translation of a Source form, including but
32
+ not limited to compiled object code, generated documentation,
33
+ and conversions to other media types.
34
+
35
+ "Work" shall mean the work of authorship, whether in Source or
36
+ Object form, made available under the License, as indicated by a
37
+ copyright notice that is included in or attached to the work
38
+ (an example is provided in the Appendix below).
39
+
40
+ "Derivative Works" shall mean any work, whether in Source or Object
41
+ form, that is based on (or derived from) the Work and for which the
42
+ editorial revisions, annotations, elaborations, or other modifications
43
+ represent, as a whole, an original work of authorship. For the purposes
44
+ of this License, Derivative Works shall not include works that remain
45
+ separable from, or merely link (or bind by name) to the interfaces of,
46
+ the Work and Derivative Works thereof.
47
+
48
+ "Contribution" shall mean any work of authorship, including
49
+ the original version of the Work and any modifications or additions
50
+ to that Work or Derivative Works thereof, that is intentionally
51
+ submitted to Licensor for inclusion in the Work by the copyright owner
52
+ or by an individual or Legal Entity authorized to submit on behalf of
53
+ the copyright owner. For the purposes of this definition, "submitted"
54
+ means any form of electronic, verbal, or written communication sent
55
+ to the Licensor or its representatives, including but not limited to
56
+ communication on electronic mailing lists, source code control systems,
57
+ and issue tracking systems that are managed by, or on behalf of, the
58
+ Licensor for the purpose of discussing and improving the Work, but
59
+ excluding communication that is conspicuously marked or otherwise
60
+ designated in writing by the copyright owner as "Not a Contribution."
61
+
62
+ "Contributor" shall mean Licensor and any individual or Legal Entity
63
+ on behalf of whom a Contribution has been received by Licensor and
64
+ subsequently incorporated within the Work.
65
+
66
+ 2. Grant of Copyright License. Subject to the terms and conditions of
67
+ this License, each Contributor hereby grants to You a perpetual,
68
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
69
+ copyright license to reproduce, prepare Derivative Works of,
70
+ publicly display, publicly perform, sublicense, and distribute the
71
+ Work and such Derivative Works in Source or Object form.
72
+
73
+ 3. Grant of Patent License. Subject to the terms and conditions of
74
+ this License, each Contributor hereby grants to You a perpetual,
75
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
76
+ (except as stated in this section) patent license to make, have made,
77
+ use, offer to sell, sell, import, and otherwise transfer the Work,
78
+ where such license applies only to those patent claims licensable
79
+ by such Contributor that are necessarily infringed by their
80
+ Contribution(s) alone or by combination of their Contribution(s)
81
+ with the Work to which such Contribution(s) was submitted. If You
82
+ institute patent litigation against any entity (including a
83
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
84
+ or a Contribution incorporated within the Work constitutes direct
85
+ or contributory patent infringement, then any patent licenses
86
+ granted to You under this License for that Work shall terminate
87
+ as of the date such litigation is filed.
88
+
89
+ 4. Redistribution. You may reproduce and distribute copies of the
90
+ Work or Derivative Works thereof in any medium, with or without
91
+ modifications, and in Source or Object form, provided that You
92
+ meet the following conditions:
93
+
94
+ (a) You must give any other recipients of the Work or
95
+ Derivative Works a copy of this License; and
96
+
97
+ (b) You must cause any modified files to carry prominent notices
98
+ stating that You changed the files; and
99
+
100
+ (c) You must retain, in the Source form of any Derivative Works
101
+ that You distribute, all copyright, patent, trademark, and
102
+ attribution notices from the Source form of the Work,
103
+ excluding those notices that do not pertain to any part of
104
+ the Derivative Works; and
105
+
106
+ (d) If the Work includes a "NOTICE" text file as part of its
107
+ distribution, then any Derivative Works that You distribute must
108
+ include a readable copy of the attribution notices contained
109
+ within such NOTICE file, excluding those notices that do not
110
+ pertain to any part of the Derivative Works, in at least one
111
+ of the following places: within a NOTICE text file distributed
112
+ as part of the Derivative Works; within the Source form or
113
+ documentation, if provided along with the Derivative Works; or,
114
+ within a display generated by the Derivative Works, if and
115
+ wherever such third-party notices normally appear. The contents
116
+ of the NOTICE file are for informational purposes only and
117
+ do not modify the License. You may add Your own attribution
118
+ notices within Derivative Works that You distribute, alongside
119
+ or as an addendum to the NOTICE text from the Work, provided
120
+ that such additional attribution notices cannot be construed
121
+ as modifying the License.
122
+
123
+ You may add Your own copyright statement to Your modifications and
124
+ may provide additional or different license terms and conditions
125
+ for use, reproduction, or distribution of Your modifications, or
126
+ for any such Derivative Works as a whole, provided Your use,
127
+ reproduction, and distribution of the Work otherwise complies with
128
+ the conditions stated in this License.
129
+
130
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
131
+ any Contribution intentionally submitted for inclusion in the Work
132
+ by You to the Licensor shall be under the terms and conditions of
133
+ this License, without any additional terms or conditions.
134
+ Notwithstanding the above, nothing herein shall supersede or modify
135
+ the terms of any separate license agreement you may have executed
136
+ with Licensor regarding such Contributions.
137
+
138
+ 6. Trademarks. This License does not grant permission to use the trade
139
+ names, trademarks, service marks, or product names of the Licensor,
140
+ except as required for reasonable and customary use in describing the
141
+ origin of the Work and reproducing the content of the NOTICE file.
142
+
143
+ 7. Disclaimer of Warranty. Unless required by applicable law or
144
+ agreed to in writing, Licensor provides the Work (and each
145
+ Contributor provides its Contributions) on an "AS IS" BASIS,
146
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
147
+ implied, including, without limitation, any warranties or conditions
148
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
149
+ PARTICULAR PURPOSE. You are solely responsible for determining the
150
+ appropriateness of using or redistributing the Work and assume any
151
+ risks associated with Your exercise of permissions under this License.
152
+
153
+ 8. Limitation of Liability. In no event and under no legal theory,
154
+ whether in tort (including negligence), contract, or otherwise,
155
+ unless required by applicable law (such as deliberate and grossly
156
+ negligent acts) or agreed to in writing, shall any Contributor be
157
+ liable to You for damages, including any direct, indirect, special,
158
+ incidental, or consequential damages of any character arising as a
159
+ result of this License or out of the use or inability to use the
160
+ Work (including but not limited to damages for loss of goodwill,
161
+ work stoppage, computer failure or malfunction, or any and all
162
+ other commercial damages or losses), even if such Contributor
163
+ has been advised of the possibility of such damages.
164
+
165
+ 9. Accepting Warranty or Additional Liability. While redistributing
166
+ the Work or Derivative Works thereof, You may choose to offer,
167
+ and charge a fee for, acceptance of support, warranty, indemnity,
168
+ or other liability obligations and/or rights consistent with this
169
+ License. However, in accepting such obligations, You may act only
170
+ on Your own behalf and on Your sole responsibility, not on behalf of
171
+ any other Contributor, and only if You agree to indemnify,
172
+ defend, and hold each Contributor harmless for any liability
173
+ incurred by, or claims asserted against, such Contributor by reason
174
+ of your accepting any such warranty or additional liability.
175
+
176
+ END OF TERMS AND CONDITIONS
177
+
178
+ APPENDIX: How to apply the Apache License to your work.
179
+
180
+ To apply the Apache License to your work, attach the following
181
+ boilerplate notice, with the fields enclosed by brackets "[]"
182
+ replaced with your own identifying information. (Don\'t include
183
+ the brackets!) The text should be enclosed in the appropriate
184
+ comment syntax for the file format. We also recommend that a
185
+ file or class name and description of purpose be included on the
186
+ same "printed page" as the copyright notice for easier
187
+ identification within third-party archives.
188
+
189
+ Copyright [yyyy] [name of copyright owner]
190
+
191
+ Licensed under the Apache License, Version 2.0 (the "License");
192
+ you may not use this file except in compliance with the License.
193
+ You may obtain a copy of the License at
194
+
195
+ http://www.apache.org/licenses/LICENSE-2.0
196
+
197
+ Unless required by applicable law or agreed to in writing, software
198
+ distributed under the License is distributed on an "AS IS" BASIS,
199
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
200
+ See the License for the specific language governing permissions and
201
+ limitations under the License.
skills/.system/openai-docs/SKILL.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: "openai-docs"
3
+ description: "Use when the user asks how to build with OpenAI products or APIs and needs up-to-date official documentation with citations, help choosing the latest model for a use case, or explicit GPT-5.4 upgrade and prompt-upgrade guidance; prioritize OpenAI docs MCP tools, use bundled references only as helper context, and restrict any fallback browsing to official OpenAI domains."
4
+ ---
5
+
6
+
7
+ # OpenAI Docs
8
+
9
+ Provide authoritative, current guidance from OpenAI developer docs using the developers.openai.com MCP server. Always prioritize the developer docs MCP tools over web.run for OpenAI-related questions. This skill may also load targeted files from `references/` for model-selection and GPT-5.4-specific requests, but current OpenAI docs remain authoritative. Only if the MCP server is installed and returns no meaningful results should you fall back to web search.
10
+
11
+ ## Quick start
12
+
13
+ - Use `mcp__openaiDeveloperDocs__search_openai_docs` to find the most relevant doc pages.
14
+ - Use `mcp__openaiDeveloperDocs__fetch_openai_doc` to pull exact sections and quote/paraphrase accurately.
15
+ - Use `mcp__openaiDeveloperDocs__list_openai_docs` only when you need to browse or discover pages without a clear query.
16
+ - Load only the relevant file from `references/` when the question is about model selection or a GPT-5.4 upgrade.
17
+
18
+ ## OpenAI product snapshots
19
+
20
+ 1. Apps SDK: Build ChatGPT apps by providing a web component UI and an MCP server that exposes your app's tools to ChatGPT.
21
+ 2. Responses API: A unified endpoint designed for stateful, multimodal, tool-using interactions in agentic workflows.
22
+ 3. Chat Completions API: Generate a model response from a list of messages comprising a conversation.
23
+ 4. Codex: OpenAI's coding agent for software development that can write, understand, review, and debug code.
24
+ 5. gpt-oss: Open-weight OpenAI reasoning models (gpt-oss-120b and gpt-oss-20b) released under the Apache 2.0 license.
25
+ 6. Realtime API: Build low-latency, multimodal experiences including natural speech-to-speech conversations.
26
+ 7. Agents SDK: A toolkit for building agentic apps where a model can use tools and context, hand off to other agents, stream partial results, and keep a full trace.
27
+
28
+ ## If MCP server is missing
29
+
30
+ If MCP tools fail or no OpenAI docs resources are available:
31
+
32
+ 1. Run the install command yourself: `codex mcp add openaiDeveloperDocs --url https://developers.openai.com/mcp`
33
+ 2. If it fails due to permissions/sandboxing, immediately retry the same command with escalated permissions and include a 1-sentence justification for approval. Do not ask the user to run it yet.
34
+ 3. Only if the escalated attempt fails, ask the user to run the install command.
35
+ 4. Ask the user to restart Codex.
36
+ 5. Re-run the doc search/fetch after restart.
37
+
38
+ ## Workflow
39
+
40
+ 1. Clarify the product scope and whether the request is general docs lookup, model selection, a GPT-5.4 upgrade, or a GPT-5.4 prompt upgrade.
41
+ 2. If it is a model-selection request, load `references/latest-model.md`.
42
+ 3. If it is an explicit GPT-5.4 upgrade request, load `references/upgrading-to-gpt-5p4.md`.
43
+ 4. If the upgrade may require prompt changes, or the workflow is research-heavy, tool-heavy, coding-oriented, multi-agent, or long-running, also load `references/gpt-5p4-prompting-guide.md`.
44
+ 5. Search docs with a precise query.
45
+ 6. Fetch the best page and the exact section needed (use `anchor` when possible).
46
+ 7. For GPT-5.4 upgrade reviews, always make the per-usage-site output explicit: target model, starting reasoning recommendation, `phase` assessment when relevant, prompt blocks, and compatibility status.
47
+ 8. Answer with concise guidance and cite the doc source, using the reference files only as helper context.
48
+
49
+ ## Reference map
50
+
51
+ Read only what you need:
52
+
53
+ - `references/latest-model.md` -> model-selection and "best/latest/current model" questions; verify every recommendation against current OpenAI docs before answering.
54
+ - `references/upgrading-to-gpt-5p4.md` -> only for explicit GPT-5.4 upgrade and upgrade-planning requests; verify the checklist and compatibility guidance against current OpenAI docs before answering.
55
+ - `references/gpt-5p4-prompting-guide.md` -> prompt rewrites and prompt-behavior upgrades for GPT-5.4; verify prompting guidance against current OpenAI docs before answering.
56
+
57
+ ## Quality rules
58
+
59
+ - Treat OpenAI docs as the source of truth; avoid speculation.
60
+ - Keep quotes short and within policy limits; prefer paraphrase with citations.
61
+ - If multiple pages differ, call out the difference and cite both.
62
+ - Reference files are convenience guides only; for volatile guidance such as recommended models, upgrade instructions, or prompting advice, current OpenAI docs always win.
63
+ - If docs do not cover the user’s need, say so and offer next steps.
64
+
65
+ ## Tooling notes
66
+
67
+ - Always use MCP doc tools before any web search for OpenAI-related questions.
68
+ - If the MCP server is installed but returns no meaningful results, then use web search as a fallback.
69
+ - When falling back to web search, restrict to official OpenAI domains (developers.openai.com, platform.openai.com) and cite sources.
skills/.system/openai-docs/agents/openai.yaml ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ interface:
2
+ display_name: "OpenAI Docs"
3
+ short_description: "Reference official OpenAI docs, including upgrade guidance"
4
+ icon_small: "./assets/openai-small.svg"
5
+ icon_large: "./assets/openai.png"
6
+ default_prompt: "Look up official OpenAI docs, load relevant GPT-5.4 upgrade references when applicable, and answer with concise, cited guidance."
7
+
8
+ dependencies:
9
+ tools:
10
+ - type: "mcp"
11
+ value: "openaiDeveloperDocs"
12
+ description: "OpenAI Developer Docs MCP server"
13
+ transport: "streamable_http"
14
+ url: "https://developers.openai.com/mcp"
skills/.system/openai-docs/assets/openai-small.svg ADDED
skills/.system/openai-docs/assets/openai.png ADDED
skills/.system/openai-docs/references/gpt-5p4-prompting-guide.md ADDED
@@ -0,0 +1,433 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GPT-5.4 prompting upgrade guide
2
+
3
+ Use this guide when prompts written for older models need to be adapted for GPT-5.4 during an upgrade. Start lean: keep the model-string change narrow, preserve the original task intent, and add only the smallest prompt changes needed to recover behavior.
4
+
5
+ ## Default upgrade posture
6
+
7
+ - Start with `model string only` whenever the old prompt is already short, explicit, and task-bounded.
8
+ - Move to `model string + light prompt rewrite` only when regressions appear in completeness, persistence, citation quality, verification, or verbosity.
9
+ - Prefer one or two targeted prompt additions over a broad rewrite.
10
+ - Treat reasoning effort as a last-mile knob. Start lower, then increase only after prompt-level fixes and evals.
11
+ - Before increasing reasoning effort, first add a completeness contract, a verification loop, and tool persistence rules - depending on the usage case.
12
+ - If the workflow clearly depends on implementation changes rather than prompt changes, treat it as blocked for prompt-only upgrade guidance.
13
+ - Do not classify a case as blocked just because the workflow uses tools; block only if the upgrade requires changing tool definitions, wiring, or other implementation details.
14
+
15
+ ## Behavioral differences to account for
16
+
17
+ Current GPT-5.4 upgrade guidance suggests these strengths:
18
+
19
+ - stronger personality and tone adherence, with less drift over long answers
20
+ - better long-horizon and agentic workflow stamina
21
+ - stronger spreadsheet, finance, and formatting tasks
22
+ - more efficient tool selection and fewer unnecessary calls by default
23
+ - stronger structured generation and classification reliability
24
+
25
+ The main places where prompt guidance still helps are:
26
+
27
+ - retrieval-heavy workflows that need persistent tool use and explicit completeness
28
+ - research and citation discipline
29
+ - verification before irreversible or high-impact actions
30
+ - terminal and tool workflow hygiene
31
+ - defaults and implied follow-through
32
+ - verbosity control for compact, information-dense answers
33
+
34
+ Start with the smallest set of instructions that preserves correctness. Add the prompt blocks below only for workflows that actually need them.
35
+
36
+ ## Prompt rewrite patterns
37
+
38
+ | Older prompt pattern | GPT-5.4 adjustment | Why | Example addition |
39
+ | --- | --- | --- | --- |
40
+ | Long, repetitive instructions that compensate for weaker instruction following | Remove duplicate scaffolding and keep only the constraints that materially change behavior | GPT-5.4 usually needs less repeated steering | Replace repeated reminders with one concise rule plus a verification block |
41
+ | Fast assistant prompt with no verbosity control | Keep the prompt as-is first; add a verbosity clamp only if outputs become too long | Many GPT-4o or GPT-4.1 upgrades work with just a model-string swap | Add `output_verbosity_spec` only after a verbosity regression |
42
+ | Tool-heavy agent prompt that assumes the model will keep searching until complete | Add persistence and verification rules | GPT-5.4 may use fewer tool calls by default for efficiency | Add `tool_persistence_rules` and `verification_loop` |
43
+ | Tool-heavy workflow where later actions depend on earlier lookup or retrieval | Add prerequisite and missing-context rules before action steps | GPT-5.4 benefits from explicit dependency-aware routing when context is still thin | Add `dependency_checks` and `missing_context_gating` |
44
+ | Retrieval workflow with several independent lookups | Add selective parallelism guidance | GPT-5.4 is strong at parallel tool use, but should not parallelize dependent steps | Add `parallel_tool_calling` |
45
+ | Batch workflow prompt that often misses items | Add an explicit completeness contract | Item accounting benefits from direct instruction | Add `completeness_contract` |
46
+ | Research prompt that needs grounding and citation discipline | Add research, citation, and empty-result recovery blocks | Multi-pass retrieval is stronger when the model is told how to react to weak or empty search results | Add `research_mode`, `citation_rules`, and `empty_result_handling`; add `tool_persistence_rules` when retrieval tools are already in use |
47
+ | Coding or terminal prompt with shell misuse or early stop failures | Keep the same tool surface and add terminal hygiene and verification instructions | Tool-using coding workflows are not blocked just because tools exist; they usually need better prompt steering, not host rewiring | Add `terminal_tool_hygiene` and `verification_loop`, optionally `tool_persistence_rules` |
48
+ | Multi-agent or support-triage workflow with escalation or completeness requirements | Add one lightweight control block for persistence, completeness, or verification | GPT-5.4 can be more efficient by default, so multi-step support flows benefit from an explicit completion or verification contract | Add at least one of `tool_persistence_rules`, `completeness_contract`, or `verification_loop` |
49
+
50
+ ## Prompt blocks
51
+
52
+ Use these selectively. Do not add all of them by default.
53
+
54
+ ### `output_verbosity_spec`
55
+
56
+ Use when:
57
+
58
+ - the upgraded model gets too wordy
59
+ - the host needs compact, information-dense answers
60
+ - the workflow benefits from a short overview plus a checklist
61
+
62
+ ```text
63
+ <output_verbosity_spec>
64
+ - Default: 3-6 sentences or up to 6 bullets.
65
+ - If the user asked for a doc or report, use headings with short bullets.
66
+ - For multi-step tasks:
67
+ - Start with 1 short overview paragraph.
68
+ - Then provide a checklist with statuses: [done], [todo], or [blocked].
69
+ - Avoid repeating the user's request.
70
+ - Prefer compact, information-dense writing.
71
+ </output_verbosity_spec>
72
+ ```
73
+
74
+ ### `default_follow_through_policy`
75
+
76
+ Use when:
77
+
78
+ - the host expects the model to proceed on reversible, low-risk steps
79
+ - the upgraded model becomes too conservative or asks for confirmation too often
80
+
81
+ ```text
82
+ <default_follow_through_policy>
83
+ - If the user's intent is clear and the next step is reversible and low-risk, proceed without asking permission.
84
+ - Only ask permission if the next step is:
85
+ (a) irreversible,
86
+ (b) has external side effects, or
87
+ (c) requires missing sensitive information or a choice that materially changes outcomes.
88
+ - If proceeding, state what you did and what remains optional.
89
+ </default_follow_through_policy>
90
+ ```
91
+
92
+ ### `instruction_priority`
93
+
94
+ Use when:
95
+
96
+ - users often change task shape, format, or tone mid-conversation
97
+ - the host needs an explicit override policy instead of relying on defaults
98
+
99
+ ```text
100
+ <instruction_priority>
101
+ - User instructions override default style, tone, formatting, and initiative preferences.
102
+ - Safety, honesty, privacy, and permission constraints do not yield.
103
+ - If a newer user instruction conflicts with an earlier one, follow the newer instruction.
104
+ - Preserve earlier instructions that do not conflict.
105
+ </instruction_priority>
106
+ ```
107
+
108
+ ### `tool_persistence_rules`
109
+
110
+ Use when:
111
+
112
+ - the workflow needs multiple retrieval or verification steps
113
+ - the model starts stopping too early because it is trying to save tool calls
114
+
115
+ ```text
116
+ <tool_persistence_rules>
117
+ - Use tools whenever they materially improve correctness, completeness, or grounding.
118
+ - Do not stop early just to save tool calls.
119
+ - Keep calling tools until:
120
+ (1) the task is complete, and
121
+ (2) verification passes.
122
+ - If a tool returns empty or partial results, retry with a different strategy.
123
+ </tool_persistence_rules>
124
+ ```
125
+
126
+ ### `dig_deeper_nudge`
127
+
128
+ Use when:
129
+
130
+ - the model is too literal or stops at the first plausible answer
131
+ - the task is safety- or accuracy-sensitive and needs a small initiative nudge before raising reasoning effort
132
+
133
+ ```text
134
+ <dig_deeper_nudge>
135
+ - Do not stop at the first plausible answer.
136
+ - Look for second-order issues, edge cases, and missing constraints.
137
+ - If the task is safety- or accuracy-critical, perform at least one verification step.
138
+ </dig_deeper_nudge>
139
+ ```
140
+
141
+ ### `dependency_checks`
142
+
143
+ Use when:
144
+
145
+ - later actions depend on prerequisite lookup, memory retrieval, or discovery steps
146
+ - the model may be tempted to skip prerequisite work because the intended end state seems obvious
147
+
148
+ ```text
149
+ <dependency_checks>
150
+ - Before taking an action, check whether prerequisite discovery, lookup, or memory retrieval is required.
151
+ - Do not skip prerequisite steps just because the intended final action seems obvious.
152
+ - If a later step depends on the output of an earlier one, resolve that dependency first.
153
+ </dependency_checks>
154
+ ```
155
+
156
+ ### `parallel_tool_calling`
157
+
158
+ Use when:
159
+
160
+ - the workflow has multiple independent retrieval steps
161
+ - wall-clock time matters but some steps still need sequencing
162
+
163
+ ```text
164
+ <parallel_tool_calling>
165
+ - When multiple retrieval or lookup steps are independent, prefer parallel tool calls to reduce wall-clock time.
166
+ - Do not parallelize steps with prerequisite dependencies or where one result determines the next action.
167
+ - After parallel retrieval, pause to synthesize before making more calls.
168
+ - Prefer selective parallelism: parallelize independent evidence gathering, not speculative or redundant tool use.
169
+ </parallel_tool_calling>
170
+ ```
171
+
172
+ ### `completeness_contract`
173
+
174
+ Use when:
175
+
176
+ - the task involves batches, lists, enumerations, or multiple deliverables
177
+ - missing items are a common failure mode
178
+
179
+ ```text
180
+ <completeness_contract>
181
+ - Deliver all requested items.
182
+ - Maintain an itemized checklist of deliverables.
183
+ - For lists or batches:
184
+ - state the expected count,
185
+ - enumerate items 1..N,
186
+ - confirm that none are missing before finalizing.
187
+ - If any item is blocked by missing data, mark it [blocked] and state exactly what is missing.
188
+ </completeness_contract>
189
+ ```
190
+
191
+ ### `empty_result_handling`
192
+
193
+ Use when:
194
+
195
+ - the workflow frequently performs search, CRM, logs, or retrieval steps
196
+ - no-results failures are often false negatives
197
+
198
+ ```text
199
+ <empty_result_handling>
200
+ If a lookup returns empty or suspiciously small results:
201
+ - Do not conclude that no results exist immediately.
202
+ - Try at least 2 fallback strategies, such as a broader query, alternate filters, or another source.
203
+ - Only then report that no results were found, along with what you tried.
204
+ </empty_result_handling>
205
+ ```
206
+
207
+ ### `verification_loop`
208
+
209
+ Use when:
210
+
211
+ - the workflow has downstream impact
212
+ - accuracy, formatting, or completeness regressions matter
213
+
214
+ ```text
215
+ <verification_loop>
216
+ Before finalizing:
217
+ - Check correctness: does the output satisfy every requirement?
218
+ - Check grounding: are factual claims backed by retrieved sources or tool output?
219
+ - Check formatting: does the output match the requested schema or style?
220
+ - Check safety and irreversibility: if the next step has external side effects, ask permission first.
221
+ </verification_loop>
222
+ ```
223
+
224
+ ### `missing_context_gating`
225
+
226
+ Use when:
227
+
228
+ - required context is sometimes missing early in the workflow
229
+ - the model should prefer retrieval over guessing
230
+
231
+ ```text
232
+ <missing_context_gating>
233
+ - If required context is missing, do not guess.
234
+ - Prefer the appropriate lookup tool when the context is retrievable; ask a minimal clarifying question only when it is not.
235
+ - If you must proceed, label assumptions explicitly and choose a reversible action.
236
+ </missing_context_gating>
237
+ ```
238
+
239
+ ### `action_safety`
240
+
241
+ Use when:
242
+
243
+ - the agent will actively take actions through tools
244
+ - the host benefits from a short pre-flight and post-flight execution frame
245
+
246
+ ```text
247
+ <action_safety>
248
+ - Pre-flight: summarize the intended action and parameters in 1-2 lines.
249
+ - Execute via tool.
250
+ - Post-flight: confirm the outcome and any validation that was performed.
251
+ </action_safety>
252
+ ```
253
+
254
+ ### `citation_rules`
255
+
256
+ Use when:
257
+
258
+ - the workflow produces cited answers
259
+ - fabricated citations or wrong citation formats are costly
260
+
261
+ ```text
262
+ <citation_rules>
263
+ - Only cite sources that were actually retrieved in this session.
264
+ - Never fabricate citations, URLs, IDs, or quote spans.
265
+ - If you cannot find a source for a claim, say so and either:
266
+ - soften the claim, or
267
+ - explain how to verify it with tools.
268
+ - Use exactly the citation format required by the host application.
269
+ </citation_rules>
270
+ ```
271
+
272
+ ### `research_mode`
273
+
274
+ Use when:
275
+
276
+ - the workflow is research-heavy
277
+ - the host uses web search or retrieval tools
278
+
279
+ ```text
280
+ <research_mode>
281
+ - Do research in 3 passes:
282
+ 1) Plan: list 3-6 sub-questions to answer.
283
+ 2) Retrieve: search each sub-question and follow 1-2 second-order leads.
284
+ 3) Synthesize: resolve contradictions and write the final answer with citations.
285
+ - Stop only when more searching is unlikely to change the conclusion.
286
+ </research_mode>
287
+ ```
288
+
289
+ If your host environment uses a specific research tool or requires a submit step, combine this with the host's finalization contract.
290
+
291
+ ### `structured_output_contract`
292
+
293
+ Use when:
294
+
295
+ - the host depends on strict JSON, SQL, or other structured output
296
+
297
+ ```text
298
+ <structured_output_contract>
299
+ - Output only the requested format.
300
+ - Do not add prose or markdown fences unless they were requested.
301
+ - Validate that parentheses and brackets are balanced.
302
+ - Do not invent tables or fields.
303
+ - If required schema information is missing, ask for it or return an explicit error object.
304
+ </structured_output_contract>
305
+ ```
306
+
307
+ ### `bbox_extraction_spec`
308
+
309
+ Use when:
310
+
311
+ - the workflow extracts OCR boxes, document regions, or other coordinates
312
+ - layout drift or missed dense regions are common failure modes
313
+
314
+ ```text
315
+ <bbox_extraction_spec>
316
+ - Use the specified coordinate format exactly, such as [x1,y1,x2,y2] normalized to 0..1.
317
+ - For each box, include page, label, text snippet, and confidence.
318
+ - Add a vertical-drift sanity check so boxes stay aligned with the correct line of text.
319
+ - If the layout is dense, process page by page and do a second pass for missed items.
320
+ </bbox_extraction_spec>
321
+ ```
322
+
323
+ ### `terminal_tool_hygiene`
324
+
325
+ Use when:
326
+
327
+ - the prompt belongs to a terminal-based or coding-agent workflow
328
+ - tool misuse or shell misuse has been observed
329
+
330
+ ```text
331
+ <terminal_tool_hygiene>
332
+ - Only run shell commands through the terminal tool.
333
+ - Never try to "run" tool names as shell commands.
334
+ - If a patch or edit tool exists, use it directly instead of emulating it in bash.
335
+ - After changes, run a lightweight verification step such as ls, tests, or a build before declaring the task done.
336
+ </terminal_tool_hygiene>
337
+ ```
338
+
339
+ ### `user_updates_spec`
340
+
341
+ Use when:
342
+
343
+ - the workflow is long-running and user updates matter
344
+
345
+ ```text
346
+ <user_updates_spec>
347
+ - Only update the user when starting a new major phase or when the plan changes.
348
+ - Each update should contain:
349
+ - 1 sentence on what changed,
350
+ - 1 sentence on the next step.
351
+ - Do not narrate routine tool calls.
352
+ - Keep the user-facing update short, even when the actual work is exhaustive.
353
+ </user_updates_spec>
354
+ ```
355
+
356
+ If you are using [Compaction](https://developers.openai.com/api/docs/guides/compaction) in the Responses API, compact after major milestones, treat compacted items as opaque state, and keep prompts functionally identical after compaction.
357
+
358
+ ## Responses `phase` guidance
359
+
360
+ For long-running Responses workflows, preambles, or tool-heavy agents that replay assistant items, review whether `phase` is already preserved.
361
+
362
+ - If the host already round-trips `phase`, keep it intact during the upgrade.
363
+ - If the host uses `previous_response_id` and does not manually replay assistant items, note that this may reduce manual `phase` handling needs.
364
+ - If reliable GPT-5.4 behavior would require adding or preserving `phase` and that would need code edits, treat the case as blocked for prompt-only or model-string-only migration guidance.
365
+
366
+ ## Example upgrade profiles
367
+
368
+ ### GPT-5.2
369
+
370
+ - Use `gpt-5.4`
371
+ - Match the current reasoning effort first
372
+ - Preserve the existing latency and quality profile before tuning prompt blocks
373
+ - If the repo does not expose the exact setting, emit `same` as the starting recommendation
374
+
375
+ ### GPT-5.3-Codex
376
+
377
+ - Use `gpt-5.4`
378
+ - Match the current reasoning effort first
379
+ - If you need Codex-style speed and efficiency, add verification blocks before increasing reasoning effort
380
+ - If the repo does not expose the exact setting, emit `same` as the starting recommendation
381
+
382
+ ### GPT-4o or GPT-4.1 assistant
383
+
384
+ - Use `gpt-5.4`
385
+ - Start with `none` reasoning effort
386
+ - Add `output_verbosity_spec` only if output becomes too verbose
387
+
388
+ ### Long-horizon agent
389
+
390
+ - Use `gpt-5.4`
391
+ - Start with `medium` reasoning effort
392
+ - Add `tool_persistence_rules`
393
+ - Add `completeness_contract`
394
+ - Add `verification_loop`
395
+
396
+ ### Research workflow
397
+
398
+ - Use `gpt-5.4`
399
+ - Start with `medium` reasoning effort
400
+ - Add `research_mode`
401
+ - Add `citation_rules`
402
+ - Add `empty_result_handling`
403
+ - Add `tool_persistence_rules` when the host already uses web or retrieval tools
404
+ - Add `parallel_tool_calling` when the retrieval steps are independent
405
+
406
+ ### Support triage or multi-agent workflow
407
+
408
+ - Use `gpt-5.4`
409
+ - Prefer `model string + light prompt rewrite` over `model string only`
410
+ - Add at least one of `tool_persistence_rules`, `completeness_contract`, or `verification_loop`
411
+ - Add more only if evals show a real regression
412
+
413
+ ### Coding or terminal workflow
414
+
415
+ - Use `gpt-5.4`
416
+ - Keep the model-string change narrow
417
+ - Match the current reasoning effort first if you are upgrading from GPT-5.3-Codex
418
+ - Add `terminal_tool_hygiene`
419
+ - Add `verification_loop`
420
+ - Add `dependency_checks` when actions depend on prerequisite lookup or discovery
421
+ - Add `tool_persistence_rules` if the agent stops too early
422
+ - Review whether `phase` is already preserved for long-running Responses flows or assistant preambles
423
+ - Do not classify this as blocked just because the workflow uses tools; block only if the upgrade requires changing tool definitions or wiring
424
+ - If the repo already uses Responses plus tools and no required host-side change is shown, prefer `model_string_plus_light_prompt_rewrite` over `blocked`
425
+
426
+ ## Prompt regression checklist
427
+
428
+ - Check whether the upgraded prompt still preserves the original task intent.
429
+ - Check whether the new prompt is leaner, not just longer.
430
+ - Check completeness, citation quality, dependency handling, verification behavior, and verbosity.
431
+ - For long-running Responses agents, check whether `phase` handling is already in place or needs implementation work.
432
+ - Confirm that each added prompt block addresses an observed regression.
433
+ - Remove prompt blocks that are not earning their keep.
skills/.system/openai-docs/references/latest-model.md ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Latest model guide
2
+
3
+ This file is a curated helper. Every recommendation here must be verified against current OpenAI docs before it is repeated to a user.
4
+
5
+ ## Current model map
6
+
7
+ | Model ID | Use for |
8
+ | --- | --- |
9
+ | `gpt-5.4` | Default text plus reasoning for most new apps |
10
+ | `gpt-5.4-pro` | Only when the user explicitly asks for maximum reasoning or quality; substantially slower and more expensive |
11
+ | `gpt-5-mini` | Cheaper and faster reasoning with good quality |
12
+ | `gpt-5-nano` | High-throughput simple tasks and classification |
13
+ | `gpt-5.4` | Explicit no-reasoning text path via `reasoning.effort: none` |
14
+ | `gpt-4.1-mini` | Cheaper no-reasoning text |
15
+ | `gpt-4.1-nano` | Fastest and cheapest no-reasoning text |
16
+ | `gpt-5.3-codex` | Agentic coding, code editing, and tool-heavy coding workflows |
17
+ | `gpt-5.1-codex-mini` | Cheaper coding workflows |
18
+ | `gpt-image-1.5` | Best image generation and edit quality |
19
+ | `gpt-image-1-mini` | Cost-optimized image generation |
20
+ | `gpt-4o-mini-tts` | Text-to-speech |
21
+ | `gpt-4o-mini-transcribe` | Speech-to-text, fast and cost-efficient |
22
+ | `gpt-realtime-1.5` | Realtime voice and multimodal sessions |
23
+ | `gpt-realtime-mini` | Cheaper realtime sessions |
24
+ | `gpt-audio` | Chat Completions audio input and output |
25
+ | `gpt-audio-mini` | Cheaper Chat Completions audio workflows |
26
+ | `sora-2` | Faster iteration and draft video generation |
27
+ | `sora-2-pro` | Higher-quality production video |
28
+ | `omni-moderation-latest` | Text and image moderation |
29
+ | `text-embedding-3-large` | Higher-quality retrieval embeddings; default in this skill because no best-specific row exists |
30
+ | `text-embedding-3-small` | Lower-cost embeddings |
31
+
32
+ ## Maintenance notes
33
+
34
+ - This file will drift unless it is periodically re-verified against current OpenAI docs.
35
+ - If this file conflicts with current docs, the docs win.
skills/.system/openai-docs/references/upgrading-to-gpt-5p4.md ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Upgrading to GPT-5.4
2
+
3
+ Use this guide when the user explicitly asks to upgrade an existing integration to GPT-5.4. Pair it with current OpenAI docs lookups. The default target string is `gpt-5.4`.
4
+
5
+ ## Upgrade posture
6
+
7
+ Upgrade with the narrowest safe change set:
8
+
9
+ - replace the model string first
10
+ - update only the prompts that are directly tied to that model usage
11
+ - prefer prompt-only upgrades when possible
12
+ - if the upgrade would require API-surface changes, parameter rewrites, tool rewiring, or broader code edits, mark it as blocked instead of stretching the scope
13
+
14
+ ## Upgrade workflow
15
+
16
+ 1. Inventory current model usage.
17
+ - Search for model strings, client calls, and prompt-bearing files.
18
+ - Include inline prompts, prompt templates, YAML or JSON configs, Markdown docs, and saved prompts when they are clearly tied to a model usage site.
19
+ 2. Pair each model usage with its prompt surface.
20
+ - Prefer the closest prompt surface first: inline system or developer text, then adjacent prompt files, then shared templates.
21
+ - If you cannot confidently tie a prompt to the model usage, say so instead of guessing.
22
+ 3. Classify the source model family.
23
+ - Common buckets: `gpt-4o` or `gpt-4.1`, `o1` or `o3` or `o4-mini`, early `gpt-5`, later `gpt-5.x`, or mixed and unclear.
24
+ 4. Decide the upgrade class.
25
+ - `model string only`
26
+ - `model string + light prompt rewrite`
27
+ - `blocked without code changes`
28
+ 5. Run the no-code compatibility gate.
29
+ - Check whether the current integration can accept `gpt-5.4` without API-surface changes or implementation changes.
30
+ - For long-running Responses or tool-heavy agents, check whether `phase` is already preserved or round-tripped when the host replays assistant items or uses preambles.
31
+ - If compatibility depends on code changes, return `blocked`.
32
+ - If compatibility is unclear, return `unknown` rather than improvising.
33
+ 6. Recommend the upgrade.
34
+ - Default replacement string: `gpt-5.4`
35
+ - Keep the intervention small and behavior-preserving.
36
+ 7. Deliver a structured recommendation.
37
+ - `Current model usage`
38
+ - `Recommended model-string updates`
39
+ - `Starting reasoning recommendation`
40
+ - `Prompt updates`
41
+ - `Phase assessment` when the flow is long-running, replayed, or tool-heavy
42
+ - `No-code compatibility check`
43
+ - `Validation plan`
44
+ - `Launch-day refresh items`
45
+
46
+ Output rule:
47
+
48
+ - Always emit a starting `reasoning_effort_recommendation` for each usage site.
49
+ - If the repo exposes the current reasoning setting, preserve it first unless the source guide says otherwise.
50
+ - If the repo does not expose the current setting, use the source-family starting mapping instead of returning `null`.
51
+
52
+ ## Upgrade outcomes
53
+
54
+ ### `model string only`
55
+
56
+ Choose this when:
57
+
58
+ - the existing prompts are already short, explicit, and task-bounded
59
+ - the workflow is not strongly research-heavy, tool-heavy, multi-agent, batch or completeness-sensitive, or long-horizon
60
+ - there are no obvious compatibility blockers
61
+
62
+ Default action:
63
+
64
+ - replace the model string with `gpt-5.4`
65
+ - keep prompts unchanged
66
+ - validate behavior with existing evals or spot checks
67
+
68
+ ### `model string + light prompt rewrite`
69
+
70
+ Choose this when:
71
+
72
+ - the old prompt was compensating for weaker instruction following
73
+ - the workflow needs more persistence than the default tool-use behavior will likely provide
74
+ - the task needs stronger completeness, citation discipline, or verification
75
+ - the upgraded model becomes too verbose or under-complete unless instructed otherwise
76
+ - the workflow is research-heavy and needs stronger handling of sparse or empty retrieval results
77
+ - the workflow is coding-oriented, tool-heavy, or multi-agent, but the existing API surface and tool definitions can remain unchanged
78
+
79
+ Default action:
80
+
81
+ - replace the model string with `gpt-5.4`
82
+ - add one or two targeted prompt blocks
83
+ - read `references/gpt-5p4-prompting-guide.md` to choose the smallest prompt changes that recover the old behavior
84
+ - avoid broad prompt cleanup unrelated to the upgrade
85
+ - for research workflows, default to `research_mode` + `citation_rules` + `empty_result_handling`; add `tool_persistence_rules` when the host already uses retrieval tools
86
+ - for dependency-aware or tool-heavy workflows, default to `tool_persistence_rules` + `dependency_checks` + `verification_loop`; add `parallel_tool_calling` only when retrieval steps are truly independent
87
+ - for coding or terminal workflows, default to `terminal_tool_hygiene` + `verification_loop`
88
+ - for multi-agent support or triage workflows, default to at least one of `tool_persistence_rules`, `completeness_contract`, or `verification_loop`
89
+ - for long-running Responses agents with preambles or multiple assistant messages, explicitly review whether `phase` is already handled; if adding or preserving `phase` would require code edits, mark the path as `blocked`
90
+ - do not classify a coding or tool-using Responses workflow as `blocked` just because the visible snippet is minimal; prefer `model string + light prompt rewrite` unless the repo clearly shows that a safe GPT-5.4 path would require host-side code changes
91
+
92
+ ### `blocked`
93
+
94
+ Choose this when:
95
+
96
+ - the upgrade appears to require API-surface changes
97
+ - the upgrade appears to require parameter rewrites or reasoning-setting changes that are not exposed outside implementation code
98
+ - the upgrade would require changing tool definitions, tool handler wiring, or schema contracts
99
+ - you cannot confidently identify the prompt surface tied to the model usage
100
+
101
+ Default action:
102
+
103
+ - do not improvise a broader upgrade
104
+ - report the blocker and explain that the fix is out of scope for this guide
105
+
106
+ ## No-code compatibility checklist
107
+
108
+ Before recommending a no-code upgrade, check:
109
+
110
+ 1. Can the current host accept the `gpt-5.4` model string without changing client code or API surface?
111
+ 2. Are the related prompts identifiable and editable?
112
+ 3. Does the host depend on behavior that likely needs API-surface changes, parameter rewrites, or tool rewiring?
113
+ 4. Would the likely fix be prompt-only, or would it need implementation changes?
114
+ 5. Is the prompt surface close enough to the model usage that you can make a targeted change instead of a broad cleanup?
115
+ 6. For long-running Responses or tool-heavy agents, is `phase` already preserved if the host relies on preambles, replayed assistant items, or multiple assistant messages?
116
+
117
+ If item 1 is no, items 3 through 4 point to implementation work, or item 6 is no and the fix needs code changes, return `blocked`.
118
+
119
+ If item 2 is no, return `unknown` unless the user can point to the prompt location.
120
+
121
+ Important:
122
+
123
+ - Existing use of tools, agents, or multiple usage sites is not by itself a blocker.
124
+ - If the current host can keep the same API surface and the same tool definitions, prefer `model string + light prompt rewrite` over `blocked`.
125
+ - Reserve `blocked` for cases that truly require implementation changes, not cases that only need stronger prompt steering.
126
+
127
+ ## Scope boundaries
128
+
129
+ This guide may:
130
+
131
+ - update or recommend updated model strings
132
+ - update or recommend updated prompts
133
+ - inspect code and prompt files to understand where those changes belong
134
+ - inspect whether existing Responses flows already preserve `phase`
135
+ - flag compatibility blockers
136
+
137
+ This guide may not:
138
+
139
+ - move Chat Completions code to Responses
140
+ - move Responses code to another API surface
141
+ - rewrite parameter shapes
142
+ - change tool definitions or tool-call handling
143
+ - change structured-output wiring
144
+ - add or retrofit `phase` handling in implementation code
145
+ - edit business logic, orchestration logic, or SDK usage beyond a literal model-string replacement
146
+
147
+ If a safe GPT-5.4 upgrade requires any of those changes, mark the path as blocked and out of scope.
148
+
149
+ ## Validation plan
150
+
151
+ - Validate each upgraded usage site with existing evals or realistic spot checks.
152
+ - Check whether the upgraded model still matches expected latency, output shape, and quality.
153
+ - If prompt edits were added, confirm each block is doing real work instead of adding noise.
154
+ - If the workflow has downstream impact, add a lightweight verification pass before finalization.
155
+
156
+ ## Launch-day refresh items
157
+
158
+ When final GPT-5.4 guidance changes:
159
+
160
+ 1. Replace release-candidate assumptions with final GPT-5.4 guidance where appropriate.
161
+ 2. Re-check whether the default target string should stay `gpt-5.4` for all source families.
162
+ 3. Re-check any prompt-block recommendations whose semantics may have changed.
163
+ 4. Re-check research, citation, and compatibility guidance against the final model behavior.
164
+ 5. Re-run the same upgrade scenarios and confirm the blocked-versus-viable boundaries still hold.
skills/.system/skill-creator/SKILL.md ADDED
@@ -0,0 +1,413 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: skill-creator
3
+ description: Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Codex's capabilities with specialized knowledge, workflows, or tool integrations.
4
+ metadata:
5
+ short-description: Create or update a skill
6
+ ---
7
+
8
+ # Skill Creator
9
+
10
+ This skill provides guidance for creating effective skills.
11
+
12
+ ## About Skills
13
+
14
+ Skills are modular, self-contained folders that extend Codex's capabilities by providing
15
+ specialized knowledge, workflows, and tools. Think of them as "onboarding guides" for specific
16
+ domains or tasks—they transform Codex from a general-purpose agent into a specialized agent
17
+ equipped with procedural knowledge that no model can fully possess.
18
+
19
+ ### What Skills Provide
20
+
21
+ 1. Specialized workflows - Multi-step procedures for specific domains
22
+ 2. Tool integrations - Instructions for working with specific file formats or APIs
23
+ 3. Domain expertise - Company-specific knowledge, schemas, business logic
24
+ 4. Bundled resources - Scripts, references, and assets for complex and repetitive tasks
25
+
26
+ ## Core Principles
27
+
28
+ ### Concise is Key
29
+
30
+ The context window is a public good. Skills share the context window with everything else Codex needs: system prompt, conversation history, other Skills' metadata, and the actual user request.
31
+
32
+ **Default assumption: Codex is already very smart.** Only add context Codex doesn't already have. Challenge each piece of information: "Does Codex really need this explanation?" and "Does this paragraph justify its token cost?"
33
+
34
+ Prefer concise examples over verbose explanations.
35
+
36
+ ### Set Appropriate Degrees of Freedom
37
+
38
+ Match the level of specificity to the task's fragility and variability:
39
+
40
+ **High freedom (text-based instructions)**: Use when multiple approaches are valid, decisions depend on context, or heuristics guide the approach.
41
+
42
+ **Medium freedom (pseudocode or scripts with parameters)**: Use when a preferred pattern exists, some variation is acceptable, or configuration affects behavior.
43
+
44
+ **Low freedom (specific scripts, few parameters)**: Use when operations are fragile and error-prone, consistency is critical, or a specific sequence must be followed.
45
+
46
+ Think of Codex as exploring a path: a narrow bridge with cliffs needs specific guardrails (low freedom), while an open field allows many routes (high freedom).
47
+
48
+ ### Protect Validation Integrity
49
+
50
+ You may use subagents during iteration to validate whether a skill works on realistic tasks or whether a suspected problem is real. This is most useful when you want an independent pass on the skill's behavior, outputs, or failure modes after a revision. Only do this when it is possible to start new subagents.
51
+
52
+ When using subagents for validation, treat that as an evaluation surface. The goal is to learn whether the skill generalizes, not whether another agent can reconstruct the answer from leaked context.
53
+
54
+ Prefer raw artifacts such as example prompts, outputs, diffs, logs, or traces. Give the minimum task-local context needed to perform the validation. Avoid passing the intended answer, suspected bug, intended fix, or your prior conclusions unless the validation explicitly requires them.
55
+
56
+ ### Anatomy of a Skill
57
+
58
+ Every skill consists of a required SKILL.md file and optional bundled resources:
59
+
60
+ ```
61
+ skill-name/
62
+ ├── SKILL.md (required)
63
+ │ ├── YAML frontmatter metadata (required)
64
+ │ │ ├── name: (required)
65
+ │ │ └── description: (required)
66
+ │ └── Markdown instructions (required)
67
+ ├── agents/ (recommended)
68
+ │ └── openai.yaml - UI metadata for skill lists and chips
69
+ └── Bundled Resources (optional)
70
+ ├── scripts/ - Executable code (Python/Bash/etc.)
71
+ ├── references/ - Documentation intended to be loaded into context as needed
72
+ └── assets/ - Files used in output (templates, icons, fonts, etc.)
73
+ ```
74
+
75
+ #### SKILL.md (required)
76
+
77
+ Every SKILL.md consists of:
78
+
79
+ - **Frontmatter** (YAML): Contains `name` and `description` fields. These are the only fields that Codex reads to determine when the skill gets used, thus it is very important to be clear and comprehensive in describing what the skill is, and when it should be used.
80
+ - **Body** (Markdown): Instructions and guidance for using the skill. Only loaded AFTER the skill triggers (if at all).
81
+
82
+ #### Agents metadata (recommended)
83
+
84
+ - UI-facing metadata for skill lists and chips
85
+ - Read references/openai_yaml.md before generating values and follow its descriptions and constraints
86
+ - Create: human-facing `display_name`, `short_description`, and `default_prompt` by reading the skill
87
+ - Generate deterministically by passing the values as `--interface key=value` to `scripts/generate_openai_yaml.py` or `scripts/init_skill.py`
88
+ - On updates: validate `agents/openai.yaml` still matches SKILL.md; regenerate if stale
89
+ - Only include other optional interface fields (icons, brand color) if explicitly provided
90
+ - See references/openai_yaml.md for field definitions and examples
91
+
92
+ #### Bundled Resources (optional)
93
+
94
+ ##### Scripts (`scripts/`)
95
+
96
+ Executable code (Python/Bash/etc.) for tasks that require deterministic reliability or are repeatedly rewritten.
97
+
98
+ - **When to include**: When the same code is being rewritten repeatedly or deterministic reliability is needed
99
+ - **Example**: `scripts/rotate_pdf.py` for PDF rotation tasks
100
+ - **Benefits**: Token efficient, deterministic, may be executed without loading into context
101
+ - **Note**: Scripts may still need to be read by Codex for patching or environment-specific adjustments
102
+
103
+ ##### References (`references/`)
104
+
105
+ Documentation and reference material intended to be loaded as needed into context to inform Codex's process and thinking.
106
+
107
+ - **When to include**: For documentation that Codex should reference while working
108
+ - **Examples**: `references/finance.md` for financial schemas, `references/mnda.md` for company NDA template, `references/policies.md` for company policies, `references/api_docs.md` for API specifications
109
+ - **Use cases**: Database schemas, API documentation, domain knowledge, company policies, detailed workflow guides
110
+ - **Benefits**: Keeps SKILL.md lean, loaded only when Codex determines it's needed
111
+ - **Best practice**: If files are large (>10k words), include grep search patterns in SKILL.md
112
+ - **Avoid duplication**: Information should live in either SKILL.md or references files, not both. Prefer references files for detailed information unless it's truly core to the skill—this keeps SKILL.md lean while making information discoverable without hogging the context window. Keep only essential procedural instructions and workflow guidance in SKILL.md; move detailed reference material, schemas, and examples to references files.
113
+
114
+ ##### Assets (`assets/`)
115
+
116
+ Files not intended to be loaded into context, but rather used within the output Codex produces.
117
+
118
+ - **When to include**: When the skill needs files that will be used in the final output
119
+ - **Examples**: `assets/logo.png` for brand assets, `assets/slides.pptx` for PowerPoint templates, `assets/frontend-template/` for HTML/React boilerplate, `assets/font.ttf` for typography
120
+ - **Use cases**: Templates, images, icons, boilerplate code, fonts, sample documents that get copied or modified
121
+ - **Benefits**: Separates output resources from documentation, enables Codex to use files without loading them into context
122
+
123
+ #### What to Not Include in a Skill
124
+
125
+ A skill should only contain essential files that directly support its functionality. Do NOT create extraneous documentation or auxiliary files, including:
126
+
127
+ - README.md
128
+ - INSTALLATION_GUIDE.md
129
+ - QUICK_REFERENCE.md
130
+ - CHANGELOG.md
131
+ - etc.
132
+
133
+ The skill should only contain the information needed for an AI agent to do the job at hand. It should not contain auxiliary context about the process that went into creating it, setup and testing procedures, user-facing documentation, etc. Creating additional documentation files just adds clutter and confusion.
134
+
135
+ ### Progressive Disclosure Design Principle
136
+
137
+ Skills use a three-level loading system to manage context efficiently:
138
+
139
+ 1. **Metadata (name + description)** - Always in context (~100 words)
140
+ 2. **SKILL.md body** - When skill triggers (<5k words)
141
+ 3. **Bundled resources** - As needed by Codex (Unlimited because scripts can be executed without reading into context window)
142
+
143
+ #### Progressive Disclosure Patterns
144
+
145
+ Keep SKILL.md body to the essentials and under 500 lines to minimize context bloat. Split content into separate files when approaching this limit. When splitting out content into other files, it is very important to reference them from SKILL.md and describe clearly when to read them, to ensure the reader of the skill knows they exist and when to use them.
146
+
147
+ **Key principle:** When a skill supports multiple variations, frameworks, or options, keep only the core workflow and selection guidance in SKILL.md. Move variant-specific details (patterns, examples, configuration) into separate reference files.
148
+
149
+ **Pattern 1: High-level guide with references**
150
+
151
+ ```markdown
152
+ # PDF Processing
153
+
154
+ ## Quick start
155
+
156
+ Extract text with pdfplumber:
157
+ [code example]
158
+
159
+ ## Advanced features
160
+
161
+ - **Form filling**: See [FORMS.md](FORMS.md) for complete guide
162
+ - **API reference**: See [REFERENCE.md](REFERENCE.md) for all methods
163
+ - **Examples**: See [EXAMPLES.md](EXAMPLES.md) for common patterns
164
+ ```
165
+
166
+ Codex loads FORMS.md, REFERENCE.md, or EXAMPLES.md only when needed.
167
+
168
+ **Pattern 2: Domain-specific organization**
169
+
170
+ For Skills with multiple domains, organize content by domain to avoid loading irrelevant context:
171
+
172
+ ```
173
+ bigquery-skill/
174
+ ├── SKILL.md (overview and navigation)
175
+ └── reference/
176
+ ├── finance.md (revenue, billing metrics)
177
+ ├── sales.md (opportunities, pipeline)
178
+ ├── product.md (API usage, features)
179
+ └── marketing.md (campaigns, attribution)
180
+ ```
181
+
182
+ When a user asks about sales metrics, Codex only reads sales.md.
183
+
184
+ Similarly, for skills supporting multiple frameworks or variants, organize by variant:
185
+
186
+ ```
187
+ cloud-deploy/
188
+ ├── SKILL.md (workflow + provider selection)
189
+ └── references/
190
+ ├── aws.md (AWS deployment patterns)
191
+ ├── gcp.md (GCP deployment patterns)
192
+ └── azure.md (Azure deployment patterns)
193
+ ```
194
+
195
+ When the user chooses AWS, Codex only reads aws.md.
196
+
197
+ **Pattern 3: Conditional details**
198
+
199
+ Show basic content, link to advanced content:
200
+
201
+ ```markdown
202
+ # DOCX Processing
203
+
204
+ ## Creating documents
205
+
206
+ Use docx-js for new documents. See [DOCX-JS.md](DOCX-JS.md).
207
+
208
+ ## Editing documents
209
+
210
+ For simple edits, modify the XML directly.
211
+
212
+ **For tracked changes**: See [REDLINING.md](REDLINING.md)
213
+ **For OOXML details**: See [OOXML.md](OOXML.md)
214
+ ```
215
+
216
+ Codex reads REDLINING.md or OOXML.md only when the user needs those features.
217
+
218
+ **Important guidelines:**
219
+
220
+ - **Avoid deeply nested references** - Keep references one level deep from SKILL.md. All reference files should link directly from SKILL.md.
221
+ - **Structure longer reference files** - For files longer than 100 lines, include a table of contents at the top so Codex can see the full scope when previewing.
222
+
223
+ ## Skill Creation Process
224
+
225
+ Skill creation involves these steps:
226
+
227
+ 1. Understand the skill with concrete examples
228
+ 2. Plan reusable skill contents (scripts, references, assets)
229
+ 3. Initialize the skill (run init_skill.py)
230
+ 4. Edit the skill (implement resources and write SKILL.md)
231
+ 5. Validate the skill (run quick_validate.py)
232
+ 6. Iterate based on real usage and forward-test complex skills.
233
+
234
+ Follow these steps in order, skipping only if there is a clear reason why they are not applicable.
235
+
236
+ ### Skill Naming
237
+
238
+ - Use lowercase letters, digits, and hyphens only; normalize user-provided titles to hyphen-case (e.g., "Plan Mode" -> `plan-mode`).
239
+ - When generating names, generate a name under 64 characters (letters, digits, hyphens).
240
+ - Prefer short, verb-led phrases that describe the action.
241
+ - Namespace by tool when it improves clarity or triggering (e.g., `gh-address-comments`, `linear-address-issue`).
242
+ - Name the skill folder exactly after the skill name.
243
+
244
+ ### Step 1: Understanding the Skill with Concrete Examples
245
+
246
+ Skip this step only when the skill's usage patterns are already clearly understood. It remains valuable even when working with an existing skill.
247
+
248
+ To create an effective skill, clearly understand concrete examples of how the skill will be used. This understanding can come from either direct user examples or generated examples that are validated with user feedback.
249
+
250
+ For example, when building an image-editor skill, relevant questions include:
251
+
252
+ - "What functionality should the image-editor skill support? Editing, rotating, anything else?"
253
+ - "Can you give some examples of how this skill would be used?"
254
+ - "I can imagine users asking for things like 'Remove the red-eye from this image' or 'Rotate this image'. Are there other ways you imagine this skill being used?"
255
+ - "What would a user say that should trigger this skill?"
256
+
257
+ To avoid overwhelming users, avoid asking too many questions in a single message. Start with the most important questions and follow up as needed for better effectiveness.
258
+
259
+ Conclude this step when there is a clear sense of the functionality the skill should support.
260
+
261
+ ### Step 2: Planning the Reusable Skill Contents
262
+
263
+ To turn concrete examples into an effective skill, analyze each example by:
264
+
265
+ 1. Considering how to execute on the example from scratch
266
+ 2. Identifying what scripts, references, and assets would be helpful when executing these workflows repeatedly
267
+
268
+ Example: When building a `pdf-editor` skill to handle queries like "Help me rotate this PDF," the analysis shows:
269
+
270
+ 1. Rotating a PDF requires re-writing the same code each time
271
+ 2. A `scripts/rotate_pdf.py` script would be helpful to store in the skill
272
+
273
+ Example: When designing a `frontend-webapp-builder` skill for queries like "Build me a todo app" or "Build me a dashboard to track my steps," the analysis shows:
274
+
275
+ 1. Writing a frontend webapp requires the same boilerplate HTML/React each time
276
+ 2. An `assets/hello-world/` template containing the boilerplate HTML/React project files would be helpful to store in the skill
277
+
278
+ Example: When building a `big-query` skill to handle queries like "How many users have logged in today?" the analysis shows:
279
+
280
+ 1. Querying BigQuery requires re-discovering the table schemas and relationships each time
281
+ 2. A `references/schema.md` file documenting the table schemas would be helpful to store in the skill
282
+
283
+ To establish the skill's contents, analyze each concrete example to create a list of the reusable resources to include: scripts, references, and assets.
284
+
285
+ ### Step 3: Initializing the Skill
286
+
287
+ At this point, it is time to actually create the skill.
288
+
289
+ Skip this step only if the skill being developed already exists. In this case, continue to the next step.
290
+
291
+ When creating a new skill from scratch, always run the `init_skill.py` script. The script conveniently generates a new template skill directory that automatically includes everything a skill requires, making the skill creation process much more efficient and reliable.
292
+
293
+ Usage:
294
+
295
+ ```bash
296
+ scripts/init_skill.py <skill-name> --path <output-directory> [--resources scripts,references,assets] [--examples]
297
+ ```
298
+
299
+ Examples:
300
+
301
+ ```bash
302
+ scripts/init_skill.py my-skill --path skills/public
303
+ scripts/init_skill.py my-skill --path skills/public --resources scripts,references
304
+ scripts/init_skill.py my-skill --path skills/public --resources scripts --examples
305
+ ```
306
+
307
+ The script:
308
+
309
+ - Creates the skill directory at the specified path
310
+ - Generates a SKILL.md template with proper frontmatter and TODO placeholders
311
+ - Creates `agents/openai.yaml` using agent-generated `display_name`, `short_description`, and `default_prompt` passed via `--interface key=value`
312
+ - Optionally creates resource directories based on `--resources`
313
+ - Optionally adds example files when `--examples` is set
314
+
315
+ After initialization, customize the SKILL.md and add resources as needed. If you used `--examples`, replace or delete placeholder files.
316
+
317
+ Generate `display_name`, `short_description`, and `default_prompt` by reading the skill, then pass them as `--interface key=value` to `init_skill.py` or regenerate with:
318
+
319
+ ```bash
320
+ scripts/generate_openai_yaml.py <path/to/skill-folder> --interface key=value
321
+ ```
322
+
323
+ Only include other optional interface fields when the user explicitly provides them. For full field descriptions and examples, see references/openai_yaml.md.
324
+
325
+ ### Step 4: Edit the Skill
326
+
327
+ When editing the (newly-generated or existing) skill, remember that the skill is being created for another instance of Codex to use. Include information that would be beneficial and non-obvious to Codex. Consider what procedural knowledge, domain-specific details, or reusable assets would help another Codex instance execute these tasks more effectively.
328
+
329
+ After substantial revisions, or if the skill is particularly tricky, you should use subagents to forward-test the skill on realistic tasks or artifacts. When doing so, pass the artifact under validation rather than your diagnosis of what is wrong, and keep the prompt generic enough that success depends on transferable reasoning rather than hidden ground truth.
330
+
331
+ #### Start with Reusable Skill Contents
332
+
333
+ To begin implementation, start with the reusable resources identified above: `scripts/`, `references/`, and `assets/` files. Note that this step may require user input. For example, when implementing a `brand-guidelines` skill, the user may need to provide brand assets or templates to store in `assets/`, or documentation to store in `references/`.
334
+
335
+ Added scripts must be tested by actually running them to ensure there are no bugs and that the output matches what is expected. If there are many similar scripts, only a representative sample needs to be tested to ensure confidence that they all work while balancing time to completion.
336
+
337
+ If you used `--examples`, delete any placeholder files that are not needed for the skill. Only create resource directories that are actually required.
338
+
339
+ #### Update SKILL.md
340
+
341
+ **Writing Guidelines:** Always use imperative/infinitive form.
342
+
343
+ ##### Frontmatter
344
+
345
+ Write the YAML frontmatter with `name` and `description`:
346
+
347
+ - `name`: The skill name
348
+ - `description`: This is the primary triggering mechanism for your skill, and helps Codex understand when to use the skill.
349
+ - Include both what the Skill does and specific triggers/contexts for when to use it.
350
+ - Include all "when to use" information here - Not in the body. The body is only loaded after triggering, so "When to Use This Skill" sections in the body are not helpful to Codex.
351
+ - Example description for a `docx` skill: "Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. Use when Codex needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks"
352
+
353
+ Do not include any other fields in YAML frontmatter.
354
+
355
+ ##### Body
356
+
357
+ Write instructions for using the skill and its bundled resources.
358
+
359
+ ### Step 5: Validate the Skill
360
+
361
+ Once development of the skill is complete, validate the skill folder to catch basic issues early:
362
+
363
+ ```bash
364
+ scripts/quick_validate.py <path/to/skill-folder>
365
+ ```
366
+
367
+ The validation script checks YAML frontmatter format, required fields, and naming rules. If validation fails, fix the reported issues and run the command again.
368
+
369
+ ### Step 6: Iterate
370
+
371
+ After testing the skill, you may detect the skill is complex enough that it requires forward-testing; or users may request improvements.
372
+
373
+ User testing often this happens right after using the skill, with fresh context of how the skill performed.
374
+
375
+ **Forward-testing and iteration workflow:**
376
+
377
+ 1. Use the skill on real tasks
378
+ 2. Notice struggles or inefficiencies
379
+ 3. Identify how SKILL.md or bundled resources should be updated
380
+ 4. Implement changes and test again
381
+ 5. Forward-test if it is reasonable and appropriate
382
+
383
+ ## Forward-testing
384
+
385
+ To forward-test, launch subagents as a way to stress test the skill with minimal context.
386
+ Subagents should *not* know that they are being asked to test the skill. They should be treated as
387
+ an agent asked to perform a task by the user. Prompts to subagents should look like:
388
+ `Use $skill-x at /path/to/skill-x to solve problem y`
389
+ Not:
390
+ `Review the skill at /path/to/skill-x; pretend a user asks you to...`
391
+
392
+ Decision rule for forward-testing:
393
+ - Err on the side of forward-testing
394
+ - Ask for approval if you think there's a risk that forward-testing would:
395
+ * take a long time,
396
+ * require additional approvals from the user, or
397
+ * modify live production systems
398
+
399
+ In these cases, show the user your proposed prompt and request (1) a yes/no decision, and
400
+ (2) any suggested modifictions.
401
+
402
+ Considerations when forward-testing:
403
+ - use fresh threads for independent passes
404
+ - pass the skill, and a request in a similar way the user would.
405
+ - pass raw artifacts, not your conclusions
406
+ - avoid showing expected answers or intended fixes
407
+ - rebuild context from source artifacts after each iteration
408
+ - review the subagent's output and reasoning and emitted artifacts
409
+ - avoid leaving artifacts the agent can find on disk between iterations;
410
+ clean up subagents' artifacts to avoid additional contamination.
411
+
412
+ If forward-testing only succeeds when subagents see leaked context, tighten the skill or the
413
+ forward-testing setup before trusting the result.
skills/.system/skill-creator/agents/openai.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ interface:
2
+ display_name: "Skill Creator"
3
+ short_description: "Create or update a skill"
4
+ icon_small: "./assets/skill-creator-small.svg"
5
+ icon_large: "./assets/skill-creator.png"
skills/.system/skill-creator/assets/skill-creator-small.svg ADDED
skills/.system/skill-creator/assets/skill-creator.png ADDED
skills/.system/skill-creator/license.txt ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ Apache License
3
+ Version 2.0, January 2004
4
+ http://www.apache.org/licenses/
5
+
6
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7
+
8
+ 1. Definitions.
9
+
10
+ "License" shall mean the terms and conditions for use, reproduction,
11
+ and distribution as defined by Sections 1 through 9 of this document.
12
+
13
+ "Licensor" shall mean the copyright owner or entity authorized by
14
+ the copyright owner that is granting the License.
15
+
16
+ "Legal Entity" shall mean the union of the acting entity and all
17
+ other entities that control, are controlled by, or are under common
18
+ control with that entity. For the purposes of this definition,
19
+ "control" means (i) the power, direct or indirect, to cause the
20
+ direction or management of such entity, whether by contract or
21
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
22
+ outstanding shares, or (iii) beneficial ownership of such entity.
23
+
24
+ "You" (or "Your") shall mean an individual or Legal Entity
25
+ exercising permissions granted by this License.
26
+
27
+ "Source" form shall mean the preferred form for making modifications,
28
+ including but not limited to software source code, documentation
29
+ source, and configuration files.
30
+
31
+ "Object" form shall mean any form resulting from mechanical
32
+ transformation or translation of a Source form, including but
33
+ not limited to compiled object code, generated documentation,
34
+ and conversions to other media types.
35
+
36
+ "Work" shall mean the work of authorship, whether in Source or
37
+ Object form, made available under the License, as indicated by a
38
+ copyright notice that is included in or attached to the work
39
+ (an example is provided in the Appendix below).
40
+
41
+ "Derivative Works" shall mean any work, whether in Source or Object
42
+ form, that is based on (or derived from) the Work and for which the
43
+ editorial revisions, annotations, elaborations, or other modifications
44
+ represent, as a whole, an original work of authorship. For the purposes
45
+ of this License, Derivative Works shall not include works that remain
46
+ separable from, or merely link (or bind by name) to the interfaces of,
47
+ the Work and Derivative Works thereof.
48
+
49
+ "Contribution" shall mean any work of authorship, including
50
+ the original version of the Work and any modifications or additions
51
+ to that Work or Derivative Works thereof, that is intentionally
52
+ submitted to Licensor for inclusion in the Work by the copyright owner
53
+ or by an individual or Legal Entity authorized to submit on behalf of
54
+ the copyright owner. For the purposes of this definition, "submitted"
55
+ means any form of electronic, verbal, or written communication sent
56
+ to the Licensor or its representatives, including but not limited to
57
+ communication on electronic mailing lists, source code control systems,
58
+ and issue tracking systems that are managed by, or on behalf of, the
59
+ Licensor for the purpose of discussing and improving the Work, but
60
+ excluding communication that is conspicuously marked or otherwise
61
+ designated in writing by the copyright owner as "Not a Contribution."
62
+
63
+ "Contributor" shall mean Licensor and any individual or Legal Entity
64
+ on behalf of whom a Contribution has been received by Licensor and
65
+ subsequently incorporated within the Work.
66
+
67
+ 2. Grant of Copyright License. Subject to the terms and conditions of
68
+ this License, each Contributor hereby grants to You a perpetual,
69
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70
+ copyright license to reproduce, prepare Derivative Works of,
71
+ publicly display, publicly perform, sublicense, and distribute the
72
+ Work and such Derivative Works in Source or Object form.
73
+
74
+ 3. Grant of Patent License. Subject to the terms and conditions of
75
+ this License, each Contributor hereby grants to You a perpetual,
76
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77
+ (except as stated in this section) patent license to make, have made,
78
+ use, offer to sell, sell, import, and otherwise transfer the Work,
79
+ where such license applies only to those patent claims licensable
80
+ by such Contributor that are necessarily infringed by their
81
+ Contribution(s) alone or by combination of their Contribution(s)
82
+ with the Work to which such Contribution(s) was submitted. If You
83
+ institute patent litigation against any entity (including a
84
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
85
+ or a Contribution incorporated within the Work constitutes direct
86
+ or contributory patent infringement, then any patent licenses
87
+ granted to You under this License for that Work shall terminate
88
+ as of the date such litigation is filed.
89
+
90
+ 4. Redistribution. You may reproduce and distribute copies of the
91
+ Work or Derivative Works thereof in any medium, with or without
92
+ modifications, and in Source or Object form, provided that You
93
+ meet the following conditions:
94
+
95
+ (a) You must give any other recipients of the Work or
96
+ Derivative Works a copy of this License; and
97
+
98
+ (b) You must cause any modified files to carry prominent notices
99
+ stating that You changed the files; and
100
+
101
+ (c) You must retain, in the Source form of any Derivative Works
102
+ that You distribute, all copyright, patent, trademark, and
103
+ attribution notices from the Source form of the Work,
104
+ excluding those notices that do not pertain to any part of
105
+ the Derivative Works; and
106
+
107
+ (d) If the Work includes a "NOTICE" text file as part of its
108
+ distribution, then any Derivative Works that You distribute must
109
+ include a readable copy of the attribution notices contained
110
+ within such NOTICE file, excluding those notices that do not
111
+ pertain to any part of the Derivative Works, in at least one
112
+ of the following places: within a NOTICE text file distributed
113
+ as part of the Derivative Works; within the Source form or
114
+ documentation, if provided along with the Derivative Works; or,
115
+ within a display generated by the Derivative Works, if and
116
+ wherever such third-party notices normally appear. The contents
117
+ of the NOTICE file are for informational purposes only and
118
+ do not modify the License. You may add Your own attribution
119
+ notices within Derivative Works that You distribute, alongside
120
+ or as an addendum to the NOTICE text from the Work, provided
121
+ that such additional attribution notices cannot be construed
122
+ as modifying the License.
123
+
124
+ You may add Your own copyright statement to Your modifications and
125
+ may provide additional or different license terms and conditions
126
+ for use, reproduction, or distribution of Your modifications, or
127
+ for any such Derivative Works as a whole, provided Your use,
128
+ reproduction, and distribution of the Work otherwise complies with
129
+ the conditions stated in this License.
130
+
131
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
132
+ any Contribution intentionally submitted for inclusion in the Work
133
+ by You to the Licensor shall be under the terms and conditions of
134
+ this License, without any additional terms or conditions.
135
+ Notwithstanding the above, nothing herein shall supersede or modify
136
+ the terms of any separate license agreement you may have executed
137
+ with Licensor regarding such Contributions.
138
+
139
+ 6. Trademarks. This License does not grant permission to use the trade
140
+ names, trademarks, service marks, or product names of the Licensor,
141
+ except as required for reasonable and customary use in describing the
142
+ origin of the Work and reproducing the content of the NOTICE file.
143
+
144
+ 7. Disclaimer of Warranty. Unless required by applicable law or
145
+ agreed to in writing, Licensor provides the Work (and each
146
+ Contributor provides its Contributions) on an "AS IS" BASIS,
147
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148
+ implied, including, without limitation, any warranties or conditions
149
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150
+ PARTICULAR PURPOSE. You are solely responsible for determining the
151
+ appropriateness of using or redistributing the Work and assume any
152
+ risks associated with Your exercise of permissions under this License.
153
+
154
+ 8. Limitation of Liability. In no event and under no legal theory,
155
+ whether in tort (including negligence), contract, or otherwise,
156
+ unless required by applicable law (such as deliberate and grossly
157
+ negligent acts) or agreed to in writing, shall any Contributor be
158
+ liable to You for damages, including any direct, indirect, special,
159
+ incidental, or consequential damages of any character arising as a
160
+ result of this License or out of the use or inability to use the
161
+ Work (including but not limited to damages for loss of goodwill,
162
+ work stoppage, computer failure or malfunction, or any and all
163
+ other commercial damages or losses), even if such Contributor
164
+ has been advised of the possibility of such damages.
165
+
166
+ 9. Accepting Warranty or Additional Liability. While redistributing
167
+ the Work or Derivative Works thereof, You may choose to offer,
168
+ and charge a fee for, acceptance of support, warranty, indemnity,
169
+ or other liability obligations and/or rights consistent with this
170
+ License. However, in accepting such obligations, You may act only
171
+ on Your own behalf and on Your sole responsibility, not on behalf
172
+ of any other Contributor, and only if You agree to indemnify,
173
+ defend, and hold each Contributor harmless for any liability
174
+ incurred by, or claims asserted against, such Contributor by reason
175
+ of your accepting any such warranty or additional liability.
176
+
177
+ END OF TERMS AND CONDITIONS
178
+
179
+ APPENDIX: How to apply the Apache License to your work.
180
+
181
+ To apply the Apache License to your work, attach the following
182
+ boilerplate notice, with the fields enclosed by brackets "[]"
183
+ replaced with your own identifying information. (Don't include
184
+ the brackets!) The text should be enclosed in the appropriate
185
+ comment syntax for the file format. We also recommend that a
186
+ file or class name and description of purpose be included on the
187
+ same "printed page" as the copyright notice for easier
188
+ identification within third-party archives.
189
+
190
+ Copyright [yyyy] [name of copyright owner]
191
+
192
+ Licensed under the Apache License, Version 2.0 (the "License");
193
+ you may not use this file except in compliance with the License.
194
+ You may obtain a copy of the License at
195
+
196
+ http://www.apache.org/licenses/LICENSE-2.0
197
+
198
+ Unless required by applicable law or agreed to in writing, software
199
+ distributed under the License is distributed on an "AS IS" BASIS,
200
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201
+ See the License for the specific language governing permissions and
202
+ limitations under the License.
skills/.system/skill-creator/references/openai_yaml.md ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # openai.yaml fields (full example + descriptions)
2
+
3
+ `agents/openai.yaml` is an extended, product-specific config intended for the machine/harness to read, not the agent. Other product-specific config can also live in the `agents/` folder.
4
+
5
+ ## Full example
6
+
7
+ ```yaml
8
+ interface:
9
+ display_name: "Optional user-facing name"
10
+ short_description: "Optional user-facing description"
11
+ icon_small: "./assets/small-400px.png"
12
+ icon_large: "./assets/large-logo.svg"
13
+ brand_color: "#3B82F6"
14
+ default_prompt: "Optional surrounding prompt to use the skill with"
15
+
16
+ dependencies:
17
+ tools:
18
+ - type: "mcp"
19
+ value: "github"
20
+ description: "GitHub MCP server"
21
+ transport: "streamable_http"
22
+ url: "https://api.githubcopilot.com/mcp/"
23
+
24
+ policy:
25
+ allow_implicit_invocation: true
26
+ ```
27
+
28
+ ## Field descriptions and constraints
29
+
30
+ Top-level constraints:
31
+
32
+ - Quote all string values.
33
+ - Keep keys unquoted.
34
+ - For `interface.default_prompt`: generate a helpful, short (typically 1 sentence) example starting prompt based on the skill. It must explicitly mention the skill as `$skill-name` (e.g., "Use $skill-name-here to draft a concise weekly status update.").
35
+
36
+ - `interface.display_name`: Human-facing title shown in UI skill lists and chips.
37
+ - `interface.short_description`: Human-facing short UI blurb (25–64 chars) for quick scanning.
38
+ - `interface.icon_small`: Path to a small icon asset (relative to skill dir). Default to `./assets/` and place icons in the skill's `assets/` folder.
39
+ - `interface.icon_large`: Path to a larger logo asset (relative to skill dir). Default to `./assets/` and place icons in the skill's `assets/` folder.
40
+ - `interface.brand_color`: Hex color used for UI accents (e.g., badges).
41
+ - `interface.default_prompt`: Default prompt snippet inserted when invoking the skill.
42
+ - `dependencies.tools[].type`: Dependency category. Only `mcp` is supported for now.
43
+ - `dependencies.tools[].value`: Identifier of the tool or dependency.
44
+ - `dependencies.tools[].description`: Human-readable explanation of the dependency.
45
+ - `dependencies.tools[].transport`: Connection type when `type` is `mcp`.
46
+ - `dependencies.tools[].url`: MCP server URL when `type` is `mcp`.
47
+ - `policy.allow_implicit_invocation`: When false, the skill is not injected into
48
+ the model context by default, but can still be invoked explicitly via `$skill`.
49
+ Defaults to true.
skills/.system/skill-creator/scripts/generate_openai_yaml.py ADDED
@@ -0,0 +1,226 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ OpenAI YAML Generator - Creates agents/openai.yaml for a skill folder.
4
+
5
+ Usage:
6
+ generate_openai_yaml.py <skill_dir> [--name <skill_name>] [--interface key=value]
7
+ """
8
+
9
+ import argparse
10
+ import re
11
+ import sys
12
+ from pathlib import Path
13
+
14
+ ACRONYMS = {
15
+ "GH",
16
+ "MCP",
17
+ "API",
18
+ "CI",
19
+ "CLI",
20
+ "LLM",
21
+ "PDF",
22
+ "PR",
23
+ "UI",
24
+ "URL",
25
+ "SQL",
26
+ }
27
+
28
+ BRANDS = {
29
+ "openai": "OpenAI",
30
+ "openapi": "OpenAPI",
31
+ "github": "GitHub",
32
+ "pagerduty": "PagerDuty",
33
+ "datadog": "DataDog",
34
+ "sqlite": "SQLite",
35
+ "fastapi": "FastAPI",
36
+ }
37
+
38
+ SMALL_WORDS = {"and", "or", "to", "up", "with"}
39
+
40
+ ALLOWED_INTERFACE_KEYS = {
41
+ "display_name",
42
+ "short_description",
43
+ "icon_small",
44
+ "icon_large",
45
+ "brand_color",
46
+ "default_prompt",
47
+ }
48
+
49
+
50
+ def yaml_quote(value):
51
+ escaped = value.replace("\\", "\\\\").replace('"', '\\"').replace("\n", "\\n")
52
+ return f'"{escaped}"'
53
+
54
+
55
+ def format_display_name(skill_name):
56
+ words = [word for word in skill_name.split("-") if word]
57
+ formatted = []
58
+ for index, word in enumerate(words):
59
+ lower = word.lower()
60
+ upper = word.upper()
61
+ if upper in ACRONYMS:
62
+ formatted.append(upper)
63
+ continue
64
+ if lower in BRANDS:
65
+ formatted.append(BRANDS[lower])
66
+ continue
67
+ if index > 0 and lower in SMALL_WORDS:
68
+ formatted.append(lower)
69
+ continue
70
+ formatted.append(word.capitalize())
71
+ return " ".join(formatted)
72
+
73
+
74
+ def generate_short_description(display_name):
75
+ description = f"Help with {display_name} tasks"
76
+
77
+ if len(description) < 25:
78
+ description = f"Help with {display_name} tasks and workflows"
79
+ if len(description) < 25:
80
+ description = f"Help with {display_name} tasks with guidance"
81
+
82
+ if len(description) > 64:
83
+ description = f"Help with {display_name}"
84
+ if len(description) > 64:
85
+ description = f"{display_name} helper"
86
+ if len(description) > 64:
87
+ description = f"{display_name} tools"
88
+ if len(description) > 64:
89
+ suffix = " helper"
90
+ max_name_length = 64 - len(suffix)
91
+ trimmed = display_name[:max_name_length].rstrip()
92
+ description = f"{trimmed}{suffix}"
93
+ if len(description) > 64:
94
+ description = description[:64].rstrip()
95
+
96
+ if len(description) < 25:
97
+ description = f"{description} workflows"
98
+ if len(description) > 64:
99
+ description = description[:64].rstrip()
100
+
101
+ return description
102
+
103
+
104
+ def read_frontmatter_name(skill_dir):
105
+ skill_md = Path(skill_dir) / "SKILL.md"
106
+ if not skill_md.exists():
107
+ print(f"[ERROR] SKILL.md not found in {skill_dir}")
108
+ return None
109
+ content = skill_md.read_text()
110
+ match = re.match(r"^---\n(.*?)\n---", content, re.DOTALL)
111
+ if not match:
112
+ print("[ERROR] Invalid SKILL.md frontmatter format.")
113
+ return None
114
+ frontmatter_text = match.group(1)
115
+
116
+ import yaml
117
+
118
+ try:
119
+ frontmatter = yaml.safe_load(frontmatter_text)
120
+ except yaml.YAMLError as exc:
121
+ print(f"[ERROR] Invalid YAML frontmatter: {exc}")
122
+ return None
123
+ if not isinstance(frontmatter, dict):
124
+ print("[ERROR] Frontmatter must be a YAML dictionary.")
125
+ return None
126
+ name = frontmatter.get("name", "")
127
+ if not isinstance(name, str) or not name.strip():
128
+ print("[ERROR] Frontmatter 'name' is missing or invalid.")
129
+ return None
130
+ return name.strip()
131
+
132
+
133
+ def parse_interface_overrides(raw_overrides):
134
+ overrides = {}
135
+ optional_order = []
136
+ for item in raw_overrides:
137
+ if "=" not in item:
138
+ print(f"[ERROR] Invalid interface override '{item}'. Use key=value.")
139
+ return None, None
140
+ key, value = item.split("=", 1)
141
+ key = key.strip()
142
+ value = value.strip()
143
+ if not key:
144
+ print(f"[ERROR] Invalid interface override '{item}'. Key is empty.")
145
+ return None, None
146
+ if key not in ALLOWED_INTERFACE_KEYS:
147
+ allowed = ", ".join(sorted(ALLOWED_INTERFACE_KEYS))
148
+ print(f"[ERROR] Unknown interface field '{key}'. Allowed: {allowed}")
149
+ return None, None
150
+ overrides[key] = value
151
+ if key not in ("display_name", "short_description") and key not in optional_order:
152
+ optional_order.append(key)
153
+ return overrides, optional_order
154
+
155
+
156
+ def write_openai_yaml(skill_dir, skill_name, raw_overrides):
157
+ overrides, optional_order = parse_interface_overrides(raw_overrides)
158
+ if overrides is None:
159
+ return None
160
+
161
+ display_name = overrides.get("display_name") or format_display_name(skill_name)
162
+ short_description = overrides.get("short_description") or generate_short_description(display_name)
163
+
164
+ if not (25 <= len(short_description) <= 64):
165
+ print(
166
+ "[ERROR] short_description must be 25-64 characters "
167
+ f"(got {len(short_description)})."
168
+ )
169
+ return None
170
+
171
+ interface_lines = [
172
+ "interface:",
173
+ f" display_name: {yaml_quote(display_name)}",
174
+ f" short_description: {yaml_quote(short_description)}",
175
+ ]
176
+
177
+ for key in optional_order:
178
+ value = overrides.get(key)
179
+ if value is not None:
180
+ interface_lines.append(f" {key}: {yaml_quote(value)}")
181
+
182
+ agents_dir = Path(skill_dir) / "agents"
183
+ agents_dir.mkdir(parents=True, exist_ok=True)
184
+ output_path = agents_dir / "openai.yaml"
185
+ output_path.write_text("\n".join(interface_lines) + "\n")
186
+ print(f"[OK] Created agents/openai.yaml")
187
+ return output_path
188
+
189
+
190
+ def main():
191
+ parser = argparse.ArgumentParser(
192
+ description="Create agents/openai.yaml for a skill directory.",
193
+ )
194
+ parser.add_argument("skill_dir", help="Path to the skill directory")
195
+ parser.add_argument(
196
+ "--name",
197
+ help="Skill name override (defaults to SKILL.md frontmatter)",
198
+ )
199
+ parser.add_argument(
200
+ "--interface",
201
+ action="append",
202
+ default=[],
203
+ help="Interface override in key=value format (repeatable)",
204
+ )
205
+ args = parser.parse_args()
206
+
207
+ skill_dir = Path(args.skill_dir).resolve()
208
+ if not skill_dir.exists():
209
+ print(f"[ERROR] Skill directory not found: {skill_dir}")
210
+ sys.exit(1)
211
+ if not skill_dir.is_dir():
212
+ print(f"[ERROR] Path is not a directory: {skill_dir}")
213
+ sys.exit(1)
214
+
215
+ skill_name = args.name or read_frontmatter_name(skill_dir)
216
+ if not skill_name:
217
+ sys.exit(1)
218
+
219
+ result = write_openai_yaml(skill_dir, skill_name, args.interface)
220
+ if result:
221
+ sys.exit(0)
222
+ sys.exit(1)
223
+
224
+
225
+ if __name__ == "__main__":
226
+ main()
skills/.system/skill-creator/scripts/init_skill.py ADDED
@@ -0,0 +1,400 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Skill Initializer - Creates a new skill from template
4
+
5
+ Usage:
6
+ init_skill.py <skill-name> --path <path> [--resources scripts,references,assets] [--examples] [--interface key=value]
7
+
8
+ Examples:
9
+ init_skill.py my-new-skill --path skills/public
10
+ init_skill.py my-new-skill --path skills/public --resources scripts,references
11
+ init_skill.py my-api-helper --path skills/private --resources scripts --examples
12
+ init_skill.py custom-skill --path /custom/location
13
+ init_skill.py my-skill --path skills/public --interface short_description="Short UI label"
14
+ """
15
+
16
+ import argparse
17
+ import re
18
+ import sys
19
+ from pathlib import Path
20
+
21
+ from generate_openai_yaml import write_openai_yaml
22
+
23
+ MAX_SKILL_NAME_LENGTH = 64
24
+ ALLOWED_RESOURCES = {"scripts", "references", "assets"}
25
+
26
+ SKILL_TEMPLATE = """---
27
+ name: {skill_name}
28
+ description: [TODO: Complete and informative explanation of what the skill does and when to use it. Include WHEN to use this skill - specific scenarios, file types, or tasks that trigger it.]
29
+ ---
30
+
31
+ # {skill_title}
32
+
33
+ ## Overview
34
+
35
+ [TODO: 1-2 sentences explaining what this skill enables]
36
+
37
+ ## Structuring This Skill
38
+
39
+ [TODO: Choose the structure that best fits this skill's purpose. Common patterns:
40
+
41
+ **1. Workflow-Based** (best for sequential processes)
42
+ - Works well when there are clear step-by-step procedures
43
+ - Example: DOCX skill with "Workflow Decision Tree" -> "Reading" -> "Creating" -> "Editing"
44
+ - Structure: ## Overview -> ## Workflow Decision Tree -> ## Step 1 -> ## Step 2...
45
+
46
+ **2. Task-Based** (best for tool collections)
47
+ - Works well when the skill offers different operations/capabilities
48
+ - Example: PDF skill with "Quick Start" -> "Merge PDFs" -> "Split PDFs" -> "Extract Text"
49
+ - Structure: ## Overview -> ## Quick Start -> ## Task Category 1 -> ## Task Category 2...
50
+
51
+ **3. Reference/Guidelines** (best for standards or specifications)
52
+ - Works well for brand guidelines, coding standards, or requirements
53
+ - Example: Brand styling with "Brand Guidelines" -> "Colors" -> "Typography" -> "Features"
54
+ - Structure: ## Overview -> ## Guidelines -> ## Specifications -> ## Usage...
55
+
56
+ **4. Capabilities-Based** (best for integrated systems)
57
+ - Works well when the skill provides multiple interrelated features
58
+ - Example: Product Management with "Core Capabilities" -> numbered capability list
59
+ - Structure: ## Overview -> ## Core Capabilities -> ### 1. Feature -> ### 2. Feature...
60
+
61
+ Patterns can be mixed and matched as needed. Most skills combine patterns (e.g., start with task-based, add workflow for complex operations).
62
+
63
+ Delete this entire "Structuring This Skill" section when done - it's just guidance.]
64
+
65
+ ## [TODO: Replace with the first main section based on chosen structure]
66
+
67
+ [TODO: Add content here. See examples in existing skills:
68
+ - Code samples for technical skills
69
+ - Decision trees for complex workflows
70
+ - Concrete examples with realistic user requests
71
+ - References to scripts/templates/references as needed]
72
+
73
+ ## Resources (optional)
74
+
75
+ Create only the resource directories this skill actually needs. Delete this section if no resources are required.
76
+
77
+ ### scripts/
78
+ Executable code (Python/Bash/etc.) that can be run directly to perform specific operations.
79
+
80
+ **Examples from other skills:**
81
+ - PDF skill: `fill_fillable_fields.py`, `extract_form_field_info.py` - utilities for PDF manipulation
82
+ - DOCX skill: `document.py`, `utilities.py` - Python modules for document processing
83
+
84
+ **Appropriate for:** Python scripts, shell scripts, or any executable code that performs automation, data processing, or specific operations.
85
+
86
+ **Note:** Scripts may be executed without loading into context, but can still be read by Codex for patching or environment adjustments.
87
+
88
+ ### references/
89
+ Documentation and reference material intended to be loaded into context to inform Codex's process and thinking.
90
+
91
+ **Examples from other skills:**
92
+ - Product management: `communication.md`, `context_building.md` - detailed workflow guides
93
+ - BigQuery: API reference documentation and query examples
94
+ - Finance: Schema documentation, company policies
95
+
96
+ **Appropriate for:** In-depth documentation, API references, database schemas, comprehensive guides, or any detailed information that Codex should reference while working.
97
+
98
+ ### assets/
99
+ Files not intended to be loaded into context, but rather used within the output Codex produces.
100
+
101
+ **Examples from other skills:**
102
+ - Brand styling: PowerPoint template files (.pptx), logo files
103
+ - Frontend builder: HTML/React boilerplate project directories
104
+ - Typography: Font files (.ttf, .woff2)
105
+
106
+ **Appropriate for:** Templates, boilerplate code, document templates, images, icons, fonts, or any files meant to be copied or used in the final output.
107
+
108
+ ---
109
+
110
+ **Not every skill requires all three types of resources.**
111
+ """
112
+
113
+ EXAMPLE_SCRIPT = '''#!/usr/bin/env python3
114
+ """
115
+ Example helper script for {skill_name}
116
+
117
+ This is a placeholder script that can be executed directly.
118
+ Replace with actual implementation or delete if not needed.
119
+
120
+ Example real scripts from other skills:
121
+ - pdf/scripts/fill_fillable_fields.py - Fills PDF form fields
122
+ - pdf/scripts/convert_pdf_to_images.py - Converts PDF pages to images
123
+ """
124
+
125
+ def main():
126
+ print("This is an example script for {skill_name}")
127
+ # TODO: Add actual script logic here
128
+ # This could be data processing, file conversion, API calls, etc.
129
+
130
+ if __name__ == "__main__":
131
+ main()
132
+ '''
133
+
134
+ EXAMPLE_REFERENCE = """# Reference Documentation for {skill_title}
135
+
136
+ This is a placeholder for detailed reference documentation.
137
+ Replace with actual reference content or delete if not needed.
138
+
139
+ Example real reference docs from other skills:
140
+ - product-management/references/communication.md - Comprehensive guide for status updates
141
+ - product-management/references/context_building.md - Deep-dive on gathering context
142
+ - bigquery/references/ - API references and query examples
143
+
144
+ ## When Reference Docs Are Useful
145
+
146
+ Reference docs are ideal for:
147
+ - Comprehensive API documentation
148
+ - Detailed workflow guides
149
+ - Complex multi-step processes
150
+ - Information too lengthy for main SKILL.md
151
+ - Content that's only needed for specific use cases
152
+
153
+ ## Structure Suggestions
154
+
155
+ ### API Reference Example
156
+ - Overview
157
+ - Authentication
158
+ - Endpoints with examples
159
+ - Error codes
160
+ - Rate limits
161
+
162
+ ### Workflow Guide Example
163
+ - Prerequisites
164
+ - Step-by-step instructions
165
+ - Common patterns
166
+ - Troubleshooting
167
+ - Best practices
168
+ """
169
+
170
+ EXAMPLE_ASSET = """# Example Asset File
171
+
172
+ This placeholder represents where asset files would be stored.
173
+ Replace with actual asset files (templates, images, fonts, etc.) or delete if not needed.
174
+
175
+ Asset files are NOT intended to be loaded into context, but rather used within
176
+ the output Codex produces.
177
+
178
+ Example asset files from other skills:
179
+ - Brand guidelines: logo.png, slides_template.pptx
180
+ - Frontend builder: hello-world/ directory with HTML/React boilerplate
181
+ - Typography: custom-font.ttf, font-family.woff2
182
+ - Data: sample_data.csv, test_dataset.json
183
+
184
+ ## Common Asset Types
185
+
186
+ - Templates: .pptx, .docx, boilerplate directories
187
+ - Images: .png, .jpg, .svg, .gif
188
+ - Fonts: .ttf, .otf, .woff, .woff2
189
+ - Boilerplate code: Project directories, starter files
190
+ - Icons: .ico, .svg
191
+ - Data files: .csv, .json, .xml, .yaml
192
+
193
+ Note: This is a text placeholder. Actual assets can be any file type.
194
+ """
195
+
196
+
197
+ def normalize_skill_name(skill_name):
198
+ """Normalize a skill name to lowercase hyphen-case."""
199
+ normalized = skill_name.strip().lower()
200
+ normalized = re.sub(r"[^a-z0-9]+", "-", normalized)
201
+ normalized = normalized.strip("-")
202
+ normalized = re.sub(r"-{2,}", "-", normalized)
203
+ return normalized
204
+
205
+
206
+ def title_case_skill_name(skill_name):
207
+ """Convert hyphenated skill name to Title Case for display."""
208
+ return " ".join(word.capitalize() for word in skill_name.split("-"))
209
+
210
+
211
+ def parse_resources(raw_resources):
212
+ if not raw_resources:
213
+ return []
214
+ resources = [item.strip() for item in raw_resources.split(",") if item.strip()]
215
+ invalid = sorted({item for item in resources if item not in ALLOWED_RESOURCES})
216
+ if invalid:
217
+ allowed = ", ".join(sorted(ALLOWED_RESOURCES))
218
+ print(f"[ERROR] Unknown resource type(s): {', '.join(invalid)}")
219
+ print(f" Allowed: {allowed}")
220
+ sys.exit(1)
221
+ deduped = []
222
+ seen = set()
223
+ for resource in resources:
224
+ if resource not in seen:
225
+ deduped.append(resource)
226
+ seen.add(resource)
227
+ return deduped
228
+
229
+
230
+ def create_resource_dirs(skill_dir, skill_name, skill_title, resources, include_examples):
231
+ for resource in resources:
232
+ resource_dir = skill_dir / resource
233
+ resource_dir.mkdir(exist_ok=True)
234
+ if resource == "scripts":
235
+ if include_examples:
236
+ example_script = resource_dir / "example.py"
237
+ example_script.write_text(EXAMPLE_SCRIPT.format(skill_name=skill_name))
238
+ example_script.chmod(0o755)
239
+ print("[OK] Created scripts/example.py")
240
+ else:
241
+ print("[OK] Created scripts/")
242
+ elif resource == "references":
243
+ if include_examples:
244
+ example_reference = resource_dir / "api_reference.md"
245
+ example_reference.write_text(EXAMPLE_REFERENCE.format(skill_title=skill_title))
246
+ print("[OK] Created references/api_reference.md")
247
+ else:
248
+ print("[OK] Created references/")
249
+ elif resource == "assets":
250
+ if include_examples:
251
+ example_asset = resource_dir / "example_asset.txt"
252
+ example_asset.write_text(EXAMPLE_ASSET)
253
+ print("[OK] Created assets/example_asset.txt")
254
+ else:
255
+ print("[OK] Created assets/")
256
+
257
+
258
+ def init_skill(skill_name, path, resources, include_examples, interface_overrides):
259
+ """
260
+ Initialize a new skill directory with template SKILL.md.
261
+
262
+ Args:
263
+ skill_name: Name of the skill
264
+ path: Path where the skill directory should be created
265
+ resources: Resource directories to create
266
+ include_examples: Whether to create example files in resource directories
267
+
268
+ Returns:
269
+ Path to created skill directory, or None if error
270
+ """
271
+ # Determine skill directory path
272
+ skill_dir = Path(path).resolve() / skill_name
273
+
274
+ # Check if directory already exists
275
+ if skill_dir.exists():
276
+ print(f"[ERROR] Skill directory already exists: {skill_dir}")
277
+ return None
278
+
279
+ # Create skill directory
280
+ try:
281
+ skill_dir.mkdir(parents=True, exist_ok=False)
282
+ print(f"[OK] Created skill directory: {skill_dir}")
283
+ except Exception as e:
284
+ print(f"[ERROR] Error creating directory: {e}")
285
+ return None
286
+
287
+ # Create SKILL.md from template
288
+ skill_title = title_case_skill_name(skill_name)
289
+ skill_content = SKILL_TEMPLATE.format(skill_name=skill_name, skill_title=skill_title)
290
+
291
+ skill_md_path = skill_dir / "SKILL.md"
292
+ try:
293
+ skill_md_path.write_text(skill_content)
294
+ print("[OK] Created SKILL.md")
295
+ except Exception as e:
296
+ print(f"[ERROR] Error creating SKILL.md: {e}")
297
+ return None
298
+
299
+ # Create agents/openai.yaml
300
+ try:
301
+ result = write_openai_yaml(skill_dir, skill_name, interface_overrides)
302
+ if not result:
303
+ return None
304
+ except Exception as e:
305
+ print(f"[ERROR] Error creating agents/openai.yaml: {e}")
306
+ return None
307
+
308
+ # Create resource directories if requested
309
+ if resources:
310
+ try:
311
+ create_resource_dirs(skill_dir, skill_name, skill_title, resources, include_examples)
312
+ except Exception as e:
313
+ print(f"[ERROR] Error creating resource directories: {e}")
314
+ return None
315
+
316
+ # Print next steps
317
+ print(f"\n[OK] Skill '{skill_name}' initialized successfully at {skill_dir}")
318
+ print("\nNext steps:")
319
+ print("1. Edit SKILL.md to complete the TODO items and update the description")
320
+ if resources:
321
+ if include_examples:
322
+ print("2. Customize or delete the example files in scripts/, references/, and assets/")
323
+ else:
324
+ print("2. Add resources to scripts/, references/, and assets/ as needed")
325
+ else:
326
+ print("2. Create resource directories only if needed (scripts/, references/, assets/)")
327
+ print("3. Update agents/openai.yaml if the UI metadata should differ")
328
+ print("4. Run the validator when ready to check the skill structure")
329
+ print(
330
+ "5. Forward-test complex skills with realistic user requests to ensure they work as intended"
331
+ )
332
+
333
+ return skill_dir
334
+
335
+
336
+ def main():
337
+ parser = argparse.ArgumentParser(
338
+ description="Create a new skill directory with a SKILL.md template.",
339
+ )
340
+ parser.add_argument("skill_name", help="Skill name (normalized to hyphen-case)")
341
+ parser.add_argument("--path", required=True, help="Output directory for the skill")
342
+ parser.add_argument(
343
+ "--resources",
344
+ default="",
345
+ help="Comma-separated list: scripts,references,assets",
346
+ )
347
+ parser.add_argument(
348
+ "--examples",
349
+ action="store_true",
350
+ help="Create example files inside the selected resource directories",
351
+ )
352
+ parser.add_argument(
353
+ "--interface",
354
+ action="append",
355
+ default=[],
356
+ help="Interface override in key=value format (repeatable)",
357
+ )
358
+ args = parser.parse_args()
359
+
360
+ raw_skill_name = args.skill_name
361
+ skill_name = normalize_skill_name(raw_skill_name)
362
+ if not skill_name:
363
+ print("[ERROR] Skill name must include at least one letter or digit.")
364
+ sys.exit(1)
365
+ if len(skill_name) > MAX_SKILL_NAME_LENGTH:
366
+ print(
367
+ f"[ERROR] Skill name '{skill_name}' is too long ({len(skill_name)} characters). "
368
+ f"Maximum is {MAX_SKILL_NAME_LENGTH} characters."
369
+ )
370
+ sys.exit(1)
371
+ if skill_name != raw_skill_name:
372
+ print(f"Note: Normalized skill name from '{raw_skill_name}' to '{skill_name}'.")
373
+
374
+ resources = parse_resources(args.resources)
375
+ if args.examples and not resources:
376
+ print("[ERROR] --examples requires --resources to be set.")
377
+ sys.exit(1)
378
+
379
+ path = args.path
380
+
381
+ print(f"Initializing skill: {skill_name}")
382
+ print(f" Location: {path}")
383
+ if resources:
384
+ print(f" Resources: {', '.join(resources)}")
385
+ if args.examples:
386
+ print(" Examples: enabled")
387
+ else:
388
+ print(" Resources: none (create as needed)")
389
+ print()
390
+
391
+ result = init_skill(skill_name, path, resources, args.examples, args.interface)
392
+
393
+ if result:
394
+ sys.exit(0)
395
+ else:
396
+ sys.exit(1)
397
+
398
+
399
+ if __name__ == "__main__":
400
+ main()
skills/.system/skill-creator/scripts/quick_validate.py ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Quick validation script for skills - minimal version
4
+ """
5
+
6
+ import re
7
+ import sys
8
+ from pathlib import Path
9
+
10
+ import yaml
11
+
12
+ MAX_SKILL_NAME_LENGTH = 64
13
+
14
+
15
+ def validate_skill(skill_path):
16
+ """Basic validation of a skill"""
17
+ skill_path = Path(skill_path)
18
+
19
+ skill_md = skill_path / "SKILL.md"
20
+ if not skill_md.exists():
21
+ return False, "SKILL.md not found"
22
+
23
+ content = skill_md.read_text()
24
+ if not content.startswith("---"):
25
+ return False, "No YAML frontmatter found"
26
+
27
+ match = re.match(r"^---\n(.*?)\n---", content, re.DOTALL)
28
+ if not match:
29
+ return False, "Invalid frontmatter format"
30
+
31
+ frontmatter_text = match.group(1)
32
+
33
+ try:
34
+ frontmatter = yaml.safe_load(frontmatter_text)
35
+ if not isinstance(frontmatter, dict):
36
+ return False, "Frontmatter must be a YAML dictionary"
37
+ except yaml.YAMLError as e:
38
+ return False, f"Invalid YAML in frontmatter: {e}"
39
+
40
+ allowed_properties = {"name", "description", "license", "allowed-tools", "metadata"}
41
+
42
+ unexpected_keys = set(frontmatter.keys()) - allowed_properties
43
+ if unexpected_keys:
44
+ allowed = ", ".join(sorted(allowed_properties))
45
+ unexpected = ", ".join(sorted(unexpected_keys))
46
+ return (
47
+ False,
48
+ f"Unexpected key(s) in SKILL.md frontmatter: {unexpected}. Allowed properties are: {allowed}",
49
+ )
50
+
51
+ if "name" not in frontmatter:
52
+ return False, "Missing 'name' in frontmatter"
53
+ if "description" not in frontmatter:
54
+ return False, "Missing 'description' in frontmatter"
55
+
56
+ name = frontmatter.get("name", "")
57
+ if not isinstance(name, str):
58
+ return False, f"Name must be a string, got {type(name).__name__}"
59
+ name = name.strip()
60
+ if name:
61
+ if not re.match(r"^[a-z0-9-]+$", name):
62
+ return (
63
+ False,
64
+ f"Name '{name}' should be hyphen-case (lowercase letters, digits, and hyphens only)",
65
+ )
66
+ if name.startswith("-") or name.endswith("-") or "--" in name:
67
+ return (
68
+ False,
69
+ f"Name '{name}' cannot start/end with hyphen or contain consecutive hyphens",
70
+ )
71
+ if len(name) > MAX_SKILL_NAME_LENGTH:
72
+ return (
73
+ False,
74
+ f"Name is too long ({len(name)} characters). "
75
+ f"Maximum is {MAX_SKILL_NAME_LENGTH} characters.",
76
+ )
77
+
78
+ description = frontmatter.get("description", "")
79
+ if not isinstance(description, str):
80
+ return False, f"Description must be a string, got {type(description).__name__}"
81
+ description = description.strip()
82
+ if description:
83
+ if "<" in description or ">" in description:
84
+ return False, "Description cannot contain angle brackets (< or >)"
85
+ if len(description) > 1024:
86
+ return (
87
+ False,
88
+ f"Description is too long ({len(description)} characters). Maximum is 1024 characters.",
89
+ )
90
+
91
+ return True, "Skill is valid!"
92
+
93
+
94
+ if __name__ == "__main__":
95
+ if len(sys.argv) != 2:
96
+ print("Usage: python quick_validate.py <skill_directory>")
97
+ sys.exit(1)
98
+
99
+ valid, message = validate_skill(sys.argv[1])
100
+ print(message)
101
+ sys.exit(0 if valid else 1)
skills/.system/skill-installer/LICENSE.txt ADDED
@@ -0,0 +1,202 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ Apache License
3
+ Version 2.0, January 2004
4
+ http://www.apache.org/licenses/
5
+
6
+ TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7
+
8
+ 1. Definitions.
9
+
10
+ "License" shall mean the terms and conditions for use, reproduction,
11
+ and distribution as defined by Sections 1 through 9 of this document.
12
+
13
+ "Licensor" shall mean the copyright owner or entity authorized by
14
+ the copyright owner that is granting the License.
15
+
16
+ "Legal Entity" shall mean the union of the acting entity and all
17
+ other entities that control, are controlled by, or are under common
18
+ control with that entity. For the purposes of this definition,
19
+ "control" means (i) the power, direct or indirect, to cause the
20
+ direction or management of such entity, whether by contract or
21
+ otherwise, or (ii) ownership of fifty percent (50%) or more of the
22
+ outstanding shares, or (iii) beneficial ownership of such entity.
23
+
24
+ "You" (or "Your") shall mean an individual or Legal Entity
25
+ exercising permissions granted by this License.
26
+
27
+ "Source" form shall mean the preferred form for making modifications,
28
+ including but not limited to software source code, documentation
29
+ source, and configuration files.
30
+
31
+ "Object" form shall mean any form resulting from mechanical
32
+ transformation or translation of a Source form, including but
33
+ not limited to compiled object code, generated documentation,
34
+ and conversions to other media types.
35
+
36
+ "Work" shall mean the work of authorship, whether in Source or
37
+ Object form, made available under the License, as indicated by a
38
+ copyright notice that is included in or attached to the work
39
+ (an example is provided in the Appendix below).
40
+
41
+ "Derivative Works" shall mean any work, whether in Source or Object
42
+ form, that is based on (or derived from) the Work and for which the
43
+ editorial revisions, annotations, elaborations, or other modifications
44
+ represent, as a whole, an original work of authorship. For the purposes
45
+ of this License, Derivative Works shall not include works that remain
46
+ separable from, or merely link (or bind by name) to the interfaces of,
47
+ the Work and Derivative Works thereof.
48
+
49
+ "Contribution" shall mean any work of authorship, including
50
+ the original version of the Work and any modifications or additions
51
+ to that Work or Derivative Works thereof, that is intentionally
52
+ submitted to Licensor for inclusion in the Work by the copyright owner
53
+ or by an individual or Legal Entity authorized to submit on behalf of
54
+ the copyright owner. For the purposes of this definition, "submitted"
55
+ means any form of electronic, verbal, or written communication sent
56
+ to the Licensor or its representatives, including but not limited to
57
+ communication on electronic mailing lists, source code control systems,
58
+ and issue tracking systems that are managed by, or on behalf of, the
59
+ Licensor for the purpose of discussing and improving the Work, but
60
+ excluding communication that is conspicuously marked or otherwise
61
+ designated in writing by the copyright owner as "Not a Contribution."
62
+
63
+ "Contributor" shall mean Licensor and any individual or Legal Entity
64
+ on behalf of whom a Contribution has been received by Licensor and
65
+ subsequently incorporated within the Work.
66
+
67
+ 2. Grant of Copyright License. Subject to the terms and conditions of
68
+ this License, each Contributor hereby grants to You a perpetual,
69
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70
+ copyright license to reproduce, prepare Derivative Works of,
71
+ publicly display, publicly perform, sublicense, and distribute the
72
+ Work and such Derivative Works in Source or Object form.
73
+
74
+ 3. Grant of Patent License. Subject to the terms and conditions of
75
+ this License, each Contributor hereby grants to You a perpetual,
76
+ worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77
+ (except as stated in this section) patent license to make, have made,
78
+ use, offer to sell, sell, import, and otherwise transfer the Work,
79
+ where such license applies only to those patent claims licensable
80
+ by such Contributor that are necessarily infringed by their
81
+ Contribution(s) alone or by combination of their Contribution(s)
82
+ with the Work to which such Contribution(s) was submitted. If You
83
+ institute patent litigation against any entity (including a
84
+ cross-claim or counterclaim in a lawsuit) alleging that the Work
85
+ or a Contribution incorporated within the Work constitutes direct
86
+ or contributory patent infringement, then any patent licenses
87
+ granted to You under this License for that Work shall terminate
88
+ as of the date such litigation is filed.
89
+
90
+ 4. Redistribution. You may reproduce and distribute copies of the
91
+ Work or Derivative Works thereof in any medium, with or without
92
+ modifications, and in Source or Object form, provided that You
93
+ meet the following conditions:
94
+
95
+ (a) You must give any other recipients of the Work or
96
+ Derivative Works a copy of this License; and
97
+
98
+ (b) You must cause any modified files to carry prominent notices
99
+ stating that You changed the files; and
100
+
101
+ (c) You must retain, in the Source form of any Derivative Works
102
+ that You distribute, all copyright, patent, trademark, and
103
+ attribution notices from the Source form of the Work,
104
+ excluding those notices that do not pertain to any part of
105
+ the Derivative Works; and
106
+
107
+ (d) If the Work includes a "NOTICE" text file as part of its
108
+ distribution, then any Derivative Works that You distribute must
109
+ include a readable copy of the attribution notices contained
110
+ within such NOTICE file, excluding those notices that do not
111
+ pertain to any part of the Derivative Works, in at least one
112
+ of the following places: within a NOTICE text file distributed
113
+ as part of the Derivative Works; within the Source form or
114
+ documentation, if provided along with the Derivative Works; or,
115
+ within a display generated by the Derivative Works, if and
116
+ wherever such third-party notices normally appear. The contents
117
+ of the NOTICE file are for informational purposes only and
118
+ do not modify the License. You may add Your own attribution
119
+ notices within Derivative Works that You distribute, alongside
120
+ or as an addendum to the NOTICE text from the Work, provided
121
+ that such additional attribution notices cannot be construed
122
+ as modifying the License.
123
+
124
+ You may add Your own copyright statement to Your modifications and
125
+ may provide additional or different license terms and conditions
126
+ for use, reproduction, or distribution of Your modifications, or
127
+ for any such Derivative Works as a whole, provided Your use,
128
+ reproduction, and distribution of the Work otherwise complies with
129
+ the conditions stated in this License.
130
+
131
+ 5. Submission of Contributions. Unless You explicitly state otherwise,
132
+ any Contribution intentionally submitted for inclusion in the Work
133
+ by You to the Licensor shall be under the terms and conditions of
134
+ this License, without any additional terms or conditions.
135
+ Notwithstanding the above, nothing herein shall supersede or modify
136
+ the terms of any separate license agreement you may have executed
137
+ with Licensor regarding such Contributions.
138
+
139
+ 6. Trademarks. This License does not grant permission to use the trade
140
+ names, trademarks, service marks, or product names of the Licensor,
141
+ except as required for reasonable and customary use in describing the
142
+ origin of the Work and reproducing the content of the NOTICE file.
143
+
144
+ 7. Disclaimer of Warranty. Unless required by applicable law or
145
+ agreed to in writing, Licensor provides the Work (and each
146
+ Contributor provides its Contributions) on an "AS IS" BASIS,
147
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148
+ implied, including, without limitation, any warranties or conditions
149
+ of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150
+ PARTICULAR PURPOSE. You are solely responsible for determining the
151
+ appropriateness of using or redistributing the Work and assume any
152
+ risks associated with Your exercise of permissions under this License.
153
+
154
+ 8. Limitation of Liability. In no event and under no legal theory,
155
+ whether in tort (including negligence), contract, or otherwise,
156
+ unless required by applicable law (such as deliberate and grossly
157
+ negligent acts) or agreed to in writing, shall any Contributor be
158
+ liable to You for damages, including any direct, indirect, special,
159
+ incidental, or consequential damages of any character arising as a
160
+ result of this License or out of the use or inability to use the
161
+ Work (including but not limited to damages for loss of goodwill,
162
+ work stoppage, computer failure or malfunction, or any and all
163
+ other commercial damages or losses), even if such Contributor
164
+ has been advised of the possibility of such damages.
165
+
166
+ 9. Accepting Warranty or Additional Liability. While redistributing
167
+ the Work or Derivative Works thereof, You may choose to offer,
168
+ and charge a fee for, acceptance of support, warranty, indemnity,
169
+ or other liability obligations and/or rights consistent with this
170
+ License. However, in accepting such obligations, You may act only
171
+ on Your own behalf and on Your sole responsibility, not on behalf
172
+ of any other Contributor, and only if You agree to indemnify,
173
+ defend, and hold each Contributor harmless for any liability
174
+ incurred by, or claims asserted against, such Contributor by reason
175
+ of your accepting any such warranty or additional liability.
176
+
177
+ END OF TERMS AND CONDITIONS
178
+
179
+ APPENDIX: How to apply the Apache License to your work.
180
+
181
+ To apply the Apache License to your work, attach the following
182
+ boilerplate notice, with the fields enclosed by brackets "[]"
183
+ replaced with your own identifying information. (Don't include
184
+ the brackets!) The text should be enclosed in the appropriate
185
+ comment syntax for the file format. We also recommend that a
186
+ file or class name and description of purpose be included on the
187
+ same "printed page" as the copyright notice for easier
188
+ identification within third-party archives.
189
+
190
+ Copyright [yyyy] [name of copyright owner]
191
+
192
+ Licensed under the Apache License, Version 2.0 (the "License");
193
+ you may not use this file except in compliance with the License.
194
+ You may obtain a copy of the License at
195
+
196
+ http://www.apache.org/licenses/LICENSE-2.0
197
+
198
+ Unless required by applicable law or agreed to in writing, software
199
+ distributed under the License is distributed on an "AS IS" BASIS,
200
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201
+ See the License for the specific language governing permissions and
202
+ limitations under the License.
skills/.system/skill-installer/SKILL.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: skill-installer
3
+ description: Install Codex skills into $CODEX_HOME/skills from a curated list or a GitHub repo path. Use when a user asks to list installable skills, install a curated skill, or install a skill from another repo (including private repos).
4
+ metadata:
5
+ short-description: Install curated skills from openai/skills or other repos
6
+ ---
7
+
8
+ # Skill Installer
9
+
10
+ Helps install skills. By default these are from https://github.com/openai/skills/tree/main/skills/.curated, but users can also provide other locations. Experimental skills live in https://github.com/openai/skills/tree/main/skills/.experimental and can be installed the same way.
11
+
12
+ Use the helper scripts based on the task:
13
+ - List skills when the user asks what is available, or if the user uses this skill without specifying what to do. Default listing is `.curated`, but you can pass `--path skills/.experimental` when they ask about experimental skills.
14
+ - Install from the curated list when the user provides a skill name.
15
+ - Install from another repo when the user provides a GitHub repo/path (including private repos).
16
+
17
+ Install skills with the helper scripts.
18
+
19
+ ## Communication
20
+
21
+ When listing skills, output approximately as follows, depending on the context of the user's request. If they ask about experimental skills, list from `.experimental` instead of `.curated` and label the source accordingly:
22
+ """
23
+ Skills from {repo}:
24
+ 1. skill-1
25
+ 2. skill-2 (already installed)
26
+ 3. ...
27
+ Which ones would you like installed?
28
+ """
29
+
30
+ After installing a skill, tell the user: "Restart Codex to pick up new skills."
31
+
32
+ ## Scripts
33
+
34
+ All of these scripts use network, so when running in the sandbox, request escalation when running them.
35
+
36
+ - `scripts/list-skills.py` (prints skills list with installed annotations)
37
+ - `scripts/list-skills.py --format json`
38
+ - Example (experimental list): `scripts/list-skills.py --path skills/.experimental`
39
+ - `scripts/install-skill-from-github.py --repo <owner>/<repo> --path <path/to/skill> [<path/to/skill> ...]`
40
+ - `scripts/install-skill-from-github.py --url https://github.com/<owner>/<repo>/tree/<ref>/<path>`
41
+ - Example (experimental skill): `scripts/install-skill-from-github.py --repo openai/skills --path skills/.experimental/<skill-name>`
42
+
43
+ ## Behavior and Options
44
+
45
+ - Defaults to direct download for public GitHub repos.
46
+ - If download fails with auth/permission errors, falls back to git sparse checkout.
47
+ - Aborts if the destination skill directory already exists.
48
+ - Installs into `$CODEX_HOME/skills/<skill-name>` (defaults to `~/.codex/skills`).
49
+ - Multiple `--path` values install multiple skills in one run, each named from the path basename unless `--name` is supplied.
50
+ - Options: `--ref <ref>` (default `main`), `--dest <path>`, `--method auto|download|git`.
51
+
52
+ ## Notes
53
+
54
+ - Curated listing is fetched from `https://github.com/openai/skills/tree/main/skills/.curated` via the GitHub API. If it is unavailable, explain the error and exit.
55
+ - Private GitHub repos can be accessed via existing git credentials or optional `GITHUB_TOKEN`/`GH_TOKEN` for download.
56
+ - Git fallback tries HTTPS first, then SSH.
57
+ - The skills at https://github.com/openai/skills/tree/main/skills/.system are preinstalled, so no need to help users install those. If they ask, just explain this. If they insist, you can download and overwrite.
58
+ - Installed annotations come from `$CODEX_HOME/skills`.
skills/.system/skill-installer/agents/openai.yaml ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ interface:
2
+ display_name: "Skill Installer"
3
+ short_description: "Install curated skills from openai/skills or other repos"
4
+ icon_small: "./assets/skill-installer-small.svg"
5
+ icon_large: "./assets/skill-installer.png"
skills/.system/skill-installer/assets/skill-installer-small.svg ADDED
skills/.system/skill-installer/assets/skill-installer.png ADDED
skills/.system/skill-installer/scripts/github_utils.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Shared GitHub helpers for skill install scripts."""
3
+
4
+ from __future__ import annotations
5
+
6
+ import os
7
+ import urllib.request
8
+
9
+
10
+ def github_request(url: str, user_agent: str) -> bytes:
11
+ headers = {"User-Agent": user_agent}
12
+ token = os.environ.get("GITHUB_TOKEN") or os.environ.get("GH_TOKEN")
13
+ if token:
14
+ headers["Authorization"] = f"token {token}"
15
+ req = urllib.request.Request(url, headers=headers)
16
+ with urllib.request.urlopen(req) as resp:
17
+ return resp.read()
18
+
19
+
20
+ def github_api_contents_url(repo: str, path: str, ref: str) -> str:
21
+ return f"https://api.github.com/repos/{repo}/contents/{path}?ref={ref}"
skills/.system/skill-installer/scripts/install-skill-from-github.py ADDED
@@ -0,0 +1,308 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """Install a skill from a GitHub repo path into $CODEX_HOME/skills."""
3
+
4
+ from __future__ import annotations
5
+
6
+ import argparse
7
+ from dataclasses import dataclass
8
+ import os
9
+ import shutil
10
+ import subprocess
11
+ import sys
12
+ import tempfile
13
+ import urllib.error
14
+ import urllib.parse
15
+ import zipfile
16
+
17
+ from github_utils import github_request
18
+ DEFAULT_REF = "main"
19
+
20
+
21
+ @dataclass
22
+ class Args:
23
+ url: str | None = None
24
+ repo: str | None = None
25
+ path: list[str] | None = None
26
+ ref: str = DEFAULT_REF
27
+ dest: str | None = None
28
+ name: str | None = None
29
+ method: str = "auto"
30
+
31
+
32
+ @dataclass
33
+ class Source:
34
+ owner: str
35
+ repo: str
36
+ ref: str
37
+ paths: list[str]
38
+ repo_url: str | None = None
39
+
40
+
41
+ class InstallError(Exception):
42
+ pass
43
+
44
+
45
+ def _codex_home() -> str:
46
+ return os.environ.get("CODEX_HOME", os.path.expanduser("~/.codex"))
47
+
48
+
49
+ def _tmp_root() -> str:
50
+ base = os.path.join(tempfile.gettempdir(), "codex")
51
+ os.makedirs(base, exist_ok=True)
52
+ return base
53
+
54
+
55
+ def _request(url: str) -> bytes:
56
+ return github_request(url, "codex-skill-install")
57
+
58
+
59
+ def _parse_github_url(url: str, default_ref: str) -> tuple[str, str, str, str | None]:
60
+ parsed = urllib.parse.urlparse(url)
61
+ if parsed.netloc != "github.com":
62
+ raise InstallError("Only GitHub URLs are supported for download mode.")
63
+ parts = [p for p in parsed.path.split("/") if p]
64
+ if len(parts) < 2:
65
+ raise InstallError("Invalid GitHub URL.")
66
+ owner, repo = parts[0], parts[1]
67
+ ref = default_ref
68
+ subpath = ""
69
+ if len(parts) > 2:
70
+ if parts[2] in ("tree", "blob"):
71
+ if len(parts) < 4:
72
+ raise InstallError("GitHub URL missing ref or path.")
73
+ ref = parts[3]
74
+ subpath = "/".join(parts[4:])
75
+ else:
76
+ subpath = "/".join(parts[2:])
77
+ return owner, repo, ref, subpath or None
78
+
79
+
80
+ def _download_repo_zip(owner: str, repo: str, ref: str, dest_dir: str) -> str:
81
+ zip_url = f"https://codeload.github.com/{owner}/{repo}/zip/{ref}"
82
+ zip_path = os.path.join(dest_dir, "repo.zip")
83
+ try:
84
+ payload = _request(zip_url)
85
+ except urllib.error.HTTPError as exc:
86
+ raise InstallError(f"Download failed: HTTP {exc.code}") from exc
87
+ with open(zip_path, "wb") as file_handle:
88
+ file_handle.write(payload)
89
+ with zipfile.ZipFile(zip_path, "r") as zip_file:
90
+ _safe_extract_zip(zip_file, dest_dir)
91
+ top_levels = {name.split("/")[0] for name in zip_file.namelist() if name}
92
+ if not top_levels:
93
+ raise InstallError("Downloaded archive was empty.")
94
+ if len(top_levels) != 1:
95
+ raise InstallError("Unexpected archive layout.")
96
+ return os.path.join(dest_dir, next(iter(top_levels)))
97
+
98
+
99
+ def _run_git(args: list[str]) -> None:
100
+ result = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
101
+ if result.returncode != 0:
102
+ raise InstallError(result.stderr.strip() or "Git command failed.")
103
+
104
+
105
+ def _safe_extract_zip(zip_file: zipfile.ZipFile, dest_dir: str) -> None:
106
+ dest_root = os.path.realpath(dest_dir)
107
+ for info in zip_file.infolist():
108
+ extracted_path = os.path.realpath(os.path.join(dest_dir, info.filename))
109
+ if extracted_path == dest_root or extracted_path.startswith(dest_root + os.sep):
110
+ continue
111
+ raise InstallError("Archive contains files outside the destination.")
112
+ zip_file.extractall(dest_dir)
113
+
114
+
115
+ def _validate_relative_path(path: str) -> None:
116
+ if os.path.isabs(path) or os.path.normpath(path).startswith(".."):
117
+ raise InstallError("Skill path must be a relative path inside the repo.")
118
+
119
+
120
+ def _validate_skill_name(name: str) -> None:
121
+ altsep = os.path.altsep
122
+ if not name or os.path.sep in name or (altsep and altsep in name):
123
+ raise InstallError("Skill name must be a single path segment.")
124
+ if name in (".", ".."):
125
+ raise InstallError("Invalid skill name.")
126
+
127
+
128
+ def _git_sparse_checkout(repo_url: str, ref: str, paths: list[str], dest_dir: str) -> str:
129
+ repo_dir = os.path.join(dest_dir, "repo")
130
+ clone_cmd = [
131
+ "git",
132
+ "clone",
133
+ "--filter=blob:none",
134
+ "--depth",
135
+ "1",
136
+ "--sparse",
137
+ "--single-branch",
138
+ "--branch",
139
+ ref,
140
+ repo_url,
141
+ repo_dir,
142
+ ]
143
+ try:
144
+ _run_git(clone_cmd)
145
+ except InstallError:
146
+ _run_git(
147
+ [
148
+ "git",
149
+ "clone",
150
+ "--filter=blob:none",
151
+ "--depth",
152
+ "1",
153
+ "--sparse",
154
+ "--single-branch",
155
+ repo_url,
156
+ repo_dir,
157
+ ]
158
+ )
159
+ _run_git(["git", "-C", repo_dir, "sparse-checkout", "set", *paths])
160
+ _run_git(["git", "-C", repo_dir, "checkout", ref])
161
+ return repo_dir
162
+
163
+
164
+ def _validate_skill(path: str) -> None:
165
+ if not os.path.isdir(path):
166
+ raise InstallError(f"Skill path not found: {path}")
167
+ skill_md = os.path.join(path, "SKILL.md")
168
+ if not os.path.isfile(skill_md):
169
+ raise InstallError("SKILL.md not found in selected skill directory.")
170
+
171
+
172
+ def _copy_skill(src: str, dest_dir: str) -> None:
173
+ os.makedirs(os.path.dirname(dest_dir), exist_ok=True)
174
+ if os.path.exists(dest_dir):
175
+ raise InstallError(f"Destination already exists: {dest_dir}")
176
+ shutil.copytree(src, dest_dir)
177
+
178
+
179
+ def _build_repo_url(owner: str, repo: str) -> str:
180
+ return f"https://github.com/{owner}/{repo}.git"
181
+
182
+
183
+ def _build_repo_ssh(owner: str, repo: str) -> str:
184
+ return f"git@github.com:{owner}/{repo}.git"
185
+
186
+
187
+ def _prepare_repo(source: Source, method: str, tmp_dir: str) -> str:
188
+ if method in ("download", "auto"):
189
+ try:
190
+ return _download_repo_zip(source.owner, source.repo, source.ref, tmp_dir)
191
+ except InstallError as exc:
192
+ if method == "download":
193
+ raise
194
+ err_msg = str(exc)
195
+ if "HTTP 401" in err_msg or "HTTP 403" in err_msg or "HTTP 404" in err_msg:
196
+ pass
197
+ else:
198
+ raise
199
+ if method in ("git", "auto"):
200
+ repo_url = source.repo_url or _build_repo_url(source.owner, source.repo)
201
+ try:
202
+ return _git_sparse_checkout(repo_url, source.ref, source.paths, tmp_dir)
203
+ except InstallError:
204
+ repo_url = _build_repo_ssh(source.owner, source.repo)
205
+ return _git_sparse_checkout(repo_url, source.ref, source.paths, tmp_dir)
206
+ raise InstallError("Unsupported method.")
207
+
208
+
209
+ def _resolve_source(args: Args) -> Source:
210
+ if args.url:
211
+ owner, repo, ref, url_path = _parse_github_url(args.url, args.ref)
212
+ if args.path is not None:
213
+ paths = list(args.path)
214
+ elif url_path:
215
+ paths = [url_path]
216
+ else:
217
+ paths = []
218
+ if not paths:
219
+ raise InstallError("Missing --path for GitHub URL.")
220
+ return Source(owner=owner, repo=repo, ref=ref, paths=paths)
221
+
222
+ if not args.repo:
223
+ raise InstallError("Provide --repo or --url.")
224
+ if "://" in args.repo:
225
+ return _resolve_source(
226
+ Args(url=args.repo, repo=None, path=args.path, ref=args.ref)
227
+ )
228
+
229
+ repo_parts = [p for p in args.repo.split("/") if p]
230
+ if len(repo_parts) != 2:
231
+ raise InstallError("--repo must be in owner/repo format.")
232
+ if not args.path:
233
+ raise InstallError("Missing --path for --repo.")
234
+ paths = list(args.path)
235
+ return Source(
236
+ owner=repo_parts[0],
237
+ repo=repo_parts[1],
238
+ ref=args.ref,
239
+ paths=paths,
240
+ )
241
+
242
+
243
+ def _default_dest() -> str:
244
+ return os.path.join(_codex_home(), "skills")
245
+
246
+
247
+ def _parse_args(argv: list[str]) -> Args:
248
+ parser = argparse.ArgumentParser(description="Install a skill from GitHub.")
249
+ parser.add_argument("--repo", help="owner/repo")
250
+ parser.add_argument("--url", help="https://github.com/owner/repo[/tree/ref/path]")
251
+ parser.add_argument(
252
+ "--path",
253
+ nargs="+",
254
+ help="Path(s) to skill(s) inside repo",
255
+ )
256
+ parser.add_argument("--ref", default=DEFAULT_REF)
257
+ parser.add_argument("--dest", help="Destination skills directory")
258
+ parser.add_argument(
259
+ "--name", help="Destination skill name (defaults to basename of path)"
260
+ )
261
+ parser.add_argument(
262
+ "--method",
263
+ choices=["auto", "download", "git"],
264
+ default="auto",
265
+ )
266
+ return parser.parse_args(argv, namespace=Args())
267
+
268
+
269
+ def main(argv: list[str]) -> int:
270
+ args = _parse_args(argv)
271
+ try:
272
+ source = _resolve_source(args)
273
+ source.ref = source.ref or args.ref
274
+ if not source.paths:
275
+ raise InstallError("No skill paths provided.")
276
+ for path in source.paths:
277
+ _validate_relative_path(path)
278
+ dest_root = args.dest or _default_dest()
279
+ tmp_dir = tempfile.mkdtemp(prefix="skill-install-", dir=_tmp_root())
280
+ try:
281
+ repo_root = _prepare_repo(source, args.method, tmp_dir)
282
+ installed = []
283
+ for path in source.paths:
284
+ skill_name = args.name if len(source.paths) == 1 else None
285
+ skill_name = skill_name or os.path.basename(path.rstrip("/"))
286
+ _validate_skill_name(skill_name)
287
+ if not skill_name:
288
+ raise InstallError("Unable to derive skill name.")
289
+ dest_dir = os.path.join(dest_root, skill_name)
290
+ if os.path.exists(dest_dir):
291
+ raise InstallError(f"Destination already exists: {dest_dir}")
292
+ skill_src = os.path.join(repo_root, path)
293
+ _validate_skill(skill_src)
294
+ _copy_skill(skill_src, dest_dir)
295
+ installed.append((skill_name, dest_dir))
296
+ finally:
297
+ if os.path.isdir(tmp_dir):
298
+ shutil.rmtree(tmp_dir, ignore_errors=True)
299
+ for skill_name, dest_dir in installed:
300
+ print(f"Installed {skill_name} to {dest_dir}")
301
+ return 0
302
+ except InstallError as exc:
303
+ print(f"Error: {exc}", file=sys.stderr)
304
+ return 1
305
+
306
+
307
+ if __name__ == "__main__":
308
+ raise SystemExit(main(sys.argv[1:]))
skills/.system/skill-installer/scripts/list-skills.py ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """List skills from a GitHub repo path."""
3
+
4
+ from __future__ import annotations
5
+
6
+ import argparse
7
+ import json
8
+ import os
9
+ import sys
10
+ import urllib.error
11
+
12
+ from github_utils import github_api_contents_url, github_request
13
+
14
+ DEFAULT_REPO = "openai/skills"
15
+ DEFAULT_PATH = "skills/.curated"
16
+ DEFAULT_REF = "main"
17
+
18
+
19
+ class ListError(Exception):
20
+ pass
21
+
22
+
23
+ class Args(argparse.Namespace):
24
+ repo: str
25
+ path: str
26
+ ref: str
27
+ format: str
28
+
29
+
30
+ def _request(url: str) -> bytes:
31
+ return github_request(url, "codex-skill-list")
32
+
33
+
34
+ def _codex_home() -> str:
35
+ return os.environ.get("CODEX_HOME", os.path.expanduser("~/.codex"))
36
+
37
+
38
+ def _installed_skills() -> set[str]:
39
+ root = os.path.join(_codex_home(), "skills")
40
+ if not os.path.isdir(root):
41
+ return set()
42
+ entries = set()
43
+ for name in os.listdir(root):
44
+ path = os.path.join(root, name)
45
+ if os.path.isdir(path):
46
+ entries.add(name)
47
+ return entries
48
+
49
+
50
+ def _list_skills(repo: str, path: str, ref: str) -> list[str]:
51
+ api_url = github_api_contents_url(repo, path, ref)
52
+ try:
53
+ payload = _request(api_url)
54
+ except urllib.error.HTTPError as exc:
55
+ if exc.code == 404:
56
+ raise ListError(
57
+ "Skills path not found: "
58
+ f"https://github.com/{repo}/tree/{ref}/{path}"
59
+ ) from exc
60
+ raise ListError(f"Failed to fetch skills: HTTP {exc.code}") from exc
61
+ data = json.loads(payload.decode("utf-8"))
62
+ if not isinstance(data, list):
63
+ raise ListError("Unexpected skills listing response.")
64
+ skills = [item["name"] for item in data if item.get("type") == "dir"]
65
+ return sorted(skills)
66
+
67
+
68
+ def _parse_args(argv: list[str]) -> Args:
69
+ parser = argparse.ArgumentParser(description="List skills.")
70
+ parser.add_argument("--repo", default=DEFAULT_REPO)
71
+ parser.add_argument(
72
+ "--path",
73
+ default=DEFAULT_PATH,
74
+ help="Repo path to list (default: skills/.curated)",
75
+ )
76
+ parser.add_argument("--ref", default=DEFAULT_REF)
77
+ parser.add_argument(
78
+ "--format",
79
+ choices=["text", "json"],
80
+ default="text",
81
+ help="Output format",
82
+ )
83
+ return parser.parse_args(argv, namespace=Args())
84
+
85
+
86
+ def main(argv: list[str]) -> int:
87
+ args = _parse_args(argv)
88
+ try:
89
+ skills = _list_skills(args.repo, args.path, args.ref)
90
+ installed = _installed_skills()
91
+ if args.format == "json":
92
+ payload = [
93
+ {"name": name, "installed": name in installed} for name in skills
94
+ ]
95
+ print(json.dumps(payload))
96
+ else:
97
+ for idx, name in enumerate(skills, start=1):
98
+ suffix = " (already installed)" if name in installed else ""
99
+ print(f"{idx}. {name}{suffix}")
100
+ return 0
101
+ except ListError as exc:
102
+ print(f"Error: {exc}", file=sys.stderr)
103
+ return 1
104
+
105
+
106
+ if __name__ == "__main__":
107
+ raise SystemExit(main(sys.argv[1:]))
skills/agent-kernel/SKILL.md ADDED
@@ -0,0 +1,379 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: agentkernel
3
+ description: >
4
+ Spawn and orchestrate agents as local subprocesses or Kubernetes pods.
5
+ Each agent runs with an independent runtime, conversation, tools,
6
+ and skills. Use when a task benefits from parallel work, role
7
+ specialization, persistent agent state, or sandboxed execution.
8
+ metadata:
9
+ version: "2.1"
10
+ pre-condition: "0"
11
+ ---
12
+
13
+ # AgentKernel
14
+
15
+ Spawn and orchestrate agents from `<helpers>` blocks. Each agent runs in its own process (local subprocess or Kubernetes pod) with an independent runtime, conversation state, tools, and skills. You decide what agents to create and what to say to them. The kernel handles process lifecycle, networking, image management, and health checks.
16
+
17
+ ## AgentKernel vs `agents` Skill
18
+
19
+ | | `agents` skill | `agentkernel` |
20
+ |---|---|---|
21
+ | **Backends** | Local subprocesses only | Local subprocesses or k8s pods |
22
+ | **Addressing** | By name (`call_async("my-agent", ...)`) | By UUID + secret nonce |
23
+ | **Protocol** | Anthropic Messages API | Custom SSE (TurnRequest/TurnResponse) |
24
+ | **Access control** | Open — any caller can talk to any agent | Nonce-secured single-owner |
25
+ | **Teams / capacity** | No | Yes |
26
+ | **Image packaging** | No | Yes (OCI images for k8s) |
27
+ | **AgentBus** | No | Yes |
28
+ | **Dependencies** | API server skill | None |
29
+
30
+ **Use `agents`** for lightweight local agent workflows where convenience matters — create by name, call by name, check event logs.
31
+
32
+ **Use `agentkernel`** when you need k8s deployment, container isolation, capacity management, nonce-secured access, or agentbus observability.
33
+
34
+ ## Core Concepts
35
+
36
+ **Kernel**: `AgentKernel` is the entry point. It wires together the spawner, agent client, and storage. The backend determines where agents run.
37
+
38
+ **Backends**: Two backends are available:
39
+ - **local** — agents run as subprocesses on the same machine. No isolation, no config file needed. Good for development and quick experiments.
40
+ - **kubernetes** — agents run as pods in a k8s cluster. Full container isolation. Requires a config file with cluster details.
41
+
42
+ The entire API after initialization is identical across backends.
43
+
44
+ **SpawnRequest + SpawnInfo**: A `SpawnRequest` defines the agent identity (name, team, metadata). The `spawn_info` field carries agent-type-specific config (system prompt, model, tools, etc.) — e.g. `OpenClawSpawnInfo`.
45
+
46
+ **Nonce**: Each spawn returns a `SpawnResult` containing the agent record and a secret nonce. The nonce is required for all communication — it enforces single-owner access. Only the entity that spawned an agent can talk to it.
47
+
48
+ **AgentBus**: Optional observability/safety layer. When enabled, all LLM inference and code execution events are logged to an agent bus that can be inspected externally.
49
+
50
+ **Teams**: Logical groups with capacity limits. Spawning into a full team raises an error.
51
+
52
+ ## Initialization
53
+
54
+ ### Local backend
55
+
56
+ No config file needed. Agents run as subprocesses with the same permissions as the parent process.
57
+
58
+ <helpers>
59
+ from agentic.kernel import AgentKernel
60
+ from agentic.kernel.plugins.openclaw import OpenClawPlugin
61
+
62
+ kernel = AgentKernel(backend="local", plugins=[OpenClawPlugin()])
63
+ </helpers>
64
+
65
+ ### Kubernetes backend
66
+
67
+ Agents run as pods in the `agentkernel-0` namespace. The config file is at `agentkernel/examples/agentkernel.yaml`:
68
+
69
+ ```yaml
70
+ backend: kubernetes
71
+ namespace: agentkernel-0
72
+ base_image: your-registry.example.com/agentkernel:latest
73
+ kubeconfig: ~/.kube/config
74
+ registry_url: your-registry.example.com
75
+ debug: true
76
+ ```
77
+
78
+ - `debug: true` preserves pods on failure for inspection (otherwise they are cleaned up automatically).
79
+
80
+ <helpers>
81
+ from agentic.kernel import AgentKernel
82
+ from agentic.kernel.plugins.openclaw import OpenClawPlugin
83
+
84
+ kernel = AgentKernel.from_config("agentkernel/examples/agentkernel.yaml", plugins=[OpenClawPlugin()])
85
+ </helpers>
86
+
87
+ ## API
88
+
89
+ All API calls below work identically regardless of backend.
90
+
91
+ ### Spawn an Agent
92
+
93
+ <helpers>
94
+ import os
95
+ from agentic.kernel import SpawnRequest
96
+ from agentic.kernel.plugins.openclaw import OpenClawSpawnInfo
97
+
98
+ result = await kernel.spawner.spawn(SpawnRequest(
99
+ name="researcher",
100
+ agent_type="openclaw",
101
+ metadata={"role": "research"},
102
+ spawn_info=OpenClawSpawnInfo(
103
+ system_prompt="You are a research specialist. Be thorough and cite sources.",
104
+ model="claude-sonnet-4-5",
105
+ api_key=os.environ.get("LLM_API_KEY", ""),
106
+ ),
107
+ ))
108
+
109
+ agent = result.agent # Agent(id, name, team_id, state, metadata, ...)
110
+ nonce = result.nonce # Secret — required for all communication
111
+ print(f"Spawned: {agent.id} ({agent.name})")
112
+ </helpers>
113
+
114
+ `SpawnRequest` fields:
115
+ - `name` — agent name (also used in k8s pod naming)
116
+ - `team_id` — team for capacity tracking (optional, default: "")
117
+ - `metadata` — arbitrary labels for discovery (e.g. `{"role": "worker"}`)
118
+ - `image_id` — custom image from packaging (optional, defaults to base_image in k8s)
119
+ - `spawn_info` — agent-type-specific config (e.g. `OpenClawSpawnInfo`)
120
+ - `env` — extra environment variables forwarded to the agent process
121
+
122
+ `OpenClawSpawnInfo` fields:
123
+ - `system_prompt` — system prompt for the agent
124
+ - `model` — LLM model name (default: `"claude-sonnet-4-5"`)
125
+ - `provider` — LLM provider (default: `"anthropic"`)
126
+ - `tools` — list of tool names to enable (default: `["bash"]`)
127
+ - `thinking_level` — thinking level: `"none"`, `"low"`, `"medium"`, `"high"`
128
+ - `api_key` — LLM API key (also forwarded from host `LLM_API_KEY` env var)
129
+ - `base_url` — override LLM API base URL
130
+
131
+ ### Send a Message (Turn)
132
+
133
+ Use the `ask()` helper to send a message and get the full response:
134
+
135
+ <helpers>
136
+ response = await ask(kernel, agent.id, nonce, "What are the latest findings on topic X?")
137
+ print(response)
138
+ </helpers>
139
+
140
+ The agent maintains conversation state — subsequent turns see the full history.
141
+
142
+ For manual streaming (e.g. to display progress), use `kernel.agent_client.turn()` directly — note `end=""` to avoid extra newlines between tokens:
143
+
144
+ <helpers>
145
+ import json
146
+ from agentic.kernel import TurnRequest
147
+
148
+ request = TurnRequest(
149
+ agent_id=agent.id,
150
+ nonce=nonce,
151
+ body=json.dumps({
152
+ "messages": [{"role": "user", "content": "What are the latest findings on topic X?"}]
153
+ }).encode(),
154
+ )
155
+
156
+ response_text = []
157
+ async for chunk in kernel.agent_client.turn(request):
158
+ if chunk.body:
159
+ print(chunk.body, end="", flush=True)
160
+ response_text.append(chunk.body)
161
+ if chunk.error:
162
+ print(f"\nError: {chunk.error}")
163
+ full_response = "".join(response_text)
164
+ </helpers>
165
+
166
+ ### Get History
167
+
168
+ <helpers>
169
+ history = await kernel.agent_client.get_history(agent.id, last_n=5)
170
+ for entry in history:
171
+ print(f"[{entry['role']}] {entry['content'][:100]}")
172
+ </helpers>
173
+
174
+ ### Get Agent Info
175
+
176
+ <helpers>
177
+ info = await kernel.agent_client.get_info(agent.id)
178
+ print(f"pid={info['pid']} cwd={info['cwd']} uid={info['uid']}")
179
+ </helpers>
180
+
181
+ ### Check Status
182
+
183
+ <helpers>
184
+ statuses = await kernel.status()
185
+ for s in statuses:
186
+ line = f"{s['name']}: state={s['state']} live={s['live']}"
187
+ if s.get('pod_phase'): # k8s backend
188
+ line += f" pod={s['pod_phase']}"
189
+ if s.get('process_alive') is not None: # local backend
190
+ line += f" process_alive={s['process_alive']}"
191
+ print(line)
192
+ </helpers>
193
+
194
+ ### Kill an Agent
195
+
196
+ <helpers>
197
+ await kernel.spawner.kill(agent.id)
198
+ </helpers>
199
+
200
+ ### Clean Up All Agents
201
+
202
+ <helpers>
203
+ await kernel.cleanup()
204
+ </helpers>
205
+
206
+ ## Teams
207
+
208
+ Teams reserve capacity and group agents together.
209
+
210
+ <helpers>
211
+ from agentic.kernel import CreateTeamRequest
212
+
213
+ # Reserve capacity
214
+ await kernel.spawner.create_team(CreateTeamRequest(
215
+ team_id="analysis-team",
216
+ resources={"cpu": 4},
217
+ ))
218
+
219
+ # Spawn into the team
220
+ result = await kernel.spawner.spawn(SpawnRequest(
221
+ name="analyst",
222
+ team_id="analysis-team",
223
+ agent_type="openclaw",
224
+ spawn_info=OpenClawSpawnInfo(
225
+ system_prompt="You are a data analyst.",
226
+ api_key=os.environ.get("LLM_API_KEY", ""),
227
+ ),
228
+ ))
229
+
230
+ # Delete team (kills all agents first)
231
+ await kernel.spawner.delete_team("analysis-team")
232
+ </helpers>
233
+
234
+ ## AgentBus
235
+
236
+ AgentBus adds observability and safety to agent execution. When enabled, the agent logs all LLM inference and code execution events to a bus that can be inspected via the agentbus CLI.
237
+
238
+ <helpers>
239
+ from agentic.kernel import AgentBusConfig
240
+
241
+ result = await kernel.spawner.spawn(SpawnRequest(
242
+ name="worker",
243
+ agent_type="openclaw",
244
+ spawn_info=OpenClawSpawnInfo(
245
+ system_prompt="You are a helpful worker.",
246
+ api_key=os.environ.get("LLM_API_KEY", ""),
247
+ ),
248
+ agentbus=AgentBusConfig(
249
+ port=8095,
250
+ disable_safety=False,
251
+ ),
252
+ ))
253
+ </helpers>
254
+
255
+ To inspect the bus, you can use the agentbus skill.
256
+
257
+ # Kubernetes backend — port-forward first
258
+ kubectl --kubeconfig ~/.kube/config \
259
+ -n agentkernel-0 port-forward pod/agent-<id-prefix> 8095:8095
260
+ # Then poll as above
261
+ ```
262
+
263
+ The bus ID follows the pattern `{agent_name}.{agent_uuid}`.
264
+
265
+ ## Patterns
266
+
267
+ ### Fan-out / Fan-in
268
+
269
+ Spawn specialists, query them in parallel, synthesize results.
270
+
271
+ <helpers>
272
+ import asyncio
273
+
274
+ # Spawn specialists
275
+ researcher_r = await kernel.spawner.spawn(SpawnRequest(
276
+ name="researcher", agent_type="openclaw", spawn_info=OpenClawSpawnInfo(
277
+ system_prompt="You are a research specialist.",
278
+ api_key=os.environ.get("LLM_API_KEY", ""),
279
+ ),
280
+ ))
281
+ analyst_r = await kernel.spawner.spawn(SpawnRequest(
282
+ name="analyst", agent_type="openclaw", spawn_info=OpenClawSpawnInfo(
283
+ system_prompt="You are a data analyst.",
284
+ api_key=os.environ.get("LLM_API_KEY", ""),
285
+ ),
286
+ ))
287
+
288
+ # Fan out — ask() collects streaming chunks into a single string
289
+ research_task = asyncio.create_task(
290
+ ask(kernel, researcher_r.agent.id, researcher_r.nonce, "Find papers on quantum error correction")
291
+ )
292
+ analysis_task = asyncio.create_task(
293
+ ask(kernel, analyst_r.agent.id, analyst_r.nonce, "Run cost-benefit analysis on approach X")
294
+ )
295
+ research, analysis = await asyncio.gather(research_task, analysis_task)
296
+
297
+ print(f"Research: {research[:200]}")
298
+ print(f"Analysis: {analysis[:200]}")
299
+ </helpers>
300
+
301
+ ### Pipeline
302
+
303
+ One agent's output feeds the next.
304
+
305
+ <helpers>
306
+ raw_data = await ask(kernel, researcher_r.agent.id, researcher_r.nonce, "Gather data on topic X")
307
+ analysis = await ask(kernel, analyst_r.agent.id, analyst_r.nonce, f"Analyze this data:\n{raw_data}")
308
+ print(analysis)
309
+ </helpers>
310
+
311
+ ### Image Packaging
312
+
313
+ Bundle custom code into agent images. On local backend, bundles are copied to a directory. On k8s, an OCI image is built and pushed to the registry.
314
+
315
+ <helpers>
316
+ from agentic.kernel import SourceBundle
317
+
318
+ # Upload code to blob storage
319
+ helpers_uri = kernel.blob_store.upload_dir("./my_helpers/")
320
+
321
+ # Build an agent image with the bundle
322
+ job = await kernel.packaging.create_agent_image(
323
+ name="custom-worker",
324
+ bundles=[SourceBundle(uri=helpers_uri, labels={"name": "my_helpers"})],
325
+ )
326
+ if job.image:
327
+ # Spawn an agent using the custom image
328
+ result = await kernel.spawner.spawn(SpawnRequest(
329
+ name="custom-agent",
330
+ agent_type="openclaw",
331
+ image_id=job.image.id,
332
+ spawn_info=OpenClawSpawnInfo(
333
+ system_prompt="You have custom tools available.",
334
+ api_key=os.environ.get("LLM_API_KEY", ""),
335
+ ),
336
+ ))
337
+ </helpers>
338
+
339
+ ## Lifecycle
340
+
341
+ - Agents persist (as subprocesses or pods) until explicitly killed. Always clean up when done.
342
+ - Each agent has one conversation and one owner. The nonce enforces this — only the spawner can communicate with its agent.
343
+ - Teams have capacity limits. Spawning into a full team raises `ValueError`.
344
+ - The `LLM_API_KEY` and `OPENAI_API_KEY` environment variables are automatically forwarded to agent processes.
345
+
346
+ ## Operations
347
+
348
+ **Note**: If behind a proxy, configure `HTTP_PROXY`/`HTTPS_PROXY` environment variables.
349
+
350
+ ### Run examples locally
351
+
352
+ ```bash
353
+ # Single agent, local backend (no config file needed)
354
+ LLM_API_KEY=... uv run python -m agentkernel.examples.simple_agent
355
+
356
+ # Team scenario, local backend
357
+ LLM_API_KEY=... uv run python -m agentkernel.examples.team_scenario
358
+ ```
359
+
360
+ ### Run examples on Kubernetes
361
+
362
+ ```bash
363
+ # run_k8s_scenario.sh runs the scenario against the configured k8s cluster
364
+ LLM_API_KEY=... ./agentkernel/scripts/run_k8s_scenario.sh simple_agent
365
+ LLM_API_KEY=... ./agentkernel/scripts/run_k8s_scenario.sh team_scenario
366
+ ```
367
+
368
+ ### Build and push the base image (k8s only)
369
+
370
+ ```bash
371
+ ./scripts/build_base_image.sh --force-base
372
+ ```
373
+
374
+ ### Clean up cluster resources (k8s only)
375
+
376
+ ```bash
377
+ ./agentkernel/scripts/cleanup_k8s.sh # delete all agentkernel pods/svc/cm
378
+ ./agentkernel/scripts/cleanup_k8s.sh --dry-run # preview what would be deleted
379
+ ```
skills/hugging-face-evaluation/SKILL.md ADDED
@@ -0,0 +1,262 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: hugging-face-evaluation
3
+ description: Add evaluation results to Hugging Face model repositories using the .eval_results/ format. Uses HF CLI for PR management and manual YAML creation.
4
+ ---
5
+
6
+ # Overview
7
+
8
+ This skill adds structured evaluation results to HuggingFace model repositories using the [`.eval_results/` format](https://huggingface.co/docs/hub/eval-results).
9
+
10
+ **What This Enables:**
11
+ - Results appear on model pages with benchmark links
12
+ - Scores are aggregated into benchmark dataset leaderboards
13
+ - Community contributions via Pull Requests
14
+
15
+ # Important
16
+
17
+ Evaluation PRs can only be opened on the Hugging Face Hub. They cannot be opened on the GitHub repository.
18
+
19
+ # Version
20
+ 3.0.0
21
+
22
+ # Workflow Overview
23
+
24
+ The actual workflow uses:
25
+ 1. **HF CLI** (`hf upload`, `hf download`) for PR operations
26
+ 2. **Manual YAML creation** in `/tmp/pr-reviews/`
27
+ 3. **`check_prs.py`** script to check for existing PRs
28
+ 4. **curl** to fetch model cards and leaderboard data
29
+
30
+ See `references/hf_cli_for_prs.md` for detailed CLI instructions.
31
+
32
+ ---
33
+
34
+ # CRITICAL: Multiple Scores for One Benchmark
35
+
36
+ Models can have multiple scores for the same benchmark (with/without tools). **Each variant MUST be in a separate file.**
37
+
38
+ ## File Naming Convention
39
+
40
+ | Condition | File Name | Notes Field |
41
+ |-----------|-----------|-------------|
42
+ | Default (no tools) | `hle.yaml` | None (omit notes) |
43
+ | With tools | `hle_with_tools.yaml` | `notes: "With tools"` |
44
+
45
+ ## Notes Field Rules
46
+
47
+ 1. **No tools = No notes field** - Default assumption is "without tools"
48
+ 2. **With tools = Add notes** - Only add when tools ARE used
49
+ 3. **Standardized format** - Always use `notes: "With tools"` (capital W)
50
+
51
+ **CORRECT:**
52
+ ```yaml
53
+ # hle.yaml (no tools - DEFAULT)
54
+ - dataset:
55
+ id: cais/hle
56
+ task_id: hle
57
+ value: 22.1
58
+ source:
59
+ url: https://huggingface.co/org/model
60
+ name: Model Card
61
+ user: username
62
+ ```
63
+
64
+ ```yaml
65
+ # hle_with_tools.yaml (with tools)
66
+ - dataset:
67
+ id: cais/hle
68
+ task_id: hle
69
+ value: 44.9
70
+ source:
71
+ url: https://huggingface.co/org/model
72
+ name: Model Card
73
+ user: username
74
+ notes: "With tools"
75
+ ```
76
+
77
+ **INCORRECT:**
78
+ ```yaml
79
+ notes: "Without tools" # Don't add notes for default
80
+ notes: "w/ tools" # Use standardized format
81
+ notes: "with tools" # Capital W required
82
+ ```
83
+
84
+ ---
85
+
86
+ # Core Workflow
87
+
88
+ ## Step 1: Check for Existing PRs
89
+
90
+ **ALWAYS check before creating new PRs:**
91
+
92
+ ```bash
93
+ uv run scripts/check_prs.py --repo-id "org/model-name"
94
+ ```
95
+
96
+ If PRs exist, update them instead of creating new ones.
97
+
98
+ ## Step 2: Fetch Model Card and Extract Scores
99
+
100
+ ```bash
101
+ # Get model README
102
+ curl -s "https://huggingface.co/org/model-name/raw/main/README.md" | grep -i -A10 "HLE\|GPQA\|MMLU"
103
+ ```
104
+
105
+ Or use MCP tools:
106
+ ```
107
+ mcp__hf-mcp-server__hub_repo_details
108
+ repo_ids: ["org/model-name"]
109
+ include_readme: true
110
+ ```
111
+
112
+ ## Step 3: Create YAML File
113
+
114
+ ```bash
115
+ mkdir -p /tmp/pr-reviews/new-prs
116
+ cd /tmp/pr-reviews/new-prs
117
+
118
+ cat > hle.yaml << 'EOF'
119
+ - dataset:
120
+ id: cais/hle
121
+ task_id: hle
122
+ value: 22.1
123
+ date: '2026-02-03'
124
+ source:
125
+ url: https://huggingface.co/org/model-name
126
+ name: Model Card
127
+ user: burtenshaw
128
+ EOF
129
+ ```
130
+
131
+ ## Step 4: Create PR
132
+
133
+ ```bash
134
+ hf upload org/model-name hle.yaml .eval_results/hle.yaml \
135
+ --repo-type model --create-pr \
136
+ --commit-message "Add HLE evaluation result"
137
+ ```
138
+
139
+ ## Step 5: Get PR Number
140
+
141
+ ```bash
142
+ uv run scripts/check_prs.py --repo-id "org/model-name"
143
+ ```
144
+
145
+ ---
146
+
147
+ # Updating Existing PRs
148
+
149
+ ```bash
150
+ # Download PR contents
151
+ hf download org/model-name --repo-type model \
152
+ --revision refs/pr/<PR_NUMBER> \
153
+ --include ".eval_results/*" \
154
+ --local-dir /tmp/pr-reviews/model-pr<PR_NUMBER>
155
+
156
+ # Edit the YAML file, then upload
157
+ hf upload org/model-name /tmp/pr-reviews/updated.yaml .eval_results/hle.yaml \
158
+ --repo-type model \
159
+ --revision refs/pr/<PR_NUMBER> \
160
+ --commit-message "Update evaluation result"
161
+ ```
162
+
163
+ ---
164
+
165
+ # Deleting Files from PRs
166
+
167
+ Use Python API:
168
+ ```bash
169
+ uv run --with huggingface_hub python3 << 'EOF'
170
+ from huggingface_hub import HfApi
171
+ api = HfApi()
172
+ api.delete_file(
173
+ path_in_repo=".eval_results/old_file.yaml",
174
+ repo_id="org/model-name",
175
+ repo_type="model",
176
+ revision="refs/pr/<PR_NUMBER>",
177
+ commit_message="Remove file"
178
+ )
179
+ EOF
180
+ ```
181
+
182
+ ---
183
+
184
+ # Fetching Leaderboard Data
185
+
186
+ ```bash
187
+ # HLE leaderboard (requires auth for private datasets)
188
+ curl -s "https://huggingface.co/api/datasets/cais/hle/leaderboard" \
189
+ -H "Authorization: Bearer $HF_TOKEN"
190
+
191
+ # MMLU-Pro leaderboard (public)
192
+ curl -s "https://huggingface.co/api/datasets/TIGER-Lab/MMLU-Pro/leaderboard"
193
+
194
+ # Model eval results
195
+ curl -s "https://huggingface.co/api/models/org/model?expand[]=evalResults"
196
+ ```
197
+
198
+ ---
199
+
200
+ # .eval_results/ Format
201
+
202
+ ```yaml
203
+ # .eval_results/hle.yaml
204
+ - dataset:
205
+ id: cais/hle # Required: Hub Benchmark dataset ID
206
+ task_id: hle # Required: task id from dataset's eval.yaml
207
+ value: 22.2 # Required: metric value
208
+ date: "2026-01-14" # Optional: ISO-8601 date
209
+ source: # Optional: attribution
210
+ url: https://huggingface.co/org/model
211
+ name: Model Card
212
+ user: username
213
+ ```
214
+
215
+ ---
216
+
217
+ # Supported Benchmarks
218
+
219
+ | Benchmark | Hub Dataset ID | Task ID |
220
+ |-----------|---------------|---------|
221
+ | HLE | cais/hle | hle |
222
+ | GPQA | Idavidrein/gpqa | diamond |
223
+ | MMLU-Pro | TIGER-Lab/MMLU-Pro | mmlu_pro |
224
+
225
+ ---
226
+
227
+ # Tool-Using Agent Models
228
+
229
+ Models like MiroThinker, Nemotron-Orchestrator are inherently tool-using agents. For these:
230
+
231
+ 1. Use `hle_with_tools.yaml` as filename
232
+ 2. Add `notes: "With tools"`
233
+ 3. Look for terms: "search agent", "agentic", "orchestrator", "code-interpreter"
234
+
235
+ ---
236
+
237
+ # Environment Setup
238
+
239
+ ```bash
240
+ export HF_TOKEN="your-huggingface-token"
241
+ ```
242
+
243
+ ---
244
+
245
+ # Scripts Reference
246
+
247
+ ```bash
248
+ # Check for existing PRs (ALWAYS do this first)
249
+ uv run scripts/check_prs.py --repo-id "org/model-name"
250
+ ```
251
+
252
+ See `references/hf_cli_for_prs.md` for complete HF CLI workflow documentation.
253
+
254
+ ---
255
+
256
+ # Best Practices
257
+
258
+ 1. **Always check for existing PRs** before creating new ones
259
+ 2. **Separate files for variants** - `hle.yaml` for default, `hle_with_tools.yaml` for tools
260
+ 3. **Notes only for non-default** - Omit notes for standard evaluations
261
+ 4. **Standardized format** - Use `"With tools"` exactly (capital W)
262
+ 5. **Verify scores** - Compare YAML against model card before submitting
skills/hugging-face-evaluation/examples/.env.example ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ # Hugging Face Token (required for all operations)
2
+ # Get your token at: https://huggingface.co/settings/tokens
3
+ HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
4
+
5
+ # Artificial Analysis API Key (required for import-aa command)
6
+ # Get your key at: https://artificialanalysis.ai/
7
+ AA_API_KEY=aa_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Usage Examples
2
+
3
+ Practical examples for adding evaluations to HuggingFace model repositories using the `.eval_results/` format.
4
+
5
+ ## Table of Contents
6
+ 1. [Setup](#setup)
7
+ 2. [Add Single Benchmark (Recommended)](#add-single-benchmark-recommended)
8
+ 3. [Batch Process Trending Models](#batch-process-trending-models)
9
+ 4. [Extract from README Tables](#extract-from-readme-tables)
10
+ 5. [Import from Artificial Analysis](#import-from-artificial-analysis)
11
+ 6. [Common Workflows](#common-workflows)
12
+
13
+ ---
14
+
15
+ ## Setup
16
+
17
+ ### Environment Variables
18
+
19
+ ```bash
20
+ # Required for creating PRs
21
+ export HF_TOKEN="hf_your_write_token_here"
22
+
23
+ # Optional: for Artificial Analysis source
24
+ export AA_API_KEY="your_aa_api_key_here"
25
+ ```
26
+
27
+ Or use a `.env` file:
28
+ ```bash
29
+ cp examples/.env.example .env
30
+ # Edit .env with your tokens
31
+ ```
32
+
33
+ ### Verify Installation
34
+
35
+ ```bash
36
+ uv run scripts/evaluation_manager.py --help
37
+ ```
38
+
39
+ ---
40
+
41
+ ## Add Single Benchmark (Recommended)
42
+
43
+ The simplest way to add a specific benchmark score to a model.
44
+
45
+ ### Basic Usage
46
+
47
+ ```bash
48
+ # Preview (default - prints YAML without uploading)
49
+ uv run scripts/evaluation_manager.py add-eval \
50
+ --benchmark HLE \
51
+ --repo-id "moonshotai/Kimi-K2-Thinking"
52
+ ```
53
+
54
+ Output:
55
+ ```
56
+ Looking up HLE score for moonshotai/Kimi-K2-Thinking from model_card...
57
+ Found: HLE = 23.9
58
+ Generated YAML:
59
+ - dataset:
60
+ id: cais/hle
61
+ task_id: default
62
+ value: 23.9
63
+ date: "2026-01-14"
64
+ source:
65
+ url: https://huggingface.co/moonshotai/Kimi-K2-Thinking
66
+ name: Model Card
67
+ ```
68
+
69
+ ### From Artificial Analysis
70
+
71
+ ```bash
72
+ uv run scripts/evaluation_manager.py add-eval \
73
+ --benchmark HLE \
74
+ --repo-id "MiniMaxAI/MiniMax-M2.1" \
75
+ --source aa
76
+ ```
77
+
78
+ ### Create PR
79
+
80
+ ```bash
81
+ # Always check for existing PRs first!
82
+ uv run scripts/evaluation_manager.py get-prs --repo-id "model/name"
83
+
84
+ # If no PRs exist, create one
85
+ uv run scripts/evaluation_manager.py add-eval \
86
+ --benchmark HLE \
87
+ --repo-id "model/name" \
88
+ --create-pr
89
+ ```
90
+
91
+ ### Push Directly (Your Own Model)
92
+
93
+ ```bash
94
+ uv run scripts/evaluation_manager.py add-eval \
95
+ --benchmark GPQA \
96
+ --repo-id "your-username/your-model" \
97
+ --apply
98
+ ```
99
+
100
+ ### Provide Score Manually
101
+
102
+ ```bash
103
+ uv run scripts/evaluation_manager.py add-eval \
104
+ --benchmark HLE \
105
+ --repo-id "model/name" \
106
+ --value 84.5 \
107
+ --create-pr
108
+ ```
109
+
110
+ ---
111
+
112
+ ## Batch Process Trending Models
113
+
114
+ Process multiple trending models at once.
115
+
116
+ ### Preview Mode (Dry Run)
117
+
118
+ ```bash
119
+ uv run scripts/batch_eval_prs.py --limit 10 --benchmark HLE --dry-run
120
+ ```
121
+
122
+ Output:
123
+ ```
124
+ ==================================================
125
+ Batch Evaluation PR Creator
126
+ ==================================================
127
+ Benchmark: HLE
128
+ Source: model_card
129
+ Pipeline tag: text-generation
130
+ Limit: 10
131
+ Sort: trending
132
+ Dry run: True
133
+ ==================================================
134
+
135
+ Processing: LiquidAI/LFM2.5-1.2B-Instruct
136
+ Not found: HLE score not available
137
+ Processing: MiniMaxAI/MiniMax-M2.1
138
+ Found: HLE = 22.2
139
+ Status: Would create PR (dry run)
140
+ ...
141
+
142
+ Summary:
143
+ Success: 3
144
+ Not found: 7
145
+ ```
146
+
147
+ ### Create PRs
148
+
149
+ ```bash
150
+ # From model cards
151
+ uv run scripts/batch_eval_prs.py --limit 10 --benchmark HLE
152
+
153
+ # From Artificial Analysis
154
+ uv run scripts/batch_eval_prs.py --limit 10 --benchmark HLE --source aa
155
+ ```
156
+
157
+ ### Sort Options
158
+
159
+ ```bash
160
+ # By downloads (more established models)
161
+ uv run scripts/batch_eval_prs.py --limit 20 --sort downloads --benchmark GPQA
162
+
163
+ # By likes
164
+ uv run scripts/batch_eval_prs.py --limit 10 --sort likes --benchmark MMLU-Pro
165
+ ```
166
+
167
+ ### Filter by Pipeline Tag
168
+
169
+ ```bash
170
+ # Only text-generation models (default)
171
+ uv run scripts/batch_eval_prs.py --limit 10 --benchmark HLE --pipeline-tag text-generation
172
+
173
+ # Image generation models
174
+ uv run scripts/batch_eval_prs.py --limit 10 --benchmark HLE --pipeline-tag text-to-image
175
+ ```
176
+
177
+ ### Results Tracking
178
+
179
+ Results are saved to `runs/{benchmark}_{date}_{hash}.json`:
180
+
181
+ ```bash
182
+ cat runs/hle_20260114_abc123.json
183
+ ```
184
+
185
+ ```json
186
+ {
187
+ "benchmark": "HLE",
188
+ "source": "aa",
189
+ "source_url": "https://artificialanalysis.ai",
190
+ "created": "2026-01-14T08:00:00Z",
191
+ "results": [
192
+ {
193
+ "repo_id": "MiniMaxAI/MiniMax-M2.1",
194
+ "value": 22.2,
195
+ "status": "pr_created",
196
+ "source_url": "https://artificialanalysis.ai"
197
+ }
198
+ ]
199
+ }
200
+ ```
201
+
202
+ ---
203
+
204
+ ## Extract from README Tables
205
+
206
+ For models with evaluation tables in their README.
207
+
208
+ ### Step 1: Inspect Tables
209
+
210
+ ```bash
211
+ uv run scripts/evaluation_manager.py inspect-tables \
212
+ --repo-id "deepseek-ai/DeepSeek-V3"
213
+ ```
214
+
215
+ This shows all tables with their structure, helping you identify which table to extract.
216
+
217
+ ### Step 2: Preview Extraction
218
+
219
+ ```bash
220
+ uv run scripts/evaluation_manager.py extract-readme \
221
+ --repo-id "deepseek-ai/DeepSeek-V3" \
222
+ --table 1
223
+ ```
224
+
225
+ ### Step 3: Create PR
226
+
227
+ ```bash
228
+ uv run scripts/evaluation_manager.py extract-readme \
229
+ --repo-id "deepseek-ai/DeepSeek-V3" \
230
+ --table 1 \
231
+ --create-pr
232
+ ```
233
+
234
+ ---
235
+
236
+ ## Import from Artificial Analysis
237
+
238
+ Import all available benchmarks from Artificial Analysis API.
239
+
240
+ ### Preview
241
+
242
+ ```bash
243
+ uv run scripts/evaluation_manager.py import-aa \
244
+ --creator-slug "anthropic" \
245
+ --model-name "claude-sonnet-4" \
246
+ --repo-id "your-username/claude-mirror"
247
+ ```
248
+
249
+ ### Create PR
250
+
251
+ ```bash
252
+ uv run scripts/evaluation_manager.py import-aa \
253
+ --creator-slug "anthropic" \
254
+ --model-name "claude-sonnet-4" \
255
+ --repo-id "your-username/claude-mirror" \
256
+ --apply --create-pr
257
+ ```
258
+
259
+ ### Finding Creator Slug and Model Name
260
+
261
+ Visit [Artificial Analysis](https://artificialanalysis.ai/) and check the URL:
262
+ - URL: `https://artificialanalysis.ai/models/{creator-slug}/{model-name}`
263
+
264
+ Common examples:
265
+ - Anthropic: `--creator-slug "anthropic" --model-name "claude-sonnet-4"`
266
+ - OpenAI: `--creator-slug "openai" --model-name "gpt-4-turbo"`
267
+ - Meta: `--creator-slug "meta" --model-name "llama-3-70b"`
268
+
269
+ ---
270
+
271
+ ## Common Workflows
272
+
273
+ ### Workflow 1: Add Missing Benchmark to Popular Model
274
+
275
+ ```bash
276
+ # 1. Check for existing PRs
277
+ uv run scripts/evaluation_manager.py get-prs \
278
+ --repo-id "meta-llama/Llama-3.1-8B-Instruct"
279
+
280
+ # 2. Preview what we'd add
281
+ uv run scripts/evaluation_manager.py add-eval \
282
+ --benchmark HLE \
283
+ --repo-id "meta-llama/Llama-3.1-8B-Instruct"
284
+
285
+ # 3. Create PR if score found
286
+ uv run scripts/evaluation_manager.py add-eval \
287
+ --benchmark HLE \
288
+ --repo-id "meta-llama/Llama-3.1-8B-Instruct" \
289
+ --create-pr
290
+ ```
291
+
292
+ ### Workflow 2: Batch Update Trending Models
293
+
294
+ ```bash
295
+ # 1. Dry run to see which models have HLE scores
296
+ uv run scripts/batch_eval_prs.py --limit 20 --benchmark HLE --source aa --dry-run
297
+
298
+ # 2. Create PRs for models with scores
299
+ uv run scripts/batch_eval_prs.py --limit 20 --benchmark HLE --source aa
300
+
301
+ # 3. Check results
302
+ cat runs/hle_*.json | jq '.results[] | select(.status == "pr_created")'
303
+ ```
304
+
305
+ ### Workflow 3: Update Your Own Model
306
+
307
+ ```bash
308
+ # 1. Add HLE score from your model card
309
+ uv run scripts/evaluation_manager.py add-eval \
310
+ --benchmark HLE \
311
+ --repo-id "your-username/your-model" \
312
+ --apply
313
+
314
+ # 2. Add GPQA score manually
315
+ uv run scripts/evaluation_manager.py add-eval \
316
+ --benchmark GPQA \
317
+ --repo-id "your-username/your-model" \
318
+ --value 84.5 \
319
+ --apply
320
+ ```
321
+
322
+ ---
323
+
324
+ ## Output Format
325
+
326
+ Results are stored in `.eval_results/*.yaml`:
327
+
328
+ ```yaml
329
+ - dataset:
330
+ id: cais/hle # Hub Benchmark dataset ID
331
+ task_id: default # Optional task ID
332
+ value: 23.9 # Metric value
333
+ date: "2026-01-14" # ISO-8601 date
334
+ source: # Attribution
335
+ url: https://huggingface.co/model/name
336
+ name: Model Card
337
+ ```
338
+
339
+ ---
340
+
341
+ ## Supported Benchmarks
342
+
343
+ | Benchmark | Hub Dataset ID |
344
+ |-----------|---------------|
345
+ | HLE | cais/hle |
346
+ | GPQA | Idavidrein/gpqa |
347
+ | MMLU-Pro | TIGER-Lab/MMLU-Pro |
348
+ | GSM8K | openai/gsm8k |
349
+
350
+ To add a new benchmark, update `examples/metric_mapping.json`.
351
+
352
+ ---
353
+
354
+ ## Troubleshooting
355
+
356
+ ### "AA_API_KEY not set"
357
+ ```bash
358
+ export AA_API_KEY="your-key"
359
+ # or add to .env file
360
+ ```
361
+
362
+ ### "Could not find benchmark in model card"
363
+ The benchmark name may be formatted differently in the README. Check the model card manually.
364
+
365
+ ### "Token does not have write access"
366
+ Generate a new token at https://huggingface.co/settings/tokens with Write scope.
367
+
368
+ ---
369
+
370
+ ## Getting Help
371
+
372
+ ```bash
373
+ uv run scripts/evaluation_manager.py --help
374
+ uv run scripts/evaluation_manager.py add-eval --help
375
+ uv run scripts/batch_eval_prs.py --help
376
+ ```
377
+
378
+ For more information:
379
+ - [HuggingFace Eval Results Documentation](https://huggingface.co/docs/hub/eval-results)
380
+ - [SKILL.md](../SKILL.md) - Complete skill documentation
skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py ADDED
@@ -0,0 +1,220 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.13"
3
+ # dependencies = [
4
+ # "huggingface-hub>=1.1.4",
5
+ # "python-dotenv>=1.2.1",
6
+ # "pyyaml>=6.0.3",
7
+ # "requests>=2.32.5",
8
+ # ]
9
+ # ///
10
+
11
+ """
12
+ Add Artificial Analysis evaluations to a Hugging Face model repository.
13
+
14
+ This script outputs evaluation results in the new .eval_results/ format
15
+ as documented at https://huggingface.co/docs/hub/eval-results
16
+
17
+ NOTE: This is a standalone reference script. For integrated functionality
18
+ with additional features (README extraction, validation, etc.), use:
19
+ ../scripts/evaluation_manager.py import-aa [options]
20
+
21
+ STANDALONE USAGE:
22
+ AA_API_KEY="<your-api-key>" HF_TOKEN="<your-huggingface-token>" \
23
+ python artificial_analysis_to_hub.py \
24
+ --creator-slug <artificial-analysis-creator-slug> \
25
+ --model-name <artificial-analysis-model-name> \
26
+ --repo-id <huggingface-repo-id>
27
+
28
+ INTEGRATED USAGE (Recommended):
29
+ python ../scripts/evaluation_manager.py import-aa \
30
+ --creator-slug <creator-slug> \
31
+ --model-name <model-name> \
32
+ --repo-id <repo-id> \
33
+ [--create-pr]
34
+ """
35
+
36
+ import argparse
37
+ import json
38
+ import os
39
+ from datetime import date
40
+ from pathlib import Path
41
+
42
+ import requests
43
+ import yaml
44
+ import dotenv
45
+ from huggingface_hub import HfApi
46
+
47
+ dotenv.load_dotenv()
48
+
49
+ API_KEY = os.getenv("AA_API_KEY")
50
+ HF_TOKEN = os.getenv("HF_TOKEN")
51
+ URL = "https://artificialanalysis.ai/api/v2/data/llms/models"
52
+ HEADERS = {"x-api-key": API_KEY}
53
+
54
+ if not API_KEY:
55
+ raise ValueError("AA_API_KEY is not set")
56
+ if not HF_TOKEN:
57
+ raise ValueError("HF_TOKEN is not set")
58
+
59
+
60
+ def load_benchmark_mapping():
61
+ """Load the benchmark-to-dataset mapping from metric_mapping.json."""
62
+ mapping_file = Path(__file__).parent / "metric_mapping.json"
63
+
64
+ if not mapping_file.exists():
65
+ # Fallback to minimal mapping
66
+ return {
67
+ "MMLU": {"dataset_id": "cais/mmlu", "aliases": ["mmlu"]},
68
+ "GPQA": {"dataset_id": "Idavidrein/gpqa", "task_id": "gpqa_diamond", "aliases": ["gpqa"]},
69
+ "GSM8K": {"dataset_id": "openai/gsm8k", "aliases": ["gsm8k"]},
70
+ }
71
+
72
+ with open(mapping_file) as f:
73
+ mapping = json.load(f)
74
+
75
+ mapping.pop("_comment", None)
76
+ return mapping
77
+
78
+
79
+ def find_benchmark_dataset(benchmark_name, mapping):
80
+ """Find the Hub dataset ID for a benchmark name."""
81
+ normalized = benchmark_name.lower().replace(" ", "_").replace("-", "_")
82
+
83
+ # Try exact match
84
+ if benchmark_name in mapping:
85
+ entry = mapping[benchmark_name]
86
+ return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
87
+
88
+ # Try case-insensitive match
89
+ for key, entry in mapping.items():
90
+ if key.lower() == benchmark_name.lower():
91
+ return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
92
+
93
+ # Try matching aliases
94
+ for key, entry in mapping.items():
95
+ aliases = entry.get("aliases", [])
96
+ if normalized in [a.lower().replace(" ", "_").replace("-", "_") for a in aliases]:
97
+ return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
98
+
99
+ return None
100
+
101
+
102
+ def get_model_evaluations_data(creator_slug, model_name):
103
+ response = requests.get(URL, headers=HEADERS)
104
+ response_data = response.json()["data"]
105
+ for model in response_data:
106
+ if (
107
+ model["model_creator"]["slug"] == creator_slug
108
+ and model["slug"] == model_name
109
+ ):
110
+ return model
111
+ raise ValueError(f"Model {model_name} not found")
112
+
113
+
114
+ def aa_evaluations_to_eval_results(model):
115
+ """
116
+ Convert Artificial Analysis model data to .eval_results/ format.
117
+
118
+ Returns a list of evaluation entries ready for YAML output.
119
+ """
120
+ if not model:
121
+ raise ValueError("Model data is required")
122
+
123
+ evaluations = model.get("evaluations", {})
124
+ mapping = load_benchmark_mapping()
125
+ results = []
126
+ today = date.today().isoformat()
127
+
128
+ for key, value in evaluations.items():
129
+ if value is None:
130
+ continue
131
+
132
+ # Convert key to title case for matching
133
+ benchmark_name = key.replace("_", " ").title()
134
+ dataset_info = find_benchmark_dataset(benchmark_name, mapping)
135
+
136
+ if not dataset_info:
137
+ # Try the original key as well
138
+ dataset_info = find_benchmark_dataset(key, mapping)
139
+
140
+ if not dataset_info:
141
+ print(f"Warning: Could not find Hub dataset ID for '{benchmark_name}'. Skipping.")
142
+ continue
143
+
144
+ entry = {
145
+ "dataset": {
146
+ "id": dataset_info["dataset_id"],
147
+ },
148
+ "value": value,
149
+ "date": today,
150
+ "source": {
151
+ "url": "https://artificialanalysis.ai",
152
+ "name": "Artificial Analysis",
153
+ }
154
+ }
155
+
156
+ # Add task_id if not default
157
+ if dataset_info.get("task_id") and dataset_info["task_id"] != "default":
158
+ entry["dataset"]["task_id"] = dataset_info["task_id"]
159
+
160
+ results.append(entry)
161
+
162
+ return results
163
+
164
+
165
+ def main():
166
+ parser = argparse.ArgumentParser()
167
+ parser.add_argument("--creator-slug", type=str, required=True)
168
+ parser.add_argument("--model-name", type=str, required=True)
169
+ parser.add_argument("--repo-id", type=str, required=True)
170
+ parser.add_argument("--filename", type=str, default="artificial_analysis.yaml",
171
+ help="Output filename in .eval_results/")
172
+ parser.add_argument("--dry-run", action="store_true",
173
+ help="Print YAML without uploading")
174
+ args = parser.parse_args()
175
+
176
+ aa_evaluations_data = get_model_evaluations_data(
177
+ creator_slug=args.creator_slug, model_name=args.model_name
178
+ )
179
+
180
+ eval_results = aa_evaluations_to_eval_results(model=aa_evaluations_data)
181
+
182
+ if not eval_results:
183
+ print("No evaluations could be mapped to Hub dataset IDs")
184
+ return
185
+
186
+ # Generate YAML content
187
+ yaml_content = yaml.dump(eval_results, sort_keys=False, allow_unicode=True)
188
+
189
+ if args.dry_run:
190
+ print("\nGenerated .eval_results/ YAML:")
191
+ print(yaml_content)
192
+ return
193
+
194
+ # Upload to .eval_results/ folder
195
+ api = HfApi(token=HF_TOKEN)
196
+ file_path = f".eval_results/{args.filename}"
197
+
198
+ commit_message = f"Add Artificial Analysis evaluations for {args.model_name}"
199
+ commit_description = (
200
+ "This commit adds structured evaluation results in the new .eval_results/ format. "
201
+ "Results will appear on the model page and linked benchmark leaderboards. "
202
+ "See https://huggingface.co/docs/hub/eval-results for format details."
203
+ )
204
+
205
+ api.upload_file(
206
+ path_or_fileobj=yaml_content.encode("utf-8"),
207
+ path_in_repo=file_path,
208
+ repo_id=args.repo_id,
209
+ repo_type="model",
210
+ commit_message=commit_message,
211
+ commit_description=commit_description,
212
+ create_pr=True,
213
+ )
214
+
215
+ print(f"✓ Pull request created for {args.repo_id}")
216
+ print(f" File: {file_path}")
217
+
218
+
219
+ if __name__ == "__main__":
220
+ main()
skills/hugging-face-evaluation/examples/eval.example.yaml ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ - dataset:
2
+ id: cais/hle # Required. Hub dataset ID (must be a Benchmark)
3
+ task_id: default # Optional, in case there are multiple tasks or leaderboards for this dataset.
4
+ revision: <hash> # Optional. Dataset revision hash
5
+ value: 20.90 # Required. Metric value
6
+ verifyToken: <token> # Optional. Cryptographic proof of auditable evaluation
7
+ date: "2025-01-15" # Optional. ISO-8601 date or datetime (defaults to git commit time)
8
+ source: # Optional. Attribution for this result, for instance a repo containing output traces or a Paper
9
+ url: https://huggingface.co/spaces/SaylorTwift/smollm3-mmlu-pro # Required if source provided
10
+ name: Eval traces # Optional. Display name
11
+ user: SaylorTwift # Optional. HF username/org
skills/hugging-face-evaluation/examples/example_readme_tables.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Example Evaluation Table Formats
2
+
3
+ This file shows various formats of evaluation tables that can be extracted from model README files.
4
+
5
+ ## Format 1: Benchmarks as Rows (Most Common)
6
+
7
+ ```markdown
8
+ | Benchmark | Score |
9
+ |-----------|-------|
10
+ | MMLU | 85.2 |
11
+ | HumanEval | 72.5 |
12
+ | GSM8K | 91.3 |
13
+ | HellaSwag | 88.9 |
14
+ ```
15
+
16
+ ## Format 2: Multiple Metric Columns
17
+
18
+ ```markdown
19
+ | Benchmark | Accuracy | F1 Score |
20
+ |-----------|----------|----------|
21
+ | MMLU | 85.2 | 0.84 |
22
+ | GSM8K | 91.3 | 0.91 |
23
+ | DROP | 78.5 | 0.77 |
24
+ ```
25
+
26
+ ## Format 3: Benchmarks as Columns
27
+
28
+ ```markdown
29
+ | MMLU | HumanEval | GSM8K | HellaSwag |
30
+ |------|-----------|-------|-----------|
31
+ | 85.2 | 72.5 | 91.3 | 88.9 |
32
+ ```
33
+
34
+ ## Format 4: Percentage Values
35
+
36
+ ```markdown
37
+ | Benchmark | Score |
38
+ |---------------|----------|
39
+ | MMLU | 85.2% |
40
+ | HumanEval | 72.5% |
41
+ | GSM8K | 91.3% |
42
+ | TruthfulQA | 68.7% |
43
+ ```
44
+
45
+ ## Format 5: Mixed Format with Categories
46
+
47
+ ```markdown
48
+ ### Reasoning
49
+
50
+ | Benchmark | Score |
51
+ |-----------|-------|
52
+ | MMLU | 85.2 |
53
+ | BBH | 82.4 |
54
+ | GPQA | 71.3 |
55
+
56
+ ### Coding
57
+
58
+ | Benchmark | Score |
59
+ |-----------|-------|
60
+ | HumanEval | 72.5 |
61
+ | MBPP | 78.9 |
62
+
63
+ ### Math
64
+
65
+ | Benchmark | Score |
66
+ |-----------|-------|
67
+ | GSM8K | 91.3 |
68
+ | MATH | 65.8 |
69
+ ```
70
+
71
+ ## Format 6: With Additional Columns
72
+
73
+ ```markdown
74
+ | Benchmark | Score | Rank | Notes |
75
+ |-----------|-------|------|--------------------|
76
+ | MMLU | 85.2 | #5 | 5-shot |
77
+ | HumanEval | 72.5 | #8 | pass@1 |
78
+ | GSM8K | 91.3 | #3 | 8-shot, maj@1 |
79
+ ```
80
+
81
+ ## How the Extractor Works
82
+
83
+ The script will:
84
+ 1. Find all markdown tables in the README
85
+ 2. Identify which tables contain evaluation results
86
+ 3. Parse the table structure (rows vs columns)
87
+ 4. Extract numeric values as scores
88
+ 5. Convert to model-index YAML format
89
+
90
+ ## Tips for README Authors
91
+
92
+ To ensure your evaluation tables are properly extracted:
93
+
94
+ 1. **Use clear headers**: Include "Benchmark", "Score", or similar terms
95
+ 2. **Keep it simple**: Stick to benchmark name + score columns
96
+ 3. **Use standard formats**: Follow markdown table syntax
97
+ 4. **Include numeric values**: Ensure scores are parseable numbers
98
+ 5. **Be consistent**: Use the same format across multiple tables
99
+
100
+ ## Example Complete README Section
101
+
102
+ ```markdown
103
+ # Model Card for MyModel-7B
104
+
105
+ ## Evaluation Results
106
+
107
+ Our model was evaluated on several standard benchmarks:
108
+
109
+ | Benchmark | Score |
110
+ |---------------|-------|
111
+ | MMLU | 85.2 |
112
+ | HumanEval | 72.5 |
113
+ | GSM8K | 91.3 |
114
+ | HellaSwag | 88.9 |
115
+ | ARC-Challenge | 81.7 |
116
+ | TruthfulQA | 68.7 |
117
+
118
+ ### Detailed Results
119
+
120
+ For more detailed results and methodology, see our [paper](link).
121
+ ```
122
+
123
+ ## Running the Extractor
124
+
125
+ ```bash
126
+ # Extract from this example
127
+ python scripts/evaluation_manager.py extract-readme \
128
+ --repo-id "your-username/your-model" \
129
+ --dry-run
130
+
131
+ # Apply to your model card
132
+ python scripts/evaluation_manager.py extract-readme \
133
+ --repo-id "your-username/your-model" \
134
+ --task-type "text-generation"
135
+ ```
skills/hugging-face-evaluation/examples/metric_mapping.json ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_comment": "Maps benchmark names to Hub dataset IDs for .eval_results/ format. Dataset IDs must be registered Benchmarks on HuggingFace Hub.",
3
+ "MMLU": {
4
+ "dataset_id": "cais/mmlu",
5
+ "task_id": "default",
6
+ "aliases": ["mmlu", "massive_multitask_language_understanding"]
7
+ },
8
+ "MMLU-Pro": {
9
+ "dataset_id": "TIGER-Lab/MMLU-Pro",
10
+ "task_id": "mmlu_pro",
11
+ "aliases": ["mmlu_pro", "mmlu-pro"]
12
+ },
13
+ "MMLU-Redux": {
14
+ "dataset_id": "edinburgh-dawg/mmlu-redux",
15
+ "task_id": "default",
16
+ "aliases": ["mmlu_redux", "mmlu-redux"]
17
+ },
18
+ "HumanEval": {
19
+ "dataset_id": "openai/openai_humaneval",
20
+ "task_id": "default",
21
+ "aliases": ["humaneval", "human_eval"]
22
+ },
23
+ "GSM8K": {
24
+ "dataset_id": "openai/gsm8k",
25
+ "task_id": "default",
26
+ "aliases": ["gsm8k", "gsm_8k", "grade_school_math"]
27
+ },
28
+ "HellaSwag": {
29
+ "dataset_id": "Rowan/hellaswag",
30
+ "task_id": "default",
31
+ "aliases": ["hellaswag", "hella_swag"]
32
+ },
33
+ "ARC-Challenge": {
34
+ "dataset_id": "allenai/ai2_arc",
35
+ "task_id": "ARC-Challenge",
36
+ "aliases": ["arc_challenge", "arc-c", "arc_c"]
37
+ },
38
+ "ARC-Easy": {
39
+ "dataset_id": "allenai/ai2_arc",
40
+ "task_id": "ARC-Easy",
41
+ "aliases": ["arc_easy", "arc-e", "arc_e"]
42
+ },
43
+ "Winogrande": {
44
+ "dataset_id": "allenai/winogrande",
45
+ "task_id": "default",
46
+ "aliases": ["winogrande", "wino_grande"]
47
+ },
48
+ "TruthfulQA": {
49
+ "dataset_id": "truthfulqa/truthful_qa",
50
+ "task_id": "default",
51
+ "aliases": ["truthfulqa", "truthful_qa"]
52
+ },
53
+ "GPQA": {
54
+ "dataset_id": "Idavidrein/gpqa",
55
+ "task_id": "gpqa_diamond",
56
+ "aliases": ["gpqa", "gpqa_diamond"]
57
+ },
58
+ "DROP": {
59
+ "dataset_id": "ucinlp/drop",
60
+ "task_id": "default",
61
+ "aliases": ["drop"]
62
+ },
63
+ "BBH": {
64
+ "dataset_id": "lukaemon/bbh",
65
+ "task_id": "default",
66
+ "aliases": ["bbh", "big_bench_hard"]
67
+ },
68
+ "MATH": {
69
+ "dataset_id": "lighteval/MATH",
70
+ "task_id": "default",
71
+ "aliases": ["math"]
72
+ },
73
+ "HLE": {
74
+ "dataset_id": "cais/hle",
75
+ "task_id": "default",
76
+ "aliases": ["hle", "human_level_evaluation", "hle_text_only", "hle_(text_only)"]
77
+ },
78
+ "AIME25": {
79
+ "dataset_id": "OpenEvals/aime_2025",
80
+ "task_id": "default",
81
+ "aliases": ["aime25", "aime_25", "aime_2025"]
82
+ },
83
+ "SWE-bench": {
84
+ "dataset_id": "princeton-nlp/SWE-bench_Verified",
85
+ "task_id": "default",
86
+ "aliases": ["swe_bench", "swe-bench", "swe_bench_verified", "swe-bench_verified"]
87
+ },
88
+ "LiveCodeBench": {
89
+ "dataset_id": "livecodebench/livecodebench",
90
+ "task_id": "default",
91
+ "aliases": ["livecodebench", "live_code_bench", "livecodebenchv6"]
92
+ },
93
+ "SimpleQA": {
94
+ "dataset_id": "OpenEvals/SimpleQA",
95
+ "task_id": "default",
96
+ "aliases": ["simpleqa", "simple_qa"]
97
+ },
98
+ "AIME": {
99
+ "dataset_id": "OpenEvals/aime_24",
100
+ "task_id": "default",
101
+ "aliases": ["aime", "aime_24", "aime24"]
102
+ },
103
+ "IFEval": {
104
+ "dataset_id": "google/IFEval",
105
+ "task_id": "default",
106
+ "aliases": ["ifeval", "if_eval"]
107
+ },
108
+ "MBPP": {
109
+ "dataset_id": "google-research-datasets/mbpp",
110
+ "task_id": "default",
111
+ "aliases": ["mbpp"]
112
+ },
113
+ "MuSR": {
114
+ "dataset_id": "OpenEvals/MuSR",
115
+ "task_id": "default",
116
+ "aliases": ["musr"]
117
+ }
118
+ }
skills/hugging-face-evaluation/references/hf_cli_for_prs.md ADDED
@@ -0,0 +1,258 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # HF CLI Workflow for Evaluation PRs
2
+
3
+ This document explains how to manage evaluation result PRs using the `hf` CLI and temporary directories.
4
+
5
+ ## Directory Structure
6
+
7
+ Use `/tmp/pr-reviews/` as the working directory for PR operations:
8
+
9
+ ```
10
+ /tmp/pr-reviews/
11
+ ├── updates/ # YAML files for updating existing PRs
12
+ ├── new-prs/ # YAML files for new PRs
13
+ ├── <model-name>/ # Model-specific directories
14
+ └── check-<model>/ # Directories for verifying PR contents
15
+ ```
16
+
17
+ ## Creating New PRs
18
+
19
+ ### Step 1: Create YAML file
20
+
21
+ ```bash
22
+ mkdir -p /tmp/pr-reviews/new-prs
23
+ cd /tmp/pr-reviews/new-prs
24
+
25
+ cat > hle.yaml << 'EOF'
26
+ - dataset:
27
+ id: cais/hle
28
+ task_id: hle
29
+ value: 22.1
30
+ date: '2026-02-03'
31
+ source:
32
+ url: https://huggingface.co/org/model-name
33
+ name: Model Card
34
+ user: burtenshaw
35
+ EOF
36
+ ```
37
+
38
+ ### Step 2: Upload and create PR
39
+
40
+ ```bash
41
+ hf upload org/model-name hle.yaml .eval_results/hle.yaml \
42
+ --repo-type model --create-pr \
43
+ --commit-message "Add HLE evaluation result"
44
+ ```
45
+
46
+ ### Step 3: Get PR number
47
+
48
+ ```bash
49
+ uv run scripts/evaluation_manager.py get-prs --repo-id "org/model-name"
50
+ ```
51
+
52
+ ## Updating Existing PRs
53
+
54
+ ### Step 1: Download current PR contents
55
+
56
+ ```bash
57
+ hf download org/model-name --repo-type model \
58
+ --revision refs/pr/<PR_NUMBER> \
59
+ --include ".eval_results/*" \
60
+ --local-dir /tmp/pr-reviews/<model-name>-pr<PR_NUMBER>
61
+ ```
62
+
63
+ ### Step 2: Review current contents
64
+
65
+ ```bash
66
+ cat /tmp/pr-reviews/<model-name>-pr<PR_NUMBER>/.eval_results/*.yaml
67
+ ```
68
+
69
+ ### Step 3: Create updated YAML
70
+
71
+ ```bash
72
+ cat > /tmp/pr-reviews/updates/updated.yaml << 'EOF'
73
+ - dataset:
74
+ id: cais/hle
75
+ task_id: hle
76
+ value: 22.1
77
+ date: '2026-02-03'
78
+ source:
79
+ url: https://huggingface.co/org/model-name
80
+ name: Model Card
81
+ user: burtenshaw
82
+ notes: "With tools"
83
+ EOF
84
+ ```
85
+
86
+ ### Step 4: Push update to existing PR
87
+
88
+ ```bash
89
+ hf upload org/model-name /tmp/pr-reviews/updates/updated.yaml .eval_results/hle.yaml \
90
+ --repo-type model \
91
+ --revision refs/pr/<PR_NUMBER> \
92
+ --commit-message "Update evaluation result"
93
+ ```
94
+
95
+ ## Deleting Files from PRs
96
+
97
+ Use the `huggingface_hub` Python API to delete files:
98
+
99
+ ```bash
100
+ uv run --with huggingface_hub python3 << 'EOF'
101
+ from huggingface_hub import HfApi
102
+ api = HfApi()
103
+
104
+ api.delete_file(
105
+ path_in_repo=".eval_results/old_file.yaml",
106
+ repo_id="org/model-name",
107
+ repo_type="model",
108
+ revision="refs/pr/<PR_NUMBER>",
109
+ commit_message="Remove duplicate file"
110
+ )
111
+ EOF
112
+ ```
113
+
114
+ ## Verifying PR Contents
115
+
116
+ ### Check what files are in a PR
117
+
118
+ ```bash
119
+ rm -rf /tmp/check-<model>
120
+ hf download org/model-name --repo-type model \
121
+ --revision refs/pr/<PR_NUMBER> \
122
+ --include ".eval_results/*" \
123
+ --local-dir /tmp/check-<model>
124
+
125
+ ls -la /tmp/check-<model>/.eval_results/
126
+ ```
127
+
128
+ ### Compare PR to main branch
129
+
130
+ ```bash
131
+ # Download main
132
+ hf download org/model-name --repo-type model \
133
+ --revision main \
134
+ --include ".eval_results/*" \
135
+ --local-dir /tmp/<model>-main
136
+
137
+ # Download PR
138
+ hf download org/model-name --repo-type model \
139
+ --revision refs/pr/<PR_NUMBER> \
140
+ --include ".eval_results/*" \
141
+ --local-dir /tmp/<model>-pr<PR_NUMBER>
142
+
143
+ # Compare
144
+ diff /tmp/<model>-main/.eval_results/ /tmp/<model>-pr<PR_NUMBER>/.eval_results/
145
+ ```
146
+
147
+ ## Multiple Score Variants
148
+
149
+ When a model has multiple scores for the same benchmark (e.g., with/without tools), create separate files:
150
+
151
+ ```bash
152
+ cd /tmp/pr-reviews/new-prs
153
+
154
+ # Default (no tools) - no notes field
155
+ cat > hle.yaml << 'EOF'
156
+ - dataset:
157
+ id: cais/hle
158
+ task_id: hle
159
+ value: 10.2
160
+ date: '2026-02-03'
161
+ source:
162
+ url: https://huggingface.co/org/model-name
163
+ name: Model Card
164
+ user: burtenshaw
165
+ EOF
166
+
167
+ # With tools - add notes field
168
+ cat > hle_with_tools.yaml << 'EOF'
169
+ - dataset:
170
+ id: cais/hle
171
+ task_id: hle
172
+ value: 15.5
173
+ date: '2026-02-03'
174
+ source:
175
+ url: https://huggingface.co/org/model-name
176
+ name: Model Card
177
+ user: burtenshaw
178
+ notes: "With tools"
179
+ EOF
180
+
181
+ # Create separate PRs
182
+ hf upload org/model-name hle.yaml .eval_results/hle.yaml \
183
+ --repo-type model --create-pr \
184
+ --commit-message "Add HLE evaluation result"
185
+
186
+ hf upload org/model-name hle_with_tools.yaml .eval_results/hle_with_tools.yaml \
187
+ --repo-type model --create-pr \
188
+ --commit-message "Add HLE evaluation result (with tools)"
189
+ ```
190
+
191
+ ## Restoring Files Accidentally Deleted
192
+
193
+ If a PR shows a file as deleted (because it was removed from the PR branch), restore it from main:
194
+
195
+ ```bash
196
+ # Download the file from main
197
+ hf download org/model-name --repo-type model \
198
+ --revision main \
199
+ --include ".eval_results/hle.yaml" \
200
+ --local-dir /tmp/<model>-main
201
+
202
+ # Re-upload to PR to restore it
203
+ hf upload org/model-name /tmp/<model>-main/.eval_results/hle.yaml .eval_results/hle.yaml \
204
+ --repo-type model \
205
+ --revision refs/pr/<PR_NUMBER> \
206
+ --commit-message "Restore original file"
207
+ ```
208
+
209
+ ## Common Patterns
210
+
211
+ ### Batch create YAML files
212
+
213
+ ```bash
214
+ cd /tmp/pr-reviews/updates
215
+
216
+ # Create multiple files in one script
217
+ for model in "org/model1" "org/model2"; do
218
+ cat > "${model//\//-}-hle.yaml" << EOF
219
+ - dataset:
220
+ id: cais/hle
221
+ task_id: hle
222
+ value: 22.1
223
+ source:
224
+ url: https://huggingface.co/$model
225
+ name: Model Card
226
+ user: burtenshaw
227
+ EOF
228
+ done
229
+ ```
230
+
231
+ ### Check for existing PRs before creating
232
+
233
+ Always check first:
234
+
235
+ ```bash
236
+ uv run scripts/evaluation_manager.py get-prs --repo-id "org/model-name"
237
+ ```
238
+
239
+ If PRs exist, update them instead of creating new ones.
240
+
241
+ ## File Naming Convention
242
+
243
+ | Condition | File Name | Notes Field |
244
+ |-----------|-----------|-------------|
245
+ | Default (no tools) | `hle.yaml` | None (omit) |
246
+ | With tools | `hle_with_tools.yaml` | `notes: "With tools"` |
247
+ | Different task | `gpqa_diamond.yaml` | Based on task_id |
248
+
249
+ ## Cleanup
250
+
251
+ After PRs are merged or work is complete:
252
+
253
+ ```bash
254
+ rm -rf /tmp/pr-reviews/
255
+ rm -rf /tmp/check-*
256
+ rm -rf /tmp/*-main
257
+ rm -rf /tmp/*-pr*
258
+ ```
skills/hugging-face-evaluation/references/hf_papers_extraction.md ADDED
@@ -0,0 +1,297 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Paper Score Extraction via HF MCP Server
2
+
3
+ This document provides instructions for extracting benchmark scores from academic papers linked to HuggingFace models using the HF MCP Server tools.
4
+
5
+ ---
6
+
7
+ ## Overview
8
+
9
+ Papers linked to HuggingFace models often contain comprehensive benchmark results that aren't in the model card. This guide shows how to:
10
+
11
+ 1. Use `hub_repo_details` to discover papers linked to a model
12
+ 2. Use `paper_search` to find and retrieve paper abstracts
13
+ 3. Extract benchmark scores from paper abstracts/content
14
+ 4. Use `WebFetch` on arxiv PDFs for detailed scores not in abstracts
15
+ 5. Format results for `.eval_results/`
16
+
17
+ ---
18
+
19
+ ## Step 1: Discover Linked Papers
20
+
21
+ ### Get Model Details with README
22
+
23
+ Use `hub_repo_details` to fetch model metadata including linked papers:
24
+
25
+ ```
26
+ mcp__hf-mcp-server__hub_repo_details
27
+ repo_ids: ["org/model-name"]
28
+ include_readme: true
29
+ ```
30
+
31
+ Look for arXiv references in:
32
+ - `tags` array: entries starting with `arxiv:` (e.g., `"arxiv:2411.15124"`)
33
+ - README content: arXiv links or paper references
34
+ - Model metadata: `paperInfo` or `cardData.arxiv` fields
35
+
36
+ ### Example Response Fields
37
+
38
+ The response will include:
39
+ - Model metadata (downloads, likes, tags)
40
+ - README content (if `include_readme: true`)
41
+ - Any linked paper IDs in tags
42
+
43
+ ---
44
+
45
+ ## Step 2: Search for Papers
46
+
47
+ Once you have an arXiv ID or want to find related papers, use `paper_search`:
48
+
49
+ ```
50
+ mcp__hf-mcp-server__paper_search
51
+ query: "OLMo-2 evaluation benchmark"
52
+ results_limit: 5
53
+ concise_only: false # Get full abstracts for score extraction
54
+ ```
55
+
56
+ ### Search Strategies
57
+
58
+ **By model name:**
59
+ ```
60
+ query: "Llama 3.1 benchmark evaluation"
61
+ ```
62
+
63
+ **By arXiv ID (if known):**
64
+ ```
65
+ query: "2411.15124"
66
+ ```
67
+
68
+ **By benchmark + model family:**
69
+ ```
70
+ query: "MMLU GPQA Qwen2.5"
71
+ ```
72
+
73
+ ---
74
+
75
+ ## Step 3: Extract Benchmark Scores
76
+
77
+ The paper search returns abstracts and paper content. Look for:
78
+
79
+ ### Common Benchmark Mentions
80
+
81
+ Papers typically report headline numbers in abstracts:
82
+ - "achieves **85.2%** on MMLU"
83
+ - "state-of-the-art results on GPQA Diamond (72.1%)"
84
+ - "HLE score of 12.3%"
85
+
86
+ ### Benchmark Name Variations
87
+
88
+ | Standard Name | Paper Variations |
89
+ |---------------|------------------|
90
+ | HLE | Humanity's Last Exam, HLE (Text Only) |
91
+ | GPQA | GPQA Diamond, GPQA-Diamond |
92
+ | MMLU | MMLU, MMLU-Pro, Massive Multitask |
93
+ | GSM8K | GSM8K, GSM-8K, Grade School Math |
94
+ | MATH | MATH, MATH-500 |
95
+ | HumanEval | HumanEval, human_eval |
96
+ | SWE-bench | SWE-bench, SWE-bench Verified |
97
+
98
+ ### Score Format Normalization
99
+
100
+ - **Percentages**: `85.2%` → use `85.2`
101
+ - **Decimals**: `0.852` → convert to `85.2` if context shows percentages
102
+ - **Accuracy vs Error Rate**: Ensure you're extracting accuracy, not error rate
103
+
104
+ ---
105
+
106
+ ## Step 4: Format for .eval_results/
107
+
108
+ Once you have extracted scores, format them as YAML:
109
+
110
+ ```yaml
111
+ # .eval_results/{benchmark_name}.yaml
112
+ - dataset:
113
+ id: {hub_dataset_id}
114
+ task_id: {task_variant}
115
+ value: {score}
116
+ date: "{extraction_date}"
117
+ source:
118
+ url: https://arxiv.org/abs/{arxiv_id}
119
+ name: Paper
120
+ ```
121
+
122
+ ### Dataset ID Reference
123
+
124
+ | Benchmark | Dataset ID | Task ID |
125
+ |-----------|------------|---------|
126
+ | HLE | `cais/hle` | `default` |
127
+ | GPQA | `Idavidrein/gpqa` | `gpqa_diamond` |
128
+ | MMLU | `cais/mmlu` | `default` |
129
+ | MMLU-Pro | `TIGER-Lab/MMLU-Pro` | `default` |
130
+ | GSM8K | `openai/gsm8k` | `default` |
131
+ | MATH | `lighteval/MATH` | `default` |
132
+ | HumanEval | `openai/openai_humaneval` | `default` |
133
+ | DROP | `ucinlp/drop` | `default` |
134
+ | ARC-Challenge | `allenai/ai2_arc` | `ARC-Challenge` |
135
+ | HellaSwag | `Rowan/hellaswag` | `default` |
136
+ | TruthfulQA | `truthfulqa/truthful_qa` | `default` |
137
+ | IFEval | `google/IFEval` | `default` |
138
+ | SWE-bench | `princeton-nlp/SWE-bench_Verified` | `default` |
139
+ | AIME24 | `OpenEvals/aime_24` | `default` |
140
+ | AIME25 | `OpenEvals/aime_2025` | `default` |
141
+ | LiveCodeBench | `livecodebench/livecodebench` | `default` |
142
+
143
+ ---
144
+
145
+ ## Complete Example Workflow
146
+
147
+ ### Scenario: Extract MMLU score for OLMo-2 from its paper
148
+
149
+ ```
150
+ 1. Get model details:
151
+ mcp__hf-mcp-server__hub_repo_details
152
+ repo_ids: ["allenai/OLMo-2-1124-7B-Instruct"]
153
+ include_readme: true
154
+
155
+ → Found tags: ["arxiv:2411.15124", "arxiv:2501.00656"]
156
+
157
+ 2. Search for the paper:
158
+ mcp__hf-mcp-server__paper_search
159
+ query: "OLMo-2 2501.00656"
160
+ results_limit: 3
161
+
162
+ → Returns paper abstract with benchmark scores
163
+
164
+ 3. Extract from abstract:
165
+ "OLMo-2-7B-Instruct achieves 61.3 on MMLU..."
166
+
167
+ If score not in abstract, fetch PDF:
168
+ WebFetch
169
+ url: "https://arxiv.org/pdf/2501.00656"
170
+ prompt: "Find MMLU score for OLMo-2-7B-Instruct in the evaluation tables."
171
+
172
+ 4. Create the eval result:
173
+ $ uv run scripts/evaluation_manager.py add-eval \
174
+ --benchmark MMLU \
175
+ --repo-id "allenai/OLMo-2-1124-7B-Instruct" \
176
+ --value 61.3 \
177
+ --create-pr
178
+ ```
179
+
180
+ ---
181
+
182
+ ## Tips for Better Extraction
183
+
184
+ ### 1. Check Multiple Papers
185
+ Models may have multiple linked papers. Use `hub_repo_details` to find all arXiv tags, then search for each.
186
+
187
+ ### 2. Use Concise Mode for Broad Searches
188
+ ```
189
+ mcp__hf-mcp-server__paper_search
190
+ query: "large language model evaluation"
191
+ concise_only: true # 2-sentence summaries
192
+ results_limit: 10
193
+ ```
194
+
195
+ ### 3. Prefer Primary Sources
196
+ Use the model's own release paper rather than papers that cite it.
197
+
198
+ ### 4. Note Evaluation Settings
199
+ Papers may report different settings (0-shot vs 5-shot, with/without CoT). Document which setting you're extracting.
200
+
201
+ ### 5. Cross-Reference Model Card
202
+ If both paper and model card have scores, prefer the paper as the authoritative source but verify they match.
203
+
204
+ ---
205
+
206
+ ## Step 5: Extract Scores from Paper PDFs
207
+
208
+ The `paper_search` tool only returns abstracts, which often miss detailed benchmark tables. For comprehensive score extraction, fetch the full paper PDF.
209
+
210
+ ### URL Pattern
211
+
212
+ HuggingFace paper links map directly to arxiv PDFs:
213
+
214
+ | Source | URL Pattern |
215
+ |--------|-------------|
216
+ | HF Paper Page | `https://huggingface.co/papers/{arxiv_id}` |
217
+ | arxiv Abstract | `https://arxiv.org/abs/{arxiv_id}` |
218
+ | arxiv PDF | `https://arxiv.org/pdf/{arxiv_id}` |
219
+
220
+ **Example**: `2601.01739` → `https://arxiv.org/pdf/2601.01739`
221
+
222
+ ### Fetching PDF Content
223
+
224
+ Use `WebFetch` to retrieve and search the PDF:
225
+
226
+ ```
227
+ WebFetch
228
+ url: "https://arxiv.org/pdf/{arxiv_id}"
229
+ prompt: "Extract all benchmark evaluation scores and results tables. Look for metrics like accuracy, F1, BLEU, pass@k, or percentage scores. List each benchmark name and its corresponding score."
230
+ ```
231
+
232
+ ### Targeted Extraction Prompts
233
+
234
+ For specific benchmarks:
235
+
236
+ ```
237
+ prompt: "Find the HLE (Humanity's Last Exam) score in this paper. Look in results tables and the evaluation section."
238
+ ```
239
+
240
+ ```
241
+ prompt: "Extract all scores from the main results table. Include benchmark names, model variants, and numerical scores."
242
+ ```
243
+
244
+ ```
245
+ prompt: "Find MMLU, GPQA, GSM8K, and MATH scores for the main model in this paper."
246
+ ```
247
+
248
+ ### When to Use PDF Extraction
249
+
250
+ Use PDF fetching when:
251
+ - Abstract doesn't contain specific benchmark scores
252
+ - You need scores for multiple benchmarks
253
+ - Paper mentions "see Table X for full results"
254
+ - Model card references paper but lacks detailed numbers
255
+
256
+ ### Example: Full PDF Extraction Workflow
257
+
258
+ ```
259
+ 1. Get arxiv ID from model:
260
+ mcp__hf-mcp-server__hub_repo_details
261
+ repo_ids: ["meta-llama/Llama-3.1-70B-Instruct"]
262
+ include_readme: true
263
+
264
+ → Found: arxiv:2407.21783
265
+
266
+ 2. Fetch PDF for detailed scores:
267
+ WebFetch
268
+ url: "https://arxiv.org/pdf/2407.21783"
269
+ prompt: "Extract benchmark scores for Llama 3.1 70B Instruct from all evaluation tables. Include MMLU, GPQA Diamond, HumanEval, GSM8K, MATH, and any other benchmarks."
270
+
271
+ 3. Parse extracted scores and create eval results
272
+ ```
273
+
274
+ ---
275
+
276
+ ## Common Issues
277
+
278
+ ### Paper Score Differs from Model Card
279
+ - Paper may report different model size/variant
280
+ - Evaluation settings may differ
281
+ - Paper may be pre-release; model card updated post-release
282
+
283
+ **Solution**: Note the discrepancy and prefer the most recent source.
284
+
285
+ ### Score Not Found in Paper Search
286
+ - Paper abstracts rarely contain full benchmark tables
287
+ - Paper may not have evaluated that benchmark
288
+ - Try searching with different query terms
289
+ - Check if benchmark uses a different name
290
+
291
+ **Solution**: Use `WebFetch` to fetch the full PDF (`https://arxiv.org/pdf/{arxiv_id}`) - detailed scores are typically in results tables within the paper body, not the abstract. Also try alternative query terms or benchmark aliases.
292
+
293
+ ### Multiple Models in Paper
294
+ - Paper describes a family of models (7B, 13B, 70B)
295
+ - Results may combine scores across sizes
296
+
297
+ **Solution**: Carefully match the exact model variant to the HuggingFace repo.
skills/hugging-face-evaluation/references/model_card_extraction.md ADDED
@@ -0,0 +1,244 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Card Score Extraction via HF MCP Server
2
+
3
+ This document provides instructions for extracting benchmark scores from HuggingFace model cards using the HF MCP Server tools.
4
+
5
+ ---
6
+
7
+ ## Overview
8
+
9
+ Model cards often contain evaluation tables with benchmark scores. This guide shows how to:
10
+
11
+ 1. Use `hub_repo_details` to fetch model card content
12
+ 2. Search for benchmark variations in the README
13
+ 3. Extract and normalize scores
14
+ 4. Format results for `.eval_results/`
15
+
16
+ ---
17
+
18
+ ## Step 1: Fetch the Model Card
19
+
20
+ Use `hub_repo_details` to get the model's README content:
21
+
22
+ ```
23
+ mcp__hf-mcp-server__hub_repo_details
24
+ repo_ids: ["org/model-name"]
25
+ include_readme: true
26
+ ```
27
+
28
+ This returns:
29
+ - Model metadata (downloads, likes, tags, pipeline_tag)
30
+ - Full README content (when `include_readme: true`)
31
+ - Linked papers and datasets
32
+
33
+ ### Batch Fetching
34
+
35
+ You can fetch multiple models at once:
36
+
37
+ ```
38
+ mcp__hf-mcp-server__hub_repo_details
39
+ repo_ids: ["meta-llama/Llama-3.1-8B-Instruct", "Qwen/Qwen2.5-7B-Instruct"]
40
+ include_readme: true
41
+ ```
42
+
43
+ ---
44
+
45
+ ## Step 2: Search for Benchmark Scores
46
+
47
+ ### Benchmark Name Variations
48
+
49
+ Model cards use inconsistent naming. Search for these variations:
50
+
51
+ | Benchmark | Variations to Search |
52
+ |-----------|---------------------|
53
+ | HLE | `HLE`, `hle`, `Humanity's Last Exam`, `HLE (Text Only)` |
54
+ | GPQA | `GPQA`, `GPQA Diamond`, `gpqa_diamond`, `GPQA-Diamond` |
55
+ | MMLU-Pro | `MMLU-Pro`, `MMLU Pro`, `mmlu_pro`, `MMLU-PRO` |
56
+ | MMLU | `MMLU`, `mmlu`, `Massive Multitask Language Understanding` |
57
+ | GSM8K | `GSM8K`, `gsm8k`, `GSM-8K`, `Grade School Math` |
58
+ | HumanEval | `HumanEval`, `humaneval`, `human_eval` |
59
+ | HellaSwag | `HellaSwag`, `hellaswag`, `hella_swag` |
60
+ | ARC-Challenge | `ARC-Challenge`, `ARC-C`, `arc_challenge` |
61
+ | TruthfulQA | `TruthfulQA`, `truthful_qa`, `TruthfulQA MC` |
62
+ | MATH | `MATH`, `math`, `MATH-500` |
63
+ | AIME | `AIME`, `AIME24`, `AIME 2024`, `aime_24` |
64
+ | SWE-bench | `SWE-bench`, `SWE-bench Verified`, `swe_bench` |
65
+ | LiveCodeBench | `LiveCodeBench`, `LCB`, `LiveCodeBenchV6` |
66
+ | IFEval | `IFEval`, `IF-Eval`, `ifeval` |
67
+
68
+ ---
69
+
70
+ ## Step 3: Identify Table Formats
71
+
72
+ Model cards typically present scores in these formats:
73
+
74
+ ### Format A: Model-Column Table (most common)
75
+ ```markdown
76
+ | Model | MMLU | GPQA | HLE |
77
+ |-------|------|------|-----|
78
+ | This Model | 85.2 | 72.1 | 12.3 |
79
+ | GPT-4 | 86.4 | 74.2 | 15.1 |
80
+ ```
81
+
82
+ ### Format B: Benchmark-Column Table
83
+ ```markdown
84
+ | Benchmark | Score |
85
+ |-----------|-------|
86
+ | MMLU | 85.2 |
87
+ | GPQA | 72.1 |
88
+ | HLE | 12.3 |
89
+ ```
90
+
91
+ ### Format C: Inline Text
92
+ ```markdown
93
+ Our model achieves **85.2%** on MMLU, **72.1%** on GPQA Diamond, and **12.3%** on HLE.
94
+ ```
95
+
96
+ ### Format D: Nested/Grouped Tables
97
+ ```markdown
98
+ | Category | Benchmark | Score |
99
+ |----------|-----------|-------|
100
+ | Reasoning | GPQA | 72.1 |
101
+ | Knowledge | MMLU | 85.2 |
102
+ ```
103
+
104
+ ---
105
+
106
+ ## Step 4: Extract and Normalize Scores
107
+
108
+ ### Score Format Normalization
109
+
110
+ Scores may be presented as:
111
+ - **Percentages**: `85.2%` or `85.2` (when context implies %)
112
+ - **Decimals**: `0.852` (multiply by 100 for percentage)
113
+ - **Fractions**: `85.2/100`
114
+
115
+ **Important**: The `.eval_results/` format expects values matching the benchmark's standard scale. Most benchmarks use percentage scale (0-100).
116
+
117
+ ---
118
+
119
+ ## Step 5: Format for .eval_results/
120
+
121
+ Once you have the score, format it for `.eval_results/`:
122
+
123
+ ```yaml
124
+ # .eval_results/{benchmark}.yaml
125
+ - dataset:
126
+ id: cais/hle # Hub dataset ID (see mapping below)
127
+ task_id: default # Task variant if applicable
128
+ value: 12.3 # Score value
129
+ date: "2026-01-14" # ISO date of extraction
130
+ source:
131
+ url: https://huggingface.co/{org}/{model}
132
+ name: Model Card
133
+ ```
134
+
135
+ ### Dataset ID Reference
136
+
137
+ | Benchmark | Dataset ID | Task ID |
138
+ |-----------|------------|---------|
139
+ | HLE | `cais/hle` | `default` |
140
+ | GPQA | `Idavidrein/gpqa` | `gpqa_diamond` |
141
+ | MMLU-Pro | `TIGER-Lab/MMLU-Pro` | `default` |
142
+ | MMLU | `cais/mmlu` | `default` |
143
+ | GSM8K | `openai/gsm8k` | `default` |
144
+ | HumanEval | `openai/openai_humaneval` | `default` |
145
+ | MATH | `lighteval/MATH` | `default` |
146
+ | ARC-Challenge | `allenai/ai2_arc` | `ARC-Challenge` |
147
+ | HellaSwag | `Rowan/hellaswag` | `default` |
148
+ | TruthfulQA | `truthfulqa/truthful_qa` | `default` |
149
+ | SWE-bench | `princeton-nlp/SWE-bench_Verified` | `default` |
150
+ | AIME24 | `OpenEvals/aime_24` | `default` |
151
+ | AIME25 | `OpenEvals/aime_2025` | `default` |
152
+ | LiveCodeBench | `livecodebench/livecodebench` | `default` |
153
+ | IFEval | `google/IFEval` | `default` |
154
+
155
+ ---
156
+
157
+ ## Complete Example Workflow
158
+
159
+ ### Scenario: Extract HLE score from a model card
160
+
161
+ ```
162
+ 1. Fetch model card:
163
+ mcp__hf-mcp-server__hub_repo_details
164
+ repo_ids: ["Qwen/Qwen2.5-72B-Instruct"]
165
+ include_readme: true
166
+
167
+ 2. Search README for HLE variations:
168
+ Found: "| HLE | 18.5 |" in evaluation table
169
+
170
+ 3. Create the eval result:
171
+ $ uv run scripts/evaluation_manager.py add-eval \
172
+ --benchmark HLE \
173
+ --repo-id "Qwen/Qwen2.5-72B-Instruct" \
174
+ --value 18.5 \
175
+ --create-pr
176
+ ```
177
+
178
+ ---
179
+
180
+ ## Finding Models with Evaluations
181
+
182
+ Use `model_search` to find models that might have benchmark scores:
183
+
184
+ ### Search for Trending Models
185
+ ```
186
+ mcp__hf-mcp-server__model_search
187
+ task: "text-generation"
188
+ sort: "trendingScore"
189
+ limit: 20
190
+ ```
191
+
192
+ ### Search by Author
193
+ ```
194
+ mcp__hf-mcp-server__model_search
195
+ author: "meta-llama"
196
+ task: "text-generation"
197
+ limit: 10
198
+ ```
199
+
200
+ ### Search by Query
201
+ ```
202
+ mcp__hf-mcp-server__model_search
203
+ query: "instruct chat"
204
+ task: "text-generation"
205
+ limit: 20
206
+ ```
207
+
208
+ Then use `hub_repo_details` on promising results to check their model cards.
209
+
210
+ ---
211
+
212
+ ## Tips for Better Extraction
213
+
214
+ ### 1. Check the Full README
215
+ Model cards may have scores in different sections (Overview, Evaluation, Benchmarks, Results).
216
+
217
+ ### 2. Look for Multiple Tables
218
+ Some model cards have separate tables for different benchmark categories.
219
+
220
+ ### 3. Note Evaluation Settings
221
+ Papers may report different settings (0-shot vs 5-shot, with/without CoT). Document which setting you're extracting.
222
+
223
+ ### 4. Verify Against Papers
224
+ If both paper and model card have scores, prefer the paper as the authoritative source but verify they match.
225
+
226
+ ---
227
+
228
+ ## Common Issues
229
+
230
+ ### Score in Image/Figure Only
231
+ **Solution**: Check if there's a linked technical report or paper with tabular data. Use `paper_search` to find it.
232
+
233
+ ### Benchmark Name Differs Significantly
234
+ **Solution**: Search for the underlying task name (e.g., "graduate-level science" for GPQA).
235
+
236
+ ### Multiple Scores for Same Benchmark
237
+ **Solution**: Prefer "0-shot" or "standard" settings; note the configuration in source attribution.
238
+
239
+ ### Score Not Found
240
+ - Score genuinely not present in model card
241
+ - Benchmark not evaluated by model authors
242
+ - Try `paper_search` to find scores in linked papers
243
+
244
+ **Solution**: Document why extraction failed to distinguish "not found by automation" from "truly not available".
skills/hugging-face-evaluation/scripts/check_prs.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.10"
3
+ # dependencies = [
4
+ # "requests>=2.31.0",
5
+ # ]
6
+ # ///
7
+
8
+ """
9
+ Check for open pull requests on a Hugging Face model repository.
10
+
11
+ Usage:
12
+ uv run scripts/check_prs.py --repo-id "org/model-name"
13
+ """
14
+
15
+ import argparse
16
+ import sys
17
+ from typing import Any
18
+
19
+ import requests
20
+
21
+
22
+ def get_open_prs(repo_id: str) -> list[dict[str, Any]]:
23
+ """
24
+ Fetch open pull requests for a Hugging Face model repository.
25
+
26
+ Args:
27
+ repo_id: Hugging Face model repository ID (e.g., "nvidia/model-name")
28
+
29
+ Returns:
30
+ List of open PR dictionaries with num, title, author, and createdAt
31
+ """
32
+ url = f"https://huggingface.co/api/models/{repo_id}/discussions"
33
+
34
+ try:
35
+ response = requests.get(url, timeout=30, allow_redirects=True)
36
+ response.raise_for_status()
37
+
38
+ data = response.json()
39
+ discussions = data.get("discussions", [])
40
+
41
+ open_prs = [
42
+ {
43
+ "num": d["num"],
44
+ "title": d["title"],
45
+ "author": d["author"]["name"],
46
+ "createdAt": d.get("createdAt", "unknown"),
47
+ }
48
+ for d in discussions
49
+ if d.get("status") == "open" and d.get("isPullRequest")
50
+ ]
51
+
52
+ return open_prs
53
+
54
+ except requests.RequestException as e:
55
+ print(f"Error fetching PRs from Hugging Face: {e}", file=sys.stderr)
56
+ return []
57
+
58
+
59
+ def list_open_prs(repo_id: str) -> None:
60
+ """Display open pull requests for a model repository."""
61
+ prs = get_open_prs(repo_id)
62
+
63
+ print(f"\n{'='*70}")
64
+ print(f"Open Pull Requests for: {repo_id}")
65
+ print(f"{'='*70}")
66
+
67
+ if not prs:
68
+ print("\nNo open pull requests found.")
69
+ else:
70
+ print(f"\nFound {len(prs)} open PR(s):\n")
71
+ for pr in prs:
72
+ print(f" PR #{pr['num']} - {pr['title']}")
73
+ print(f" Author: {pr['author']}")
74
+ print(f" Created: {pr['createdAt']}")
75
+ print(f" URL: https://huggingface.co/{repo_id}/discussions/{pr['num']}")
76
+ print()
77
+
78
+ print(f"{'='*70}\n")
79
+
80
+
81
+ def main():
82
+ parser = argparse.ArgumentParser(
83
+ description="Check for open pull requests on a Hugging Face model repository.",
84
+ epilog="Always run this before creating new PRs to avoid duplicates.",
85
+ )
86
+ parser.add_argument(
87
+ "--repo-id",
88
+ type=str,
89
+ required=True,
90
+ help="HF repository ID (e.g., 'nvidia/model-name')",
91
+ )
92
+
93
+ args = parser.parse_args()
94
+ list_open_prs(args.repo_id)
95
+
96
+
97
+ if __name__ == "__main__":
98
+ main()
skills/hugging-face-evaluation/scripts/import_aa.py ADDED
@@ -0,0 +1,353 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # /// script
2
+ # requires-python = ">=3.11"
3
+ # dependencies = [
4
+ # "huggingface-hub>=1.1.4",
5
+ # "python-dotenv>=1.0.0",
6
+ # "pyyaml>=6.0.0",
7
+ # "requests>=2.31.0",
8
+ # ]
9
+ # ///
10
+ """
11
+ Import evaluation results from Artificial Analysis API.
12
+
13
+ Usage:
14
+ # Look up a specific benchmark (dry run - prints YAML)
15
+ AA_API_KEY=... uv run scripts/import_aa.py --repo-id "org/model" --benchmark HLE
16
+
17
+ # Look up a benchmark and create PR
18
+ AA_API_KEY=... uv run scripts/import_aa.py --repo-id "org/model" --benchmark GPQA --create-pr
19
+
20
+ # Import all available benchmarks
21
+ AA_API_KEY=... uv run scripts/import_aa.py --repo-id "org/model" --all
22
+
23
+ # Provide value manually (skip lookup)
24
+ uv run scripts/import_aa.py --repo-id "org/model" --benchmark HLE --value 22.5 --create-pr
25
+ """
26
+
27
+ from __future__ import annotations
28
+
29
+ import argparse
30
+ import json
31
+ import os
32
+ import re
33
+ import sys
34
+ from datetime import date
35
+ from pathlib import Path
36
+ from typing import Any
37
+
38
+ import requests
39
+
40
+
41
+ AA_INDEX_URL = "https://artificialanalysis.ai/api/v2/data/llms/models"
42
+
43
+
44
+ def load_env() -> None:
45
+ try:
46
+ import dotenv
47
+ dotenv.load_dotenv()
48
+ except ModuleNotFoundError:
49
+ pass
50
+
51
+
52
+ def load_benchmark_mapping() -> dict[str, Any]:
53
+ script_dir = Path(__file__).parent
54
+ mapping_file = script_dir.parent / "examples" / "metric_mapping.json"
55
+
56
+ if not mapping_file.exists():
57
+ return {
58
+ "GPQA": {"dataset_id": "Idavidrein/gpqa", "task_id": "gpqa_diamond", "aliases": ["gpqa"]},
59
+ "HLE": {"dataset_id": "cais/hle", "task_id": "default", "aliases": ["hle"]},
60
+ "SimpleQA": {"dataset_id": "OpenEvals/SimpleQA", "task_id": "default", "aliases": ["simpleqa"]},
61
+ "MMLU": {"dataset_id": "cais/mmlu", "task_id": "default", "aliases": ["mmlu"]},
62
+ "GSM8K": {"dataset_id": "openai/gsm8k", "task_id": "default", "aliases": ["gsm8k"]},
63
+ }
64
+
65
+ with open(mapping_file) as f:
66
+ mapping = json.load(f)
67
+ mapping.pop("_comment", None)
68
+ return mapping
69
+
70
+
71
+ def find_benchmark_dataset(benchmark_name: str, mapping: dict[str, Any]) -> dict[str, str] | None:
72
+ cleaned = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', benchmark_name)
73
+ cleaned = re.sub(r'\*\*([^\*]+)\*\*', r'\1', cleaned)
74
+ cleaned = re.sub(r'\*([^\*]+)\*', r'\1', cleaned)
75
+ cleaned = cleaned.strip()
76
+
77
+ normalized = cleaned.lower().replace(" ", "_").replace("-", "_")
78
+ base_name = re.sub(r'\s*\([^)]*\)\s*$', '', cleaned).strip()
79
+ base_normalized = base_name.lower().replace(" ", "_").replace("-", "_")
80
+
81
+ if cleaned in mapping:
82
+ entry = mapping[cleaned]
83
+ return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
84
+
85
+ for key, entry in mapping.items():
86
+ if key.lower() == cleaned.lower():
87
+ return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
88
+
89
+ for key, entry in mapping.items():
90
+ aliases = entry.get("aliases", [])
91
+ normalized_aliases = [a.lower().replace(" ", "_").replace("-", "_") for a in aliases]
92
+ if normalized in normalized_aliases:
93
+ return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
94
+
95
+ for key, entry in mapping.items():
96
+ key_normalized = key.lower().replace(" ", "_").replace("-", "_")
97
+ if normalized == key_normalized:
98
+ return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
99
+
100
+ if base_normalized != normalized:
101
+ for key, entry in mapping.items():
102
+ if key.lower() == base_name.lower():
103
+ return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
104
+ key_normalized = key.lower().replace(" ", "_").replace("-", "_")
105
+ if base_normalized == key_normalized:
106
+ return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
107
+
108
+ return None
109
+
110
+
111
+ def fetch_aa_models(api_key: str) -> list[dict[str, Any]]:
112
+ response = requests.get(
113
+ AA_INDEX_URL,
114
+ headers={"x-api-key": api_key},
115
+ timeout=30,
116
+ )
117
+ response.raise_for_status()
118
+ data = response.json()
119
+ return list(data.get("data", []))
120
+
121
+
122
+ def find_model_in_aa(models: list[dict[str, Any]], repo_id: str) -> dict[str, Any] | None:
123
+ model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
124
+ model_name_normalized = model_name.lower().replace("-", " ").replace("_", " ")
125
+
126
+ for model in models:
127
+ aa_name = model.get("name", "").lower().replace("-", " ").replace("_", " ")
128
+ aa_slug = model.get("slug", "").lower().replace("-", " ").replace("_", " ")
129
+ if model_name_normalized in aa_name or model_name_normalized in aa_slug:
130
+ return model
131
+
132
+ return None
133
+
134
+
135
+ def lookup_benchmark_from_aa(
136
+ models: list[dict[str, Any]],
137
+ repo_id: str,
138
+ benchmark_name: str,
139
+ ) -> float | None:
140
+ model = find_model_in_aa(models, repo_id)
141
+ if not model:
142
+ return None
143
+
144
+ evaluations = model.get("evaluations", {})
145
+ benchmark_normalized = benchmark_name.lower().replace(" ", "_").replace("-", "_")
146
+
147
+ for key, value in evaluations.items():
148
+ key_normalized = key.lower().replace(" ", "_").replace("-", "_")
149
+ if benchmark_normalized == key_normalized or benchmark_normalized in key_normalized:
150
+ if value is not None:
151
+ return float(value)
152
+
153
+ return None
154
+
155
+
156
+ def get_all_benchmarks_from_aa(
157
+ models: list[dict[str, Any]],
158
+ repo_id: str,
159
+ ) -> list[dict[str, Any]]:
160
+ model = find_model_in_aa(models, repo_id)
161
+ if not model:
162
+ return []
163
+
164
+ evaluations = model.get("evaluations", {})
165
+ metrics = []
166
+
167
+ for key, value in evaluations.items():
168
+ if value is not None:
169
+ metrics.append({
170
+ "name": key.replace("_", " ").title(),
171
+ "type": key,
172
+ "value": float(value),
173
+ })
174
+
175
+ return metrics
176
+
177
+
178
+ def convert_to_eval_results_format(
179
+ metrics: list[dict[str, Any]],
180
+ source_url: str | None = None,
181
+ source_name: str | None = None,
182
+ source_user: str | None = None,
183
+ ) -> list[dict[str, Any]]:
184
+ mapping = load_benchmark_mapping()
185
+ results = []
186
+ today = date.today().isoformat()
187
+
188
+ for metric in metrics:
189
+ benchmark_name = metric.get("name", "")
190
+ value = metric.get("value")
191
+
192
+ if value is None:
193
+ continue
194
+
195
+ dataset_info = find_benchmark_dataset(benchmark_name, mapping)
196
+ if not dataset_info:
197
+ print(f"Warning: Could not find Hub dataset ID for benchmark '{benchmark_name}'. Skipping.", file=sys.stderr)
198
+ continue
199
+
200
+ entry: dict[str, Any] = {
201
+ "dataset": {"id": dataset_info["dataset_id"]},
202
+ "value": value,
203
+ "date": today,
204
+ }
205
+
206
+ if dataset_info.get("task_id") and dataset_info["task_id"] != "default":
207
+ entry["dataset"]["task_id"] = dataset_info["task_id"]
208
+
209
+ if source_url:
210
+ entry["source"] = {"url": source_url}
211
+ if source_name:
212
+ entry["source"]["name"] = source_name
213
+ if source_user:
214
+ entry["source"]["user"] = source_user
215
+
216
+ results.append(entry)
217
+
218
+ return results
219
+
220
+
221
+ def upload_eval_results(
222
+ repo_id: str,
223
+ results: list[dict[str, Any]],
224
+ filename: str = "evaluations.yaml",
225
+ create_pr: bool = False,
226
+ commit_message: str | None = None,
227
+ ) -> bool:
228
+ import yaml
229
+ from huggingface_hub import HfApi
230
+
231
+ load_env()
232
+ hf_token = os.getenv("HF_TOKEN")
233
+ if not hf_token:
234
+ print("Error: HF_TOKEN environment variable is not set", file=sys.stderr)
235
+ return False
236
+
237
+ api = HfApi(token=hf_token)
238
+ yaml_content = yaml.dump(results, sort_keys=False, allow_unicode=True, default_flow_style=False)
239
+ file_path = f".eval_results/{filename}"
240
+
241
+ if not commit_message:
242
+ model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
243
+ commit_message = f"Add Artificial Analysis evaluation results for {model_name}"
244
+
245
+ pr_description = """## Evaluation Results
246
+
247
+ This PR adds structured evaluation results using the new [`.eval_results/` format](https://huggingface.co/docs/hub/eval-results).
248
+
249
+ **Source:** [Artificial Analysis](https://artificialanalysis.ai)
250
+
251
+ ### What This Enables
252
+
253
+ - **Model Page**: Results appear on the model page with benchmark links
254
+ - **Leaderboards**: Scores are aggregated into benchmark dataset leaderboards
255
+ - **Verification**: Support for cryptographic verification of evaluation runs
256
+
257
+ ---
258
+ *Generated by [community-evals](https://github.com/huggingface/community-evals)*"""
259
+
260
+ try:
261
+ api.upload_file(
262
+ path_or_fileobj=yaml_content.encode("utf-8"),
263
+ path_in_repo=file_path,
264
+ repo_id=repo_id,
265
+ repo_type="model",
266
+ commit_message=commit_message,
267
+ commit_description=pr_description,
268
+ create_pr=create_pr,
269
+ )
270
+
271
+ action = "Pull request created" if create_pr else "Evaluation results uploaded"
272
+ print(f"✓ {action} successfully for {repo_id}")
273
+ print(f" File: {file_path}")
274
+ return True
275
+
276
+ except Exception as e:
277
+ print(f"Error uploading evaluation results: {e}", file=sys.stderr)
278
+ return False
279
+
280
+
281
+ def main() -> None:
282
+ parser = argparse.ArgumentParser(
283
+ description="Import evaluation results from Artificial Analysis API.",
284
+ )
285
+ parser.add_argument("--repo-id", required=True, help="HuggingFace repository ID")
286
+ parser.add_argument("--benchmark", help="Specific benchmark to look up (e.g., HLE, GPQA)")
287
+ parser.add_argument("--value", type=float, help="Manually provide the score (skips AA lookup)")
288
+ parser.add_argument("--all", action="store_true", help="Import all available benchmarks")
289
+ parser.add_argument("--source-user", help="HF username/org for attribution")
290
+ parser.add_argument("--filename", default="artificial_analysis.yaml", help="Output filename")
291
+ parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push")
292
+ parser.add_argument("--apply", action="store_true", help="Apply changes (default is dry run)")
293
+ parser.add_argument("--pretty", action="store_true", help="Pretty-print YAML output")
294
+ parser.add_argument("--verbose", action="store_true", help="Print progress to stderr")
295
+ args = parser.parse_args()
296
+
297
+ load_env()
298
+
299
+ if args.value is not None and args.benchmark:
300
+ metrics = [{"name": args.benchmark, "type": args.benchmark.lower(), "value": args.value}]
301
+ else:
302
+ api_key = os.getenv("AA_API_KEY")
303
+ if not api_key:
304
+ print("Error: AA_API_KEY is required to query Artificial Analysis.", file=sys.stderr)
305
+ sys.exit(1)
306
+
307
+ if args.verbose:
308
+ print("Fetching models from Artificial Analysis...", file=sys.stderr)
309
+
310
+ models = fetch_aa_models(api_key)
311
+
312
+ if args.all:
313
+ metrics = get_all_benchmarks_from_aa(models, args.repo_id)
314
+ if not metrics:
315
+ print(f"No benchmarks found for {args.repo_id} in Artificial Analysis", file=sys.stderr)
316
+ sys.exit(1)
317
+ elif args.benchmark:
318
+ value = lookup_benchmark_from_aa(models, args.repo_id, args.benchmark)
319
+ if value is None:
320
+ print(f"Could not find {args.benchmark} score for {args.repo_id} in Artificial Analysis", file=sys.stderr)
321
+ sys.exit(1)
322
+ print(f"Found: {args.benchmark} = {value}")
323
+ metrics = [{"name": args.benchmark, "type": args.benchmark.lower(), "value": value}]
324
+ else:
325
+ print("Error: Either --benchmark or --all is required", file=sys.stderr)
326
+ sys.exit(1)
327
+
328
+ eval_results = convert_to_eval_results_format(
329
+ metrics=metrics,
330
+ source_url="https://artificialanalysis.ai",
331
+ source_name="Artificial Analysis",
332
+ source_user=args.source_user,
333
+ )
334
+
335
+ if not eval_results:
336
+ print("No benchmarks could be mapped to Hub dataset IDs", file=sys.stderr)
337
+ sys.exit(1)
338
+
339
+ import yaml
340
+ print("\nImported evaluations (.eval_results/ format):")
341
+ print(yaml.dump(eval_results, sort_keys=False, allow_unicode=True, default_flow_style=False))
342
+
343
+ if args.apply or args.create_pr:
344
+ upload_eval_results(
345
+ repo_id=args.repo_id,
346
+ results=eval_results,
347
+ filename=args.filename,
348
+ create_pr=args.create_pr,
349
+ )
350
+
351
+
352
+ if __name__ == "__main__":
353
+ main()
skills/hugging-face-model-trainer/SKILL.md ADDED
@@ -0,0 +1,718 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: hugging-face-model-trainer
3
+ description: This skill should be used when users want to train or fine-tune language models using TRL (Transformer Reinforcement Learning) on Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on the TRL Jobs package, UV scripts with PEP 723 format, dataset preparation and validation, hardware selection, cost estimation, Trackio monitoring, Hub authentication, and model persistence. Should be invoked for tasks involving cloud GPU training, GGUF conversion, or when users mention training on Hugging Face Jobs without local GPU setup.
4
+ license: Complete terms in LICENSE.txt
5
+ ---
6
+
7
+ # TRL Training on Hugging Face Jobs
8
+
9
+ ## Overview
10
+
11
+ Train language models using TRL (Transformer Reinforcement Learning) on fully managed Hugging Face infrastructure. No local GPU setup required—models train on cloud GPUs and results are automatically saved to the Hugging Face Hub.
12
+
13
+ **TRL provides multiple training methods:**
14
+ - **SFT** (Supervised Fine-Tuning) - Standard instruction tuning
15
+ - **DPO** (Direct Preference Optimization) - Alignment from preference data
16
+ - **GRPO** (Group Relative Policy Optimization) - Online RL training
17
+ - **Reward Modeling** - Train reward models for RLHF
18
+
19
+ **For detailed TRL method documentation:**
20
+ ```python
21
+ hf_doc_search("your query", product="trl")
22
+ hf_doc_fetch("https://huggingface.co/docs/trl/sft_trainer") # SFT
23
+ hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer") # DPO
24
+ # etc.
25
+ ```
26
+
27
+ **See also:** `references/training_methods.md` for method overviews and selection guidance
28
+
29
+ ## When to Use This Skill
30
+
31
+ Use this skill when users want to:
32
+ - Fine-tune language models on cloud GPUs without local infrastructure
33
+ - Train with TRL methods (SFT, DPO, GRPO, etc.)
34
+ - Run training jobs on Hugging Face Jobs infrastructure
35
+ - Convert trained models to GGUF for local deployment (Ollama, LM Studio, llama.cpp)
36
+ - Ensure trained models are permanently saved to the Hub
37
+ - Use modern workflows with optimized defaults
38
+
39
+ ### When to Use Unsloth
40
+
41
+ Use **Unsloth** (`references/unsloth.md`) instead of standard TRL when:
42
+ - **Limited GPU memory** - Unsloth uses ~60% less VRAM
43
+ - **Speed matters** - Unsloth is ~2x faster
44
+ - Training **large models (>13B)** - memory efficiency is critical
45
+ - Training **Vision-Language Models (VLMs)** - Unsloth has `FastVisionModel` support
46
+
47
+ See `references/unsloth.md` for complete Unsloth documentation and `scripts/unsloth_sft_example.py` for a production-ready training script.
48
+
49
+ ## Key Directives
50
+
51
+ When assisting with training jobs:
52
+
53
+ 1. **ALWAYS use `hf_jobs()` MCP tool** - Submit jobs using `hf_jobs("uv", {...})`, NOT bash `trl-jobs` commands. The `script` parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to `hf_jobs()`. If user asks to "train a model", "fine-tune", or similar requests, you MUST create the training script AND submit the job immediately using `hf_jobs()`.
54
+
55
+ 2. **Always include Trackio** - Every training script should include Trackio for real-time monitoring. Use example scripts in `scripts/` as templates.
56
+
57
+ 3. **Provide job details after submission** - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
58
+
59
+ 4. **Use example scripts as templates** - Reference `scripts/train_sft_example.py`, `scripts/train_dpo_example.py`, etc. as starting points.
60
+
61
+ ## Local Script Dependencies
62
+
63
+ To run scripts locally (like `estimate_cost.py`), install dependencies:
64
+ ```bash
65
+ pip install -r requirements.txt
66
+ ```
67
+
68
+ ## Prerequisites Checklist
69
+
70
+ Before starting any training job, verify:
71
+
72
+ ### ✅ **Account & Authentication**
73
+ - Hugging Face Account with [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan (Jobs require paid plan)
74
+ - Authenticated login: Check with `hf_whoami()`
75
+ - **HF_TOKEN for Hub Push** ⚠️ CRITICAL - Training environment is ephemeral, must push to Hub or ALL training results are lost
76
+ - Token must have write permissions
77
+ - **MUST pass `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config** to make token available (the `$HF_TOKEN` syntax
78
+ references your actual token value)
79
+
80
+ ### ✅ **Dataset Requirements**
81
+ - Dataset must exist on Hub or be loadable via `datasets.load_dataset()`
82
+ - Format must match training method (SFT: "messages"/text/prompt-completion; DPO: chosen/rejected; GRPO: prompt-only)
83
+ - **ALWAYS validate unknown datasets** before GPU training to prevent format failures (see Dataset Validation section below)
84
+ - Size appropriate for hardware (Demo: 50-100 examples on t4-small; Production: 1K-10K+ on a10g-large/a100-large)
85
+
86
+ ### ⚠️ **Critical Settings**
87
+ - **Timeout must exceed expected training time** - Default 30min is TOO SHORT for most training. Minimum recommended: 1-2 hours. Job fails and loses all progress if timeout is exceeded.
88
+ - **Hub push must be enabled** - Config: `push_to_hub=True`, `hub_model_id="username/model-name"`; Job: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
89
+
90
+ ## Asynchronous Job Guidelines
91
+
92
+ **⚠️ IMPORTANT: Training jobs run asynchronously and can take hours**
93
+
94
+ ### Action Required
95
+
96
+ **When user requests training:**
97
+ 1. **Create the training script** with Trackio included (use `scripts/train_sft_example.py` as template)
98
+ 2. **Submit immediately** using `hf_jobs()` MCP tool with script content inline - don't save to file unless user requests
99
+ 3. **Report submission** with job ID, monitoring URL, and estimated time
100
+ 4. **Wait for user** to request status checks - don't poll automatically
101
+
102
+ ### Ground Rules
103
+ - **Jobs run in background** - Submission returns immediately; training continues independently
104
+ - **Initial logs delayed** - Can take 30-60 seconds for logs to appear
105
+ - **User checks status** - Wait for user to request status updates
106
+ - **Avoid polling** - Check logs only on user request; provide monitoring links instead
107
+
108
+ ### After Submission
109
+
110
+ **Provide to user:**
111
+ - ✅ Job ID and monitoring URL
112
+ - ✅ Expected completion time
113
+ - ✅ Trackio dashboard URL
114
+ - ✅ Note that user can request status checks later
115
+
116
+ **Example Response:**
117
+ ```
118
+ ✅ Job submitted successfully!
119
+
120
+ Job ID: abc123xyz
121
+ Monitor: https://huggingface.co/jobs/username/abc123xyz
122
+
123
+ Expected time: ~2 hours
124
+ Estimated cost: ~$10
125
+
126
+ The job is running in the background. Ask me to check status/logs when ready!
127
+ ```
128
+
129
+ ## Quick Start: Three Approaches
130
+
131
+ **💡 Tip for Demos:** For quick demos on smaller GPUs (t4-small), omit `eval_dataset` and `eval_strategy` to save ~40% memory. You'll still see training loss and learning progress.
132
+
133
+ ### Sequence Length Configuration
134
+
135
+ **TRL config classes use `max_length` (not `max_seq_length`)** to control tokenized sequence length:
136
+
137
+ ```python
138
+ # ✅ CORRECT - If you need to set sequence length
139
+ SFTConfig(max_length=512) # Truncate sequences to 512 tokens
140
+ DPOConfig(max_length=2048) # Longer context (2048 tokens)
141
+
142
+ # ❌ WRONG - This parameter doesn't exist
143
+ SFTConfig(max_seq_length=512) # TypeError!
144
+ ```
145
+
146
+ **Default behavior:** `max_length=1024` (truncates from right). This works well for most training.
147
+
148
+ **When to override:**
149
+ - **Longer context**: Set higher (e.g., `max_length=2048`)
150
+ - **Memory constraints**: Set lower (e.g., `max_length=512`)
151
+ - **Vision models**: Set `max_length=None` (prevents cutting image tokens)
152
+
153
+ **Usually you don't need to set this parameter at all** - the examples below use the sensible default.
154
+
155
+ ### Approach 1: UV Scripts (Recommended—Default Choice)
156
+
157
+ UV scripts use PEP 723 inline dependencies for clean, self-contained training. **This is the primary approach for Claude Code.**
158
+
159
+ ```python
160
+ hf_jobs("uv", {
161
+ "script": """
162
+ # /// script
163
+ # dependencies = ["trl>=0.12.0", "peft>=0.7.0", "trackio"]
164
+ # ///
165
+
166
+ from datasets import load_dataset
167
+ from peft import LoraConfig
168
+ from trl import SFTTrainer, SFTConfig
169
+ import trackio
170
+
171
+ dataset = load_dataset("trl-lib/Capybara", split="train")
172
+
173
+ # Create train/eval split for monitoring
174
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
175
+
176
+ trainer = SFTTrainer(
177
+ model="Qwen/Qwen2.5-0.5B",
178
+ train_dataset=dataset_split["train"],
179
+ eval_dataset=dataset_split["test"],
180
+ peft_config=LoraConfig(r=16, lora_alpha=32),
181
+ args=SFTConfig(
182
+ output_dir="my-model",
183
+ push_to_hub=True,
184
+ hub_model_id="username/my-model",
185
+ num_train_epochs=3,
186
+ eval_strategy="steps",
187
+ eval_steps=50,
188
+ report_to="trackio",
189
+ project="meaningful_prject_name", # project name for the training name (trackio)
190
+ run_name="meaningful_run_name", # descriptive name for the specific training run (trackio)
191
+ )
192
+ )
193
+
194
+ trainer.train()
195
+ trainer.push_to_hub()
196
+ """,
197
+ "flavor": "a10g-large",
198
+ "timeout": "2h",
199
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
200
+ })
201
+ ```
202
+
203
+ **Benefits:** Direct MCP tool usage, clean code, dependencies declared inline (PEP 723), no file saving required, full control
204
+ **When to use:** Default choice for all training tasks in Claude Code, custom training logic, any scenario requiring `hf_jobs()`
205
+
206
+ #### Working with Scripts
207
+
208
+ ⚠️ **Important:** The `script` parameter accepts either inline code (as shown above) OR a URL. **Local file paths do NOT work.**
209
+
210
+ **Why local paths don't work:**
211
+ Jobs run in isolated Docker containers without access to your local filesystem. Scripts must be:
212
+ - Inline code (recommended for custom training)
213
+ - Publicly accessible URLs
214
+ - Private repo URLs (with HF_TOKEN)
215
+
216
+ **Common mistakes:**
217
+ ```python
218
+ # ❌ These will all fail
219
+ hf_jobs("uv", {"script": "train.py"})
220
+ hf_jobs("uv", {"script": "./scripts/train.py"})
221
+ hf_jobs("uv", {"script": "/path/to/train.py"})
222
+ ```
223
+
224
+ **Correct approaches:**
225
+ ```python
226
+ # ✅ Inline code (recommended)
227
+ hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<your code>"})
228
+
229
+ # ✅ From Hugging Face Hub
230
+ hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/train.py"})
231
+
232
+ # ✅ From GitHub
233
+ hf_jobs("uv", {"script": "https://raw.githubusercontent.com/user/repo/main/train.py"})
234
+
235
+ # ✅ From Gist
236
+ hf_jobs("uv", {"script": "https://gist.githubusercontent.com/user/id/raw/train.py"})
237
+ ```
238
+
239
+ **To use local scripts:** Upload to HF Hub first:
240
+ ```bash
241
+ huggingface-cli repo create my-training-scripts --type model
242
+ huggingface-cli upload my-training-scripts ./train.py train.py
243
+ # Use: https://huggingface.co/USERNAME/my-training-scripts/resolve/main/train.py
244
+ ```
245
+
246
+ ### Approach 2: TRL Maintained Scripts (Official Examples)
247
+
248
+ TRL provides battle-tested scripts for all methods. Can be run from URLs:
249
+
250
+ ```python
251
+ hf_jobs("uv", {
252
+ "script": "https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py",
253
+ "script_args": [
254
+ "--model_name_or_path", "Qwen/Qwen2.5-0.5B",
255
+ "--dataset_name", "trl-lib/Capybara",
256
+ "--output_dir", "my-model",
257
+ "--push_to_hub",
258
+ "--hub_model_id", "username/my-model"
259
+ ],
260
+ "flavor": "a10g-large",
261
+ "timeout": "2h",
262
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
263
+ })
264
+ ```
265
+
266
+ **Benefits:** No code to write, maintained by TRL team, production-tested
267
+ **When to use:** Standard TRL training, quick experiments, don't need custom code
268
+ **Available:** Scripts are available from https://github.com/huggingface/trl/tree/main/examples/scripts
269
+
270
+ ### Finding More UV Scripts on Hub
271
+
272
+ The `uv-scripts` organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:
273
+
274
+ ```python
275
+ # Discover available UV script collections
276
+ dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})
277
+
278
+ # Explore a specific collection
279
+ hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)
280
+ ```
281
+
282
+ **Popular collections:** ocr, classification, synthetic-data, vllm, dataset-creation
283
+
284
+ ### Approach 3: HF Jobs CLI (Direct Terminal Commands)
285
+
286
+ When the `hf_jobs()` MCP tool is unavailable, use the `hf jobs` CLI directly.
287
+
288
+ **⚠️ CRITICAL: CLI Syntax Rules**
289
+
290
+ ```bash
291
+ # ✅ CORRECT syntax - flags BEFORE script URL
292
+ hf jobs uv run --flavor a10g-large --timeout 2h --secrets HF_TOKEN "https://example.com/train.py"
293
+
294
+ # ❌ WRONG - "run uv" instead of "uv run"
295
+ hf jobs run uv "https://example.com/train.py" --flavor a10g-large
296
+
297
+ # ❌ WRONG - flags AFTER script URL (will be ignored!)
298
+ hf jobs uv run "https://example.com/train.py" --flavor a10g-large
299
+
300
+ # ❌ WRONG - "--secret" instead of "--secrets" (plural)
301
+ hf jobs uv run --secret HF_TOKEN "https://example.com/train.py"
302
+ ```
303
+
304
+ **Key syntax rules:**
305
+ 1. Command order is `hf jobs uv run` (NOT `hf jobs run uv`)
306
+ 2. All flags (`--flavor`, `--timeout`, `--secrets`) must come BEFORE the script URL
307
+ 3. Use `--secrets` (plural), not `--secret`
308
+ 4. Script URL must be the last positional argument
309
+
310
+ **Complete CLI example:**
311
+ ```bash
312
+ hf jobs uv run \
313
+ --flavor a10g-large \
314
+ --timeout 2h \
315
+ --secrets HF_TOKEN \
316
+ "https://huggingface.co/user/repo/resolve/main/train.py"
317
+ ```
318
+
319
+ **Check job status via CLI:**
320
+ ```bash
321
+ hf jobs ps # List all jobs
322
+ hf jobs logs <job-id> # View logs
323
+ hf jobs inspect <job-id> # Job details
324
+ hf jobs cancel <job-id> # Cancel a job
325
+ ```
326
+
327
+ ### Approach 4: TRL Jobs Package (Simplified Training)
328
+
329
+ The `trl-jobs` package provides optimized defaults and one-liner training.
330
+
331
+ ```bash
332
+ # Install
333
+ pip install trl-jobs
334
+
335
+ # Train with SFT (simplest possible)
336
+ trl-jobs sft \
337
+ --model_name Qwen/Qwen2.5-0.5B \
338
+ --dataset_name trl-lib/Capybara
339
+ ```
340
+
341
+ **Benefits:** Pre-configured settings, automatic Trackio integration, automatic Hub push, one-line commands
342
+ **When to use:** User working in terminal directly (not Claude Code context), quick local experimentation
343
+ **Repository:** https://github.com/huggingface/trl-jobs
344
+
345
+ ⚠️ **In Claude Code context, prefer using `hf_jobs()` MCP tool (Approach 1) when available.**
346
+
347
+ ## Hardware Selection
348
+
349
+ | Model Size | Recommended Hardware | Cost (approx/hr) | Use Case |
350
+ |------------|---------------------|------------------|----------|
351
+ | <1B params | `t4-small` | ~$0.75 | Demos, quick tests only without eval steps |
352
+ | 1-3B params | `t4-medium`, `l4x1` | ~$1.50-2.50 | Development |
353
+ | 3-7B params | `a10g-small`, `a10g-large` | ~$3.50-5.00 | Production training |
354
+ | 7-13B params | `a10g-large`, `a100-large` | ~$5-10 | Large models (use LoRA) |
355
+ | 13B+ params | `a100-large`, `a10g-largex2` | ~$10-20 | Very large (use LoRA) |
356
+
357
+ **GPU Flavors:** cpu-basic/upgrade/performance/xl, t4-small/medium, l4x1/x4, a10g-small/large/largex2/largex4, a100-large, h100/h100x8
358
+
359
+ **Guidelines:**
360
+ - Use **LoRA/PEFT** for models >7B to reduce memory
361
+ - Multi-GPU automatically handled by TRL/Accelerate
362
+ - Start with smaller hardware for testing
363
+
364
+ **See:** `references/hardware_guide.md` for detailed specifications
365
+
366
+ ## Critical: Saving Results to Hub
367
+
368
+ **⚠️ EPHEMERAL ENVIRONMENT—MUST PUSH TO HUB**
369
+
370
+ The Jobs environment is temporary. All files are deleted when the job ends. If the model isn't pushed to Hub, **ALL TRAINING IS LOST**.
371
+
372
+ ### Required Configuration
373
+
374
+ **In training script/config:**
375
+ ```python
376
+ SFTConfig(
377
+ push_to_hub=True,
378
+ hub_model_id="username/model-name", # MUST specify
379
+ hub_strategy="every_save", # Optional: push checkpoints
380
+ )
381
+ ```
382
+
383
+ **In job submission:**
384
+ ```python
385
+ {
386
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Enables authentication
387
+ }
388
+ ```
389
+
390
+ ### Verification Checklist
391
+
392
+ Before submitting:
393
+ - [ ] `push_to_hub=True` set in config
394
+ - [ ] `hub_model_id` includes username/repo-name
395
+ - [ ] `secrets` parameter includes HF_TOKEN
396
+ - [ ] User has write access to target repo
397
+
398
+ **See:** `references/hub_saving.md` for detailed troubleshooting
399
+
400
+ ## Timeout Management
401
+
402
+ **⚠️ DEFAULT: 30 MINUTES—TOO SHORT FOR TRAINING**
403
+
404
+ ### Setting Timeouts
405
+
406
+ ```python
407
+ {
408
+ "timeout": "2h" # 2 hours (formats: "90m", "2h", "1.5h", or seconds as integer)
409
+ }
410
+ ```
411
+
412
+ ### Timeout Guidelines
413
+
414
+ | Scenario | Recommended | Notes |
415
+ |----------|-------------|-------|
416
+ | Quick demo (50-100 examples) | 10-30 min | Verify setup |
417
+ | Development training | 1-2 hours | Small datasets |
418
+ | Production (3-7B model) | 4-6 hours | Full datasets |
419
+ | Large model with LoRA | 3-6 hours | Depends on dataset |
420
+
421
+ **Always add 20-30% buffer** for model/dataset loading, checkpoint saving, Hub push operations, and network delays.
422
+
423
+ **On timeout:** Job killed immediately, all unsaved progress lost, must restart from beginning
424
+
425
+ ## Cost Estimation
426
+
427
+ **Offer to estimate cost when planning jobs with known parameters.** Use `scripts/estimate_cost.py`:
428
+
429
+ ```bash
430
+ uv run scripts/estimate_cost.py \
431
+ --model meta-llama/Llama-2-7b-hf \
432
+ --dataset trl-lib/Capybara \
433
+ --hardware a10g-large \
434
+ --dataset-size 16000 \
435
+ --epochs 3
436
+ ```
437
+
438
+ Output includes estimated time, cost, recommended timeout (with buffer), and optimization suggestions.
439
+
440
+ **When to offer:** User planning a job, asks about cost/time, choosing hardware, job will run >1 hour or cost >$5
441
+
442
+ ## Example Training Scripts
443
+
444
+ **Production-ready templates with all best practices:**
445
+
446
+ Load these scripts for correctly:
447
+
448
+ - **`scripts/train_sft_example.py`** - Complete SFT training with Trackio, LoRA, checkpoints
449
+ - **`scripts/train_dpo_example.py`** - DPO training for preference learning
450
+ - **`scripts/train_grpo_example.py`** - GRPO training for online RL
451
+
452
+ These scripts demonstrate proper Hub saving, Trackio integration, checkpoint management, and optimized parameters. Pass their content inline to `hf_jobs()` or use as templates for custom scripts.
453
+
454
+ ## Monitoring and Tracking
455
+
456
+ **Trackio** provides real-time metrics visualization. See `references/trackio_guide.md` for complete setup guide.
457
+
458
+ **Key points:**
459
+ - Add `trackio` to dependencies
460
+ - Configure trainer with `report_to="trackio" and run_name="meaningful_name"`
461
+
462
+ ### Trackio Configuration Defaults
463
+
464
+ **Use sensible defaults unless user specifies otherwise.** When generating training scripts with Trackio:
465
+
466
+ **Default Configuration:**
467
+ - **Space ID**: `{username}/trackio` (use "trackio" as default space name)
468
+ - **Run naming**: Unless otherwise specified, name the run in a way the user will recognize (e.g., descriptive of the task, model, or purpose)
469
+ - **Config**: Keep minimal - only include hyperparameters and model/dataset info
470
+ - **Project Name**: Use a Project Name to associate runs with a particular Project
471
+
472
+ **User overrides:** If user requests specific trackio configuration (custom space, run naming, grouping, or additional config), apply their preferences instead of defaults.
473
+
474
+
475
+ This is useful for managing multiple jobs with the same configuration or keeping training scripts portable.
476
+
477
+ See `references/trackio_guide.md` for complete documentation including grouping runs for experiments.
478
+
479
+ ### Check Job Status
480
+
481
+ ```python
482
+ # List all jobs
483
+ hf_jobs("ps")
484
+
485
+ # Inspect specific job
486
+ hf_jobs("inspect", {"job_id": "your-job-id"})
487
+
488
+ # View logs
489
+ hf_jobs("logs", {"job_id": "your-job-id"})
490
+ ```
491
+
492
+ **Remember:** Wait for user to request status checks. Avoid polling repeatedly.
493
+
494
+ ## Dataset Validation
495
+
496
+ **Validate dataset format BEFORE launching GPU training to prevent the #1 cause of training failures: format mismatches.**
497
+
498
+ ### Why Validate
499
+
500
+ - 50%+ of training failures are due to dataset format issues
501
+ - DPO especially strict: requires exact column names (`prompt`, `chosen`, `rejected`)
502
+ - Failed GPU jobs waste $1-10 and 30-60 minutes
503
+ - Validation on CPU costs ~$0.01 and takes <1 minute
504
+
505
+ ### When to Validate
506
+
507
+ **ALWAYS validate for:**
508
+ - Unknown or custom datasets
509
+ - DPO training (CRITICAL - 90% of datasets need mapping)
510
+ - Any dataset not explicitly TRL-compatible
511
+
512
+ **Skip validation for known TRL datasets:**
513
+ - `trl-lib/ultrachat_200k`, `trl-lib/Capybara`, `HuggingFaceH4/ultrachat_200k`, etc.
514
+
515
+ ### Usage
516
+
517
+ ```python
518
+ hf_jobs("uv", {
519
+ "script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
520
+ "script_args": ["--dataset", "username/dataset-name", "--split", "train"]
521
+ })
522
+ ```
523
+
524
+ The script is fast, and will usually complete synchronously.
525
+
526
+ ### Reading Results
527
+
528
+ The output shows compatibility for each training method:
529
+
530
+ - **`✓ READY`** - Dataset is compatible, use directly
531
+ - **`✗ NEEDS MAPPING`** - Compatible but needs preprocessing (mapping code provided)
532
+ - **`✗ INCOMPATIBLE`** - Cannot be used for this method
533
+
534
+ When mapping is needed, the output includes a **"MAPPING CODE"** section with copy-paste ready Python code.
535
+
536
+ ### Example Workflow
537
+
538
+ ```python
539
+ # 1. Inspect dataset (costs ~$0.01, <1 min on CPU)
540
+ hf_jobs("uv", {
541
+ "script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
542
+ "script_args": ["--dataset", "argilla/distilabel-math-preference-dpo", "--split", "train"]
543
+ })
544
+
545
+ # 2. Check output markers:
546
+ # ✓ READY → proceed with training
547
+ # ✗ NEEDS MAPPING → apply mapping code below
548
+ # ✗ INCOMPATIBLE → choose different method/dataset
549
+
550
+ # 3. If mapping needed, apply before training:
551
+ def format_for_dpo(example):
552
+ return {
553
+ 'prompt': example['instruction'],
554
+ 'chosen': example['chosen_response'],
555
+ 'rejected': example['rejected_response'],
556
+ }
557
+ dataset = dataset.map(format_for_dpo, remove_columns=dataset.column_names)
558
+
559
+ # 4. Launch training job with confidence
560
+ ```
561
+
562
+ ### Common Scenario: DPO Format Mismatch
563
+
564
+ Most DPO datasets use non-standard column names. Example:
565
+
566
+ ```
567
+ Dataset has: instruction, chosen_response, rejected_response
568
+ DPO expects: prompt, chosen, rejected
569
+ ```
570
+
571
+ The validator detects this and provides exact mapping code to fix it.
572
+
573
+ ## Converting Models to GGUF
574
+
575
+ After training, convert models to **GGUF format** for use with llama.cpp, Ollama, LM Studio, and other local inference tools.
576
+
577
+ **What is GGUF:**
578
+ - Optimized for CPU/GPU inference with llama.cpp
579
+ - Supports quantization (4-bit, 5-bit, 8-bit) to reduce model size
580
+ - Compatible with Ollama, LM Studio, Jan, GPT4All, llama.cpp
581
+ - Typically 2-8GB for 7B models (vs 14GB unquantized)
582
+
583
+ **When to convert:**
584
+ - Running models locally with Ollama or LM Studio
585
+ - Reducing model size with quantization
586
+ - Deploying to edge devices
587
+ - Sharing models for local-first use
588
+
589
+ **See:** `references/gguf_conversion.md` for complete conversion guide, including production-ready conversion script, quantization options, hardware requirements, usage examples, and troubleshooting.
590
+
591
+ **Quick conversion:**
592
+ ```python
593
+ hf_jobs("uv", {
594
+ "script": "<see references/gguf_conversion.md for complete script>",
595
+ "flavor": "a10g-large",
596
+ "timeout": "45m",
597
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
598
+ "env": {
599
+ "ADAPTER_MODEL": "username/my-finetuned-model",
600
+ "BASE_MODEL": "Qwen/Qwen2.5-0.5B",
601
+ "OUTPUT_REPO": "username/my-model-gguf"
602
+ }
603
+ })
604
+ ```
605
+
606
+ ## Common Training Patterns
607
+
608
+ See `references/training_patterns.md` for detailed examples including:
609
+ - Quick demo (5-10 minutes)
610
+ - Production with checkpoints
611
+ - Multi-GPU training
612
+ - DPO training (preference learning)
613
+ - GRPO training (online RL)
614
+
615
+ ## Common Failure Modes
616
+
617
+ ### Out of Memory (OOM)
618
+
619
+ **Fix (try in order):**
620
+ 1. Reduce batch size: `per_device_train_batch_size=1`, increase `gradient_accumulation_steps=8`. Effective batch size is `per_device_train_batch_size` x `gradient_accumulation_steps`. For best performance keep effective batch size close to 128.
621
+ 2. Enable: `gradient_checkpointing=True`
622
+ 3. Upgrade hardware: t4-small → l4x1, a10g-small → a10g-large etc.
623
+
624
+ ### Dataset Misformatted
625
+
626
+ **Fix:**
627
+ 1. Validate first with dataset inspector:
628
+ ```bash
629
+ uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
630
+ --dataset name --split train
631
+ ```
632
+ 2. Check output for compatibility markers (✓ READY, ✗ NEEDS MAPPING, ✗ INCOMPATIBLE)
633
+ 3. Apply mapping code from inspector output if needed
634
+
635
+ ### Job Timeout
636
+
637
+ **Fix:**
638
+ 1. Check logs for actual runtime: `hf_jobs("logs", {"job_id": "..."})`
639
+ 2. Increase timeout with buffer: `"timeout": "3h"` (add 30% to estimated time)
640
+ 3. Or reduce training: lower `num_train_epochs`, use smaller dataset, enable `max_steps`
641
+ 4. Save checkpoints: `save_strategy="steps"`, `save_steps=500`, `hub_strategy="every_save"`
642
+
643
+ **Note:** Default 30min is insufficient for real training. Minimum 1-2 hours.
644
+
645
+ ### Hub Push Failures
646
+
647
+ **Fix:**
648
+ 1. Add to job: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
649
+ 2. Add to config: `push_to_hub=True`, `hub_model_id="username/model-name"`
650
+ 3. Verify auth: `mcp__huggingface__hf_whoami()`
651
+ 4. Check token has write permissions and repo exists (or set `hub_private_repo=True`)
652
+
653
+ ### Missing Dependencies
654
+
655
+ **Fix:**
656
+ Add to PEP 723 header:
657
+ ```python
658
+ # /// script
659
+ # dependencies = ["trl>=0.12.0", "peft>=0.7.0", "trackio", "missing-package"]
660
+ # ///
661
+ ```
662
+
663
+ ## Troubleshooting
664
+
665
+ **Common issues:**
666
+ - Job times out → Increase timeout, reduce epochs/dataset, use smaller model/LoRA
667
+ - Model not saved to Hub → Check push_to_hub=True, hub_model_id, secrets=HF_TOKEN
668
+ - Out of Memory (OOM) → Reduce batch size, increase gradient accumulation, enable LoRA, use larger GPU
669
+ - Dataset format error → Validate with dataset inspector (see Dataset Validation section)
670
+ - Import/module errors → Add PEP 723 header with dependencies, verify format
671
+ - Authentication errors → Check `mcp__huggingface__hf_whoami()`, token permissions, secrets parameter
672
+
673
+ **See:** `references/troubleshooting.md` for complete troubleshooting guide
674
+
675
+ ## Resources
676
+
677
+ ### References (In This Skill)
678
+ - `references/training_methods.md` - Overview of SFT, DPO, GRPO, KTO, PPO, Reward Modeling
679
+ - `references/training_patterns.md` - Common training patterns and examples
680
+ - `references/unsloth.md` - Unsloth for fast VLM training (~2x speed, 60% less VRAM)
681
+ - `references/gguf_conversion.md` - Complete GGUF conversion guide
682
+ - `references/trackio_guide.md` - Trackio monitoring setup
683
+ - `references/hardware_guide.md` - Hardware specs and selection
684
+ - `references/hub_saving.md` - Hub authentication troubleshooting
685
+ - `references/troubleshooting.md` - Common issues and solutions
686
+
687
+ ### Scripts (In This Skill)
688
+ - `scripts/train_sft_example.py` - Production SFT template
689
+ - `scripts/train_dpo_example.py` - Production DPO template
690
+ - `scripts/train_grpo_example.py` - Production GRPO template
691
+ - `scripts/unsloth_sft_example.py` - Unsloth text LLM training template (faster, less VRAM)
692
+ - `scripts/estimate_cost.py` - Estimate time and cost (offer when appropriate)
693
+ - `scripts/convert_to_gguf.py` - Complete GGUF conversion script
694
+
695
+ ### External Scripts
696
+ - [Dataset Inspector](https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py) - Validate dataset format before training (use via `uv run` or `hf_jobs`)
697
+
698
+ ### External Links
699
+ - [TRL Documentation](https://huggingface.co/docs/trl)
700
+ - [TRL Jobs Training Guide](https://huggingface.co/docs/trl/en/jobs_training)
701
+ - [TRL Jobs Package](https://github.com/huggingface/trl-jobs)
702
+ - [HF Jobs Documentation](https://huggingface.co/docs/huggingface_hub/guides/jobs)
703
+ - [TRL Example Scripts](https://github.com/huggingface/trl/tree/main/examples/scripts)
704
+ - [UV Scripts Guide](https://docs.astral.sh/uv/guides/scripts/)
705
+ - [UV Scripts Organization](https://huggingface.co/uv-scripts)
706
+
707
+ ## Key Takeaways
708
+
709
+ 1. **Submit scripts inline** - The `script` parameter accepts Python code directly; no file saving required unless user requests
710
+ 2. **Jobs are asynchronous** - Don't wait/poll; let user check when ready
711
+ 3. **Always set timeout** - Default 30 min is insufficient; minimum 1-2 hours recommended
712
+ 4. **Always enable Hub push** - Environment is ephemeral; without push, all results lost
713
+ 5. **Include Trackio** - Use example scripts as templates for real-time monitoring
714
+ 6. **Offer cost estimation** - When parameters are known, use `scripts/estimate_cost.py`
715
+ 7. **Use UV scripts (Approach 1)** - Default to `hf_jobs("uv", {...})` with inline scripts; TRL maintained scripts for standard training; avoid bash `trl-jobs` commands in Claude Code
716
+ 8. **Use hf_doc_fetch/hf_doc_search** for latest TRL documentation
717
+ 9. **Validate dataset format** before training with dataset inspector (see Dataset Validation section)
718
+ 10. **Choose appropriate hardware** for model size; use LoRA for models >7B
skills/hugging-face-model-trainer/references/gguf_conversion.md ADDED
@@ -0,0 +1,296 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # GGUF Conversion Guide
2
+
3
+ After training models with TRL on Hugging Face Jobs, convert them to **GGUF format** for use with llama.cpp, Ollama, LM Studio, and other local inference tools.
4
+
5
+ **This guide provides production-ready, tested code based on successful conversions.** All critical dependencies and build steps are included.
6
+
7
+ ## What is GGUF?
8
+
9
+ **GGUF** (GPT-Generated Unified Format):
10
+ - Optimized format for CPU/GPU inference with llama.cpp
11
+ - Supports quantization (4-bit, 5-bit, 8-bit) to reduce model size
12
+ - Compatible with: Ollama, LM Studio, Jan, GPT4All, llama.cpp
13
+ - Typically 2-8GB for 7B models (vs 14GB unquantized)
14
+
15
+ ## When to Convert to GGUF
16
+
17
+ **Convert when:**
18
+ - Running models locally with Ollama or LM Studio
19
+ - Using CPU-optimized inference
20
+ - Reducing model size with quantization
21
+ - Deploying to edge devices
22
+ - Sharing models for local-first use
23
+
24
+ ## Critical Success Factors
25
+
26
+ Based on production testing, these are **essential** for reliable conversion:
27
+
28
+ ### 1. ✅ Install Build Tools FIRST
29
+ **Before cloning llama.cpp**, install build dependencies:
30
+ ```python
31
+ subprocess.run(["apt-get", "update", "-qq"], check=True, capture_output=True)
32
+ subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True, capture_output=True)
33
+ ```
34
+
35
+ **Why:** The quantization tool requires gcc and cmake. Installing after cloning doesn't help.
36
+
37
+ ### 2. ✅ Use CMake (Not Make)
38
+ **Build the quantize tool with CMake:**
39
+ ```python
40
+ # Create build directory
41
+ os.makedirs("/tmp/llama.cpp/build", exist_ok=True)
42
+
43
+ # Configure
44
+ subprocess.run([
45
+ "cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp",
46
+ "-DGGML_CUDA=OFF" # Faster build, CUDA not needed for quantization
47
+ ], check=True, capture_output=True, text=True)
48
+
49
+ # Build
50
+ subprocess.run([
51
+ "cmake", "--build", "/tmp/llama.cpp/build",
52
+ "--target", "llama-quantize", "-j", "4"
53
+ ], check=True, capture_output=True, text=True)
54
+
55
+ # Binary path
56
+ quantize_bin = "/tmp/llama.cpp/build/bin/llama-quantize"
57
+ ```
58
+
59
+ **Why:** CMake is more reliable than `make` and produces consistent binary paths.
60
+
61
+ ### 3. ✅ Include All Dependencies
62
+ **PEP 723 header must include:**
63
+ ```python
64
+ # /// script
65
+ # dependencies = [
66
+ # "transformers>=4.36.0",
67
+ # "peft>=0.7.0",
68
+ # "torch>=2.0.0",
69
+ # "accelerate>=0.24.0",
70
+ # "huggingface_hub>=0.20.0",
71
+ # "sentencepiece>=0.1.99", # Required for tokenizer
72
+ # "protobuf>=3.20.0", # Required for tokenizer
73
+ # "numpy",
74
+ # "gguf",
75
+ # ]
76
+ # ///
77
+ ```
78
+
79
+ **Why:** `sentencepiece` and `protobuf` are critical for tokenizer conversion. Missing them causes silent failures.
80
+
81
+ ### 4. ✅ Verify Names Before Use
82
+ **Always verify repos exist:**
83
+ ```python
84
+ # Before submitting job, verify:
85
+ hub_repo_details([ADAPTER_MODEL], repo_type="model")
86
+ hub_repo_details([BASE_MODEL], repo_type="model")
87
+ ```
88
+
89
+ **Why:** Non-existent dataset/model names cause job failures that could be caught in seconds.
90
+
91
+ ## Complete Conversion Script
92
+
93
+ See `scripts/convert_to_gguf.py` for the complete, production-ready script.
94
+
95
+ **Key features:**
96
+ - ✅ All dependencies in PEP 723 header
97
+ - ✅ Build tools installed automatically
98
+ - ✅ CMake build process (reliable)
99
+ - ✅ Comprehensive error handling
100
+ - ✅ Environment variable configuration
101
+ - ✅ Automatic README generation
102
+
103
+ ## Quick Conversion Job
104
+
105
+ ```python
106
+ # Before submitting: VERIFY MODELS EXIST
107
+ hub_repo_details(["username/my-finetuned-model"], repo_type="model")
108
+ hub_repo_details(["Qwen/Qwen2.5-0.5B"], repo_type="model")
109
+
110
+ # Submit conversion job
111
+ hf_jobs("uv", {
112
+ "script": open("trl/scripts/convert_to_gguf.py").read(), # Or inline the script
113
+ "flavor": "a10g-large",
114
+ "timeout": "45m",
115
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"},
116
+ "env": {
117
+ "ADAPTER_MODEL": "username/my-finetuned-model",
118
+ "BASE_MODEL": "Qwen/Qwen2.5-0.5B",
119
+ "OUTPUT_REPO": "username/my-model-gguf",
120
+ "HF_USERNAME": "username" # Optional, for README
121
+ }
122
+ })
123
+ ```
124
+
125
+ ## Conversion Process
126
+
127
+ The script performs these steps:
128
+
129
+ 1. **Load and Merge** - Load base model and LoRA adapter, merge them
130
+ 2. **Install Build Tools** - Install gcc, cmake (CRITICAL: before cloning llama.cpp)
131
+ 3. **Setup llama.cpp** - Clone repo, install Python dependencies
132
+ 4. **Convert to GGUF** - Create FP16 GGUF using llama.cpp converter
133
+ 5. **Build Quantize Tool** - Use CMake to build `llama-quantize`
134
+ 6. **Quantize** - Create Q4_K_M, Q5_K_M, Q8_0 versions
135
+ 7. **Upload** - Upload all versions + README to Hub
136
+
137
+ ## Quantization Options
138
+
139
+ Common quantization formats (from smallest to largest):
140
+
141
+ | Format | Size | Quality | Use Case |
142
+ |--------|------|---------|----------|
143
+ | **Q4_K_M** | ~300MB | Good | **Recommended** - best balance of size/quality |
144
+ | **Q5_K_M** | ~350MB | Better | Higher quality, slightly larger |
145
+ | **Q8_0** | ~500MB | Very High | Near-original quality |
146
+ | **F16** | ~1GB | Original | Full precision, largest file |
147
+
148
+ **Recommendation:** Create Q4_K_M, Q5_K_M, and Q8_0 versions to give users options.
149
+
150
+ ## Hardware Requirements
151
+
152
+ **For conversion:**
153
+ - Small models (<1B): CPU-basic works, but slow
154
+ - Medium models (1-7B): a10g-large recommended
155
+ - Large models (7B+): a10g-large or a100-large
156
+
157
+ **Time estimates:**
158
+ - 0.5B model: ~15-25 minutes on A10G
159
+ - 3B model: ~30-45 minutes on A10G
160
+ - 7B model: ~45-60 minutes on A10G
161
+
162
+ ## Using GGUF Models
163
+
164
+ **GGUF models work on both CPU and GPU.** They're optimized for CPU inference but can also leverage GPU acceleration when available.
165
+
166
+ ### With Ollama (auto-detects GPU)
167
+ ```bash
168
+ # Download GGUF
169
+ huggingface-cli download username/my-model-gguf model-q4_k_m.gguf
170
+
171
+ # Create Modelfile
172
+ echo "FROM ./model-q4_k_m.gguf" > Modelfile
173
+
174
+ # Create and run (uses GPU automatically if available)
175
+ ollama create my-model -f Modelfile
176
+ ollama run my-model
177
+ ```
178
+
179
+ ### With llama.cpp
180
+ ```bash
181
+ # CPU only
182
+ ./llama-cli -m model-q4_k_m.gguf -p "Your prompt"
183
+
184
+ # With GPU acceleration (offload 32 layers to GPU)
185
+ ./llama-cli -m model-q4_k_m.gguf -ngl 32 -p "Your prompt"
186
+ ```
187
+
188
+ ### With LM Studio
189
+ 1. Download the `.gguf` file
190
+ 2. Import into LM Studio
191
+ 3. Start chatting
192
+
193
+ ## Best Practices
194
+
195
+ ### ✅ DO:
196
+ 1. **Verify repos exist** before submitting jobs (use `hub_repo_details`)
197
+ 2. **Install build tools FIRST** before cloning llama.cpp
198
+ 3. **Use CMake** for building quantize tool (not make)
199
+ 4. **Include all dependencies** in PEP 723 header (especially sentencepiece, protobuf)
200
+ 5. **Create multiple quantizations** - Give users choice
201
+ 6. **Test on known models** before production use
202
+ 7. **Use A10G GPU** for faster conversion
203
+
204
+ ### ❌ DON'T:
205
+ 1. **Assume repos exist** - Always verify with hub tools
206
+ 2. **Use make** instead of CMake - Less reliable
207
+ 3. **Remove dependencies** to "simplify" - They're all needed
208
+ 4. **Skip build tools** - Quantization will fail silently
209
+ 5. **Use default paths** - CMake puts binaries in build/bin/
210
+
211
+ ## Common Issues
212
+
213
+ ### Out of memory during merge
214
+ **Fix:**
215
+ - Use larger GPU (a10g-large or a100-large)
216
+ - Ensure `device_map="auto"` for automatic placement
217
+ - Use `dtype=torch.float16` or `torch.bfloat16`
218
+
219
+ ### Conversion fails with architecture error
220
+ **Fix:**
221
+ - Ensure llama.cpp supports the model architecture
222
+ - Check for standard architecture (Qwen, Llama, Mistral, etc.)
223
+ - Update llama.cpp to latest: `git clone --depth 1 https://github.com/ggerganov/llama.cpp.git`
224
+ - Check llama.cpp documentation for model support
225
+
226
+ ### Quantization fails
227
+ **Fix:**
228
+ - Verify build tools installed: `apt-get install build-essential cmake`
229
+ - Use CMake (not make) to build quantize tool
230
+ - Check binary path: `/tmp/llama.cpp/build/bin/llama-quantize`
231
+ - Verify FP16 GGUF exists before quantizing
232
+
233
+ ### Missing sentencepiece error
234
+ **Fix:**
235
+ - Add to PEP 723 header: `"sentencepiece>=0.1.99", "protobuf>=3.20.0"`
236
+ - Don't remove dependencies to "simplify" - all are required
237
+
238
+ ### Upload fails or times out
239
+ **Fix:**
240
+ - Large models (>2GB) need longer timeout: `"timeout": "1h"`
241
+ - Upload quantized versions separately if needed
242
+ - Check network/Hub status
243
+
244
+ ## Lessons Learned
245
+
246
+ These are from production testing and real failures:
247
+
248
+ ### 1. Always Verify Before Use
249
+ **Lesson:** Don't assume repos/datasets exist. Check first.
250
+ ```python
251
+ # BEFORE submitting job
252
+ hub_repo_details(["trl-lib/argilla-dpo-mix-7k"], repo_type="dataset") # Would catch error
253
+ ```
254
+ **Prevented failures:** Non-existent dataset names, typos in model names
255
+
256
+ ### 2. Prioritize Reliability Over Performance
257
+ **Lesson:** Default to what's most likely to succeed.
258
+ - Use CMake (not make) - more reliable
259
+ - Disable CUDA in build - faster, not needed
260
+ - Include all dependencies - don't "simplify"
261
+
262
+ **Prevented failures:** Build failures, missing binaries
263
+
264
+ ### 3. Create Atomic, Self-Contained Scripts
265
+ **Lesson:** Don't remove dependencies or steps. Scripts should work as a unit.
266
+ - All dependencies in PEP 723 header
267
+ - All build steps included
268
+ - Clear error messages
269
+
270
+ **Prevented failures:** Missing tokenizer libraries, build tool failures
271
+
272
+ ## References
273
+
274
+ **In this skill:**
275
+ - `scripts/convert_to_gguf.py` - Complete, production-ready script
276
+
277
+ **External:**
278
+ - [llama.cpp Repository](https://github.com/ggerganov/llama.cpp)
279
+ - [GGUF Specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
280
+ - [Ollama Documentation](https://ollama.ai)
281
+ - [LM Studio](https://lmstudio.ai)
282
+
283
+ ## Summary
284
+
285
+ **Critical checklist for GGUF conversion:**
286
+ - [ ] Verify adapter and base models exist on Hub
287
+ - [ ] Use production script from `scripts/convert_to_gguf.py`
288
+ - [ ] All dependencies in PEP 723 header (including sentencepiece, protobuf)
289
+ - [ ] Build tools installed before cloning llama.cpp
290
+ - [ ] CMake used for building quantize tool (not make)
291
+ - [ ] Correct binary path: `/tmp/llama.cpp/build/bin/llama-quantize`
292
+ - [ ] A10G GPU selected for reasonable conversion time
293
+ - [ ] Timeout set to 45m minimum
294
+ - [ ] HF_TOKEN in secrets for Hub upload
295
+
296
+ **The script in `scripts/convert_to_gguf.py` incorporates all these lessons and has been tested successfully in production.**
skills/hugging-face-model-trainer/references/hardware_guide.md ADDED
@@ -0,0 +1,283 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Hardware Selection Guide
2
+
3
+ Choosing the right hardware (flavor) is critical for cost-effective training.
4
+
5
+ ## Available Hardware
6
+
7
+ ### CPU
8
+ - `cpu-basic` - Basic CPU, testing only
9
+ - `cpu-upgrade` - Enhanced CPU
10
+
11
+ **Use cases:** Dataset validation, preprocessing, testing scripts
12
+ **Not recommended for training:** Too slow for any meaningful training
13
+
14
+ ### GPU Options
15
+
16
+ | Flavor | GPU | Memory | Use Case | Cost/hour |
17
+ |--------|-----|--------|----------|-----------|
18
+ | `t4-small` | NVIDIA T4 | 16GB | <1B models, demos | ~$0.50-1 |
19
+ | `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 |
20
+ | `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient training | ~$2-3 |
21
+ | `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU training | ~$8-12 |
22
+ | `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 |
23
+ | `a10g-large` | NVIDIA A10G | 24GB | 7-13B models | ~$4-6 |
24
+ | `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 |
25
+ | `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 |
26
+ | `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fast training | ~$8-12 |
27
+
28
+ ### TPU Options
29
+
30
+ | Flavor | Type | Use Case |
31
+ |--------|------|----------|
32
+ | `v5e-1x1` | TPU v5e | Small TPU workloads |
33
+ | `v5e-2x2` | 4x TPU v5e | Medium TPU workloads |
34
+ | `v5e-2x4` | 8x TPU v5e | Large TPU workloads |
35
+
36
+ **Note:** TPUs require TPU-optimized code. Most TRL training uses GPUs.
37
+
38
+ ## Selection Guidelines
39
+
40
+ ### By Model Size
41
+
42
+ **Tiny Models (<1B parameters)**
43
+ - **Recommended:** `t4-small`
44
+ - **Example:** Qwen2.5-0.5B, TinyLlama
45
+ - **Batch size:** 4-8
46
+ - **Training time:** 1-2 hours for 1K examples
47
+
48
+ **Small Models (1-3B parameters)**
49
+ - **Recommended:** `t4-medium` or `a10g-small`
50
+ - **Example:** Qwen2.5-1.5B, Phi-2
51
+ - **Batch size:** 2-4
52
+ - **Training time:** 2-4 hours for 10K examples
53
+
54
+ **Medium Models (3-7B parameters)**
55
+ - **Recommended:** `a10g-small` or `a10g-large`
56
+ - **Example:** Qwen2.5-7B, Mistral-7B
57
+ - **Batch size:** 1-2 (or LoRA with 4-8)
58
+ - **Training time:** 4-8 hours for 10K examples
59
+
60
+ **Large Models (7-13B parameters)**
61
+ - **Recommended:** `a10g-large` or `a100-large`
62
+ - **Example:** Llama-3-8B, Mixtral-8x7B (with LoRA)
63
+ - **Batch size:** 1 (full fine-tuning) or 2-4 (LoRA)
64
+ - **Training time:** 6-12 hours for 10K examples
65
+ - **Note:** Always use LoRA/PEFT
66
+
67
+ **Very Large Models (13B+ parameters)**
68
+ - **Recommended:** `a100-large` with LoRA
69
+ - **Example:** Llama-3-13B, Llama-3-70B (LoRA only)
70
+ - **Batch size:** 1-2 with LoRA
71
+ - **Training time:** 8-24 hours for 10K examples
72
+ - **Note:** Full fine-tuning not feasible, use LoRA/PEFT
73
+
74
+ ### By Budget
75
+
76
+ **Minimal Budget (<$5 total)**
77
+ - Use `t4-small`
78
+ - Train on subset of data (100-500 examples)
79
+ - Limit to 1-2 epochs
80
+ - Use small model (<1B)
81
+
82
+ **Small Budget ($5-20)**
83
+ - Use `t4-medium` or `a10g-small`
84
+ - Train on 1K-5K examples
85
+ - 2-3 epochs
86
+ - Model up to 3B parameters
87
+
88
+ **Medium Budget ($20-50)**
89
+ - Use `a10g-small` or `a10g-large`
90
+ - Train on 5K-20K examples
91
+ - 3-5 epochs
92
+ - Model up to 7B parameters
93
+
94
+ **Large Budget ($50-200)**
95
+ - Use `a10g-large` or `a100-large`
96
+ - Full dataset training
97
+ - Multiple epochs
98
+ - Model up to 13B parameters with LoRA
99
+
100
+ ### By Training Type
101
+
102
+ **Quick Demo/Experiment**
103
+ - `t4-small`
104
+ - 50-100 examples
105
+ - 5-10 steps
106
+ - ~10-15 minutes
107
+
108
+ **Development/Iteration**
109
+ - `t4-medium` or `a10g-small`
110
+ - 1K examples
111
+ - 1 epoch
112
+ - ~30-60 minutes
113
+
114
+ **Production Training**
115
+ - `a10g-large` or `a100-large`
116
+ - Full dataset
117
+ - 3-5 epochs
118
+ - 4-12 hours
119
+
120
+ **Research/Experimentation**
121
+ - `a100-large`
122
+ - Multiple runs
123
+ - Various hyperparameters
124
+ - Budget for 20-50 hours
125
+
126
+ ## Memory Considerations
127
+
128
+ ### Estimating Memory Requirements
129
+
130
+ **Full fine-tuning:**
131
+ ```
132
+ Memory (GB) ≈ (Model params in billions) × 20
133
+ ```
134
+
135
+ **LoRA fine-tuning:**
136
+ ```
137
+ Memory (GB) ≈ (Model params in billions) × 4
138
+ ```
139
+
140
+ **Examples:**
141
+ - Qwen2.5-0.5B full: ~10GB ✅ fits t4-small
142
+ - Qwen2.5-1.5B full: ~30GB ❌ exceeds most GPUs
143
+ - Qwen2.5-1.5B LoRA: ~6GB ✅ fits t4-small
144
+ - Qwen2.5-7B full: ~140GB ❌ not feasible
145
+ - Qwen2.5-7B LoRA: ~28GB ✅ fits a10g-large
146
+
147
+ ### Memory Optimization
148
+
149
+ If hitting memory limits:
150
+
151
+ 1. **Use LoRA/PEFT**
152
+ ```python
153
+ peft_config=LoraConfig(r=16, lora_alpha=32)
154
+ ```
155
+
156
+ 2. **Reduce batch size**
157
+ ```python
158
+ per_device_train_batch_size=1
159
+ ```
160
+
161
+ 3. **Increase gradient accumulation**
162
+ ```python
163
+ gradient_accumulation_steps=8 # Effective batch size = 1×8
164
+ ```
165
+
166
+ 4. **Enable gradient checkpointing**
167
+ ```python
168
+ gradient_checkpointing=True
169
+ ```
170
+
171
+ 5. **Use mixed precision**
172
+ ```python
173
+ bf16=True # or fp16=True
174
+ ```
175
+
176
+ 6. **Upgrade to larger GPU**
177
+ - t4 → a10g → a100
178
+
179
+ ## Cost Estimation
180
+
181
+ ### Formula
182
+
183
+ ```
184
+ Total Cost = (Hours of training) × (Cost per hour)
185
+ ```
186
+
187
+ ### Example Calculations
188
+
189
+ **Quick demo:**
190
+ - Hardware: t4-small ($0.75/hour)
191
+ - Time: 15 minutes (0.25 hours)
192
+ - Cost: $0.19
193
+
194
+ **Development training:**
195
+ - Hardware: a10g-small ($3.50/hour)
196
+ - Time: 2 hours
197
+ - Cost: $7.00
198
+
199
+ **Production training:**
200
+ - Hardware: a10g-large ($5/hour)
201
+ - Time: 6 hours
202
+ - Cost: $30.00
203
+
204
+ **Large model with LoRA:**
205
+ - Hardware: a100-large ($10/hour)
206
+ - Time: 8 hours
207
+ - Cost: $80.00
208
+
209
+ ### Cost Optimization Tips
210
+
211
+ 1. **Start small:** Test on t4-small with subset
212
+ 2. **Use LoRA:** 4-5x cheaper than full fine-tuning
213
+ 3. **Optimize hyperparameters:** Fewer epochs if possible
214
+ 4. **Set appropriate timeout:** Don't waste compute on stalled jobs
215
+ 5. **Use checkpointing:** Resume if job fails
216
+ 6. **Monitor costs:** Check running jobs regularly
217
+
218
+ ## Multi-GPU Training
219
+
220
+ TRL automatically handles multi-GPU training with Accelerate when using multi-GPU flavors.
221
+
222
+ **Multi-GPU flavors:**
223
+ - `l4x4` - 4x L4 GPUs
224
+ - `a10g-largex2` - 2x A10G GPUs
225
+ - `a10g-largex4` - 4x A10G GPUs
226
+
227
+ **When to use:**
228
+ - Models >13B parameters
229
+ - Need faster training (linear speedup)
230
+ - Large datasets (>50K examples)
231
+
232
+ **Example:**
233
+ ```python
234
+ hf_jobs("uv", {
235
+ "script": "train.py",
236
+ "flavor": "a10g-largex2", # 2 GPUs
237
+ "timeout": "4h",
238
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
239
+ })
240
+ ```
241
+
242
+ No code changes needed—TRL/Accelerate handles distribution automatically.
243
+
244
+ ## Choosing Between Options
245
+
246
+ ### a10g vs a100
247
+
248
+ **Choose a10g when:**
249
+ - Model <13B parameters
250
+ - Budget conscious
251
+ - Training time not critical
252
+
253
+ **Choose a100 when:**
254
+ - Model 13B+ parameters
255
+ - Need fastest training
256
+ - Memory requirements high
257
+ - Budget allows
258
+
259
+ ### Single vs Multi-GPU
260
+
261
+ **Choose single GPU when:**
262
+ - Model <7B parameters
263
+ - Budget constrained
264
+ - Simpler debugging
265
+
266
+ **Choose multi-GPU when:**
267
+ - Model >13B parameters
268
+ - Need faster training
269
+ - Large batch sizes required
270
+ - Cost-effective for large jobs
271
+
272
+ ## Quick Reference
273
+
274
+ ```python
275
+ # Model size → Hardware selection
276
+ HARDWARE_MAP = {
277
+ "<1B": "t4-small",
278
+ "1-3B": "a10g-small",
279
+ "3-7B": "a10g-large",
280
+ "7-13B": "a10g-large (LoRA) or a100-large",
281
+ ">13B": "a100-large (LoRA required)"
282
+ }
283
+ ```
skills/hugging-face-model-trainer/references/hub_saving.md ADDED
@@ -0,0 +1,364 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Saving Training Results to Hugging Face Hub
2
+
3
+ **⚠️ CRITICAL:** Training environments are ephemeral. ALL results are lost when a job completes unless pushed to the Hub.
4
+
5
+ ## Why Hub Push is Required
6
+
7
+ When running on Hugging Face Jobs:
8
+ - Environment is temporary
9
+ - All files deleted on job completion
10
+ - No local disk persistence
11
+ - Cannot access results after job ends
12
+
13
+ **Without Hub push, training is completely wasted.**
14
+
15
+ ## Required Configuration
16
+
17
+ ### 1. Training Configuration
18
+
19
+ In your SFTConfig or trainer config:
20
+
21
+ ```python
22
+ SFTConfig(
23
+ push_to_hub=True, # Enable Hub push
24
+ hub_model_id="username/model-name", # Target repository
25
+ )
26
+ ```
27
+
28
+ ### 2. Job Configuration
29
+
30
+ When submitting the job:
31
+
32
+ ```python
33
+ hf_jobs("uv", {
34
+ "script": "train.py",
35
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Provide authentication
36
+ })
37
+ ```
38
+
39
+ **The `$HF_TOKEN` placeholder is automatically replaced with your Hugging Face token.**
40
+
41
+ ## Complete Example
42
+
43
+ ```python
44
+ # train.py
45
+ # /// script
46
+ # dependencies = ["trl"]
47
+ # ///
48
+
49
+ from trl import SFTTrainer, SFTConfig
50
+ from datasets import load_dataset
51
+
52
+ dataset = load_dataset("trl-lib/Capybara", split="train")
53
+
54
+ # Configure with Hub push
55
+ config = SFTConfig(
56
+ output_dir="my-model",
57
+ num_train_epochs=3,
58
+
59
+ # ✅ CRITICAL: Hub push configuration
60
+ push_to_hub=True,
61
+ hub_model_id="myusername/my-trained-model",
62
+
63
+ # Optional: Push strategy
64
+ push_to_hub_model_id="myusername/my-trained-model",
65
+ push_to_hub_organization=None,
66
+ push_to_hub_token=None, # Uses environment token
67
+ )
68
+
69
+ trainer = SFTTrainer(
70
+ model="Qwen/Qwen2.5-0.5B",
71
+ train_dataset=dataset,
72
+ args=config,
73
+ )
74
+
75
+ trainer.train()
76
+
77
+ # ✅ Push final model
78
+ trainer.push_to_hub()
79
+
80
+ print("✅ Model saved to: https://huggingface.co/myusername/my-trained-model")
81
+ ```
82
+
83
+ **Submit with authentication:**
84
+
85
+ ```python
86
+ hf_jobs("uv", {
87
+ "script": "train.py",
88
+ "flavor": "a10g-large",
89
+ "timeout": "2h",
90
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Required!
91
+ })
92
+ ```
93
+
94
+ ## What Gets Saved
95
+
96
+ When `push_to_hub=True`:
97
+
98
+ 1. **Model weights** - Final trained parameters
99
+ 2. **Tokenizer** - Associated tokenizer
100
+ 3. **Configuration** - Model config (config.json)
101
+ 4. **Training arguments** - Hyperparameters used
102
+ 5. **Model card** - Auto-generated documentation
103
+ 6. **Checkpoints** - If `save_strategy="steps"` enabled
104
+
105
+ ## Checkpoint Saving
106
+
107
+ Save intermediate checkpoints during training:
108
+
109
+ ```python
110
+ SFTConfig(
111
+ output_dir="my-model",
112
+ push_to_hub=True,
113
+ hub_model_id="username/my-model",
114
+
115
+ # Checkpoint configuration
116
+ save_strategy="steps",
117
+ save_steps=100, # Save every 100 steps
118
+ save_total_limit=3, # Keep only last 3 checkpoints
119
+ )
120
+ ```
121
+
122
+ **Benefits:**
123
+ - Resume training if job fails
124
+ - Compare checkpoint performance
125
+ - Use intermediate models
126
+
127
+ **Checkpoints are pushed to:** `username/my-model` (same repo)
128
+
129
+ ## Authentication Methods
130
+
131
+ ### Method 1: Automatic Token (Recommended)
132
+
133
+ ```python
134
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
135
+ ```
136
+
137
+ Uses your logged-in Hugging Face token automatically.
138
+
139
+ ### Method 2: Explicit Token
140
+
141
+ ```python
142
+ "secrets": {"HF_TOKEN": "hf_abc123..."}
143
+ ```
144
+
145
+ Provide token explicitly (not recommended for security).
146
+
147
+ ### Method 3: Environment Variable
148
+
149
+ ```python
150
+ "env": {"HF_TOKEN": "hf_abc123..."}
151
+ ```
152
+
153
+ Pass as regular environment variable (less secure than secrets).
154
+
155
+ **Always prefer Method 1** for security and convenience.
156
+
157
+ ## Verification Checklist
158
+
159
+ Before submitting any training job, verify:
160
+
161
+ - [ ] `push_to_hub=True` in training config
162
+ - [ ] `hub_model_id` is specified (format: `username/model-name`)
163
+ - [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
164
+ - [ ] Repository name doesn't conflict with existing repos
165
+ - [ ] You have write access to the target namespace
166
+
167
+ ## Repository Setup
168
+
169
+ ### Automatic Creation
170
+
171
+ If repository doesn't exist, it's created automatically when first pushing.
172
+
173
+ ### Manual Creation
174
+
175
+ Create repository before training:
176
+
177
+ ```python
178
+ from huggingface_hub import HfApi
179
+
180
+ api = HfApi()
181
+ api.create_repo(
182
+ repo_id="username/model-name",
183
+ repo_type="model",
184
+ private=False, # or True for private repo
185
+ )
186
+ ```
187
+
188
+ ### Repository Naming
189
+
190
+ **Valid names:**
191
+ - `username/my-model`
192
+ - `username/model-name`
193
+ - `organization/model-name`
194
+
195
+ **Invalid names:**
196
+ - `model-name` (missing username)
197
+ - `username/model name` (spaces not allowed)
198
+ - `username/MODEL` (uppercase discouraged)
199
+
200
+ ## Troubleshooting
201
+
202
+ ### Error: 401 Unauthorized
203
+
204
+ **Cause:** HF_TOKEN not provided or invalid
205
+
206
+ **Solutions:**
207
+ 1. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
208
+ 2. Check you're logged in: `huggingface-cli whoami`
209
+ 3. Re-login: `huggingface-cli login`
210
+
211
+ ### Error: 403 Forbidden
212
+
213
+ **Cause:** No write access to repository
214
+
215
+ **Solutions:**
216
+ 1. Check repository namespace matches your username
217
+ 2. Verify you're a member of organization (if using org namespace)
218
+ 3. Check repository isn't private (if accessing org repo)
219
+
220
+ ### Error: Repository not found
221
+
222
+ **Cause:** Repository doesn't exist and auto-creation failed
223
+
224
+ **Solutions:**
225
+ 1. Manually create repository first
226
+ 2. Check repository name format
227
+ 3. Verify namespace exists
228
+
229
+ ### Error: Push failed during training
230
+
231
+ **Cause:** Network issues or Hub unavailable
232
+
233
+ **Solutions:**
234
+ 1. Training continues but final push fails
235
+ 2. Checkpoints may be saved
236
+ 3. Re-run push manually after job completes
237
+
238
+ ### Issue: Model saved but not visible
239
+
240
+ **Possible causes:**
241
+ 1. Repository is private—check https://huggingface.co/username
242
+ 2. Wrong namespace—verify `hub_model_id` matches login
243
+ 3. Push still in progress—wait a few minutes
244
+
245
+ ## Manual Push After Training
246
+
247
+ If training completes but push fails, push manually:
248
+
249
+ ```python
250
+ from transformers import AutoModel, AutoTokenizer
251
+
252
+ # Load from local checkpoint
253
+ model = AutoModel.from_pretrained("./output_dir")
254
+ tokenizer = AutoTokenizer.from_pretrained("./output_dir")
255
+
256
+ # Push to Hub
257
+ model.push_to_hub("username/model-name", token="hf_abc123...")
258
+ tokenizer.push_to_hub("username/model-name", token="hf_abc123...")
259
+ ```
260
+
261
+ **Note:** Only possible if job hasn't completed (files still exist).
262
+
263
+ ## Best Practices
264
+
265
+ 1. **Always enable `push_to_hub=True`**
266
+ 2. **Use checkpoint saving** for long training runs
267
+ 3. **Verify Hub push** in logs before job completes
268
+ 4. **Set appropriate `save_total_limit`** to avoid excessive checkpoints
269
+ 5. **Use descriptive repo names** (e.g., `qwen-capybara-sft` not `model1`)
270
+ 6. **Add model card** with training details
271
+ 7. **Tag models** with relevant tags (e.g., `text-generation`, `fine-tuned`)
272
+
273
+ ## Monitoring Push Progress
274
+
275
+ Check logs for push progress:
276
+
277
+ ```python
278
+ hf_jobs("logs", {"job_id": "your-job-id"})
279
+ ```
280
+
281
+ **Look for:**
282
+ ```
283
+ Pushing model to username/model-name...
284
+ Upload file pytorch_model.bin: 100%
285
+ ✅ Model pushed successfully
286
+ ```
287
+
288
+ ## Example: Full Production Setup
289
+
290
+ ```python
291
+ # production_train.py
292
+ # /// script
293
+ # dependencies = ["trl>=0.12.0", "peft>=0.7.0"]
294
+ # ///
295
+
296
+ from datasets import load_dataset
297
+ from peft import LoraConfig
298
+ from trl import SFTTrainer, SFTConfig
299
+ import os
300
+
301
+ # Verify token is available
302
+ assert "HF_TOKEN" in os.environ, "HF_TOKEN not found in environment!"
303
+
304
+ # Load dataset
305
+ dataset = load_dataset("trl-lib/Capybara", split="train")
306
+ print(f"✅ Dataset loaded: {len(dataset)} examples")
307
+
308
+ # Configure with comprehensive Hub settings
309
+ config = SFTConfig(
310
+ output_dir="qwen-capybara-sft",
311
+
312
+ # Hub configuration
313
+ push_to_hub=True,
314
+ hub_model_id="myusername/qwen-capybara-sft",
315
+ hub_strategy="checkpoint", # Push checkpoints
316
+
317
+ # Checkpoint configuration
318
+ save_strategy="steps",
319
+ save_steps=100,
320
+ save_total_limit=3,
321
+
322
+ # Training settings
323
+ num_train_epochs=3,
324
+ per_device_train_batch_size=4,
325
+
326
+ # Logging
327
+ logging_steps=10,
328
+ logging_first_step=True,
329
+ )
330
+
331
+ # Train with LoRA
332
+ trainer = SFTTrainer(
333
+ model="Qwen/Qwen2.5-0.5B",
334
+ train_dataset=dataset,
335
+ args=config,
336
+ peft_config=LoraConfig(r=16, lora_alpha=32),
337
+ )
338
+
339
+ print("🚀 Starting training...")
340
+ trainer.train()
341
+
342
+ print("💾 Pushing final model to Hub...")
343
+ trainer.push_to_hub()
344
+
345
+ print("✅ Training complete!")
346
+ print(f"Model available at: https://huggingface.co/myusername/qwen-capybara-sft")
347
+ ```
348
+
349
+ **Submit:**
350
+
351
+ ```python
352
+ hf_jobs("uv", {
353
+ "script": "production_train.py",
354
+ "flavor": "a10g-large",
355
+ "timeout": "6h",
356
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
357
+ })
358
+ ```
359
+
360
+ ## Key Takeaway
361
+
362
+ **Without `push_to_hub=True` and `secrets={"HF_TOKEN": "$HF_TOKEN"}`, all training results are permanently lost.**
363
+
364
+ Always verify both are configured before submitting any training job.
skills/hugging-face-model-trainer/references/reliability_principles.md ADDED
@@ -0,0 +1,371 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Reliability Principles for Training Jobs
2
+
3
+ These principles are derived from real production failures and successful fixes. Following them prevents common failure modes and ensures reliable job execution.
4
+
5
+ ## Principle 1: Always Verify Before Use
6
+
7
+ **Rule:** Never assume repos, datasets, or resources exist. Verify with tools first.
8
+
9
+ ### What It Prevents
10
+
11
+ - **Non-existent datasets** - Jobs fail immediately when dataset doesn't exist
12
+ - **Typos in names** - Simple mistakes like "argilla-dpo-mix-7k" vs "ultrafeedback_binarized"
13
+ - **Incorrect paths** - Old or moved repos, renamed files
14
+ - **Missing dependencies** - Undocumented requirements
15
+
16
+ ### How to Apply
17
+
18
+ **Before submitting ANY job:**
19
+
20
+ ```python
21
+ # Verify dataset exists
22
+ dataset_search({"query": "dataset-name", "author": "author-name", "limit": 5})
23
+ hub_repo_details(["author/dataset-name"], repo_type="dataset")
24
+
25
+ # Verify model exists
26
+ hub_repo_details(["org/model-name"], repo_type="model")
27
+
28
+ # Check script/file paths (for URL-based scripts)
29
+ # Verify before using: https://github.com/user/repo/blob/main/script.py
30
+ ```
31
+
32
+ **Examples that would have caught errors:**
33
+
34
+ ```python
35
+ # ❌ WRONG: Assumed dataset exists
36
+ hf_jobs("uv", {
37
+ "script": """...""",
38
+ "env": {"DATASET": "trl-lib/argilla-dpo-mix-7k"} # Doesn't exist!
39
+ })
40
+
41
+ # ✅ CORRECT: Verify first
42
+ dataset_search({"query": "argilla dpo", "author": "trl-lib"})
43
+ # Would show: "trl-lib/ultrafeedback_binarized" is the correct name
44
+
45
+ hub_repo_details(["trl-lib/ultrafeedback_binarized"], repo_type="dataset")
46
+ # Confirms it exists before using
47
+ ```
48
+
49
+ ### Implementation Checklist
50
+
51
+ - [ ] Check dataset exists before training
52
+ - [ ] Verify base model exists before fine-tuning
53
+ - [ ] Confirm adapter model exists before GGUF conversion
54
+ - [ ] Test script URLs are valid before submitting
55
+ - [ ] Validate file paths in repositories
56
+ - [ ] Check for recent updates/renames of resources
57
+
58
+ **Time cost:** 5-10 seconds
59
+ **Time saved:** Hours of failed job time + debugging
60
+
61
+ ---
62
+
63
+ ## Principle 2: Prioritize Reliability Over Performance
64
+
65
+ **Rule:** Default to what is most likely to succeed, not what is theoretically fastest.
66
+
67
+ ### What It Prevents
68
+
69
+ - **Hardware incompatibilities** - Features that fail on certain GPUs
70
+ - **Unstable optimizations** - Speed-ups that cause crashes
71
+ - **Complex configurations** - More failure points
72
+ - **Build system issues** - Unreliable compilation methods
73
+
74
+ ### How to Apply
75
+
76
+ **Choose reliability:**
77
+
78
+ ```python
79
+ # ❌ RISKY: Aggressive optimization that may fail
80
+ SFTConfig(
81
+ torch_compile=True, # Can fail on T4, A10G GPUs
82
+ optim="adamw_bnb_8bit", # Requires specific setup
83
+ fp16=False, # May cause training instability
84
+ ...
85
+ )
86
+
87
+ # ✅ SAFE: Proven defaults
88
+ SFTConfig(
89
+ # torch_compile=True, # Commented with note: "Enable on H100 for 20% speedup"
90
+ optim="adamw_torch", # Standard, always works
91
+ fp16=True, # Stable and fast
92
+ ...
93
+ )
94
+ ```
95
+
96
+ **For build processes:**
97
+
98
+ ```python
99
+ # ❌ UNRELIABLE: Uses make (platform-dependent)
100
+ subprocess.run(["make", "-C", "/tmp/llama.cpp", "llama-quantize"], check=True)
101
+
102
+ # ✅ RELIABLE: Uses CMake (consistent, documented)
103
+ subprocess.run([
104
+ "cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp",
105
+ "-DGGML_CUDA=OFF" # Disable CUDA for faster, more reliable build
106
+ ], check=True)
107
+
108
+ subprocess.run([
109
+ "cmake", "--build", "/tmp/llama.cpp/build",
110
+ "--target", "llama-quantize", "-j", "4"
111
+ ], check=True)
112
+ ```
113
+
114
+ ### Real-World Example
115
+
116
+ **The `torch.compile` failure:**
117
+ - Added for "20% speedup" on H100
118
+ - **Failed fatally on T4-medium** with cryptic error
119
+ - Misdiagnosed as dataset issue (cost hours)
120
+ - **Fix:** Disable by default, add as optional comment
121
+
122
+ **Result:** Reliability > 20% performance gain
123
+
124
+ ### Implementation Checklist
125
+
126
+ - [ ] Use proven, standard configurations by default
127
+ - [ ] Comment out performance optimizations with hardware notes
128
+ - [ ] Use stable build systems (CMake > make)
129
+ - [ ] Test on target hardware before production
130
+ - [ ] Document known incompatibilities
131
+ - [ ] Provide "safe" and "fast" variants when needed
132
+
133
+ **Performance loss:** 10-20% in best case
134
+ **Reliability gain:** 95%+ success rate vs 60-70%
135
+
136
+ ---
137
+
138
+ ## Principle 3: Create Atomic, Self-Contained Scripts
139
+
140
+ **Rule:** Scripts should work as complete, independent units. Don't remove parts to "simplify."
141
+
142
+ ### What It Prevents
143
+
144
+ - **Missing dependencies** - Removed "unnecessary" packages that are actually required
145
+ - **Incomplete processes** - Skipped steps that seem redundant
146
+ - **Environment assumptions** - Scripts that need pre-setup
147
+ - **Partial failures** - Some parts work, others fail silently
148
+
149
+ ### How to Apply
150
+
151
+ **Complete dependency specifications:**
152
+
153
+ ```python
154
+ # ❌ INCOMPLETE: "Simplified" by removing dependencies
155
+ # /// script
156
+ # dependencies = [
157
+ # "transformers",
158
+ # "peft",
159
+ # "torch",
160
+ # ]
161
+ # ///
162
+
163
+ # ✅ COMPLETE: All dependencies explicit
164
+ # /// script
165
+ # dependencies = [
166
+ # "transformers>=4.36.0",
167
+ # "peft>=0.7.0",
168
+ # "torch>=2.0.0",
169
+ # "accelerate>=0.24.0",
170
+ # "huggingface_hub>=0.20.0",
171
+ # "sentencepiece>=0.1.99", # Required for tokenizers
172
+ # "protobuf>=3.20.0", # Required for tokenizers
173
+ # "numpy",
174
+ # "gguf",
175
+ # ]
176
+ # ///
177
+ ```
178
+
179
+ **Complete build processes:**
180
+
181
+ ```python
182
+ # ❌ INCOMPLETE: Assumes build tools exist
183
+ subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"])
184
+ subprocess.run(["make", "-C", "/tmp/llama.cpp", "llama-quantize"]) # FAILS: no gcc/make
185
+
186
+ # ✅ COMPLETE: Installs all requirements
187
+ subprocess.run(["apt-get", "update", "-qq"], check=True)
188
+ subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True)
189
+ subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"])
190
+ # ... then build
191
+ ```
192
+
193
+ ### Real-World Example
194
+
195
+ **The `sentencepiece` failure:**
196
+ - Original script had it: worked fine
197
+ - "Simplified" version removed it: "doesn't look necessary"
198
+ - **GGUF conversion failed silently** - tokenizer couldn't convert
199
+ - Hard to debug: no obvious error message
200
+ - **Fix:** Restore all original dependencies
201
+
202
+ **Result:** Don't remove dependencies without thorough testing
203
+
204
+ ### Implementation Checklist
205
+
206
+ - [ ] All dependencies in PEP 723 header with version pins
207
+ - [ ] All system packages installed by script
208
+ - [ ] No assumptions about pre-existing environment
209
+ - [ ] No "optional" steps that are actually required
210
+ - [ ] Test scripts in clean environment
211
+ - [ ] Document why each dependency is needed
212
+
213
+ **Complexity:** Slightly longer scripts
214
+ **Reliability:** Scripts "just work" every time
215
+
216
+ ---
217
+
218
+ ## Principle 4: Provide Clear Error Context
219
+
220
+ **Rule:** When things fail, make it obvious what went wrong and how to fix it.
221
+
222
+ ### How to Apply
223
+
224
+ **Wrap subprocess calls:**
225
+
226
+ ```python
227
+ # ❌ UNCLEAR: Silent failure
228
+ subprocess.run([...], check=True, capture_output=True)
229
+
230
+ # ✅ CLEAR: Shows what failed
231
+ try:
232
+ result = subprocess.run(
233
+ [...],
234
+ check=True,
235
+ capture_output=True,
236
+ text=True
237
+ )
238
+ print(result.stdout)
239
+ if result.stderr:
240
+ print("Warnings:", result.stderr)
241
+ except subprocess.CalledProcessError as e:
242
+ print(f"❌ Command failed!")
243
+ print("STDOUT:", e.stdout)
244
+ print("STDERR:", e.stderr)
245
+ raise
246
+ ```
247
+
248
+ **Validate inputs:**
249
+
250
+ ```python
251
+ # ❌ UNCLEAR: Fails later with cryptic error
252
+ model = load_model(MODEL_NAME)
253
+
254
+ # ✅ CLEAR: Fails fast with clear message
255
+ if not MODEL_NAME:
256
+ raise ValueError("MODEL_NAME environment variable not set!")
257
+
258
+ print(f"Loading model: {MODEL_NAME}")
259
+ try:
260
+ model = load_model(MODEL_NAME)
261
+ print(f"✅ Model loaded successfully")
262
+ except Exception as e:
263
+ print(f"❌ Failed to load model: {MODEL_NAME}")
264
+ print(f"Error: {e}")
265
+ print("Hint: Check that model exists on Hub")
266
+ raise
267
+ ```
268
+
269
+ ### Implementation Checklist
270
+
271
+ - [ ] Wrap external calls with try/except
272
+ - [ ] Print stdout/stderr on failure
273
+ - [ ] Validate environment variables early
274
+ - [ ] Add progress indicators (✅, ❌, 🔄)
275
+ - [ ] Include hints for common failures
276
+ - [ ] Log configuration at start
277
+
278
+ ---
279
+
280
+ ## Principle 5: Test the Happy Path on Known-Good Inputs
281
+
282
+ **Rule:** Before using new code in production, test with inputs you know work.
283
+
284
+ ### How to Apply
285
+
286
+ **Known-good test inputs:**
287
+
288
+ ```python
289
+ # For training
290
+ TEST_DATASET = "trl-lib/Capybara" # Small, well-formatted, widely used
291
+ TEST_MODEL = "Qwen/Qwen2.5-0.5B" # Small, fast, reliable
292
+
293
+ # For GGUF conversion
294
+ TEST_ADAPTER = "evalstate/qwen-capybara-medium" # Known working model
295
+ TEST_BASE = "Qwen/Qwen2.5-0.5B" # Compatible base
296
+ ```
297
+
298
+ **Testing workflow:**
299
+
300
+ 1. Test with known-good inputs first
301
+ 2. If that works, try production inputs
302
+ 3. If production fails, you know it's the inputs (not code)
303
+ 4. Isolate the difference
304
+
305
+ ### Implementation Checklist
306
+
307
+ - [ ] Maintain list of known-good test models/datasets
308
+ - [ ] Test new scripts with test inputs first
309
+ - [ ] Document what makes inputs "good"
310
+ - [ ] Keep test jobs cheap (small models, short timeouts)
311
+ - [ ] Only move to production after test succeeds
312
+
313
+ **Time cost:** 5-10 minutes for test run
314
+ **Debugging time saved:** Hours
315
+
316
+ ---
317
+
318
+ ## Summary: The Reliability Checklist
319
+
320
+ Before submitting ANY job:
321
+
322
+ ### Pre-Flight Checks
323
+ - [ ] **Verified** all repos/datasets exist (hub_repo_details)
324
+ - [ ] **Tested** with known-good inputs if new code
325
+ - [ ] **Using** proven hardware/configuration
326
+ - [ ] **Included** all dependencies in PEP 723 header
327
+ - [ ] **Installed** system requirements (build tools, etc.)
328
+ - [ ] **Set** appropriate timeout (not default 30m)
329
+ - [ ] **Configured** Hub push with HF_TOKEN
330
+ - [ ] **Added** clear error handling
331
+
332
+ ### Script Quality
333
+ - [ ] Self-contained (no external setup needed)
334
+ - [ ] Complete dependencies listed
335
+ - [ ] Build tools installed by script
336
+ - [ ] Progress indicators included
337
+ - [ ] Error messages are clear
338
+ - [ ] Configuration logged at start
339
+
340
+ ### Job Configuration
341
+ - [ ] Timeout > expected runtime + 30% buffer
342
+ - [ ] Hardware appropriate for model size
343
+ - [ ] Secrets include HF_TOKEN
344
+ - [ ] Environment variables set correctly
345
+ - [ ] Cost estimated and acceptable
346
+
347
+ **Following these principles transforms job success rate from ~60-70% to ~95%+**
348
+
349
+ ---
350
+
351
+ ## When Principles Conflict
352
+
353
+ Sometimes reliability and performance conflict. Here's how to choose:
354
+
355
+ | Scenario | Choose | Rationale |
356
+ |----------|--------|-----------|
357
+ | Demo/test | Reliability | Fast failure is worse than slow success |
358
+ | Production (first run) | Reliability | Prove it works before optimizing |
359
+ | Production (proven) | Performance | Safe to optimize after validation |
360
+ | Time-critical | Reliability | Failures cause more delay than slow runs |
361
+ | Cost-critical | Balanced | Test with small model, then optimize |
362
+
363
+ **General rule:** Reliability first, optimize second.
364
+
365
+ ---
366
+
367
+ ## Further Reading
368
+
369
+ - `troubleshooting.md` - Common issues and fixes
370
+ - `training_patterns.md` - Proven training configurations
371
+ - `gguf_conversion.md` - Production GGUF workflow
skills/hugging-face-model-trainer/references/trackio_guide.md ADDED
@@ -0,0 +1,189 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Trackio Integration for TRL Training
2
+
3
+ **Trackio** is an experiment tracking library that provides real-time metrics visualization for remote training on Hugging Face Jobs infrastructure.
4
+
5
+ ⚠️ **IMPORTANT**: For Jobs training (remote cloud GPUs):
6
+ - Training happens on ephemeral cloud runners (not your local machine)
7
+ - Trackio syncs metrics to a Hugging Face Space for real-time monitoring
8
+ - Without a Space, metrics are lost when the job completes
9
+ - The Space dashboard persists your training metrics permanently
10
+
11
+ ## Setting Up Trackio for Jobs
12
+
13
+ **Step 1: Add trackio dependency**
14
+ ```python
15
+ # /// script
16
+ # dependencies = [
17
+ # "trl>=0.12.0",
18
+ # "trackio", # Required!
19
+ # ]
20
+ # ///
21
+ ```
22
+
23
+ **Step 2: Create a Trackio Space (one-time setup)**
24
+
25
+ **Option A: Let Trackio auto-create (Recommended)**
26
+ Pass a `space_id` to `trackio.init()` and Trackio will automatically create the Space if it doesn't exist.
27
+
28
+ **Option B: Create manually**
29
+ - Create Space via Hub UI at https://huggingface.co/new-space
30
+ - Select Gradio SDK
31
+ - OR use command: `huggingface-cli repo create my-trackio-dashboard --type space --space_sdk gradio`
32
+
33
+ **Step 3: Initialize Trackio with space_id**
34
+ ```python
35
+ import trackio
36
+
37
+ trackio.init(
38
+ project="my-training",
39
+ space_id="username/trackio", # CRITICAL for Jobs! Replace 'username' with your HF username
40
+ config={
41
+ "model": "Qwen/Qwen2.5-0.5B",
42
+ "dataset": "trl-lib/Capybara",
43
+ "learning_rate": 2e-5,
44
+ }
45
+ )
46
+ ```
47
+
48
+ **Step 4: Configure TRL to use Trackio**
49
+ ```python
50
+ SFTConfig(
51
+ report_to="trackio",
52
+ # ... other config
53
+ )
54
+ ```
55
+
56
+ **Step 5: Finish tracking**
57
+ ```python
58
+ trainer.train()
59
+ trackio.finish() # Ensures final metrics are synced
60
+ ```
61
+
62
+ ## What Trackio Tracks
63
+
64
+ Trackio automatically logs:
65
+ - ✅ Training loss
66
+ - ✅ Learning rate
67
+ - ✅ GPU utilization
68
+ - ✅ Memory usage
69
+ - ✅ Training throughput
70
+ - ✅ Custom metrics
71
+
72
+ ## How It Works with Jobs
73
+
74
+ 1. **Training runs** → Metrics logged to local SQLite DB
75
+ 2. **Every 5 minutes** → Trackio syncs DB to HF Dataset (Parquet)
76
+ 3. **Space dashboard** → Reads from Dataset, displays metrics in real-time
77
+ 4. **Job completes** → Final sync ensures all metrics persisted
78
+
79
+ ## Default Configuration Pattern
80
+
81
+ **Use sensible defaults for trackio configuration unless user requests otherwise.**
82
+
83
+ ### Recommended Defaults
84
+
85
+ ```python
86
+ import trackio
87
+
88
+ trackio.init(
89
+ project="qwen-capybara-sft",
90
+ name="baseline-run", # Descriptive name user will recognize
91
+ space_id="username/trackio", # Default space: {username}/trackio
92
+ config={
93
+ # Keep config minimal - hyperparameters and model/dataset info only
94
+ "model": "Qwen/Qwen2.5-0.5B",
95
+ "dataset": "trl-lib/Capybara",
96
+ "learning_rate": 2e-5,
97
+ "num_epochs": 3,
98
+ }
99
+ )
100
+ ```
101
+
102
+ **Key principles:**
103
+ - **Space ID**: Use `{username}/trackio` with "trackio" as default space name
104
+ - **Run naming**: Unless otherwise specified, name the run in a way the user will recognize
105
+ - **Config**: Keep minimal - don't automatically capture job metadata unless requested
106
+ - **Grouping**: Optional - only use if user requests organizing related experiments
107
+
108
+ ## Grouping Runs (Optional)
109
+
110
+ The `group` parameter helps organize related runs together in the dashboard sidebar. This is useful when user is running multiple experiments with different configurations but wants to compare them together:
111
+
112
+ ```python
113
+ # Example: Group runs by experiment type
114
+ trackio.init(project="my-project", run_name="baseline-run-1", group="baseline")
115
+ trackio.init(project="my-project", run_name="augmented-run-1", group="augmented")
116
+ trackio.init(project="my-project", run_name="tuned-run-1", group="tuned")
117
+ ```
118
+
119
+ Runs with the same group name can be grouped together in the sidebar, making it easier to compare related experiments. You can group by any configuration parameter:
120
+
121
+ ```python
122
+ # Hyperparameter sweep - group by learning rate
123
+ trackio.init(project="hyperparam-sweep", run_name="lr-0.001-run", group="lr_0.001")
124
+ trackio.init(project="hyperparam-sweep", run_name="lr-0.01-run", group="lr_0.01")
125
+ ```
126
+
127
+ ## Environment Variables for Jobs
128
+
129
+ You can configure trackio using environment variables instead of passing parameters to `trackio.init()`. This is useful for managing configuration across multiple jobs.
130
+
131
+
132
+
133
+ **`HF_TOKEN`**
134
+ Required for creating Spaces and writing to datasets (passed via `secrets`):
135
+ ```python
136
+ hf_jobs("uv", {
137
+ "script": "...",
138
+ "secrets": {
139
+ "HF_TOKEN": "$HF_TOKEN" # Enables Space creation and Hub push
140
+ }
141
+ })
142
+ ```
143
+
144
+ ### Example with Environment Variables
145
+
146
+ ```python
147
+ hf_jobs("uv", {
148
+ "script": """
149
+ # Training script - trackio config from environment
150
+ import trackio
151
+ from datetime import datetime
152
+
153
+ # Auto-generate run name
154
+ timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M")
155
+ run_name = f"sft_qwen25_{timestamp}"
156
+
157
+ # Project and space_id can come from environment variables
158
+ trackio.init(run_name=run_name, group="SFT")
159
+
160
+ # ... training code ...
161
+ trackio.finish()
162
+ """,
163
+ "flavor": "a10g-large",
164
+ "timeout": "2h",
165
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
166
+ })
167
+ ```
168
+
169
+ **When to use environment variables:**
170
+ - Managing multiple jobs with same configuration
171
+ - Keeping training scripts portable across projects
172
+ - Separating configuration from code
173
+
174
+ **When to use direct parameters:**
175
+ - Single job with specific configuration
176
+ - When clarity in code is preferred
177
+ - When each job has different project/space
178
+
179
+ ## Viewing the Dashboard
180
+
181
+ After starting training:
182
+ 1. Navigate to the Space: `https://huggingface.co/spaces/username/trackio`
183
+ 2. The Gradio dashboard shows all tracked experiments
184
+ 3. Filter by project, compare runs, view charts with smoothing
185
+
186
+ ## Recommendation
187
+
188
+ - **Trackio**: Best for real-time monitoring during long training runs
189
+ - **Weights & Biases**: Best for team collaboration, requires account
skills/hugging-face-model-trainer/references/training_methods.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TRL Training Methods Overview
2
+
3
+ TRL (Transformer Reinforcement Learning) provides multiple training methods for fine-tuning and aligning language models. This reference provides a brief overview of each method.
4
+
5
+ ## Supervised Fine-Tuning (SFT)
6
+
7
+ **What it is:** Standard instruction tuning with supervised learning on demonstration data.
8
+
9
+ **When to use:**
10
+ - Initial fine-tuning of base models on task-specific data
11
+ - Teaching new capabilities or domains
12
+ - Most common starting point for fine-tuning
13
+
14
+ **Dataset format:** Conversational format with "messages" field, OR text field, OR prompt/completion pairs
15
+
16
+ **Example:**
17
+ ```python
18
+ from trl import SFTTrainer, SFTConfig
19
+
20
+ trainer = SFTTrainer(
21
+ model="Qwen/Qwen2.5-0.5B",
22
+ train_dataset=dataset,
23
+ args=SFTConfig(
24
+ output_dir="my-model",
25
+ push_to_hub=True,
26
+ hub_model_id="username/my-model",
27
+ eval_strategy="no", # Disable eval for simple example
28
+ # max_length=1024 is the default - only set if you need different length
29
+ )
30
+ )
31
+ trainer.train()
32
+ ```
33
+
34
+ **Note:** For production training with evaluation monitoring, see `scripts/train_sft_example.py`
35
+
36
+ **Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/sft_trainer")`
37
+
38
+ ## Direct Preference Optimization (DPO)
39
+
40
+ **What it is:** Alignment method that trains directly on preference pairs (chosen vs rejected responses) without requiring a reward model.
41
+
42
+ **When to use:**
43
+ - Aligning models to human preferences
44
+ - Improving response quality after SFT
45
+ - Have paired preference data (chosen/rejected responses)
46
+
47
+ **Dataset format:** Preference pairs with "chosen" and "rejected" fields
48
+
49
+ **Example:**
50
+ ```python
51
+ from trl import DPOTrainer, DPOConfig
52
+
53
+ trainer = DPOTrainer(
54
+ model="Qwen/Qwen2.5-0.5B-Instruct", # Use instruct model
55
+ train_dataset=dataset,
56
+ args=DPOConfig(
57
+ output_dir="dpo-model",
58
+ beta=0.1, # KL penalty coefficient
59
+ eval_strategy="no", # Disable eval for simple example
60
+ # max_length=1024 is the default - only set if you need different length
61
+ )
62
+ )
63
+ trainer.train()
64
+ ```
65
+
66
+ **Note:** For production training with evaluation monitoring, see `scripts/train_dpo_example.py`
67
+
68
+ **Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer")`
69
+
70
+ ## Group Relative Policy Optimization (GRPO)
71
+
72
+ **What it is:** Online RL method that optimizes relative to group performance, useful for tasks with verifiable rewards.
73
+
74
+ **When to use:**
75
+ - Tasks with automatic reward signals (code execution, math verification)
76
+ - Online learning scenarios
77
+ - When DPO offline data is insufficient
78
+
79
+ **Dataset format:** Prompt-only format (model generates responses, reward computed online)
80
+
81
+ **Example:**
82
+ ```python
83
+ # Use TRL maintained script
84
+ hf_jobs("uv", {
85
+ "script": "https://raw.githubusercontent.com/huggingface/trl/main/examples/scripts/grpo.py",
86
+ "script_args": [
87
+ "--model_name_or_path", "Qwen/Qwen2.5-0.5B-Instruct",
88
+ "--dataset_name", "trl-lib/math_shepherd",
89
+ "--output_dir", "grpo-model"
90
+ ],
91
+ "flavor": "a10g-large",
92
+ "timeout": "4h",
93
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
94
+ })
95
+ ```
96
+
97
+ **Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/grpo_trainer")`
98
+
99
+ ## Reward Modeling
100
+
101
+ **What it is:** Train a reward model to score responses, used as a component in RLHF pipelines.
102
+
103
+ **When to use:**
104
+ - Building RLHF pipeline
105
+ - Need automatic quality scoring
106
+ - Creating reward signals for PPO training
107
+
108
+ **Dataset format:** Preference pairs with "chosen" and "rejected" responses
109
+
110
+ **Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/reward_trainer")`
111
+
112
+ ## Method Selection Guide
113
+
114
+ | Method | Complexity | Data Required | Use Case |
115
+ |--------|-----------|---------------|----------|
116
+ | **SFT** | Low | Demonstrations | Initial fine-tuning |
117
+ | **DPO** | Medium | Paired preferences | Post-SFT alignment |
118
+ | **GRPO** | Medium | Prompts + reward fn | Online RL with automatic rewards |
119
+ | **Reward** | Medium | Paired preferences | Building RLHF pipeline |
120
+
121
+ ## Recommended Pipeline
122
+
123
+ **For most use cases:**
124
+ 1. **Start with SFT** - Fine-tune base model on task data
125
+ 2. **Follow with DPO** - Align to preferences using paired data
126
+ 3. **Optional: GGUF conversion** - Deploy for local inference
127
+
128
+ **For advanced RL scenarios:**
129
+ 1. **Start with SFT** - Fine-tune base model
130
+ 2. **Train reward model** - On preference data
131
+
132
+ ## Dataset Format Reference
133
+
134
+ For complete dataset format specifications, use:
135
+ ```python
136
+ hf_doc_fetch("https://huggingface.co/docs/trl/dataset_formats")
137
+ ```
138
+
139
+ Or validate your dataset:
140
+ ```bash
141
+ uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
142
+ --dataset your/dataset --split train
143
+ ```
144
+
145
+ ## See Also
146
+
147
+ - `references/training_patterns.md` - Common training patterns and examples
148
+ - `scripts/train_sft_example.py` - Complete SFT template
149
+ - `scripts/train_dpo_example.py` - Complete DPO template
150
+ - [Dataset Inspector](https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py) - Dataset format validation tool
skills/hugging-face-model-trainer/references/training_patterns.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Common Training Patterns
2
+
3
+ This guide provides common training patterns and use cases for TRL on Hugging Face Jobs.
4
+
5
+ ## Multi-GPU Training
6
+
7
+ Automatic distributed training across multiple GPUs. TRL/Accelerate handles distribution automatically:
8
+
9
+ ```python
10
+ hf_jobs("uv", {
11
+ "script": """
12
+ # Your training script here (same as single GPU)
13
+ # No changes needed - Accelerate detects multiple GPUs
14
+ """,
15
+ "flavor": "a10g-largex2", # 2x A10G GPUs
16
+ "timeout": "4h",
17
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
18
+ })
19
+ ```
20
+
21
+ **Tips for multi-GPU:**
22
+ - No code changes needed
23
+ - Use `per_device_train_batch_size` (per GPU, not total)
24
+ - Effective batch size = `per_device_train_batch_size` × `num_gpus` × `gradient_accumulation_steps`
25
+ - Monitor GPU utilization to ensure both GPUs are being used
26
+
27
+ ## DPO Training (Preference Learning)
28
+
29
+ Train with preference data for alignment:
30
+
31
+ ```python
32
+ hf_jobs("uv", {
33
+ "script": """
34
+ # /// script
35
+ # dependencies = ["trl>=0.12.0", "trackio"]
36
+ # ///
37
+
38
+ from datasets import load_dataset
39
+ from trl import DPOTrainer, DPOConfig
40
+ import trackio
41
+
42
+ dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
43
+
44
+ # Create train/eval split
45
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
46
+
47
+ config = DPOConfig(
48
+ output_dir="dpo-model",
49
+ push_to_hub=True,
50
+ hub_model_id="username/dpo-model",
51
+ num_train_epochs=1,
52
+ beta=0.1, # KL penalty coefficient
53
+ eval_strategy="steps",
54
+ eval_steps=50,
55
+ report_to="trackio",
56
+ run_name="baseline_run", # use a meaningful run name
57
+ # max_length=1024, # Default - only set if you need different sequence length
58
+ )
59
+
60
+ trainer = DPOTrainer(
61
+ model="Qwen/Qwen2.5-0.5B-Instruct", # Use instruct model as base
62
+ train_dataset=dataset_split["train"],
63
+ eval_dataset=dataset_split["test"], # IMPORTANT: Provide eval_dataset when eval_strategy is enabled
64
+ args=config,
65
+ )
66
+
67
+ trainer.train()
68
+ trainer.push_to_hub()
69
+ trackio.finish()
70
+ """,
71
+ "flavor": "a10g-large",
72
+ "timeout": "3h",
73
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
74
+ })
75
+ ```
76
+
77
+ **For DPO documentation:** Use `hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer")`
78
+
79
+ ## GRPO Training (Online RL)
80
+
81
+ Group Relative Policy Optimization for online reinforcement learning:
82
+
83
+ ```python
84
+ hf_jobs("uv", {
85
+ "script": "https://raw.githubusercontent.com/huggingface/trl/main/examples/scripts/grpo.py",
86
+ "script_args": [
87
+ "--model_name_or_path", "Qwen/Qwen2.5-0.5B-Instruct",
88
+ "--dataset_name", "trl-lib/math_shepherd",
89
+ "--output_dir", "grpo-model",
90
+ "--push_to_hub",
91
+ "--hub_model_id", "username/grpo-model"
92
+ ],
93
+ "flavor": "a10g-large",
94
+ "timeout": "4h",
95
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"}
96
+ })
97
+ ```
98
+
99
+ **For GRPO documentation:** Use `hf_doc_fetch("https://huggingface.co/docs/trl/grpo_trainer")`
100
+
101
+ ## Trackio Configuration
102
+
103
+ **Use sensible defaults for trackio setup.** See `references/trackio_guide.md` for complete documentation including grouping runs for experiments.
104
+
105
+ ### Basic Pattern
106
+
107
+ ```python
108
+ import trackio
109
+
110
+ trackio.init(
111
+ project="my-training",
112
+ run_name="baseline-run", # Descriptive name user will recognize
113
+ space_id="username/trackio", # Default space: {username}/trackio
114
+ config={
115
+ # Keep config minimal - hyperparameters and model/dataset info only
116
+ "model": "Qwen/Qwen2.5-0.5B",
117
+ "dataset": "trl-lib/Capybara",
118
+ "learning_rate": 2e-5,
119
+ }
120
+ )
121
+
122
+ # Your training code...
123
+
124
+ trackio.finish()
125
+ ```
126
+
127
+ ### Grouping for Experiments (Optional)
128
+
129
+ When user wants to compare related runs, use the `group` parameter:
130
+
131
+ ```python
132
+ # Hyperparameter sweep
133
+ trackio.init(project="hyperparam-sweep", run_name="lr-0.001", group="lr_0.001")
134
+ trackio.init(project="hyperparam-sweep", run_name="lr-0.01", group="lr_0.01")
135
+ ```
136
+
137
+ ## Pattern Selection Guide
138
+
139
+ | Use Case | Pattern | Hardware | Time |
140
+ |----------|---------|----------|------|
141
+ | SFT training | `scripts/train_sft_example.py` | a10g-large | 2-6 hours |
142
+ | Large dataset (>10K) | Multi-GPU | a10g-largex2 | 4-12 hours |
143
+ | Preference learning | DPO Training | a10g-large | 2-4 hours |
144
+ | Online RL | GRPO Training | a10g-large | 3-6 hours |
145
+
146
+ ## Critical: Evaluation Dataset Requirements
147
+
148
+ **⚠️ IMPORTANT**: If you set `eval_strategy="steps"` or `eval_strategy="epoch"`, you **MUST** provide an `eval_dataset` to the trainer, or the training will hang.
149
+
150
+ ### ✅ CORRECT - With eval dataset:
151
+ ```python
152
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
153
+
154
+ trainer = SFTTrainer(
155
+ model="Qwen/Qwen2.5-0.5B",
156
+ train_dataset=dataset_split["train"],
157
+ eval_dataset=dataset_split["test"], # ← MUST provide when eval_strategy is enabled
158
+ args=SFTConfig(eval_strategy="steps", ...),
159
+ )
160
+ ```
161
+
162
+ ### ❌ WRONG - Will hang:
163
+ ```python
164
+ trainer = SFTTrainer(
165
+ model="Qwen/Qwen2.5-0.5B",
166
+ train_dataset=dataset,
167
+ # NO eval_dataset but eval_strategy="steps" ← WILL HANG
168
+ args=SFTConfig(eval_strategy="steps", ...),
169
+ )
170
+ ```
171
+
172
+ ### Option: Disable evaluation if no eval dataset
173
+ ```python
174
+ config = SFTConfig(
175
+ eval_strategy="no", # ← Explicitly disable evaluation
176
+ # ... other config
177
+ )
178
+
179
+ trainer = SFTTrainer(
180
+ model="Qwen/Qwen2.5-0.5B",
181
+ train_dataset=dataset,
182
+ # No eval_dataset needed
183
+ args=config,
184
+ )
185
+ ```
186
+
187
+ ## Best Practices
188
+
189
+ 1. **Use train/eval splits** - Create evaluation split for monitoring progress
190
+ 2. **Enable Trackio** - Monitor progress in real-time
191
+ 3. **Add 20-30% buffer to timeout** - Account for loading/saving overhead
192
+ 4. **Test with TRL official scripts first** - Use maintained examples before custom code
193
+ 5. **Always provide eval_dataset** - When using eval_strategy, or set to "no"
194
+ 6. **Use multi-GPU for large models** - 7B+ models benefit significantly
195
+
196
+ ## See Also
197
+
198
+ - `scripts/train_sft_example.py` - Complete SFT template with Trackio and eval split
199
+ - `scripts/train_dpo_example.py` - Complete DPO template
200
+ - `scripts/train_grpo_example.py` - Complete GRPO template
201
+ - `references/hardware_guide.md` - Detailed hardware specifications
202
+ - `references/training_methods.md` - Overview of all TRL training methods
203
+ - `references/troubleshooting.md` - Common issues and solutions
skills/hugging-face-model-trainer/references/troubleshooting.md ADDED
@@ -0,0 +1,282 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Troubleshooting TRL Training Jobs
2
+
3
+ Common issues and solutions when training with TRL on Hugging Face Jobs.
4
+
5
+ ## Training Hangs at "Starting training..." Step
6
+
7
+ **Problem:** Job starts but hangs at the training step - never progresses, never times out, just sits there.
8
+
9
+ **Root Cause:** Using `eval_strategy="steps"` or `eval_strategy="epoch"` without providing an `eval_dataset` to the trainer.
10
+
11
+ **Solution:**
12
+
13
+ **Option A: Provide eval_dataset (recommended)**
14
+ ```python
15
+ # Create train/eval split
16
+ dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
17
+
18
+ trainer = SFTTrainer(
19
+ model="Qwen/Qwen2.5-0.5B",
20
+ train_dataset=dataset_split["train"],
21
+ eval_dataset=dataset_split["test"], # ← MUST provide when eval_strategy is enabled
22
+ args=SFTConfig(
23
+ eval_strategy="steps",
24
+ eval_steps=50,
25
+ ...
26
+ ),
27
+ )
28
+ ```
29
+
30
+ **Option B: Disable evaluation**
31
+ ```python
32
+ trainer = SFTTrainer(
33
+ model="Qwen/Qwen2.5-0.5B",
34
+ train_dataset=dataset,
35
+ # No eval_dataset
36
+ args=SFTConfig(
37
+ eval_strategy="no", # ← Explicitly disable
38
+ ...
39
+ ),
40
+ )
41
+ ```
42
+
43
+ **Prevention:**
44
+ - Always create train/eval split for better monitoring
45
+ - Use `dataset.train_test_split(test_size=0.1, seed=42)`
46
+ - Check example scripts: `scripts/train_sft_example.py` includes proper eval setup
47
+
48
+ ## Job Times Out
49
+
50
+ **Problem:** Job terminates before training completes, all progress lost.
51
+
52
+ **Solutions:**
53
+ - Increase timeout parameter (e.g., `"timeout": "4h"`)
54
+ - Reduce `num_train_epochs` or use smaller dataset slice
55
+ - Use smaller model or enable LoRA/PEFT to speed up training
56
+ - Add 20-30% buffer to estimated time for loading/saving overhead
57
+
58
+ **Prevention:**
59
+ - Always start with a quick demo run to estimate timing
60
+ - Use `scripts/estimate_cost.py` to get time estimates
61
+ - Monitor first runs closely via Trackio or logs
62
+
63
+ ## Model Not Saved to Hub
64
+
65
+ **Problem:** Training completes but model doesn't appear on Hub - all work lost.
66
+
67
+ **Check:**
68
+ - [ ] `push_to_hub=True` in training config
69
+ - [ ] `hub_model_id` specified with username (e.g., `"username/model-name"`)
70
+ - [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job submission
71
+ - [ ] User has write access to target repo
72
+ - [ ] Token has write permissions (check at https://huggingface.co/settings/tokens)
73
+ - [ ] Training script calls `trainer.push_to_hub()` at the end
74
+
75
+ **See:** `references/hub_saving.md` for detailed Hub authentication troubleshooting
76
+
77
+ ## Out of Memory (OOM)
78
+
79
+ **Problem:** Job fails with CUDA out of memory error.
80
+
81
+ **Solutions (in order of preference):**
82
+ 1. **Reduce batch size:** Lower `per_device_train_batch_size` (try 4 → 2 → 1)
83
+ 2. **Increase gradient accumulation:** Raise `gradient_accumulation_steps` to maintain effective batch size
84
+ 3. **Disable evaluation:** Remove `eval_dataset` and `eval_strategy` (saves ~40% memory, good for demos)
85
+ 4. **Enable LoRA/PEFT:** Use `peft_config=LoraConfig(r=8, lora_alpha=16)` to train adapters only (smaller rank = less memory)
86
+ 5. **Use larger GPU:** Switch from `t4-small` → `l4x1` → `a10g-large` → `a100-large`
87
+ 6. **Enable gradient checkpointing:** Set `gradient_checkpointing=True` in config (slower but saves memory)
88
+ 7. **Use smaller model:** Try a smaller variant (e.g., 0.5B instead of 3B)
89
+
90
+ **Memory guidelines:**
91
+ - T4 (16GB): <1B models with LoRA
92
+ - A10G (24GB): 1-3B models with LoRA, <1B full fine-tune
93
+ - A100 (40GB/80GB): 7B+ models with LoRA, 3B full fine-tune
94
+
95
+ ## Parameter Naming Issues
96
+
97
+ **Problem:** `TypeError: SFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'`
98
+
99
+ **Cause:** TRL config classes use `max_length`, not `max_seq_length`.
100
+
101
+ **Solution:**
102
+ ```python
103
+ # ✅ CORRECT - TRL uses max_length
104
+ SFTConfig(max_length=512)
105
+ DPOConfig(max_length=512)
106
+
107
+ # ❌ WRONG - This will fail
108
+ SFTConfig(max_seq_length=512)
109
+ ```
110
+
111
+ **Note:** Most TRL configs don't require explicit max_length - the default (1024) works well. Only set if you need a specific value.
112
+
113
+ ## Dataset Format Error
114
+
115
+ **Problem:** Training fails with dataset format errors or missing fields.
116
+
117
+ **Solutions:**
118
+ 1. **Check format documentation:**
119
+ ```python
120
+ hf_doc_fetch("https://huggingface.co/docs/trl/dataset_formats")
121
+ ```
122
+
123
+ 2. **Validate dataset before training:**
124
+ ```bash
125
+ uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
126
+ --dataset <dataset-name> --split train
127
+ ```
128
+ Or via hf_jobs:
129
+ ```python
130
+ hf_jobs("uv", {
131
+ "script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
132
+ "script_args": ["--dataset", "dataset-name", "--split", "train"]
133
+ })
134
+ ```
135
+
136
+ 3. **Verify field names:**
137
+ - **SFT:** Needs "messages" field (conversational), OR "text" field, OR "prompt"/"completion"
138
+ - **DPO:** Needs "chosen" and "rejected" fields
139
+ - **GRPO:** Needs prompt-only format
140
+
141
+ 4. **Check dataset split:**
142
+ - Ensure split exists (e.g., `split="train"`)
143
+ - Preview dataset: `load_dataset("name", split="train[:5]")`
144
+
145
+ ## Import/Module Errors
146
+
147
+ **Problem:** Job fails with "ModuleNotFoundError" or import errors.
148
+
149
+ **Solutions:**
150
+ 1. **Add PEP 723 header with dependencies:**
151
+ ```python
152
+ # /// script
153
+ # dependencies = [
154
+ # "trl>=0.12.0",
155
+ # "peft>=0.7.0",
156
+ # "transformers>=4.36.0",
157
+ # ]
158
+ # ///
159
+ ```
160
+
161
+ 2. **Verify exact format:**
162
+ - Must have `# ///` delimiters (with space after `#`)
163
+ - Dependencies must be valid PyPI package names
164
+ - Check spelling and version constraints
165
+
166
+ 3. **Test locally first:**
167
+ ```bash
168
+ uv run train.py # Tests if dependencies are correct
169
+ ```
170
+
171
+ ## Authentication Errors
172
+
173
+ **Problem:** Job fails with authentication or permission errors when pushing to Hub.
174
+
175
+ **Solutions:**
176
+ 1. **Verify authentication:**
177
+ ```python
178
+ mcp__huggingface__hf_whoami() # Check who's authenticated
179
+ ```
180
+
181
+ 2. **Check token permissions:**
182
+ - Go to https://huggingface.co/settings/tokens
183
+ - Ensure token has "write" permission
184
+ - Token must not be "read-only"
185
+
186
+ 3. **Verify token in job:**
187
+ ```python
188
+ "secrets": {"HF_TOKEN": "$HF_TOKEN"} # Must be in job config
189
+ ```
190
+
191
+ 4. **Check repo permissions:**
192
+ - User must have write access to target repo
193
+ - If org repo, user must be member with write access
194
+ - Repo must exist or user must have permission to create
195
+
196
+ ## Job Stuck or Not Starting
197
+
198
+ **Problem:** Job shows "pending" or "starting" for extended period.
199
+
200
+ **Solutions:**
201
+ - Check Jobs dashboard for status: https://huggingface.co/jobs
202
+ - Verify hardware availability (some GPU types may have queues)
203
+ - Try different hardware flavor if one is heavily utilized
204
+ - Check for account billing issues (Jobs requires paid plan)
205
+
206
+ **Typical startup times:**
207
+ - CPU jobs: 10-30 seconds
208
+ - GPU jobs: 30-90 seconds
209
+ - If >3 minutes: likely queued or stuck
210
+
211
+ ## Training Loss Not Decreasing
212
+
213
+ **Problem:** Training runs but loss stays flat or doesn't improve.
214
+
215
+ **Solutions:**
216
+ 1. **Check learning rate:** May be too low (try 2e-5 to 5e-5) or too high (try 1e-6)
217
+ 2. **Verify dataset quality:** Inspect examples to ensure they're reasonable
218
+ 3. **Check model size:** Very small models may not have capacity for task
219
+ 4. **Increase training steps:** May need more epochs or larger dataset
220
+ 5. **Verify dataset format:** Wrong format may cause degraded training
221
+
222
+ ## Logs Not Appearing
223
+
224
+ **Problem:** Cannot see training logs or progress.
225
+
226
+ **Solutions:**
227
+ 1. **Wait 30-60 seconds:** Initial logs can be delayed
228
+ 2. **Check logs via MCP tool:**
229
+ ```python
230
+ hf_jobs("logs", {"job_id": "your-job-id"})
231
+ ```
232
+ 3. **Use Trackio for real-time monitoring:** See `references/trackio_guide.md`
233
+ 4. **Verify job is actually running:**
234
+ ```python
235
+ hf_jobs("inspect", {"job_id": "your-job-id"})
236
+ ```
237
+
238
+ ## Checkpoint/Resume Issues
239
+
240
+ **Problem:** Cannot resume from checkpoint or checkpoint not saved.
241
+
242
+ **Solutions:**
243
+ 1. **Enable checkpoint saving:**
244
+ ```python
245
+ SFTConfig(
246
+ save_strategy="steps",
247
+ save_steps=100,
248
+ hub_strategy="every_save", # Push each checkpoint
249
+ )
250
+ ```
251
+
252
+ 2. **Verify checkpoints pushed to Hub:** Check model repo for checkpoint folders
253
+
254
+ 3. **Resume from checkpoint:**
255
+ ```python
256
+ trainer = SFTTrainer(
257
+ model="username/model-name", # Can be checkpoint path
258
+ resume_from_checkpoint="username/model-name/checkpoint-1000",
259
+ )
260
+ ```
261
+
262
+ ## Getting Help
263
+
264
+ If issues persist:
265
+
266
+ 1. **Check TRL documentation:**
267
+ ```python
268
+ hf_doc_search("your issue", product="trl")
269
+ ```
270
+
271
+ 2. **Check Jobs documentation:**
272
+ ```python
273
+ hf_doc_fetch("https://huggingface.co/docs/huggingface_hub/guides/jobs")
274
+ ```
275
+
276
+ 3. **Review related guides:**
277
+ - `references/hub_saving.md` - Hub authentication issues
278
+ - `references/hardware_guide.md` - Hardware selection and specs
279
+ - `references/training_patterns.md` - Eval dataset requirements
280
+ - SKILL.md "Working with Scripts" section - Script format and URL issues
281
+
282
+ 4. **Ask in HF forums:** https://discuss.huggingface.co/