Publish codex workspace
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- AGENTS.md +0 -0
- README.md +32 -0
- skills/.system/.codex-system-skills.marker +1 -0
- skills/.system/openai-docs/LICENSE.txt +201 -0
- skills/.system/openai-docs/SKILL.md +69 -0
- skills/.system/openai-docs/agents/openai.yaml +14 -0
- skills/.system/openai-docs/assets/openai-small.svg +3 -0
- skills/.system/openai-docs/assets/openai.png +0 -0
- skills/.system/openai-docs/references/gpt-5p4-prompting-guide.md +433 -0
- skills/.system/openai-docs/references/latest-model.md +35 -0
- skills/.system/openai-docs/references/upgrading-to-gpt-5p4.md +164 -0
- skills/.system/skill-creator/SKILL.md +413 -0
- skills/.system/skill-creator/agents/openai.yaml +5 -0
- skills/.system/skill-creator/assets/skill-creator-small.svg +3 -0
- skills/.system/skill-creator/assets/skill-creator.png +0 -0
- skills/.system/skill-creator/license.txt +202 -0
- skills/.system/skill-creator/references/openai_yaml.md +49 -0
- skills/.system/skill-creator/scripts/generate_openai_yaml.py +226 -0
- skills/.system/skill-creator/scripts/init_skill.py +400 -0
- skills/.system/skill-creator/scripts/quick_validate.py +101 -0
- skills/.system/skill-installer/LICENSE.txt +202 -0
- skills/.system/skill-installer/SKILL.md +58 -0
- skills/.system/skill-installer/agents/openai.yaml +5 -0
- skills/.system/skill-installer/assets/skill-installer-small.svg +3 -0
- skills/.system/skill-installer/assets/skill-installer.png +0 -0
- skills/.system/skill-installer/scripts/github_utils.py +21 -0
- skills/.system/skill-installer/scripts/install-skill-from-github.py +308 -0
- skills/.system/skill-installer/scripts/list-skills.py +107 -0
- skills/agent-kernel/SKILL.md +379 -0
- skills/hugging-face-evaluation/SKILL.md +262 -0
- skills/hugging-face-evaluation/examples/.env.example +7 -0
- skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +380 -0
- skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +220 -0
- skills/hugging-face-evaluation/examples/eval.example.yaml +11 -0
- skills/hugging-face-evaluation/examples/example_readme_tables.md +135 -0
- skills/hugging-face-evaluation/examples/metric_mapping.json +118 -0
- skills/hugging-face-evaluation/references/hf_cli_for_prs.md +258 -0
- skills/hugging-face-evaluation/references/hf_papers_extraction.md +297 -0
- skills/hugging-face-evaluation/references/model_card_extraction.md +244 -0
- skills/hugging-face-evaluation/scripts/check_prs.py +98 -0
- skills/hugging-face-evaluation/scripts/import_aa.py +353 -0
- skills/hugging-face-model-trainer/SKILL.md +718 -0
- skills/hugging-face-model-trainer/references/gguf_conversion.md +296 -0
- skills/hugging-face-model-trainer/references/hardware_guide.md +283 -0
- skills/hugging-face-model-trainer/references/hub_saving.md +364 -0
- skills/hugging-face-model-trainer/references/reliability_principles.md +371 -0
- skills/hugging-face-model-trainer/references/trackio_guide.md +189 -0
- skills/hugging-face-model-trainer/references/training_methods.md +150 -0
- skills/hugging-face-model-trainer/references/training_patterns.md +203 -0
- skills/hugging-face-model-trainer/references/troubleshooting.md +282 -0
AGENTS.md
ADDED
|
File without changes
|
README.md
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- codex
|
| 4 |
+
- agent
|
| 5 |
+
library_name: codex
|
| 6 |
+
agent_name: ".codex"
|
| 7 |
+
agent_emoji: "🤖"
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# 🤖 .codex
|
| 11 |
+
|
| 12 |
+
A Codex workspace published to the Hugging Face Hub.
|
| 13 |
+
|
| 14 |
+
## Quick Start
|
| 15 |
+
|
| 16 |
+
```bash
|
| 17 |
+
hf claw install --codex burtenshaw/codex
|
| 18 |
+
|
| 19 |
+
# Or install and launch it in one shot
|
| 20 |
+
hf claw run --codex burtenshaw/codex
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
## AGENTS.md
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
## Included Directories
|
| 27 |
+
|
| 28 |
+
- `skills/`
|
| 29 |
+
|
| 30 |
+
---
|
| 31 |
+
|
| 32 |
+
*Published with [hf-claw](https://github.com/huggingface/harness)*
|
skills/.system/.codex-system-skills.marker
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
415286eb412224fe
|
skills/.system/openai-docs/LICENSE.txt
ADDED
|
@@ -0,0 +1,201 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Apache License
|
| 2 |
+
Version 2.0, January 2004
|
| 3 |
+
http://www.apache.org/licenses/
|
| 4 |
+
|
| 5 |
+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
| 6 |
+
|
| 7 |
+
1. Definitions.
|
| 8 |
+
|
| 9 |
+
"License" shall mean the terms and conditions for use, reproduction,
|
| 10 |
+
and distribution as defined by Sections 1 through 9 of this document.
|
| 11 |
+
|
| 12 |
+
"Licensor" shall mean the copyright owner or entity authorized by
|
| 13 |
+
the copyright owner that is granting the License.
|
| 14 |
+
|
| 15 |
+
"Legal Entity" shall mean the union of the acting entity and all
|
| 16 |
+
other entities that control, are controlled by, or are under common
|
| 17 |
+
control with that entity. For the purposes of this definition,
|
| 18 |
+
"control" means (i) the power, direct or indirect, to cause the
|
| 19 |
+
direction or management of such entity, whether by contract or
|
| 20 |
+
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
| 21 |
+
outstanding shares, or (iii) beneficial ownership of such entity.
|
| 22 |
+
|
| 23 |
+
"You" (or "Your") shall mean an individual or Legal Entity
|
| 24 |
+
exercising permissions granted by this License.
|
| 25 |
+
|
| 26 |
+
"Source" form shall mean the preferred form for making modifications,
|
| 27 |
+
including but not limited to software source code, documentation
|
| 28 |
+
source, and configuration files.
|
| 29 |
+
|
| 30 |
+
"Object" form shall mean any form resulting from mechanical
|
| 31 |
+
transformation or translation of a Source form, including but
|
| 32 |
+
not limited to compiled object code, generated documentation,
|
| 33 |
+
and conversions to other media types.
|
| 34 |
+
|
| 35 |
+
"Work" shall mean the work of authorship, whether in Source or
|
| 36 |
+
Object form, made available under the License, as indicated by a
|
| 37 |
+
copyright notice that is included in or attached to the work
|
| 38 |
+
(an example is provided in the Appendix below).
|
| 39 |
+
|
| 40 |
+
"Derivative Works" shall mean any work, whether in Source or Object
|
| 41 |
+
form, that is based on (or derived from) the Work and for which the
|
| 42 |
+
editorial revisions, annotations, elaborations, or other modifications
|
| 43 |
+
represent, as a whole, an original work of authorship. For the purposes
|
| 44 |
+
of this License, Derivative Works shall not include works that remain
|
| 45 |
+
separable from, or merely link (or bind by name) to the interfaces of,
|
| 46 |
+
the Work and Derivative Works thereof.
|
| 47 |
+
|
| 48 |
+
"Contribution" shall mean any work of authorship, including
|
| 49 |
+
the original version of the Work and any modifications or additions
|
| 50 |
+
to that Work or Derivative Works thereof, that is intentionally
|
| 51 |
+
submitted to Licensor for inclusion in the Work by the copyright owner
|
| 52 |
+
or by an individual or Legal Entity authorized to submit on behalf of
|
| 53 |
+
the copyright owner. For the purposes of this definition, "submitted"
|
| 54 |
+
means any form of electronic, verbal, or written communication sent
|
| 55 |
+
to the Licensor or its representatives, including but not limited to
|
| 56 |
+
communication on electronic mailing lists, source code control systems,
|
| 57 |
+
and issue tracking systems that are managed by, or on behalf of, the
|
| 58 |
+
Licensor for the purpose of discussing and improving the Work, but
|
| 59 |
+
excluding communication that is conspicuously marked or otherwise
|
| 60 |
+
designated in writing by the copyright owner as "Not a Contribution."
|
| 61 |
+
|
| 62 |
+
"Contributor" shall mean Licensor and any individual or Legal Entity
|
| 63 |
+
on behalf of whom a Contribution has been received by Licensor and
|
| 64 |
+
subsequently incorporated within the Work.
|
| 65 |
+
|
| 66 |
+
2. Grant of Copyright License. Subject to the terms and conditions of
|
| 67 |
+
this License, each Contributor hereby grants to You a perpetual,
|
| 68 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
| 69 |
+
copyright license to reproduce, prepare Derivative Works of,
|
| 70 |
+
publicly display, publicly perform, sublicense, and distribute the
|
| 71 |
+
Work and such Derivative Works in Source or Object form.
|
| 72 |
+
|
| 73 |
+
3. Grant of Patent License. Subject to the terms and conditions of
|
| 74 |
+
this License, each Contributor hereby grants to You a perpetual,
|
| 75 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
| 76 |
+
(except as stated in this section) patent license to make, have made,
|
| 77 |
+
use, offer to sell, sell, import, and otherwise transfer the Work,
|
| 78 |
+
where such license applies only to those patent claims licensable
|
| 79 |
+
by such Contributor that are necessarily infringed by their
|
| 80 |
+
Contribution(s) alone or by combination of their Contribution(s)
|
| 81 |
+
with the Work to which such Contribution(s) was submitted. If You
|
| 82 |
+
institute patent litigation against any entity (including a
|
| 83 |
+
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
| 84 |
+
or a Contribution incorporated within the Work constitutes direct
|
| 85 |
+
or contributory patent infringement, then any patent licenses
|
| 86 |
+
granted to You under this License for that Work shall terminate
|
| 87 |
+
as of the date such litigation is filed.
|
| 88 |
+
|
| 89 |
+
4. Redistribution. You may reproduce and distribute copies of the
|
| 90 |
+
Work or Derivative Works thereof in any medium, with or without
|
| 91 |
+
modifications, and in Source or Object form, provided that You
|
| 92 |
+
meet the following conditions:
|
| 93 |
+
|
| 94 |
+
(a) You must give any other recipients of the Work or
|
| 95 |
+
Derivative Works a copy of this License; and
|
| 96 |
+
|
| 97 |
+
(b) You must cause any modified files to carry prominent notices
|
| 98 |
+
stating that You changed the files; and
|
| 99 |
+
|
| 100 |
+
(c) You must retain, in the Source form of any Derivative Works
|
| 101 |
+
that You distribute, all copyright, patent, trademark, and
|
| 102 |
+
attribution notices from the Source form of the Work,
|
| 103 |
+
excluding those notices that do not pertain to any part of
|
| 104 |
+
the Derivative Works; and
|
| 105 |
+
|
| 106 |
+
(d) If the Work includes a "NOTICE" text file as part of its
|
| 107 |
+
distribution, then any Derivative Works that You distribute must
|
| 108 |
+
include a readable copy of the attribution notices contained
|
| 109 |
+
within such NOTICE file, excluding those notices that do not
|
| 110 |
+
pertain to any part of the Derivative Works, in at least one
|
| 111 |
+
of the following places: within a NOTICE text file distributed
|
| 112 |
+
as part of the Derivative Works; within the Source form or
|
| 113 |
+
documentation, if provided along with the Derivative Works; or,
|
| 114 |
+
within a display generated by the Derivative Works, if and
|
| 115 |
+
wherever such third-party notices normally appear. The contents
|
| 116 |
+
of the NOTICE file are for informational purposes only and
|
| 117 |
+
do not modify the License. You may add Your own attribution
|
| 118 |
+
notices within Derivative Works that You distribute, alongside
|
| 119 |
+
or as an addendum to the NOTICE text from the Work, provided
|
| 120 |
+
that such additional attribution notices cannot be construed
|
| 121 |
+
as modifying the License.
|
| 122 |
+
|
| 123 |
+
You may add Your own copyright statement to Your modifications and
|
| 124 |
+
may provide additional or different license terms and conditions
|
| 125 |
+
for use, reproduction, or distribution of Your modifications, or
|
| 126 |
+
for any such Derivative Works as a whole, provided Your use,
|
| 127 |
+
reproduction, and distribution of the Work otherwise complies with
|
| 128 |
+
the conditions stated in this License.
|
| 129 |
+
|
| 130 |
+
5. Submission of Contributions. Unless You explicitly state otherwise,
|
| 131 |
+
any Contribution intentionally submitted for inclusion in the Work
|
| 132 |
+
by You to the Licensor shall be under the terms and conditions of
|
| 133 |
+
this License, without any additional terms or conditions.
|
| 134 |
+
Notwithstanding the above, nothing herein shall supersede or modify
|
| 135 |
+
the terms of any separate license agreement you may have executed
|
| 136 |
+
with Licensor regarding such Contributions.
|
| 137 |
+
|
| 138 |
+
6. Trademarks. This License does not grant permission to use the trade
|
| 139 |
+
names, trademarks, service marks, or product names of the Licensor,
|
| 140 |
+
except as required for reasonable and customary use in describing the
|
| 141 |
+
origin of the Work and reproducing the content of the NOTICE file.
|
| 142 |
+
|
| 143 |
+
7. Disclaimer of Warranty. Unless required by applicable law or
|
| 144 |
+
agreed to in writing, Licensor provides the Work (and each
|
| 145 |
+
Contributor provides its Contributions) on an "AS IS" BASIS,
|
| 146 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
| 147 |
+
implied, including, without limitation, any warranties or conditions
|
| 148 |
+
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
| 149 |
+
PARTICULAR PURPOSE. You are solely responsible for determining the
|
| 150 |
+
appropriateness of using or redistributing the Work and assume any
|
| 151 |
+
risks associated with Your exercise of permissions under this License.
|
| 152 |
+
|
| 153 |
+
8. Limitation of Liability. In no event and under no legal theory,
|
| 154 |
+
whether in tort (including negligence), contract, or otherwise,
|
| 155 |
+
unless required by applicable law (such as deliberate and grossly
|
| 156 |
+
negligent acts) or agreed to in writing, shall any Contributor be
|
| 157 |
+
liable to You for damages, including any direct, indirect, special,
|
| 158 |
+
incidental, or consequential damages of any character arising as a
|
| 159 |
+
result of this License or out of the use or inability to use the
|
| 160 |
+
Work (including but not limited to damages for loss of goodwill,
|
| 161 |
+
work stoppage, computer failure or malfunction, or any and all
|
| 162 |
+
other commercial damages or losses), even if such Contributor
|
| 163 |
+
has been advised of the possibility of such damages.
|
| 164 |
+
|
| 165 |
+
9. Accepting Warranty or Additional Liability. While redistributing
|
| 166 |
+
the Work or Derivative Works thereof, You may choose to offer,
|
| 167 |
+
and charge a fee for, acceptance of support, warranty, indemnity,
|
| 168 |
+
or other liability obligations and/or rights consistent with this
|
| 169 |
+
License. However, in accepting such obligations, You may act only
|
| 170 |
+
on Your own behalf and on Your sole responsibility, not on behalf of
|
| 171 |
+
any other Contributor, and only if You agree to indemnify,
|
| 172 |
+
defend, and hold each Contributor harmless for any liability
|
| 173 |
+
incurred by, or claims asserted against, such Contributor by reason
|
| 174 |
+
of your accepting any such warranty or additional liability.
|
| 175 |
+
|
| 176 |
+
END OF TERMS AND CONDITIONS
|
| 177 |
+
|
| 178 |
+
APPENDIX: How to apply the Apache License to your work.
|
| 179 |
+
|
| 180 |
+
To apply the Apache License to your work, attach the following
|
| 181 |
+
boilerplate notice, with the fields enclosed by brackets "[]"
|
| 182 |
+
replaced with your own identifying information. (Don\'t include
|
| 183 |
+
the brackets!) The text should be enclosed in the appropriate
|
| 184 |
+
comment syntax for the file format. We also recommend that a
|
| 185 |
+
file or class name and description of purpose be included on the
|
| 186 |
+
same "printed page" as the copyright notice for easier
|
| 187 |
+
identification within third-party archives.
|
| 188 |
+
|
| 189 |
+
Copyright [yyyy] [name of copyright owner]
|
| 190 |
+
|
| 191 |
+
Licensed under the Apache License, Version 2.0 (the "License");
|
| 192 |
+
you may not use this file except in compliance with the License.
|
| 193 |
+
You may obtain a copy of the License at
|
| 194 |
+
|
| 195 |
+
http://www.apache.org/licenses/LICENSE-2.0
|
| 196 |
+
|
| 197 |
+
Unless required by applicable law or agreed to in writing, software
|
| 198 |
+
distributed under the License is distributed on an "AS IS" BASIS,
|
| 199 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 200 |
+
See the License for the specific language governing permissions and
|
| 201 |
+
limitations under the License.
|
skills/.system/openai-docs/SKILL.md
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: "openai-docs"
|
| 3 |
+
description: "Use when the user asks how to build with OpenAI products or APIs and needs up-to-date official documentation with citations, help choosing the latest model for a use case, or explicit GPT-5.4 upgrade and prompt-upgrade guidance; prioritize OpenAI docs MCP tools, use bundled references only as helper context, and restrict any fallback browsing to official OpenAI domains."
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
|
| 7 |
+
# OpenAI Docs
|
| 8 |
+
|
| 9 |
+
Provide authoritative, current guidance from OpenAI developer docs using the developers.openai.com MCP server. Always prioritize the developer docs MCP tools over web.run for OpenAI-related questions. This skill may also load targeted files from `references/` for model-selection and GPT-5.4-specific requests, but current OpenAI docs remain authoritative. Only if the MCP server is installed and returns no meaningful results should you fall back to web search.
|
| 10 |
+
|
| 11 |
+
## Quick start
|
| 12 |
+
|
| 13 |
+
- Use `mcp__openaiDeveloperDocs__search_openai_docs` to find the most relevant doc pages.
|
| 14 |
+
- Use `mcp__openaiDeveloperDocs__fetch_openai_doc` to pull exact sections and quote/paraphrase accurately.
|
| 15 |
+
- Use `mcp__openaiDeveloperDocs__list_openai_docs` only when you need to browse or discover pages without a clear query.
|
| 16 |
+
- Load only the relevant file from `references/` when the question is about model selection or a GPT-5.4 upgrade.
|
| 17 |
+
|
| 18 |
+
## OpenAI product snapshots
|
| 19 |
+
|
| 20 |
+
1. Apps SDK: Build ChatGPT apps by providing a web component UI and an MCP server that exposes your app's tools to ChatGPT.
|
| 21 |
+
2. Responses API: A unified endpoint designed for stateful, multimodal, tool-using interactions in agentic workflows.
|
| 22 |
+
3. Chat Completions API: Generate a model response from a list of messages comprising a conversation.
|
| 23 |
+
4. Codex: OpenAI's coding agent for software development that can write, understand, review, and debug code.
|
| 24 |
+
5. gpt-oss: Open-weight OpenAI reasoning models (gpt-oss-120b and gpt-oss-20b) released under the Apache 2.0 license.
|
| 25 |
+
6. Realtime API: Build low-latency, multimodal experiences including natural speech-to-speech conversations.
|
| 26 |
+
7. Agents SDK: A toolkit for building agentic apps where a model can use tools and context, hand off to other agents, stream partial results, and keep a full trace.
|
| 27 |
+
|
| 28 |
+
## If MCP server is missing
|
| 29 |
+
|
| 30 |
+
If MCP tools fail or no OpenAI docs resources are available:
|
| 31 |
+
|
| 32 |
+
1. Run the install command yourself: `codex mcp add openaiDeveloperDocs --url https://developers.openai.com/mcp`
|
| 33 |
+
2. If it fails due to permissions/sandboxing, immediately retry the same command with escalated permissions and include a 1-sentence justification for approval. Do not ask the user to run it yet.
|
| 34 |
+
3. Only if the escalated attempt fails, ask the user to run the install command.
|
| 35 |
+
4. Ask the user to restart Codex.
|
| 36 |
+
5. Re-run the doc search/fetch after restart.
|
| 37 |
+
|
| 38 |
+
## Workflow
|
| 39 |
+
|
| 40 |
+
1. Clarify the product scope and whether the request is general docs lookup, model selection, a GPT-5.4 upgrade, or a GPT-5.4 prompt upgrade.
|
| 41 |
+
2. If it is a model-selection request, load `references/latest-model.md`.
|
| 42 |
+
3. If it is an explicit GPT-5.4 upgrade request, load `references/upgrading-to-gpt-5p4.md`.
|
| 43 |
+
4. If the upgrade may require prompt changes, or the workflow is research-heavy, tool-heavy, coding-oriented, multi-agent, or long-running, also load `references/gpt-5p4-prompting-guide.md`.
|
| 44 |
+
5. Search docs with a precise query.
|
| 45 |
+
6. Fetch the best page and the exact section needed (use `anchor` when possible).
|
| 46 |
+
7. For GPT-5.4 upgrade reviews, always make the per-usage-site output explicit: target model, starting reasoning recommendation, `phase` assessment when relevant, prompt blocks, and compatibility status.
|
| 47 |
+
8. Answer with concise guidance and cite the doc source, using the reference files only as helper context.
|
| 48 |
+
|
| 49 |
+
## Reference map
|
| 50 |
+
|
| 51 |
+
Read only what you need:
|
| 52 |
+
|
| 53 |
+
- `references/latest-model.md` -> model-selection and "best/latest/current model" questions; verify every recommendation against current OpenAI docs before answering.
|
| 54 |
+
- `references/upgrading-to-gpt-5p4.md` -> only for explicit GPT-5.4 upgrade and upgrade-planning requests; verify the checklist and compatibility guidance against current OpenAI docs before answering.
|
| 55 |
+
- `references/gpt-5p4-prompting-guide.md` -> prompt rewrites and prompt-behavior upgrades for GPT-5.4; verify prompting guidance against current OpenAI docs before answering.
|
| 56 |
+
|
| 57 |
+
## Quality rules
|
| 58 |
+
|
| 59 |
+
- Treat OpenAI docs as the source of truth; avoid speculation.
|
| 60 |
+
- Keep quotes short and within policy limits; prefer paraphrase with citations.
|
| 61 |
+
- If multiple pages differ, call out the difference and cite both.
|
| 62 |
+
- Reference files are convenience guides only; for volatile guidance such as recommended models, upgrade instructions, or prompting advice, current OpenAI docs always win.
|
| 63 |
+
- If docs do not cover the user’s need, say so and offer next steps.
|
| 64 |
+
|
| 65 |
+
## Tooling notes
|
| 66 |
+
|
| 67 |
+
- Always use MCP doc tools before any web search for OpenAI-related questions.
|
| 68 |
+
- If the MCP server is installed but returns no meaningful results, then use web search as a fallback.
|
| 69 |
+
- When falling back to web search, restrict to official OpenAI domains (developers.openai.com, platform.openai.com) and cite sources.
|
skills/.system/openai-docs/agents/openai.yaml
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
interface:
|
| 2 |
+
display_name: "OpenAI Docs"
|
| 3 |
+
short_description: "Reference official OpenAI docs, including upgrade guidance"
|
| 4 |
+
icon_small: "./assets/openai-small.svg"
|
| 5 |
+
icon_large: "./assets/openai.png"
|
| 6 |
+
default_prompt: "Look up official OpenAI docs, load relevant GPT-5.4 upgrade references when applicable, and answer with concise, cited guidance."
|
| 7 |
+
|
| 8 |
+
dependencies:
|
| 9 |
+
tools:
|
| 10 |
+
- type: "mcp"
|
| 11 |
+
value: "openaiDeveloperDocs"
|
| 12 |
+
description: "OpenAI Developer Docs MCP server"
|
| 13 |
+
transport: "streamable_http"
|
| 14 |
+
url: "https://developers.openai.com/mcp"
|
skills/.system/openai-docs/assets/openai-small.svg
ADDED
|
|
skills/.system/openai-docs/assets/openai.png
ADDED
|
skills/.system/openai-docs/references/gpt-5p4-prompting-guide.md
ADDED
|
@@ -0,0 +1,433 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GPT-5.4 prompting upgrade guide
|
| 2 |
+
|
| 3 |
+
Use this guide when prompts written for older models need to be adapted for GPT-5.4 during an upgrade. Start lean: keep the model-string change narrow, preserve the original task intent, and add only the smallest prompt changes needed to recover behavior.
|
| 4 |
+
|
| 5 |
+
## Default upgrade posture
|
| 6 |
+
|
| 7 |
+
- Start with `model string only` whenever the old prompt is already short, explicit, and task-bounded.
|
| 8 |
+
- Move to `model string + light prompt rewrite` only when regressions appear in completeness, persistence, citation quality, verification, or verbosity.
|
| 9 |
+
- Prefer one or two targeted prompt additions over a broad rewrite.
|
| 10 |
+
- Treat reasoning effort as a last-mile knob. Start lower, then increase only after prompt-level fixes and evals.
|
| 11 |
+
- Before increasing reasoning effort, first add a completeness contract, a verification loop, and tool persistence rules - depending on the usage case.
|
| 12 |
+
- If the workflow clearly depends on implementation changes rather than prompt changes, treat it as blocked for prompt-only upgrade guidance.
|
| 13 |
+
- Do not classify a case as blocked just because the workflow uses tools; block only if the upgrade requires changing tool definitions, wiring, or other implementation details.
|
| 14 |
+
|
| 15 |
+
## Behavioral differences to account for
|
| 16 |
+
|
| 17 |
+
Current GPT-5.4 upgrade guidance suggests these strengths:
|
| 18 |
+
|
| 19 |
+
- stronger personality and tone adherence, with less drift over long answers
|
| 20 |
+
- better long-horizon and agentic workflow stamina
|
| 21 |
+
- stronger spreadsheet, finance, and formatting tasks
|
| 22 |
+
- more efficient tool selection and fewer unnecessary calls by default
|
| 23 |
+
- stronger structured generation and classification reliability
|
| 24 |
+
|
| 25 |
+
The main places where prompt guidance still helps are:
|
| 26 |
+
|
| 27 |
+
- retrieval-heavy workflows that need persistent tool use and explicit completeness
|
| 28 |
+
- research and citation discipline
|
| 29 |
+
- verification before irreversible or high-impact actions
|
| 30 |
+
- terminal and tool workflow hygiene
|
| 31 |
+
- defaults and implied follow-through
|
| 32 |
+
- verbosity control for compact, information-dense answers
|
| 33 |
+
|
| 34 |
+
Start with the smallest set of instructions that preserves correctness. Add the prompt blocks below only for workflows that actually need them.
|
| 35 |
+
|
| 36 |
+
## Prompt rewrite patterns
|
| 37 |
+
|
| 38 |
+
| Older prompt pattern | GPT-5.4 adjustment | Why | Example addition |
|
| 39 |
+
| --- | --- | --- | --- |
|
| 40 |
+
| Long, repetitive instructions that compensate for weaker instruction following | Remove duplicate scaffolding and keep only the constraints that materially change behavior | GPT-5.4 usually needs less repeated steering | Replace repeated reminders with one concise rule plus a verification block |
|
| 41 |
+
| Fast assistant prompt with no verbosity control | Keep the prompt as-is first; add a verbosity clamp only if outputs become too long | Many GPT-4o or GPT-4.1 upgrades work with just a model-string swap | Add `output_verbosity_spec` only after a verbosity regression |
|
| 42 |
+
| Tool-heavy agent prompt that assumes the model will keep searching until complete | Add persistence and verification rules | GPT-5.4 may use fewer tool calls by default for efficiency | Add `tool_persistence_rules` and `verification_loop` |
|
| 43 |
+
| Tool-heavy workflow where later actions depend on earlier lookup or retrieval | Add prerequisite and missing-context rules before action steps | GPT-5.4 benefits from explicit dependency-aware routing when context is still thin | Add `dependency_checks` and `missing_context_gating` |
|
| 44 |
+
| Retrieval workflow with several independent lookups | Add selective parallelism guidance | GPT-5.4 is strong at parallel tool use, but should not parallelize dependent steps | Add `parallel_tool_calling` |
|
| 45 |
+
| Batch workflow prompt that often misses items | Add an explicit completeness contract | Item accounting benefits from direct instruction | Add `completeness_contract` |
|
| 46 |
+
| Research prompt that needs grounding and citation discipline | Add research, citation, and empty-result recovery blocks | Multi-pass retrieval is stronger when the model is told how to react to weak or empty search results | Add `research_mode`, `citation_rules`, and `empty_result_handling`; add `tool_persistence_rules` when retrieval tools are already in use |
|
| 47 |
+
| Coding or terminal prompt with shell misuse or early stop failures | Keep the same tool surface and add terminal hygiene and verification instructions | Tool-using coding workflows are not blocked just because tools exist; they usually need better prompt steering, not host rewiring | Add `terminal_tool_hygiene` and `verification_loop`, optionally `tool_persistence_rules` |
|
| 48 |
+
| Multi-agent or support-triage workflow with escalation or completeness requirements | Add one lightweight control block for persistence, completeness, or verification | GPT-5.4 can be more efficient by default, so multi-step support flows benefit from an explicit completion or verification contract | Add at least one of `tool_persistence_rules`, `completeness_contract`, or `verification_loop` |
|
| 49 |
+
|
| 50 |
+
## Prompt blocks
|
| 51 |
+
|
| 52 |
+
Use these selectively. Do not add all of them by default.
|
| 53 |
+
|
| 54 |
+
### `output_verbosity_spec`
|
| 55 |
+
|
| 56 |
+
Use when:
|
| 57 |
+
|
| 58 |
+
- the upgraded model gets too wordy
|
| 59 |
+
- the host needs compact, information-dense answers
|
| 60 |
+
- the workflow benefits from a short overview plus a checklist
|
| 61 |
+
|
| 62 |
+
```text
|
| 63 |
+
<output_verbosity_spec>
|
| 64 |
+
- Default: 3-6 sentences or up to 6 bullets.
|
| 65 |
+
- If the user asked for a doc or report, use headings with short bullets.
|
| 66 |
+
- For multi-step tasks:
|
| 67 |
+
- Start with 1 short overview paragraph.
|
| 68 |
+
- Then provide a checklist with statuses: [done], [todo], or [blocked].
|
| 69 |
+
- Avoid repeating the user's request.
|
| 70 |
+
- Prefer compact, information-dense writing.
|
| 71 |
+
</output_verbosity_spec>
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
### `default_follow_through_policy`
|
| 75 |
+
|
| 76 |
+
Use when:
|
| 77 |
+
|
| 78 |
+
- the host expects the model to proceed on reversible, low-risk steps
|
| 79 |
+
- the upgraded model becomes too conservative or asks for confirmation too often
|
| 80 |
+
|
| 81 |
+
```text
|
| 82 |
+
<default_follow_through_policy>
|
| 83 |
+
- If the user's intent is clear and the next step is reversible and low-risk, proceed without asking permission.
|
| 84 |
+
- Only ask permission if the next step is:
|
| 85 |
+
(a) irreversible,
|
| 86 |
+
(b) has external side effects, or
|
| 87 |
+
(c) requires missing sensitive information or a choice that materially changes outcomes.
|
| 88 |
+
- If proceeding, state what you did and what remains optional.
|
| 89 |
+
</default_follow_through_policy>
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
### `instruction_priority`
|
| 93 |
+
|
| 94 |
+
Use when:
|
| 95 |
+
|
| 96 |
+
- users often change task shape, format, or tone mid-conversation
|
| 97 |
+
- the host needs an explicit override policy instead of relying on defaults
|
| 98 |
+
|
| 99 |
+
```text
|
| 100 |
+
<instruction_priority>
|
| 101 |
+
- User instructions override default style, tone, formatting, and initiative preferences.
|
| 102 |
+
- Safety, honesty, privacy, and permission constraints do not yield.
|
| 103 |
+
- If a newer user instruction conflicts with an earlier one, follow the newer instruction.
|
| 104 |
+
- Preserve earlier instructions that do not conflict.
|
| 105 |
+
</instruction_priority>
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
### `tool_persistence_rules`
|
| 109 |
+
|
| 110 |
+
Use when:
|
| 111 |
+
|
| 112 |
+
- the workflow needs multiple retrieval or verification steps
|
| 113 |
+
- the model starts stopping too early because it is trying to save tool calls
|
| 114 |
+
|
| 115 |
+
```text
|
| 116 |
+
<tool_persistence_rules>
|
| 117 |
+
- Use tools whenever they materially improve correctness, completeness, or grounding.
|
| 118 |
+
- Do not stop early just to save tool calls.
|
| 119 |
+
- Keep calling tools until:
|
| 120 |
+
(1) the task is complete, and
|
| 121 |
+
(2) verification passes.
|
| 122 |
+
- If a tool returns empty or partial results, retry with a different strategy.
|
| 123 |
+
</tool_persistence_rules>
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
### `dig_deeper_nudge`
|
| 127 |
+
|
| 128 |
+
Use when:
|
| 129 |
+
|
| 130 |
+
- the model is too literal or stops at the first plausible answer
|
| 131 |
+
- the task is safety- or accuracy-sensitive and needs a small initiative nudge before raising reasoning effort
|
| 132 |
+
|
| 133 |
+
```text
|
| 134 |
+
<dig_deeper_nudge>
|
| 135 |
+
- Do not stop at the first plausible answer.
|
| 136 |
+
- Look for second-order issues, edge cases, and missing constraints.
|
| 137 |
+
- If the task is safety- or accuracy-critical, perform at least one verification step.
|
| 138 |
+
</dig_deeper_nudge>
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
### `dependency_checks`
|
| 142 |
+
|
| 143 |
+
Use when:
|
| 144 |
+
|
| 145 |
+
- later actions depend on prerequisite lookup, memory retrieval, or discovery steps
|
| 146 |
+
- the model may be tempted to skip prerequisite work because the intended end state seems obvious
|
| 147 |
+
|
| 148 |
+
```text
|
| 149 |
+
<dependency_checks>
|
| 150 |
+
- Before taking an action, check whether prerequisite discovery, lookup, or memory retrieval is required.
|
| 151 |
+
- Do not skip prerequisite steps just because the intended final action seems obvious.
|
| 152 |
+
- If a later step depends on the output of an earlier one, resolve that dependency first.
|
| 153 |
+
</dependency_checks>
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
### `parallel_tool_calling`
|
| 157 |
+
|
| 158 |
+
Use when:
|
| 159 |
+
|
| 160 |
+
- the workflow has multiple independent retrieval steps
|
| 161 |
+
- wall-clock time matters but some steps still need sequencing
|
| 162 |
+
|
| 163 |
+
```text
|
| 164 |
+
<parallel_tool_calling>
|
| 165 |
+
- When multiple retrieval or lookup steps are independent, prefer parallel tool calls to reduce wall-clock time.
|
| 166 |
+
- Do not parallelize steps with prerequisite dependencies or where one result determines the next action.
|
| 167 |
+
- After parallel retrieval, pause to synthesize before making more calls.
|
| 168 |
+
- Prefer selective parallelism: parallelize independent evidence gathering, not speculative or redundant tool use.
|
| 169 |
+
</parallel_tool_calling>
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
### `completeness_contract`
|
| 173 |
+
|
| 174 |
+
Use when:
|
| 175 |
+
|
| 176 |
+
- the task involves batches, lists, enumerations, or multiple deliverables
|
| 177 |
+
- missing items are a common failure mode
|
| 178 |
+
|
| 179 |
+
```text
|
| 180 |
+
<completeness_contract>
|
| 181 |
+
- Deliver all requested items.
|
| 182 |
+
- Maintain an itemized checklist of deliverables.
|
| 183 |
+
- For lists or batches:
|
| 184 |
+
- state the expected count,
|
| 185 |
+
- enumerate items 1..N,
|
| 186 |
+
- confirm that none are missing before finalizing.
|
| 187 |
+
- If any item is blocked by missing data, mark it [blocked] and state exactly what is missing.
|
| 188 |
+
</completeness_contract>
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
### `empty_result_handling`
|
| 192 |
+
|
| 193 |
+
Use when:
|
| 194 |
+
|
| 195 |
+
- the workflow frequently performs search, CRM, logs, or retrieval steps
|
| 196 |
+
- no-results failures are often false negatives
|
| 197 |
+
|
| 198 |
+
```text
|
| 199 |
+
<empty_result_handling>
|
| 200 |
+
If a lookup returns empty or suspiciously small results:
|
| 201 |
+
- Do not conclude that no results exist immediately.
|
| 202 |
+
- Try at least 2 fallback strategies, such as a broader query, alternate filters, or another source.
|
| 203 |
+
- Only then report that no results were found, along with what you tried.
|
| 204 |
+
</empty_result_handling>
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
### `verification_loop`
|
| 208 |
+
|
| 209 |
+
Use when:
|
| 210 |
+
|
| 211 |
+
- the workflow has downstream impact
|
| 212 |
+
- accuracy, formatting, or completeness regressions matter
|
| 213 |
+
|
| 214 |
+
```text
|
| 215 |
+
<verification_loop>
|
| 216 |
+
Before finalizing:
|
| 217 |
+
- Check correctness: does the output satisfy every requirement?
|
| 218 |
+
- Check grounding: are factual claims backed by retrieved sources or tool output?
|
| 219 |
+
- Check formatting: does the output match the requested schema or style?
|
| 220 |
+
- Check safety and irreversibility: if the next step has external side effects, ask permission first.
|
| 221 |
+
</verification_loop>
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
### `missing_context_gating`
|
| 225 |
+
|
| 226 |
+
Use when:
|
| 227 |
+
|
| 228 |
+
- required context is sometimes missing early in the workflow
|
| 229 |
+
- the model should prefer retrieval over guessing
|
| 230 |
+
|
| 231 |
+
```text
|
| 232 |
+
<missing_context_gating>
|
| 233 |
+
- If required context is missing, do not guess.
|
| 234 |
+
- Prefer the appropriate lookup tool when the context is retrievable; ask a minimal clarifying question only when it is not.
|
| 235 |
+
- If you must proceed, label assumptions explicitly and choose a reversible action.
|
| 236 |
+
</missing_context_gating>
|
| 237 |
+
```
|
| 238 |
+
|
| 239 |
+
### `action_safety`
|
| 240 |
+
|
| 241 |
+
Use when:
|
| 242 |
+
|
| 243 |
+
- the agent will actively take actions through tools
|
| 244 |
+
- the host benefits from a short pre-flight and post-flight execution frame
|
| 245 |
+
|
| 246 |
+
```text
|
| 247 |
+
<action_safety>
|
| 248 |
+
- Pre-flight: summarize the intended action and parameters in 1-2 lines.
|
| 249 |
+
- Execute via tool.
|
| 250 |
+
- Post-flight: confirm the outcome and any validation that was performed.
|
| 251 |
+
</action_safety>
|
| 252 |
+
```
|
| 253 |
+
|
| 254 |
+
### `citation_rules`
|
| 255 |
+
|
| 256 |
+
Use when:
|
| 257 |
+
|
| 258 |
+
- the workflow produces cited answers
|
| 259 |
+
- fabricated citations or wrong citation formats are costly
|
| 260 |
+
|
| 261 |
+
```text
|
| 262 |
+
<citation_rules>
|
| 263 |
+
- Only cite sources that were actually retrieved in this session.
|
| 264 |
+
- Never fabricate citations, URLs, IDs, or quote spans.
|
| 265 |
+
- If you cannot find a source for a claim, say so and either:
|
| 266 |
+
- soften the claim, or
|
| 267 |
+
- explain how to verify it with tools.
|
| 268 |
+
- Use exactly the citation format required by the host application.
|
| 269 |
+
</citation_rules>
|
| 270 |
+
```
|
| 271 |
+
|
| 272 |
+
### `research_mode`
|
| 273 |
+
|
| 274 |
+
Use when:
|
| 275 |
+
|
| 276 |
+
- the workflow is research-heavy
|
| 277 |
+
- the host uses web search or retrieval tools
|
| 278 |
+
|
| 279 |
+
```text
|
| 280 |
+
<research_mode>
|
| 281 |
+
- Do research in 3 passes:
|
| 282 |
+
1) Plan: list 3-6 sub-questions to answer.
|
| 283 |
+
2) Retrieve: search each sub-question and follow 1-2 second-order leads.
|
| 284 |
+
3) Synthesize: resolve contradictions and write the final answer with citations.
|
| 285 |
+
- Stop only when more searching is unlikely to change the conclusion.
|
| 286 |
+
</research_mode>
|
| 287 |
+
```
|
| 288 |
+
|
| 289 |
+
If your host environment uses a specific research tool or requires a submit step, combine this with the host's finalization contract.
|
| 290 |
+
|
| 291 |
+
### `structured_output_contract`
|
| 292 |
+
|
| 293 |
+
Use when:
|
| 294 |
+
|
| 295 |
+
- the host depends on strict JSON, SQL, or other structured output
|
| 296 |
+
|
| 297 |
+
```text
|
| 298 |
+
<structured_output_contract>
|
| 299 |
+
- Output only the requested format.
|
| 300 |
+
- Do not add prose or markdown fences unless they were requested.
|
| 301 |
+
- Validate that parentheses and brackets are balanced.
|
| 302 |
+
- Do not invent tables or fields.
|
| 303 |
+
- If required schema information is missing, ask for it or return an explicit error object.
|
| 304 |
+
</structured_output_contract>
|
| 305 |
+
```
|
| 306 |
+
|
| 307 |
+
### `bbox_extraction_spec`
|
| 308 |
+
|
| 309 |
+
Use when:
|
| 310 |
+
|
| 311 |
+
- the workflow extracts OCR boxes, document regions, or other coordinates
|
| 312 |
+
- layout drift or missed dense regions are common failure modes
|
| 313 |
+
|
| 314 |
+
```text
|
| 315 |
+
<bbox_extraction_spec>
|
| 316 |
+
- Use the specified coordinate format exactly, such as [x1,y1,x2,y2] normalized to 0..1.
|
| 317 |
+
- For each box, include page, label, text snippet, and confidence.
|
| 318 |
+
- Add a vertical-drift sanity check so boxes stay aligned with the correct line of text.
|
| 319 |
+
- If the layout is dense, process page by page and do a second pass for missed items.
|
| 320 |
+
</bbox_extraction_spec>
|
| 321 |
+
```
|
| 322 |
+
|
| 323 |
+
### `terminal_tool_hygiene`
|
| 324 |
+
|
| 325 |
+
Use when:
|
| 326 |
+
|
| 327 |
+
- the prompt belongs to a terminal-based or coding-agent workflow
|
| 328 |
+
- tool misuse or shell misuse has been observed
|
| 329 |
+
|
| 330 |
+
```text
|
| 331 |
+
<terminal_tool_hygiene>
|
| 332 |
+
- Only run shell commands through the terminal tool.
|
| 333 |
+
- Never try to "run" tool names as shell commands.
|
| 334 |
+
- If a patch or edit tool exists, use it directly instead of emulating it in bash.
|
| 335 |
+
- After changes, run a lightweight verification step such as ls, tests, or a build before declaring the task done.
|
| 336 |
+
</terminal_tool_hygiene>
|
| 337 |
+
```
|
| 338 |
+
|
| 339 |
+
### `user_updates_spec`
|
| 340 |
+
|
| 341 |
+
Use when:
|
| 342 |
+
|
| 343 |
+
- the workflow is long-running and user updates matter
|
| 344 |
+
|
| 345 |
+
```text
|
| 346 |
+
<user_updates_spec>
|
| 347 |
+
- Only update the user when starting a new major phase or when the plan changes.
|
| 348 |
+
- Each update should contain:
|
| 349 |
+
- 1 sentence on what changed,
|
| 350 |
+
- 1 sentence on the next step.
|
| 351 |
+
- Do not narrate routine tool calls.
|
| 352 |
+
- Keep the user-facing update short, even when the actual work is exhaustive.
|
| 353 |
+
</user_updates_spec>
|
| 354 |
+
```
|
| 355 |
+
|
| 356 |
+
If you are using [Compaction](https://developers.openai.com/api/docs/guides/compaction) in the Responses API, compact after major milestones, treat compacted items as opaque state, and keep prompts functionally identical after compaction.
|
| 357 |
+
|
| 358 |
+
## Responses `phase` guidance
|
| 359 |
+
|
| 360 |
+
For long-running Responses workflows, preambles, or tool-heavy agents that replay assistant items, review whether `phase` is already preserved.
|
| 361 |
+
|
| 362 |
+
- If the host already round-trips `phase`, keep it intact during the upgrade.
|
| 363 |
+
- If the host uses `previous_response_id` and does not manually replay assistant items, note that this may reduce manual `phase` handling needs.
|
| 364 |
+
- If reliable GPT-5.4 behavior would require adding or preserving `phase` and that would need code edits, treat the case as blocked for prompt-only or model-string-only migration guidance.
|
| 365 |
+
|
| 366 |
+
## Example upgrade profiles
|
| 367 |
+
|
| 368 |
+
### GPT-5.2
|
| 369 |
+
|
| 370 |
+
- Use `gpt-5.4`
|
| 371 |
+
- Match the current reasoning effort first
|
| 372 |
+
- Preserve the existing latency and quality profile before tuning prompt blocks
|
| 373 |
+
- If the repo does not expose the exact setting, emit `same` as the starting recommendation
|
| 374 |
+
|
| 375 |
+
### GPT-5.3-Codex
|
| 376 |
+
|
| 377 |
+
- Use `gpt-5.4`
|
| 378 |
+
- Match the current reasoning effort first
|
| 379 |
+
- If you need Codex-style speed and efficiency, add verification blocks before increasing reasoning effort
|
| 380 |
+
- If the repo does not expose the exact setting, emit `same` as the starting recommendation
|
| 381 |
+
|
| 382 |
+
### GPT-4o or GPT-4.1 assistant
|
| 383 |
+
|
| 384 |
+
- Use `gpt-5.4`
|
| 385 |
+
- Start with `none` reasoning effort
|
| 386 |
+
- Add `output_verbosity_spec` only if output becomes too verbose
|
| 387 |
+
|
| 388 |
+
### Long-horizon agent
|
| 389 |
+
|
| 390 |
+
- Use `gpt-5.4`
|
| 391 |
+
- Start with `medium` reasoning effort
|
| 392 |
+
- Add `tool_persistence_rules`
|
| 393 |
+
- Add `completeness_contract`
|
| 394 |
+
- Add `verification_loop`
|
| 395 |
+
|
| 396 |
+
### Research workflow
|
| 397 |
+
|
| 398 |
+
- Use `gpt-5.4`
|
| 399 |
+
- Start with `medium` reasoning effort
|
| 400 |
+
- Add `research_mode`
|
| 401 |
+
- Add `citation_rules`
|
| 402 |
+
- Add `empty_result_handling`
|
| 403 |
+
- Add `tool_persistence_rules` when the host already uses web or retrieval tools
|
| 404 |
+
- Add `parallel_tool_calling` when the retrieval steps are independent
|
| 405 |
+
|
| 406 |
+
### Support triage or multi-agent workflow
|
| 407 |
+
|
| 408 |
+
- Use `gpt-5.4`
|
| 409 |
+
- Prefer `model string + light prompt rewrite` over `model string only`
|
| 410 |
+
- Add at least one of `tool_persistence_rules`, `completeness_contract`, or `verification_loop`
|
| 411 |
+
- Add more only if evals show a real regression
|
| 412 |
+
|
| 413 |
+
### Coding or terminal workflow
|
| 414 |
+
|
| 415 |
+
- Use `gpt-5.4`
|
| 416 |
+
- Keep the model-string change narrow
|
| 417 |
+
- Match the current reasoning effort first if you are upgrading from GPT-5.3-Codex
|
| 418 |
+
- Add `terminal_tool_hygiene`
|
| 419 |
+
- Add `verification_loop`
|
| 420 |
+
- Add `dependency_checks` when actions depend on prerequisite lookup or discovery
|
| 421 |
+
- Add `tool_persistence_rules` if the agent stops too early
|
| 422 |
+
- Review whether `phase` is already preserved for long-running Responses flows or assistant preambles
|
| 423 |
+
- Do not classify this as blocked just because the workflow uses tools; block only if the upgrade requires changing tool definitions or wiring
|
| 424 |
+
- If the repo already uses Responses plus tools and no required host-side change is shown, prefer `model_string_plus_light_prompt_rewrite` over `blocked`
|
| 425 |
+
|
| 426 |
+
## Prompt regression checklist
|
| 427 |
+
|
| 428 |
+
- Check whether the upgraded prompt still preserves the original task intent.
|
| 429 |
+
- Check whether the new prompt is leaner, not just longer.
|
| 430 |
+
- Check completeness, citation quality, dependency handling, verification behavior, and verbosity.
|
| 431 |
+
- For long-running Responses agents, check whether `phase` handling is already in place or needs implementation work.
|
| 432 |
+
- Confirm that each added prompt block addresses an observed regression.
|
| 433 |
+
- Remove prompt blocks that are not earning their keep.
|
skills/.system/openai-docs/references/latest-model.md
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Latest model guide
|
| 2 |
+
|
| 3 |
+
This file is a curated helper. Every recommendation here must be verified against current OpenAI docs before it is repeated to a user.
|
| 4 |
+
|
| 5 |
+
## Current model map
|
| 6 |
+
|
| 7 |
+
| Model ID | Use for |
|
| 8 |
+
| --- | --- |
|
| 9 |
+
| `gpt-5.4` | Default text plus reasoning for most new apps |
|
| 10 |
+
| `gpt-5.4-pro` | Only when the user explicitly asks for maximum reasoning or quality; substantially slower and more expensive |
|
| 11 |
+
| `gpt-5-mini` | Cheaper and faster reasoning with good quality |
|
| 12 |
+
| `gpt-5-nano` | High-throughput simple tasks and classification |
|
| 13 |
+
| `gpt-5.4` | Explicit no-reasoning text path via `reasoning.effort: none` |
|
| 14 |
+
| `gpt-4.1-mini` | Cheaper no-reasoning text |
|
| 15 |
+
| `gpt-4.1-nano` | Fastest and cheapest no-reasoning text |
|
| 16 |
+
| `gpt-5.3-codex` | Agentic coding, code editing, and tool-heavy coding workflows |
|
| 17 |
+
| `gpt-5.1-codex-mini` | Cheaper coding workflows |
|
| 18 |
+
| `gpt-image-1.5` | Best image generation and edit quality |
|
| 19 |
+
| `gpt-image-1-mini` | Cost-optimized image generation |
|
| 20 |
+
| `gpt-4o-mini-tts` | Text-to-speech |
|
| 21 |
+
| `gpt-4o-mini-transcribe` | Speech-to-text, fast and cost-efficient |
|
| 22 |
+
| `gpt-realtime-1.5` | Realtime voice and multimodal sessions |
|
| 23 |
+
| `gpt-realtime-mini` | Cheaper realtime sessions |
|
| 24 |
+
| `gpt-audio` | Chat Completions audio input and output |
|
| 25 |
+
| `gpt-audio-mini` | Cheaper Chat Completions audio workflows |
|
| 26 |
+
| `sora-2` | Faster iteration and draft video generation |
|
| 27 |
+
| `sora-2-pro` | Higher-quality production video |
|
| 28 |
+
| `omni-moderation-latest` | Text and image moderation |
|
| 29 |
+
| `text-embedding-3-large` | Higher-quality retrieval embeddings; default in this skill because no best-specific row exists |
|
| 30 |
+
| `text-embedding-3-small` | Lower-cost embeddings |
|
| 31 |
+
|
| 32 |
+
## Maintenance notes
|
| 33 |
+
|
| 34 |
+
- This file will drift unless it is periodically re-verified against current OpenAI docs.
|
| 35 |
+
- If this file conflicts with current docs, the docs win.
|
skills/.system/openai-docs/references/upgrading-to-gpt-5p4.md
ADDED
|
@@ -0,0 +1,164 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Upgrading to GPT-5.4
|
| 2 |
+
|
| 3 |
+
Use this guide when the user explicitly asks to upgrade an existing integration to GPT-5.4. Pair it with current OpenAI docs lookups. The default target string is `gpt-5.4`.
|
| 4 |
+
|
| 5 |
+
## Upgrade posture
|
| 6 |
+
|
| 7 |
+
Upgrade with the narrowest safe change set:
|
| 8 |
+
|
| 9 |
+
- replace the model string first
|
| 10 |
+
- update only the prompts that are directly tied to that model usage
|
| 11 |
+
- prefer prompt-only upgrades when possible
|
| 12 |
+
- if the upgrade would require API-surface changes, parameter rewrites, tool rewiring, or broader code edits, mark it as blocked instead of stretching the scope
|
| 13 |
+
|
| 14 |
+
## Upgrade workflow
|
| 15 |
+
|
| 16 |
+
1. Inventory current model usage.
|
| 17 |
+
- Search for model strings, client calls, and prompt-bearing files.
|
| 18 |
+
- Include inline prompts, prompt templates, YAML or JSON configs, Markdown docs, and saved prompts when they are clearly tied to a model usage site.
|
| 19 |
+
2. Pair each model usage with its prompt surface.
|
| 20 |
+
- Prefer the closest prompt surface first: inline system or developer text, then adjacent prompt files, then shared templates.
|
| 21 |
+
- If you cannot confidently tie a prompt to the model usage, say so instead of guessing.
|
| 22 |
+
3. Classify the source model family.
|
| 23 |
+
- Common buckets: `gpt-4o` or `gpt-4.1`, `o1` or `o3` or `o4-mini`, early `gpt-5`, later `gpt-5.x`, or mixed and unclear.
|
| 24 |
+
4. Decide the upgrade class.
|
| 25 |
+
- `model string only`
|
| 26 |
+
- `model string + light prompt rewrite`
|
| 27 |
+
- `blocked without code changes`
|
| 28 |
+
5. Run the no-code compatibility gate.
|
| 29 |
+
- Check whether the current integration can accept `gpt-5.4` without API-surface changes or implementation changes.
|
| 30 |
+
- For long-running Responses or tool-heavy agents, check whether `phase` is already preserved or round-tripped when the host replays assistant items or uses preambles.
|
| 31 |
+
- If compatibility depends on code changes, return `blocked`.
|
| 32 |
+
- If compatibility is unclear, return `unknown` rather than improvising.
|
| 33 |
+
6. Recommend the upgrade.
|
| 34 |
+
- Default replacement string: `gpt-5.4`
|
| 35 |
+
- Keep the intervention small and behavior-preserving.
|
| 36 |
+
7. Deliver a structured recommendation.
|
| 37 |
+
- `Current model usage`
|
| 38 |
+
- `Recommended model-string updates`
|
| 39 |
+
- `Starting reasoning recommendation`
|
| 40 |
+
- `Prompt updates`
|
| 41 |
+
- `Phase assessment` when the flow is long-running, replayed, or tool-heavy
|
| 42 |
+
- `No-code compatibility check`
|
| 43 |
+
- `Validation plan`
|
| 44 |
+
- `Launch-day refresh items`
|
| 45 |
+
|
| 46 |
+
Output rule:
|
| 47 |
+
|
| 48 |
+
- Always emit a starting `reasoning_effort_recommendation` for each usage site.
|
| 49 |
+
- If the repo exposes the current reasoning setting, preserve it first unless the source guide says otherwise.
|
| 50 |
+
- If the repo does not expose the current setting, use the source-family starting mapping instead of returning `null`.
|
| 51 |
+
|
| 52 |
+
## Upgrade outcomes
|
| 53 |
+
|
| 54 |
+
### `model string only`
|
| 55 |
+
|
| 56 |
+
Choose this when:
|
| 57 |
+
|
| 58 |
+
- the existing prompts are already short, explicit, and task-bounded
|
| 59 |
+
- the workflow is not strongly research-heavy, tool-heavy, multi-agent, batch or completeness-sensitive, or long-horizon
|
| 60 |
+
- there are no obvious compatibility blockers
|
| 61 |
+
|
| 62 |
+
Default action:
|
| 63 |
+
|
| 64 |
+
- replace the model string with `gpt-5.4`
|
| 65 |
+
- keep prompts unchanged
|
| 66 |
+
- validate behavior with existing evals or spot checks
|
| 67 |
+
|
| 68 |
+
### `model string + light prompt rewrite`
|
| 69 |
+
|
| 70 |
+
Choose this when:
|
| 71 |
+
|
| 72 |
+
- the old prompt was compensating for weaker instruction following
|
| 73 |
+
- the workflow needs more persistence than the default tool-use behavior will likely provide
|
| 74 |
+
- the task needs stronger completeness, citation discipline, or verification
|
| 75 |
+
- the upgraded model becomes too verbose or under-complete unless instructed otherwise
|
| 76 |
+
- the workflow is research-heavy and needs stronger handling of sparse or empty retrieval results
|
| 77 |
+
- the workflow is coding-oriented, tool-heavy, or multi-agent, but the existing API surface and tool definitions can remain unchanged
|
| 78 |
+
|
| 79 |
+
Default action:
|
| 80 |
+
|
| 81 |
+
- replace the model string with `gpt-5.4`
|
| 82 |
+
- add one or two targeted prompt blocks
|
| 83 |
+
- read `references/gpt-5p4-prompting-guide.md` to choose the smallest prompt changes that recover the old behavior
|
| 84 |
+
- avoid broad prompt cleanup unrelated to the upgrade
|
| 85 |
+
- for research workflows, default to `research_mode` + `citation_rules` + `empty_result_handling`; add `tool_persistence_rules` when the host already uses retrieval tools
|
| 86 |
+
- for dependency-aware or tool-heavy workflows, default to `tool_persistence_rules` + `dependency_checks` + `verification_loop`; add `parallel_tool_calling` only when retrieval steps are truly independent
|
| 87 |
+
- for coding or terminal workflows, default to `terminal_tool_hygiene` + `verification_loop`
|
| 88 |
+
- for multi-agent support or triage workflows, default to at least one of `tool_persistence_rules`, `completeness_contract`, or `verification_loop`
|
| 89 |
+
- for long-running Responses agents with preambles or multiple assistant messages, explicitly review whether `phase` is already handled; if adding or preserving `phase` would require code edits, mark the path as `blocked`
|
| 90 |
+
- do not classify a coding or tool-using Responses workflow as `blocked` just because the visible snippet is minimal; prefer `model string + light prompt rewrite` unless the repo clearly shows that a safe GPT-5.4 path would require host-side code changes
|
| 91 |
+
|
| 92 |
+
### `blocked`
|
| 93 |
+
|
| 94 |
+
Choose this when:
|
| 95 |
+
|
| 96 |
+
- the upgrade appears to require API-surface changes
|
| 97 |
+
- the upgrade appears to require parameter rewrites or reasoning-setting changes that are not exposed outside implementation code
|
| 98 |
+
- the upgrade would require changing tool definitions, tool handler wiring, or schema contracts
|
| 99 |
+
- you cannot confidently identify the prompt surface tied to the model usage
|
| 100 |
+
|
| 101 |
+
Default action:
|
| 102 |
+
|
| 103 |
+
- do not improvise a broader upgrade
|
| 104 |
+
- report the blocker and explain that the fix is out of scope for this guide
|
| 105 |
+
|
| 106 |
+
## No-code compatibility checklist
|
| 107 |
+
|
| 108 |
+
Before recommending a no-code upgrade, check:
|
| 109 |
+
|
| 110 |
+
1. Can the current host accept the `gpt-5.4` model string without changing client code or API surface?
|
| 111 |
+
2. Are the related prompts identifiable and editable?
|
| 112 |
+
3. Does the host depend on behavior that likely needs API-surface changes, parameter rewrites, or tool rewiring?
|
| 113 |
+
4. Would the likely fix be prompt-only, or would it need implementation changes?
|
| 114 |
+
5. Is the prompt surface close enough to the model usage that you can make a targeted change instead of a broad cleanup?
|
| 115 |
+
6. For long-running Responses or tool-heavy agents, is `phase` already preserved if the host relies on preambles, replayed assistant items, or multiple assistant messages?
|
| 116 |
+
|
| 117 |
+
If item 1 is no, items 3 through 4 point to implementation work, or item 6 is no and the fix needs code changes, return `blocked`.
|
| 118 |
+
|
| 119 |
+
If item 2 is no, return `unknown` unless the user can point to the prompt location.
|
| 120 |
+
|
| 121 |
+
Important:
|
| 122 |
+
|
| 123 |
+
- Existing use of tools, agents, or multiple usage sites is not by itself a blocker.
|
| 124 |
+
- If the current host can keep the same API surface and the same tool definitions, prefer `model string + light prompt rewrite` over `blocked`.
|
| 125 |
+
- Reserve `blocked` for cases that truly require implementation changes, not cases that only need stronger prompt steering.
|
| 126 |
+
|
| 127 |
+
## Scope boundaries
|
| 128 |
+
|
| 129 |
+
This guide may:
|
| 130 |
+
|
| 131 |
+
- update or recommend updated model strings
|
| 132 |
+
- update or recommend updated prompts
|
| 133 |
+
- inspect code and prompt files to understand where those changes belong
|
| 134 |
+
- inspect whether existing Responses flows already preserve `phase`
|
| 135 |
+
- flag compatibility blockers
|
| 136 |
+
|
| 137 |
+
This guide may not:
|
| 138 |
+
|
| 139 |
+
- move Chat Completions code to Responses
|
| 140 |
+
- move Responses code to another API surface
|
| 141 |
+
- rewrite parameter shapes
|
| 142 |
+
- change tool definitions or tool-call handling
|
| 143 |
+
- change structured-output wiring
|
| 144 |
+
- add or retrofit `phase` handling in implementation code
|
| 145 |
+
- edit business logic, orchestration logic, or SDK usage beyond a literal model-string replacement
|
| 146 |
+
|
| 147 |
+
If a safe GPT-5.4 upgrade requires any of those changes, mark the path as blocked and out of scope.
|
| 148 |
+
|
| 149 |
+
## Validation plan
|
| 150 |
+
|
| 151 |
+
- Validate each upgraded usage site with existing evals or realistic spot checks.
|
| 152 |
+
- Check whether the upgraded model still matches expected latency, output shape, and quality.
|
| 153 |
+
- If prompt edits were added, confirm each block is doing real work instead of adding noise.
|
| 154 |
+
- If the workflow has downstream impact, add a lightweight verification pass before finalization.
|
| 155 |
+
|
| 156 |
+
## Launch-day refresh items
|
| 157 |
+
|
| 158 |
+
When final GPT-5.4 guidance changes:
|
| 159 |
+
|
| 160 |
+
1. Replace release-candidate assumptions with final GPT-5.4 guidance where appropriate.
|
| 161 |
+
2. Re-check whether the default target string should stay `gpt-5.4` for all source families.
|
| 162 |
+
3. Re-check any prompt-block recommendations whose semantics may have changed.
|
| 163 |
+
4. Re-check research, citation, and compatibility guidance against the final model behavior.
|
| 164 |
+
5. Re-run the same upgrade scenarios and confirm the blocked-versus-viable boundaries still hold.
|
skills/.system/skill-creator/SKILL.md
ADDED
|
@@ -0,0 +1,413 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: skill-creator
|
| 3 |
+
description: Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Codex's capabilities with specialized knowledge, workflows, or tool integrations.
|
| 4 |
+
metadata:
|
| 5 |
+
short-description: Create or update a skill
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# Skill Creator
|
| 9 |
+
|
| 10 |
+
This skill provides guidance for creating effective skills.
|
| 11 |
+
|
| 12 |
+
## About Skills
|
| 13 |
+
|
| 14 |
+
Skills are modular, self-contained folders that extend Codex's capabilities by providing
|
| 15 |
+
specialized knowledge, workflows, and tools. Think of them as "onboarding guides" for specific
|
| 16 |
+
domains or tasks—they transform Codex from a general-purpose agent into a specialized agent
|
| 17 |
+
equipped with procedural knowledge that no model can fully possess.
|
| 18 |
+
|
| 19 |
+
### What Skills Provide
|
| 20 |
+
|
| 21 |
+
1. Specialized workflows - Multi-step procedures for specific domains
|
| 22 |
+
2. Tool integrations - Instructions for working with specific file formats or APIs
|
| 23 |
+
3. Domain expertise - Company-specific knowledge, schemas, business logic
|
| 24 |
+
4. Bundled resources - Scripts, references, and assets for complex and repetitive tasks
|
| 25 |
+
|
| 26 |
+
## Core Principles
|
| 27 |
+
|
| 28 |
+
### Concise is Key
|
| 29 |
+
|
| 30 |
+
The context window is a public good. Skills share the context window with everything else Codex needs: system prompt, conversation history, other Skills' metadata, and the actual user request.
|
| 31 |
+
|
| 32 |
+
**Default assumption: Codex is already very smart.** Only add context Codex doesn't already have. Challenge each piece of information: "Does Codex really need this explanation?" and "Does this paragraph justify its token cost?"
|
| 33 |
+
|
| 34 |
+
Prefer concise examples over verbose explanations.
|
| 35 |
+
|
| 36 |
+
### Set Appropriate Degrees of Freedom
|
| 37 |
+
|
| 38 |
+
Match the level of specificity to the task's fragility and variability:
|
| 39 |
+
|
| 40 |
+
**High freedom (text-based instructions)**: Use when multiple approaches are valid, decisions depend on context, or heuristics guide the approach.
|
| 41 |
+
|
| 42 |
+
**Medium freedom (pseudocode or scripts with parameters)**: Use when a preferred pattern exists, some variation is acceptable, or configuration affects behavior.
|
| 43 |
+
|
| 44 |
+
**Low freedom (specific scripts, few parameters)**: Use when operations are fragile and error-prone, consistency is critical, or a specific sequence must be followed.
|
| 45 |
+
|
| 46 |
+
Think of Codex as exploring a path: a narrow bridge with cliffs needs specific guardrails (low freedom), while an open field allows many routes (high freedom).
|
| 47 |
+
|
| 48 |
+
### Protect Validation Integrity
|
| 49 |
+
|
| 50 |
+
You may use subagents during iteration to validate whether a skill works on realistic tasks or whether a suspected problem is real. This is most useful when you want an independent pass on the skill's behavior, outputs, or failure modes after a revision. Only do this when it is possible to start new subagents.
|
| 51 |
+
|
| 52 |
+
When using subagents for validation, treat that as an evaluation surface. The goal is to learn whether the skill generalizes, not whether another agent can reconstruct the answer from leaked context.
|
| 53 |
+
|
| 54 |
+
Prefer raw artifacts such as example prompts, outputs, diffs, logs, or traces. Give the minimum task-local context needed to perform the validation. Avoid passing the intended answer, suspected bug, intended fix, or your prior conclusions unless the validation explicitly requires them.
|
| 55 |
+
|
| 56 |
+
### Anatomy of a Skill
|
| 57 |
+
|
| 58 |
+
Every skill consists of a required SKILL.md file and optional bundled resources:
|
| 59 |
+
|
| 60 |
+
```
|
| 61 |
+
skill-name/
|
| 62 |
+
├── SKILL.md (required)
|
| 63 |
+
│ ├── YAML frontmatter metadata (required)
|
| 64 |
+
│ │ ├── name: (required)
|
| 65 |
+
│ │ └── description: (required)
|
| 66 |
+
│ └── Markdown instructions (required)
|
| 67 |
+
├── agents/ (recommended)
|
| 68 |
+
│ └── openai.yaml - UI metadata for skill lists and chips
|
| 69 |
+
└── Bundled Resources (optional)
|
| 70 |
+
├── scripts/ - Executable code (Python/Bash/etc.)
|
| 71 |
+
├── references/ - Documentation intended to be loaded into context as needed
|
| 72 |
+
└── assets/ - Files used in output (templates, icons, fonts, etc.)
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
#### SKILL.md (required)
|
| 76 |
+
|
| 77 |
+
Every SKILL.md consists of:
|
| 78 |
+
|
| 79 |
+
- **Frontmatter** (YAML): Contains `name` and `description` fields. These are the only fields that Codex reads to determine when the skill gets used, thus it is very important to be clear and comprehensive in describing what the skill is, and when it should be used.
|
| 80 |
+
- **Body** (Markdown): Instructions and guidance for using the skill. Only loaded AFTER the skill triggers (if at all).
|
| 81 |
+
|
| 82 |
+
#### Agents metadata (recommended)
|
| 83 |
+
|
| 84 |
+
- UI-facing metadata for skill lists and chips
|
| 85 |
+
- Read references/openai_yaml.md before generating values and follow its descriptions and constraints
|
| 86 |
+
- Create: human-facing `display_name`, `short_description`, and `default_prompt` by reading the skill
|
| 87 |
+
- Generate deterministically by passing the values as `--interface key=value` to `scripts/generate_openai_yaml.py` or `scripts/init_skill.py`
|
| 88 |
+
- On updates: validate `agents/openai.yaml` still matches SKILL.md; regenerate if stale
|
| 89 |
+
- Only include other optional interface fields (icons, brand color) if explicitly provided
|
| 90 |
+
- See references/openai_yaml.md for field definitions and examples
|
| 91 |
+
|
| 92 |
+
#### Bundled Resources (optional)
|
| 93 |
+
|
| 94 |
+
##### Scripts (`scripts/`)
|
| 95 |
+
|
| 96 |
+
Executable code (Python/Bash/etc.) for tasks that require deterministic reliability or are repeatedly rewritten.
|
| 97 |
+
|
| 98 |
+
- **When to include**: When the same code is being rewritten repeatedly or deterministic reliability is needed
|
| 99 |
+
- **Example**: `scripts/rotate_pdf.py` for PDF rotation tasks
|
| 100 |
+
- **Benefits**: Token efficient, deterministic, may be executed without loading into context
|
| 101 |
+
- **Note**: Scripts may still need to be read by Codex for patching or environment-specific adjustments
|
| 102 |
+
|
| 103 |
+
##### References (`references/`)
|
| 104 |
+
|
| 105 |
+
Documentation and reference material intended to be loaded as needed into context to inform Codex's process and thinking.
|
| 106 |
+
|
| 107 |
+
- **When to include**: For documentation that Codex should reference while working
|
| 108 |
+
- **Examples**: `references/finance.md` for financial schemas, `references/mnda.md` for company NDA template, `references/policies.md` for company policies, `references/api_docs.md` for API specifications
|
| 109 |
+
- **Use cases**: Database schemas, API documentation, domain knowledge, company policies, detailed workflow guides
|
| 110 |
+
- **Benefits**: Keeps SKILL.md lean, loaded only when Codex determines it's needed
|
| 111 |
+
- **Best practice**: If files are large (>10k words), include grep search patterns in SKILL.md
|
| 112 |
+
- **Avoid duplication**: Information should live in either SKILL.md or references files, not both. Prefer references files for detailed information unless it's truly core to the skill—this keeps SKILL.md lean while making information discoverable without hogging the context window. Keep only essential procedural instructions and workflow guidance in SKILL.md; move detailed reference material, schemas, and examples to references files.
|
| 113 |
+
|
| 114 |
+
##### Assets (`assets/`)
|
| 115 |
+
|
| 116 |
+
Files not intended to be loaded into context, but rather used within the output Codex produces.
|
| 117 |
+
|
| 118 |
+
- **When to include**: When the skill needs files that will be used in the final output
|
| 119 |
+
- **Examples**: `assets/logo.png` for brand assets, `assets/slides.pptx` for PowerPoint templates, `assets/frontend-template/` for HTML/React boilerplate, `assets/font.ttf` for typography
|
| 120 |
+
- **Use cases**: Templates, images, icons, boilerplate code, fonts, sample documents that get copied or modified
|
| 121 |
+
- **Benefits**: Separates output resources from documentation, enables Codex to use files without loading them into context
|
| 122 |
+
|
| 123 |
+
#### What to Not Include in a Skill
|
| 124 |
+
|
| 125 |
+
A skill should only contain essential files that directly support its functionality. Do NOT create extraneous documentation or auxiliary files, including:
|
| 126 |
+
|
| 127 |
+
- README.md
|
| 128 |
+
- INSTALLATION_GUIDE.md
|
| 129 |
+
- QUICK_REFERENCE.md
|
| 130 |
+
- CHANGELOG.md
|
| 131 |
+
- etc.
|
| 132 |
+
|
| 133 |
+
The skill should only contain the information needed for an AI agent to do the job at hand. It should not contain auxiliary context about the process that went into creating it, setup and testing procedures, user-facing documentation, etc. Creating additional documentation files just adds clutter and confusion.
|
| 134 |
+
|
| 135 |
+
### Progressive Disclosure Design Principle
|
| 136 |
+
|
| 137 |
+
Skills use a three-level loading system to manage context efficiently:
|
| 138 |
+
|
| 139 |
+
1. **Metadata (name + description)** - Always in context (~100 words)
|
| 140 |
+
2. **SKILL.md body** - When skill triggers (<5k words)
|
| 141 |
+
3. **Bundled resources** - As needed by Codex (Unlimited because scripts can be executed without reading into context window)
|
| 142 |
+
|
| 143 |
+
#### Progressive Disclosure Patterns
|
| 144 |
+
|
| 145 |
+
Keep SKILL.md body to the essentials and under 500 lines to minimize context bloat. Split content into separate files when approaching this limit. When splitting out content into other files, it is very important to reference them from SKILL.md and describe clearly when to read them, to ensure the reader of the skill knows they exist and when to use them.
|
| 146 |
+
|
| 147 |
+
**Key principle:** When a skill supports multiple variations, frameworks, or options, keep only the core workflow and selection guidance in SKILL.md. Move variant-specific details (patterns, examples, configuration) into separate reference files.
|
| 148 |
+
|
| 149 |
+
**Pattern 1: High-level guide with references**
|
| 150 |
+
|
| 151 |
+
```markdown
|
| 152 |
+
# PDF Processing
|
| 153 |
+
|
| 154 |
+
## Quick start
|
| 155 |
+
|
| 156 |
+
Extract text with pdfplumber:
|
| 157 |
+
[code example]
|
| 158 |
+
|
| 159 |
+
## Advanced features
|
| 160 |
+
|
| 161 |
+
- **Form filling**: See [FORMS.md](FORMS.md) for complete guide
|
| 162 |
+
- **API reference**: See [REFERENCE.md](REFERENCE.md) for all methods
|
| 163 |
+
- **Examples**: See [EXAMPLES.md](EXAMPLES.md) for common patterns
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
Codex loads FORMS.md, REFERENCE.md, or EXAMPLES.md only when needed.
|
| 167 |
+
|
| 168 |
+
**Pattern 2: Domain-specific organization**
|
| 169 |
+
|
| 170 |
+
For Skills with multiple domains, organize content by domain to avoid loading irrelevant context:
|
| 171 |
+
|
| 172 |
+
```
|
| 173 |
+
bigquery-skill/
|
| 174 |
+
├── SKILL.md (overview and navigation)
|
| 175 |
+
└── reference/
|
| 176 |
+
├── finance.md (revenue, billing metrics)
|
| 177 |
+
├── sales.md (opportunities, pipeline)
|
| 178 |
+
├── product.md (API usage, features)
|
| 179 |
+
└── marketing.md (campaigns, attribution)
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
When a user asks about sales metrics, Codex only reads sales.md.
|
| 183 |
+
|
| 184 |
+
Similarly, for skills supporting multiple frameworks or variants, organize by variant:
|
| 185 |
+
|
| 186 |
+
```
|
| 187 |
+
cloud-deploy/
|
| 188 |
+
├── SKILL.md (workflow + provider selection)
|
| 189 |
+
└── references/
|
| 190 |
+
├── aws.md (AWS deployment patterns)
|
| 191 |
+
├── gcp.md (GCP deployment patterns)
|
| 192 |
+
└── azure.md (Azure deployment patterns)
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
When the user chooses AWS, Codex only reads aws.md.
|
| 196 |
+
|
| 197 |
+
**Pattern 3: Conditional details**
|
| 198 |
+
|
| 199 |
+
Show basic content, link to advanced content:
|
| 200 |
+
|
| 201 |
+
```markdown
|
| 202 |
+
# DOCX Processing
|
| 203 |
+
|
| 204 |
+
## Creating documents
|
| 205 |
+
|
| 206 |
+
Use docx-js for new documents. See [DOCX-JS.md](DOCX-JS.md).
|
| 207 |
+
|
| 208 |
+
## Editing documents
|
| 209 |
+
|
| 210 |
+
For simple edits, modify the XML directly.
|
| 211 |
+
|
| 212 |
+
**For tracked changes**: See [REDLINING.md](REDLINING.md)
|
| 213 |
+
**For OOXML details**: See [OOXML.md](OOXML.md)
|
| 214 |
+
```
|
| 215 |
+
|
| 216 |
+
Codex reads REDLINING.md or OOXML.md only when the user needs those features.
|
| 217 |
+
|
| 218 |
+
**Important guidelines:**
|
| 219 |
+
|
| 220 |
+
- **Avoid deeply nested references** - Keep references one level deep from SKILL.md. All reference files should link directly from SKILL.md.
|
| 221 |
+
- **Structure longer reference files** - For files longer than 100 lines, include a table of contents at the top so Codex can see the full scope when previewing.
|
| 222 |
+
|
| 223 |
+
## Skill Creation Process
|
| 224 |
+
|
| 225 |
+
Skill creation involves these steps:
|
| 226 |
+
|
| 227 |
+
1. Understand the skill with concrete examples
|
| 228 |
+
2. Plan reusable skill contents (scripts, references, assets)
|
| 229 |
+
3. Initialize the skill (run init_skill.py)
|
| 230 |
+
4. Edit the skill (implement resources and write SKILL.md)
|
| 231 |
+
5. Validate the skill (run quick_validate.py)
|
| 232 |
+
6. Iterate based on real usage and forward-test complex skills.
|
| 233 |
+
|
| 234 |
+
Follow these steps in order, skipping only if there is a clear reason why they are not applicable.
|
| 235 |
+
|
| 236 |
+
### Skill Naming
|
| 237 |
+
|
| 238 |
+
- Use lowercase letters, digits, and hyphens only; normalize user-provided titles to hyphen-case (e.g., "Plan Mode" -> `plan-mode`).
|
| 239 |
+
- When generating names, generate a name under 64 characters (letters, digits, hyphens).
|
| 240 |
+
- Prefer short, verb-led phrases that describe the action.
|
| 241 |
+
- Namespace by tool when it improves clarity or triggering (e.g., `gh-address-comments`, `linear-address-issue`).
|
| 242 |
+
- Name the skill folder exactly after the skill name.
|
| 243 |
+
|
| 244 |
+
### Step 1: Understanding the Skill with Concrete Examples
|
| 245 |
+
|
| 246 |
+
Skip this step only when the skill's usage patterns are already clearly understood. It remains valuable even when working with an existing skill.
|
| 247 |
+
|
| 248 |
+
To create an effective skill, clearly understand concrete examples of how the skill will be used. This understanding can come from either direct user examples or generated examples that are validated with user feedback.
|
| 249 |
+
|
| 250 |
+
For example, when building an image-editor skill, relevant questions include:
|
| 251 |
+
|
| 252 |
+
- "What functionality should the image-editor skill support? Editing, rotating, anything else?"
|
| 253 |
+
- "Can you give some examples of how this skill would be used?"
|
| 254 |
+
- "I can imagine users asking for things like 'Remove the red-eye from this image' or 'Rotate this image'. Are there other ways you imagine this skill being used?"
|
| 255 |
+
- "What would a user say that should trigger this skill?"
|
| 256 |
+
|
| 257 |
+
To avoid overwhelming users, avoid asking too many questions in a single message. Start with the most important questions and follow up as needed for better effectiveness.
|
| 258 |
+
|
| 259 |
+
Conclude this step when there is a clear sense of the functionality the skill should support.
|
| 260 |
+
|
| 261 |
+
### Step 2: Planning the Reusable Skill Contents
|
| 262 |
+
|
| 263 |
+
To turn concrete examples into an effective skill, analyze each example by:
|
| 264 |
+
|
| 265 |
+
1. Considering how to execute on the example from scratch
|
| 266 |
+
2. Identifying what scripts, references, and assets would be helpful when executing these workflows repeatedly
|
| 267 |
+
|
| 268 |
+
Example: When building a `pdf-editor` skill to handle queries like "Help me rotate this PDF," the analysis shows:
|
| 269 |
+
|
| 270 |
+
1. Rotating a PDF requires re-writing the same code each time
|
| 271 |
+
2. A `scripts/rotate_pdf.py` script would be helpful to store in the skill
|
| 272 |
+
|
| 273 |
+
Example: When designing a `frontend-webapp-builder` skill for queries like "Build me a todo app" or "Build me a dashboard to track my steps," the analysis shows:
|
| 274 |
+
|
| 275 |
+
1. Writing a frontend webapp requires the same boilerplate HTML/React each time
|
| 276 |
+
2. An `assets/hello-world/` template containing the boilerplate HTML/React project files would be helpful to store in the skill
|
| 277 |
+
|
| 278 |
+
Example: When building a `big-query` skill to handle queries like "How many users have logged in today?" the analysis shows:
|
| 279 |
+
|
| 280 |
+
1. Querying BigQuery requires re-discovering the table schemas and relationships each time
|
| 281 |
+
2. A `references/schema.md` file documenting the table schemas would be helpful to store in the skill
|
| 282 |
+
|
| 283 |
+
To establish the skill's contents, analyze each concrete example to create a list of the reusable resources to include: scripts, references, and assets.
|
| 284 |
+
|
| 285 |
+
### Step 3: Initializing the Skill
|
| 286 |
+
|
| 287 |
+
At this point, it is time to actually create the skill.
|
| 288 |
+
|
| 289 |
+
Skip this step only if the skill being developed already exists. In this case, continue to the next step.
|
| 290 |
+
|
| 291 |
+
When creating a new skill from scratch, always run the `init_skill.py` script. The script conveniently generates a new template skill directory that automatically includes everything a skill requires, making the skill creation process much more efficient and reliable.
|
| 292 |
+
|
| 293 |
+
Usage:
|
| 294 |
+
|
| 295 |
+
```bash
|
| 296 |
+
scripts/init_skill.py <skill-name> --path <output-directory> [--resources scripts,references,assets] [--examples]
|
| 297 |
+
```
|
| 298 |
+
|
| 299 |
+
Examples:
|
| 300 |
+
|
| 301 |
+
```bash
|
| 302 |
+
scripts/init_skill.py my-skill --path skills/public
|
| 303 |
+
scripts/init_skill.py my-skill --path skills/public --resources scripts,references
|
| 304 |
+
scripts/init_skill.py my-skill --path skills/public --resources scripts --examples
|
| 305 |
+
```
|
| 306 |
+
|
| 307 |
+
The script:
|
| 308 |
+
|
| 309 |
+
- Creates the skill directory at the specified path
|
| 310 |
+
- Generates a SKILL.md template with proper frontmatter and TODO placeholders
|
| 311 |
+
- Creates `agents/openai.yaml` using agent-generated `display_name`, `short_description`, and `default_prompt` passed via `--interface key=value`
|
| 312 |
+
- Optionally creates resource directories based on `--resources`
|
| 313 |
+
- Optionally adds example files when `--examples` is set
|
| 314 |
+
|
| 315 |
+
After initialization, customize the SKILL.md and add resources as needed. If you used `--examples`, replace or delete placeholder files.
|
| 316 |
+
|
| 317 |
+
Generate `display_name`, `short_description`, and `default_prompt` by reading the skill, then pass them as `--interface key=value` to `init_skill.py` or regenerate with:
|
| 318 |
+
|
| 319 |
+
```bash
|
| 320 |
+
scripts/generate_openai_yaml.py <path/to/skill-folder> --interface key=value
|
| 321 |
+
```
|
| 322 |
+
|
| 323 |
+
Only include other optional interface fields when the user explicitly provides them. For full field descriptions and examples, see references/openai_yaml.md.
|
| 324 |
+
|
| 325 |
+
### Step 4: Edit the Skill
|
| 326 |
+
|
| 327 |
+
When editing the (newly-generated or existing) skill, remember that the skill is being created for another instance of Codex to use. Include information that would be beneficial and non-obvious to Codex. Consider what procedural knowledge, domain-specific details, or reusable assets would help another Codex instance execute these tasks more effectively.
|
| 328 |
+
|
| 329 |
+
After substantial revisions, or if the skill is particularly tricky, you should use subagents to forward-test the skill on realistic tasks or artifacts. When doing so, pass the artifact under validation rather than your diagnosis of what is wrong, and keep the prompt generic enough that success depends on transferable reasoning rather than hidden ground truth.
|
| 330 |
+
|
| 331 |
+
#### Start with Reusable Skill Contents
|
| 332 |
+
|
| 333 |
+
To begin implementation, start with the reusable resources identified above: `scripts/`, `references/`, and `assets/` files. Note that this step may require user input. For example, when implementing a `brand-guidelines` skill, the user may need to provide brand assets or templates to store in `assets/`, or documentation to store in `references/`.
|
| 334 |
+
|
| 335 |
+
Added scripts must be tested by actually running them to ensure there are no bugs and that the output matches what is expected. If there are many similar scripts, only a representative sample needs to be tested to ensure confidence that they all work while balancing time to completion.
|
| 336 |
+
|
| 337 |
+
If you used `--examples`, delete any placeholder files that are not needed for the skill. Only create resource directories that are actually required.
|
| 338 |
+
|
| 339 |
+
#### Update SKILL.md
|
| 340 |
+
|
| 341 |
+
**Writing Guidelines:** Always use imperative/infinitive form.
|
| 342 |
+
|
| 343 |
+
##### Frontmatter
|
| 344 |
+
|
| 345 |
+
Write the YAML frontmatter with `name` and `description`:
|
| 346 |
+
|
| 347 |
+
- `name`: The skill name
|
| 348 |
+
- `description`: This is the primary triggering mechanism for your skill, and helps Codex understand when to use the skill.
|
| 349 |
+
- Include both what the Skill does and specific triggers/contexts for when to use it.
|
| 350 |
+
- Include all "when to use" information here - Not in the body. The body is only loaded after triggering, so "When to Use This Skill" sections in the body are not helpful to Codex.
|
| 351 |
+
- Example description for a `docx` skill: "Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. Use when Codex needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks"
|
| 352 |
+
|
| 353 |
+
Do not include any other fields in YAML frontmatter.
|
| 354 |
+
|
| 355 |
+
##### Body
|
| 356 |
+
|
| 357 |
+
Write instructions for using the skill and its bundled resources.
|
| 358 |
+
|
| 359 |
+
### Step 5: Validate the Skill
|
| 360 |
+
|
| 361 |
+
Once development of the skill is complete, validate the skill folder to catch basic issues early:
|
| 362 |
+
|
| 363 |
+
```bash
|
| 364 |
+
scripts/quick_validate.py <path/to/skill-folder>
|
| 365 |
+
```
|
| 366 |
+
|
| 367 |
+
The validation script checks YAML frontmatter format, required fields, and naming rules. If validation fails, fix the reported issues and run the command again.
|
| 368 |
+
|
| 369 |
+
### Step 6: Iterate
|
| 370 |
+
|
| 371 |
+
After testing the skill, you may detect the skill is complex enough that it requires forward-testing; or users may request improvements.
|
| 372 |
+
|
| 373 |
+
User testing often this happens right after using the skill, with fresh context of how the skill performed.
|
| 374 |
+
|
| 375 |
+
**Forward-testing and iteration workflow:**
|
| 376 |
+
|
| 377 |
+
1. Use the skill on real tasks
|
| 378 |
+
2. Notice struggles or inefficiencies
|
| 379 |
+
3. Identify how SKILL.md or bundled resources should be updated
|
| 380 |
+
4. Implement changes and test again
|
| 381 |
+
5. Forward-test if it is reasonable and appropriate
|
| 382 |
+
|
| 383 |
+
## Forward-testing
|
| 384 |
+
|
| 385 |
+
To forward-test, launch subagents as a way to stress test the skill with minimal context.
|
| 386 |
+
Subagents should *not* know that they are being asked to test the skill. They should be treated as
|
| 387 |
+
an agent asked to perform a task by the user. Prompts to subagents should look like:
|
| 388 |
+
`Use $skill-x at /path/to/skill-x to solve problem y`
|
| 389 |
+
Not:
|
| 390 |
+
`Review the skill at /path/to/skill-x; pretend a user asks you to...`
|
| 391 |
+
|
| 392 |
+
Decision rule for forward-testing:
|
| 393 |
+
- Err on the side of forward-testing
|
| 394 |
+
- Ask for approval if you think there's a risk that forward-testing would:
|
| 395 |
+
* take a long time,
|
| 396 |
+
* require additional approvals from the user, or
|
| 397 |
+
* modify live production systems
|
| 398 |
+
|
| 399 |
+
In these cases, show the user your proposed prompt and request (1) a yes/no decision, and
|
| 400 |
+
(2) any suggested modifictions.
|
| 401 |
+
|
| 402 |
+
Considerations when forward-testing:
|
| 403 |
+
- use fresh threads for independent passes
|
| 404 |
+
- pass the skill, and a request in a similar way the user would.
|
| 405 |
+
- pass raw artifacts, not your conclusions
|
| 406 |
+
- avoid showing expected answers or intended fixes
|
| 407 |
+
- rebuild context from source artifacts after each iteration
|
| 408 |
+
- review the subagent's output and reasoning and emitted artifacts
|
| 409 |
+
- avoid leaving artifacts the agent can find on disk between iterations;
|
| 410 |
+
clean up subagents' artifacts to avoid additional contamination.
|
| 411 |
+
|
| 412 |
+
If forward-testing only succeeds when subagents see leaked context, tighten the skill or the
|
| 413 |
+
forward-testing setup before trusting the result.
|
skills/.system/skill-creator/agents/openai.yaml
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
interface:
|
| 2 |
+
display_name: "Skill Creator"
|
| 3 |
+
short_description: "Create or update a skill"
|
| 4 |
+
icon_small: "./assets/skill-creator-small.svg"
|
| 5 |
+
icon_large: "./assets/skill-creator.png"
|
skills/.system/skill-creator/assets/skill-creator-small.svg
ADDED
|
|
skills/.system/skill-creator/assets/skill-creator.png
ADDED
|
skills/.system/skill-creator/license.txt
ADDED
|
@@ -0,0 +1,202 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
Apache License
|
| 3 |
+
Version 2.0, January 2004
|
| 4 |
+
http://www.apache.org/licenses/
|
| 5 |
+
|
| 6 |
+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
| 7 |
+
|
| 8 |
+
1. Definitions.
|
| 9 |
+
|
| 10 |
+
"License" shall mean the terms and conditions for use, reproduction,
|
| 11 |
+
and distribution as defined by Sections 1 through 9 of this document.
|
| 12 |
+
|
| 13 |
+
"Licensor" shall mean the copyright owner or entity authorized by
|
| 14 |
+
the copyright owner that is granting the License.
|
| 15 |
+
|
| 16 |
+
"Legal Entity" shall mean the union of the acting entity and all
|
| 17 |
+
other entities that control, are controlled by, or are under common
|
| 18 |
+
control with that entity. For the purposes of this definition,
|
| 19 |
+
"control" means (i) the power, direct or indirect, to cause the
|
| 20 |
+
direction or management of such entity, whether by contract or
|
| 21 |
+
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
| 22 |
+
outstanding shares, or (iii) beneficial ownership of such entity.
|
| 23 |
+
|
| 24 |
+
"You" (or "Your") shall mean an individual or Legal Entity
|
| 25 |
+
exercising permissions granted by this License.
|
| 26 |
+
|
| 27 |
+
"Source" form shall mean the preferred form for making modifications,
|
| 28 |
+
including but not limited to software source code, documentation
|
| 29 |
+
source, and configuration files.
|
| 30 |
+
|
| 31 |
+
"Object" form shall mean any form resulting from mechanical
|
| 32 |
+
transformation or translation of a Source form, including but
|
| 33 |
+
not limited to compiled object code, generated documentation,
|
| 34 |
+
and conversions to other media types.
|
| 35 |
+
|
| 36 |
+
"Work" shall mean the work of authorship, whether in Source or
|
| 37 |
+
Object form, made available under the License, as indicated by a
|
| 38 |
+
copyright notice that is included in or attached to the work
|
| 39 |
+
(an example is provided in the Appendix below).
|
| 40 |
+
|
| 41 |
+
"Derivative Works" shall mean any work, whether in Source or Object
|
| 42 |
+
form, that is based on (or derived from) the Work and for which the
|
| 43 |
+
editorial revisions, annotations, elaborations, or other modifications
|
| 44 |
+
represent, as a whole, an original work of authorship. For the purposes
|
| 45 |
+
of this License, Derivative Works shall not include works that remain
|
| 46 |
+
separable from, or merely link (or bind by name) to the interfaces of,
|
| 47 |
+
the Work and Derivative Works thereof.
|
| 48 |
+
|
| 49 |
+
"Contribution" shall mean any work of authorship, including
|
| 50 |
+
the original version of the Work and any modifications or additions
|
| 51 |
+
to that Work or Derivative Works thereof, that is intentionally
|
| 52 |
+
submitted to Licensor for inclusion in the Work by the copyright owner
|
| 53 |
+
or by an individual or Legal Entity authorized to submit on behalf of
|
| 54 |
+
the copyright owner. For the purposes of this definition, "submitted"
|
| 55 |
+
means any form of electronic, verbal, or written communication sent
|
| 56 |
+
to the Licensor or its representatives, including but not limited to
|
| 57 |
+
communication on electronic mailing lists, source code control systems,
|
| 58 |
+
and issue tracking systems that are managed by, or on behalf of, the
|
| 59 |
+
Licensor for the purpose of discussing and improving the Work, but
|
| 60 |
+
excluding communication that is conspicuously marked or otherwise
|
| 61 |
+
designated in writing by the copyright owner as "Not a Contribution."
|
| 62 |
+
|
| 63 |
+
"Contributor" shall mean Licensor and any individual or Legal Entity
|
| 64 |
+
on behalf of whom a Contribution has been received by Licensor and
|
| 65 |
+
subsequently incorporated within the Work.
|
| 66 |
+
|
| 67 |
+
2. Grant of Copyright License. Subject to the terms and conditions of
|
| 68 |
+
this License, each Contributor hereby grants to You a perpetual,
|
| 69 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
| 70 |
+
copyright license to reproduce, prepare Derivative Works of,
|
| 71 |
+
publicly display, publicly perform, sublicense, and distribute the
|
| 72 |
+
Work and such Derivative Works in Source or Object form.
|
| 73 |
+
|
| 74 |
+
3. Grant of Patent License. Subject to the terms and conditions of
|
| 75 |
+
this License, each Contributor hereby grants to You a perpetual,
|
| 76 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
| 77 |
+
(except as stated in this section) patent license to make, have made,
|
| 78 |
+
use, offer to sell, sell, import, and otherwise transfer the Work,
|
| 79 |
+
where such license applies only to those patent claims licensable
|
| 80 |
+
by such Contributor that are necessarily infringed by their
|
| 81 |
+
Contribution(s) alone or by combination of their Contribution(s)
|
| 82 |
+
with the Work to which such Contribution(s) was submitted. If You
|
| 83 |
+
institute patent litigation against any entity (including a
|
| 84 |
+
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
| 85 |
+
or a Contribution incorporated within the Work constitutes direct
|
| 86 |
+
or contributory patent infringement, then any patent licenses
|
| 87 |
+
granted to You under this License for that Work shall terminate
|
| 88 |
+
as of the date such litigation is filed.
|
| 89 |
+
|
| 90 |
+
4. Redistribution. You may reproduce and distribute copies of the
|
| 91 |
+
Work or Derivative Works thereof in any medium, with or without
|
| 92 |
+
modifications, and in Source or Object form, provided that You
|
| 93 |
+
meet the following conditions:
|
| 94 |
+
|
| 95 |
+
(a) You must give any other recipients of the Work or
|
| 96 |
+
Derivative Works a copy of this License; and
|
| 97 |
+
|
| 98 |
+
(b) You must cause any modified files to carry prominent notices
|
| 99 |
+
stating that You changed the files; and
|
| 100 |
+
|
| 101 |
+
(c) You must retain, in the Source form of any Derivative Works
|
| 102 |
+
that You distribute, all copyright, patent, trademark, and
|
| 103 |
+
attribution notices from the Source form of the Work,
|
| 104 |
+
excluding those notices that do not pertain to any part of
|
| 105 |
+
the Derivative Works; and
|
| 106 |
+
|
| 107 |
+
(d) If the Work includes a "NOTICE" text file as part of its
|
| 108 |
+
distribution, then any Derivative Works that You distribute must
|
| 109 |
+
include a readable copy of the attribution notices contained
|
| 110 |
+
within such NOTICE file, excluding those notices that do not
|
| 111 |
+
pertain to any part of the Derivative Works, in at least one
|
| 112 |
+
of the following places: within a NOTICE text file distributed
|
| 113 |
+
as part of the Derivative Works; within the Source form or
|
| 114 |
+
documentation, if provided along with the Derivative Works; or,
|
| 115 |
+
within a display generated by the Derivative Works, if and
|
| 116 |
+
wherever such third-party notices normally appear. The contents
|
| 117 |
+
of the NOTICE file are for informational purposes only and
|
| 118 |
+
do not modify the License. You may add Your own attribution
|
| 119 |
+
notices within Derivative Works that You distribute, alongside
|
| 120 |
+
or as an addendum to the NOTICE text from the Work, provided
|
| 121 |
+
that such additional attribution notices cannot be construed
|
| 122 |
+
as modifying the License.
|
| 123 |
+
|
| 124 |
+
You may add Your own copyright statement to Your modifications and
|
| 125 |
+
may provide additional or different license terms and conditions
|
| 126 |
+
for use, reproduction, or distribution of Your modifications, or
|
| 127 |
+
for any such Derivative Works as a whole, provided Your use,
|
| 128 |
+
reproduction, and distribution of the Work otherwise complies with
|
| 129 |
+
the conditions stated in this License.
|
| 130 |
+
|
| 131 |
+
5. Submission of Contributions. Unless You explicitly state otherwise,
|
| 132 |
+
any Contribution intentionally submitted for inclusion in the Work
|
| 133 |
+
by You to the Licensor shall be under the terms and conditions of
|
| 134 |
+
this License, without any additional terms or conditions.
|
| 135 |
+
Notwithstanding the above, nothing herein shall supersede or modify
|
| 136 |
+
the terms of any separate license agreement you may have executed
|
| 137 |
+
with Licensor regarding such Contributions.
|
| 138 |
+
|
| 139 |
+
6. Trademarks. This License does not grant permission to use the trade
|
| 140 |
+
names, trademarks, service marks, or product names of the Licensor,
|
| 141 |
+
except as required for reasonable and customary use in describing the
|
| 142 |
+
origin of the Work and reproducing the content of the NOTICE file.
|
| 143 |
+
|
| 144 |
+
7. Disclaimer of Warranty. Unless required by applicable law or
|
| 145 |
+
agreed to in writing, Licensor provides the Work (and each
|
| 146 |
+
Contributor provides its Contributions) on an "AS IS" BASIS,
|
| 147 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
| 148 |
+
implied, including, without limitation, any warranties or conditions
|
| 149 |
+
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
| 150 |
+
PARTICULAR PURPOSE. You are solely responsible for determining the
|
| 151 |
+
appropriateness of using or redistributing the Work and assume any
|
| 152 |
+
risks associated with Your exercise of permissions under this License.
|
| 153 |
+
|
| 154 |
+
8. Limitation of Liability. In no event and under no legal theory,
|
| 155 |
+
whether in tort (including negligence), contract, or otherwise,
|
| 156 |
+
unless required by applicable law (such as deliberate and grossly
|
| 157 |
+
negligent acts) or agreed to in writing, shall any Contributor be
|
| 158 |
+
liable to You for damages, including any direct, indirect, special,
|
| 159 |
+
incidental, or consequential damages of any character arising as a
|
| 160 |
+
result of this License or out of the use or inability to use the
|
| 161 |
+
Work (including but not limited to damages for loss of goodwill,
|
| 162 |
+
work stoppage, computer failure or malfunction, or any and all
|
| 163 |
+
other commercial damages or losses), even if such Contributor
|
| 164 |
+
has been advised of the possibility of such damages.
|
| 165 |
+
|
| 166 |
+
9. Accepting Warranty or Additional Liability. While redistributing
|
| 167 |
+
the Work or Derivative Works thereof, You may choose to offer,
|
| 168 |
+
and charge a fee for, acceptance of support, warranty, indemnity,
|
| 169 |
+
or other liability obligations and/or rights consistent with this
|
| 170 |
+
License. However, in accepting such obligations, You may act only
|
| 171 |
+
on Your own behalf and on Your sole responsibility, not on behalf
|
| 172 |
+
of any other Contributor, and only if You agree to indemnify,
|
| 173 |
+
defend, and hold each Contributor harmless for any liability
|
| 174 |
+
incurred by, or claims asserted against, such Contributor by reason
|
| 175 |
+
of your accepting any such warranty or additional liability.
|
| 176 |
+
|
| 177 |
+
END OF TERMS AND CONDITIONS
|
| 178 |
+
|
| 179 |
+
APPENDIX: How to apply the Apache License to your work.
|
| 180 |
+
|
| 181 |
+
To apply the Apache License to your work, attach the following
|
| 182 |
+
boilerplate notice, with the fields enclosed by brackets "[]"
|
| 183 |
+
replaced with your own identifying information. (Don't include
|
| 184 |
+
the brackets!) The text should be enclosed in the appropriate
|
| 185 |
+
comment syntax for the file format. We also recommend that a
|
| 186 |
+
file or class name and description of purpose be included on the
|
| 187 |
+
same "printed page" as the copyright notice for easier
|
| 188 |
+
identification within third-party archives.
|
| 189 |
+
|
| 190 |
+
Copyright [yyyy] [name of copyright owner]
|
| 191 |
+
|
| 192 |
+
Licensed under the Apache License, Version 2.0 (the "License");
|
| 193 |
+
you may not use this file except in compliance with the License.
|
| 194 |
+
You may obtain a copy of the License at
|
| 195 |
+
|
| 196 |
+
http://www.apache.org/licenses/LICENSE-2.0
|
| 197 |
+
|
| 198 |
+
Unless required by applicable law or agreed to in writing, software
|
| 199 |
+
distributed under the License is distributed on an "AS IS" BASIS,
|
| 200 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 201 |
+
See the License for the specific language governing permissions and
|
| 202 |
+
limitations under the License.
|
skills/.system/skill-creator/references/openai_yaml.md
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# openai.yaml fields (full example + descriptions)
|
| 2 |
+
|
| 3 |
+
`agents/openai.yaml` is an extended, product-specific config intended for the machine/harness to read, not the agent. Other product-specific config can also live in the `agents/` folder.
|
| 4 |
+
|
| 5 |
+
## Full example
|
| 6 |
+
|
| 7 |
+
```yaml
|
| 8 |
+
interface:
|
| 9 |
+
display_name: "Optional user-facing name"
|
| 10 |
+
short_description: "Optional user-facing description"
|
| 11 |
+
icon_small: "./assets/small-400px.png"
|
| 12 |
+
icon_large: "./assets/large-logo.svg"
|
| 13 |
+
brand_color: "#3B82F6"
|
| 14 |
+
default_prompt: "Optional surrounding prompt to use the skill with"
|
| 15 |
+
|
| 16 |
+
dependencies:
|
| 17 |
+
tools:
|
| 18 |
+
- type: "mcp"
|
| 19 |
+
value: "github"
|
| 20 |
+
description: "GitHub MCP server"
|
| 21 |
+
transport: "streamable_http"
|
| 22 |
+
url: "https://api.githubcopilot.com/mcp/"
|
| 23 |
+
|
| 24 |
+
policy:
|
| 25 |
+
allow_implicit_invocation: true
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
## Field descriptions and constraints
|
| 29 |
+
|
| 30 |
+
Top-level constraints:
|
| 31 |
+
|
| 32 |
+
- Quote all string values.
|
| 33 |
+
- Keep keys unquoted.
|
| 34 |
+
- For `interface.default_prompt`: generate a helpful, short (typically 1 sentence) example starting prompt based on the skill. It must explicitly mention the skill as `$skill-name` (e.g., "Use $skill-name-here to draft a concise weekly status update.").
|
| 35 |
+
|
| 36 |
+
- `interface.display_name`: Human-facing title shown in UI skill lists and chips.
|
| 37 |
+
- `interface.short_description`: Human-facing short UI blurb (25–64 chars) for quick scanning.
|
| 38 |
+
- `interface.icon_small`: Path to a small icon asset (relative to skill dir). Default to `./assets/` and place icons in the skill's `assets/` folder.
|
| 39 |
+
- `interface.icon_large`: Path to a larger logo asset (relative to skill dir). Default to `./assets/` and place icons in the skill's `assets/` folder.
|
| 40 |
+
- `interface.brand_color`: Hex color used for UI accents (e.g., badges).
|
| 41 |
+
- `interface.default_prompt`: Default prompt snippet inserted when invoking the skill.
|
| 42 |
+
- `dependencies.tools[].type`: Dependency category. Only `mcp` is supported for now.
|
| 43 |
+
- `dependencies.tools[].value`: Identifier of the tool or dependency.
|
| 44 |
+
- `dependencies.tools[].description`: Human-readable explanation of the dependency.
|
| 45 |
+
- `dependencies.tools[].transport`: Connection type when `type` is `mcp`.
|
| 46 |
+
- `dependencies.tools[].url`: MCP server URL when `type` is `mcp`.
|
| 47 |
+
- `policy.allow_implicit_invocation`: When false, the skill is not injected into
|
| 48 |
+
the model context by default, but can still be invoked explicitly via `$skill`.
|
| 49 |
+
Defaults to true.
|
skills/.system/skill-creator/scripts/generate_openai_yaml.py
ADDED
|
@@ -0,0 +1,226 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
OpenAI YAML Generator - Creates agents/openai.yaml for a skill folder.
|
| 4 |
+
|
| 5 |
+
Usage:
|
| 6 |
+
generate_openai_yaml.py <skill_dir> [--name <skill_name>] [--interface key=value]
|
| 7 |
+
"""
|
| 8 |
+
|
| 9 |
+
import argparse
|
| 10 |
+
import re
|
| 11 |
+
import sys
|
| 12 |
+
from pathlib import Path
|
| 13 |
+
|
| 14 |
+
ACRONYMS = {
|
| 15 |
+
"GH",
|
| 16 |
+
"MCP",
|
| 17 |
+
"API",
|
| 18 |
+
"CI",
|
| 19 |
+
"CLI",
|
| 20 |
+
"LLM",
|
| 21 |
+
"PDF",
|
| 22 |
+
"PR",
|
| 23 |
+
"UI",
|
| 24 |
+
"URL",
|
| 25 |
+
"SQL",
|
| 26 |
+
}
|
| 27 |
+
|
| 28 |
+
BRANDS = {
|
| 29 |
+
"openai": "OpenAI",
|
| 30 |
+
"openapi": "OpenAPI",
|
| 31 |
+
"github": "GitHub",
|
| 32 |
+
"pagerduty": "PagerDuty",
|
| 33 |
+
"datadog": "DataDog",
|
| 34 |
+
"sqlite": "SQLite",
|
| 35 |
+
"fastapi": "FastAPI",
|
| 36 |
+
}
|
| 37 |
+
|
| 38 |
+
SMALL_WORDS = {"and", "or", "to", "up", "with"}
|
| 39 |
+
|
| 40 |
+
ALLOWED_INTERFACE_KEYS = {
|
| 41 |
+
"display_name",
|
| 42 |
+
"short_description",
|
| 43 |
+
"icon_small",
|
| 44 |
+
"icon_large",
|
| 45 |
+
"brand_color",
|
| 46 |
+
"default_prompt",
|
| 47 |
+
}
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def yaml_quote(value):
|
| 51 |
+
escaped = value.replace("\\", "\\\\").replace('"', '\\"').replace("\n", "\\n")
|
| 52 |
+
return f'"{escaped}"'
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def format_display_name(skill_name):
|
| 56 |
+
words = [word for word in skill_name.split("-") if word]
|
| 57 |
+
formatted = []
|
| 58 |
+
for index, word in enumerate(words):
|
| 59 |
+
lower = word.lower()
|
| 60 |
+
upper = word.upper()
|
| 61 |
+
if upper in ACRONYMS:
|
| 62 |
+
formatted.append(upper)
|
| 63 |
+
continue
|
| 64 |
+
if lower in BRANDS:
|
| 65 |
+
formatted.append(BRANDS[lower])
|
| 66 |
+
continue
|
| 67 |
+
if index > 0 and lower in SMALL_WORDS:
|
| 68 |
+
formatted.append(lower)
|
| 69 |
+
continue
|
| 70 |
+
formatted.append(word.capitalize())
|
| 71 |
+
return " ".join(formatted)
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def generate_short_description(display_name):
|
| 75 |
+
description = f"Help with {display_name} tasks"
|
| 76 |
+
|
| 77 |
+
if len(description) < 25:
|
| 78 |
+
description = f"Help with {display_name} tasks and workflows"
|
| 79 |
+
if len(description) < 25:
|
| 80 |
+
description = f"Help with {display_name} tasks with guidance"
|
| 81 |
+
|
| 82 |
+
if len(description) > 64:
|
| 83 |
+
description = f"Help with {display_name}"
|
| 84 |
+
if len(description) > 64:
|
| 85 |
+
description = f"{display_name} helper"
|
| 86 |
+
if len(description) > 64:
|
| 87 |
+
description = f"{display_name} tools"
|
| 88 |
+
if len(description) > 64:
|
| 89 |
+
suffix = " helper"
|
| 90 |
+
max_name_length = 64 - len(suffix)
|
| 91 |
+
trimmed = display_name[:max_name_length].rstrip()
|
| 92 |
+
description = f"{trimmed}{suffix}"
|
| 93 |
+
if len(description) > 64:
|
| 94 |
+
description = description[:64].rstrip()
|
| 95 |
+
|
| 96 |
+
if len(description) < 25:
|
| 97 |
+
description = f"{description} workflows"
|
| 98 |
+
if len(description) > 64:
|
| 99 |
+
description = description[:64].rstrip()
|
| 100 |
+
|
| 101 |
+
return description
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
def read_frontmatter_name(skill_dir):
|
| 105 |
+
skill_md = Path(skill_dir) / "SKILL.md"
|
| 106 |
+
if not skill_md.exists():
|
| 107 |
+
print(f"[ERROR] SKILL.md not found in {skill_dir}")
|
| 108 |
+
return None
|
| 109 |
+
content = skill_md.read_text()
|
| 110 |
+
match = re.match(r"^---\n(.*?)\n---", content, re.DOTALL)
|
| 111 |
+
if not match:
|
| 112 |
+
print("[ERROR] Invalid SKILL.md frontmatter format.")
|
| 113 |
+
return None
|
| 114 |
+
frontmatter_text = match.group(1)
|
| 115 |
+
|
| 116 |
+
import yaml
|
| 117 |
+
|
| 118 |
+
try:
|
| 119 |
+
frontmatter = yaml.safe_load(frontmatter_text)
|
| 120 |
+
except yaml.YAMLError as exc:
|
| 121 |
+
print(f"[ERROR] Invalid YAML frontmatter: {exc}")
|
| 122 |
+
return None
|
| 123 |
+
if not isinstance(frontmatter, dict):
|
| 124 |
+
print("[ERROR] Frontmatter must be a YAML dictionary.")
|
| 125 |
+
return None
|
| 126 |
+
name = frontmatter.get("name", "")
|
| 127 |
+
if not isinstance(name, str) or not name.strip():
|
| 128 |
+
print("[ERROR] Frontmatter 'name' is missing or invalid.")
|
| 129 |
+
return None
|
| 130 |
+
return name.strip()
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
def parse_interface_overrides(raw_overrides):
|
| 134 |
+
overrides = {}
|
| 135 |
+
optional_order = []
|
| 136 |
+
for item in raw_overrides:
|
| 137 |
+
if "=" not in item:
|
| 138 |
+
print(f"[ERROR] Invalid interface override '{item}'. Use key=value.")
|
| 139 |
+
return None, None
|
| 140 |
+
key, value = item.split("=", 1)
|
| 141 |
+
key = key.strip()
|
| 142 |
+
value = value.strip()
|
| 143 |
+
if not key:
|
| 144 |
+
print(f"[ERROR] Invalid interface override '{item}'. Key is empty.")
|
| 145 |
+
return None, None
|
| 146 |
+
if key not in ALLOWED_INTERFACE_KEYS:
|
| 147 |
+
allowed = ", ".join(sorted(ALLOWED_INTERFACE_KEYS))
|
| 148 |
+
print(f"[ERROR] Unknown interface field '{key}'. Allowed: {allowed}")
|
| 149 |
+
return None, None
|
| 150 |
+
overrides[key] = value
|
| 151 |
+
if key not in ("display_name", "short_description") and key not in optional_order:
|
| 152 |
+
optional_order.append(key)
|
| 153 |
+
return overrides, optional_order
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
def write_openai_yaml(skill_dir, skill_name, raw_overrides):
|
| 157 |
+
overrides, optional_order = parse_interface_overrides(raw_overrides)
|
| 158 |
+
if overrides is None:
|
| 159 |
+
return None
|
| 160 |
+
|
| 161 |
+
display_name = overrides.get("display_name") or format_display_name(skill_name)
|
| 162 |
+
short_description = overrides.get("short_description") or generate_short_description(display_name)
|
| 163 |
+
|
| 164 |
+
if not (25 <= len(short_description) <= 64):
|
| 165 |
+
print(
|
| 166 |
+
"[ERROR] short_description must be 25-64 characters "
|
| 167 |
+
f"(got {len(short_description)})."
|
| 168 |
+
)
|
| 169 |
+
return None
|
| 170 |
+
|
| 171 |
+
interface_lines = [
|
| 172 |
+
"interface:",
|
| 173 |
+
f" display_name: {yaml_quote(display_name)}",
|
| 174 |
+
f" short_description: {yaml_quote(short_description)}",
|
| 175 |
+
]
|
| 176 |
+
|
| 177 |
+
for key in optional_order:
|
| 178 |
+
value = overrides.get(key)
|
| 179 |
+
if value is not None:
|
| 180 |
+
interface_lines.append(f" {key}: {yaml_quote(value)}")
|
| 181 |
+
|
| 182 |
+
agents_dir = Path(skill_dir) / "agents"
|
| 183 |
+
agents_dir.mkdir(parents=True, exist_ok=True)
|
| 184 |
+
output_path = agents_dir / "openai.yaml"
|
| 185 |
+
output_path.write_text("\n".join(interface_lines) + "\n")
|
| 186 |
+
print(f"[OK] Created agents/openai.yaml")
|
| 187 |
+
return output_path
|
| 188 |
+
|
| 189 |
+
|
| 190 |
+
def main():
|
| 191 |
+
parser = argparse.ArgumentParser(
|
| 192 |
+
description="Create agents/openai.yaml for a skill directory.",
|
| 193 |
+
)
|
| 194 |
+
parser.add_argument("skill_dir", help="Path to the skill directory")
|
| 195 |
+
parser.add_argument(
|
| 196 |
+
"--name",
|
| 197 |
+
help="Skill name override (defaults to SKILL.md frontmatter)",
|
| 198 |
+
)
|
| 199 |
+
parser.add_argument(
|
| 200 |
+
"--interface",
|
| 201 |
+
action="append",
|
| 202 |
+
default=[],
|
| 203 |
+
help="Interface override in key=value format (repeatable)",
|
| 204 |
+
)
|
| 205 |
+
args = parser.parse_args()
|
| 206 |
+
|
| 207 |
+
skill_dir = Path(args.skill_dir).resolve()
|
| 208 |
+
if not skill_dir.exists():
|
| 209 |
+
print(f"[ERROR] Skill directory not found: {skill_dir}")
|
| 210 |
+
sys.exit(1)
|
| 211 |
+
if not skill_dir.is_dir():
|
| 212 |
+
print(f"[ERROR] Path is not a directory: {skill_dir}")
|
| 213 |
+
sys.exit(1)
|
| 214 |
+
|
| 215 |
+
skill_name = args.name or read_frontmatter_name(skill_dir)
|
| 216 |
+
if not skill_name:
|
| 217 |
+
sys.exit(1)
|
| 218 |
+
|
| 219 |
+
result = write_openai_yaml(skill_dir, skill_name, args.interface)
|
| 220 |
+
if result:
|
| 221 |
+
sys.exit(0)
|
| 222 |
+
sys.exit(1)
|
| 223 |
+
|
| 224 |
+
|
| 225 |
+
if __name__ == "__main__":
|
| 226 |
+
main()
|
skills/.system/skill-creator/scripts/init_skill.py
ADDED
|
@@ -0,0 +1,400 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Skill Initializer - Creates a new skill from template
|
| 4 |
+
|
| 5 |
+
Usage:
|
| 6 |
+
init_skill.py <skill-name> --path <path> [--resources scripts,references,assets] [--examples] [--interface key=value]
|
| 7 |
+
|
| 8 |
+
Examples:
|
| 9 |
+
init_skill.py my-new-skill --path skills/public
|
| 10 |
+
init_skill.py my-new-skill --path skills/public --resources scripts,references
|
| 11 |
+
init_skill.py my-api-helper --path skills/private --resources scripts --examples
|
| 12 |
+
init_skill.py custom-skill --path /custom/location
|
| 13 |
+
init_skill.py my-skill --path skills/public --interface short_description="Short UI label"
|
| 14 |
+
"""
|
| 15 |
+
|
| 16 |
+
import argparse
|
| 17 |
+
import re
|
| 18 |
+
import sys
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
|
| 21 |
+
from generate_openai_yaml import write_openai_yaml
|
| 22 |
+
|
| 23 |
+
MAX_SKILL_NAME_LENGTH = 64
|
| 24 |
+
ALLOWED_RESOURCES = {"scripts", "references", "assets"}
|
| 25 |
+
|
| 26 |
+
SKILL_TEMPLATE = """---
|
| 27 |
+
name: {skill_name}
|
| 28 |
+
description: [TODO: Complete and informative explanation of what the skill does and when to use it. Include WHEN to use this skill - specific scenarios, file types, or tasks that trigger it.]
|
| 29 |
+
---
|
| 30 |
+
|
| 31 |
+
# {skill_title}
|
| 32 |
+
|
| 33 |
+
## Overview
|
| 34 |
+
|
| 35 |
+
[TODO: 1-2 sentences explaining what this skill enables]
|
| 36 |
+
|
| 37 |
+
## Structuring This Skill
|
| 38 |
+
|
| 39 |
+
[TODO: Choose the structure that best fits this skill's purpose. Common patterns:
|
| 40 |
+
|
| 41 |
+
**1. Workflow-Based** (best for sequential processes)
|
| 42 |
+
- Works well when there are clear step-by-step procedures
|
| 43 |
+
- Example: DOCX skill with "Workflow Decision Tree" -> "Reading" -> "Creating" -> "Editing"
|
| 44 |
+
- Structure: ## Overview -> ## Workflow Decision Tree -> ## Step 1 -> ## Step 2...
|
| 45 |
+
|
| 46 |
+
**2. Task-Based** (best for tool collections)
|
| 47 |
+
- Works well when the skill offers different operations/capabilities
|
| 48 |
+
- Example: PDF skill with "Quick Start" -> "Merge PDFs" -> "Split PDFs" -> "Extract Text"
|
| 49 |
+
- Structure: ## Overview -> ## Quick Start -> ## Task Category 1 -> ## Task Category 2...
|
| 50 |
+
|
| 51 |
+
**3. Reference/Guidelines** (best for standards or specifications)
|
| 52 |
+
- Works well for brand guidelines, coding standards, or requirements
|
| 53 |
+
- Example: Brand styling with "Brand Guidelines" -> "Colors" -> "Typography" -> "Features"
|
| 54 |
+
- Structure: ## Overview -> ## Guidelines -> ## Specifications -> ## Usage...
|
| 55 |
+
|
| 56 |
+
**4. Capabilities-Based** (best for integrated systems)
|
| 57 |
+
- Works well when the skill provides multiple interrelated features
|
| 58 |
+
- Example: Product Management with "Core Capabilities" -> numbered capability list
|
| 59 |
+
- Structure: ## Overview -> ## Core Capabilities -> ### 1. Feature -> ### 2. Feature...
|
| 60 |
+
|
| 61 |
+
Patterns can be mixed and matched as needed. Most skills combine patterns (e.g., start with task-based, add workflow for complex operations).
|
| 62 |
+
|
| 63 |
+
Delete this entire "Structuring This Skill" section when done - it's just guidance.]
|
| 64 |
+
|
| 65 |
+
## [TODO: Replace with the first main section based on chosen structure]
|
| 66 |
+
|
| 67 |
+
[TODO: Add content here. See examples in existing skills:
|
| 68 |
+
- Code samples for technical skills
|
| 69 |
+
- Decision trees for complex workflows
|
| 70 |
+
- Concrete examples with realistic user requests
|
| 71 |
+
- References to scripts/templates/references as needed]
|
| 72 |
+
|
| 73 |
+
## Resources (optional)
|
| 74 |
+
|
| 75 |
+
Create only the resource directories this skill actually needs. Delete this section if no resources are required.
|
| 76 |
+
|
| 77 |
+
### scripts/
|
| 78 |
+
Executable code (Python/Bash/etc.) that can be run directly to perform specific operations.
|
| 79 |
+
|
| 80 |
+
**Examples from other skills:**
|
| 81 |
+
- PDF skill: `fill_fillable_fields.py`, `extract_form_field_info.py` - utilities for PDF manipulation
|
| 82 |
+
- DOCX skill: `document.py`, `utilities.py` - Python modules for document processing
|
| 83 |
+
|
| 84 |
+
**Appropriate for:** Python scripts, shell scripts, or any executable code that performs automation, data processing, or specific operations.
|
| 85 |
+
|
| 86 |
+
**Note:** Scripts may be executed without loading into context, but can still be read by Codex for patching or environment adjustments.
|
| 87 |
+
|
| 88 |
+
### references/
|
| 89 |
+
Documentation and reference material intended to be loaded into context to inform Codex's process and thinking.
|
| 90 |
+
|
| 91 |
+
**Examples from other skills:**
|
| 92 |
+
- Product management: `communication.md`, `context_building.md` - detailed workflow guides
|
| 93 |
+
- BigQuery: API reference documentation and query examples
|
| 94 |
+
- Finance: Schema documentation, company policies
|
| 95 |
+
|
| 96 |
+
**Appropriate for:** In-depth documentation, API references, database schemas, comprehensive guides, or any detailed information that Codex should reference while working.
|
| 97 |
+
|
| 98 |
+
### assets/
|
| 99 |
+
Files not intended to be loaded into context, but rather used within the output Codex produces.
|
| 100 |
+
|
| 101 |
+
**Examples from other skills:**
|
| 102 |
+
- Brand styling: PowerPoint template files (.pptx), logo files
|
| 103 |
+
- Frontend builder: HTML/React boilerplate project directories
|
| 104 |
+
- Typography: Font files (.ttf, .woff2)
|
| 105 |
+
|
| 106 |
+
**Appropriate for:** Templates, boilerplate code, document templates, images, icons, fonts, or any files meant to be copied or used in the final output.
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
**Not every skill requires all three types of resources.**
|
| 111 |
+
"""
|
| 112 |
+
|
| 113 |
+
EXAMPLE_SCRIPT = '''#!/usr/bin/env python3
|
| 114 |
+
"""
|
| 115 |
+
Example helper script for {skill_name}
|
| 116 |
+
|
| 117 |
+
This is a placeholder script that can be executed directly.
|
| 118 |
+
Replace with actual implementation or delete if not needed.
|
| 119 |
+
|
| 120 |
+
Example real scripts from other skills:
|
| 121 |
+
- pdf/scripts/fill_fillable_fields.py - Fills PDF form fields
|
| 122 |
+
- pdf/scripts/convert_pdf_to_images.py - Converts PDF pages to images
|
| 123 |
+
"""
|
| 124 |
+
|
| 125 |
+
def main():
|
| 126 |
+
print("This is an example script for {skill_name}")
|
| 127 |
+
# TODO: Add actual script logic here
|
| 128 |
+
# This could be data processing, file conversion, API calls, etc.
|
| 129 |
+
|
| 130 |
+
if __name__ == "__main__":
|
| 131 |
+
main()
|
| 132 |
+
'''
|
| 133 |
+
|
| 134 |
+
EXAMPLE_REFERENCE = """# Reference Documentation for {skill_title}
|
| 135 |
+
|
| 136 |
+
This is a placeholder for detailed reference documentation.
|
| 137 |
+
Replace with actual reference content or delete if not needed.
|
| 138 |
+
|
| 139 |
+
Example real reference docs from other skills:
|
| 140 |
+
- product-management/references/communication.md - Comprehensive guide for status updates
|
| 141 |
+
- product-management/references/context_building.md - Deep-dive on gathering context
|
| 142 |
+
- bigquery/references/ - API references and query examples
|
| 143 |
+
|
| 144 |
+
## When Reference Docs Are Useful
|
| 145 |
+
|
| 146 |
+
Reference docs are ideal for:
|
| 147 |
+
- Comprehensive API documentation
|
| 148 |
+
- Detailed workflow guides
|
| 149 |
+
- Complex multi-step processes
|
| 150 |
+
- Information too lengthy for main SKILL.md
|
| 151 |
+
- Content that's only needed for specific use cases
|
| 152 |
+
|
| 153 |
+
## Structure Suggestions
|
| 154 |
+
|
| 155 |
+
### API Reference Example
|
| 156 |
+
- Overview
|
| 157 |
+
- Authentication
|
| 158 |
+
- Endpoints with examples
|
| 159 |
+
- Error codes
|
| 160 |
+
- Rate limits
|
| 161 |
+
|
| 162 |
+
### Workflow Guide Example
|
| 163 |
+
- Prerequisites
|
| 164 |
+
- Step-by-step instructions
|
| 165 |
+
- Common patterns
|
| 166 |
+
- Troubleshooting
|
| 167 |
+
- Best practices
|
| 168 |
+
"""
|
| 169 |
+
|
| 170 |
+
EXAMPLE_ASSET = """# Example Asset File
|
| 171 |
+
|
| 172 |
+
This placeholder represents where asset files would be stored.
|
| 173 |
+
Replace with actual asset files (templates, images, fonts, etc.) or delete if not needed.
|
| 174 |
+
|
| 175 |
+
Asset files are NOT intended to be loaded into context, but rather used within
|
| 176 |
+
the output Codex produces.
|
| 177 |
+
|
| 178 |
+
Example asset files from other skills:
|
| 179 |
+
- Brand guidelines: logo.png, slides_template.pptx
|
| 180 |
+
- Frontend builder: hello-world/ directory with HTML/React boilerplate
|
| 181 |
+
- Typography: custom-font.ttf, font-family.woff2
|
| 182 |
+
- Data: sample_data.csv, test_dataset.json
|
| 183 |
+
|
| 184 |
+
## Common Asset Types
|
| 185 |
+
|
| 186 |
+
- Templates: .pptx, .docx, boilerplate directories
|
| 187 |
+
- Images: .png, .jpg, .svg, .gif
|
| 188 |
+
- Fonts: .ttf, .otf, .woff, .woff2
|
| 189 |
+
- Boilerplate code: Project directories, starter files
|
| 190 |
+
- Icons: .ico, .svg
|
| 191 |
+
- Data files: .csv, .json, .xml, .yaml
|
| 192 |
+
|
| 193 |
+
Note: This is a text placeholder. Actual assets can be any file type.
|
| 194 |
+
"""
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
def normalize_skill_name(skill_name):
|
| 198 |
+
"""Normalize a skill name to lowercase hyphen-case."""
|
| 199 |
+
normalized = skill_name.strip().lower()
|
| 200 |
+
normalized = re.sub(r"[^a-z0-9]+", "-", normalized)
|
| 201 |
+
normalized = normalized.strip("-")
|
| 202 |
+
normalized = re.sub(r"-{2,}", "-", normalized)
|
| 203 |
+
return normalized
|
| 204 |
+
|
| 205 |
+
|
| 206 |
+
def title_case_skill_name(skill_name):
|
| 207 |
+
"""Convert hyphenated skill name to Title Case for display."""
|
| 208 |
+
return " ".join(word.capitalize() for word in skill_name.split("-"))
|
| 209 |
+
|
| 210 |
+
|
| 211 |
+
def parse_resources(raw_resources):
|
| 212 |
+
if not raw_resources:
|
| 213 |
+
return []
|
| 214 |
+
resources = [item.strip() for item in raw_resources.split(",") if item.strip()]
|
| 215 |
+
invalid = sorted({item for item in resources if item not in ALLOWED_RESOURCES})
|
| 216 |
+
if invalid:
|
| 217 |
+
allowed = ", ".join(sorted(ALLOWED_RESOURCES))
|
| 218 |
+
print(f"[ERROR] Unknown resource type(s): {', '.join(invalid)}")
|
| 219 |
+
print(f" Allowed: {allowed}")
|
| 220 |
+
sys.exit(1)
|
| 221 |
+
deduped = []
|
| 222 |
+
seen = set()
|
| 223 |
+
for resource in resources:
|
| 224 |
+
if resource not in seen:
|
| 225 |
+
deduped.append(resource)
|
| 226 |
+
seen.add(resource)
|
| 227 |
+
return deduped
|
| 228 |
+
|
| 229 |
+
|
| 230 |
+
def create_resource_dirs(skill_dir, skill_name, skill_title, resources, include_examples):
|
| 231 |
+
for resource in resources:
|
| 232 |
+
resource_dir = skill_dir / resource
|
| 233 |
+
resource_dir.mkdir(exist_ok=True)
|
| 234 |
+
if resource == "scripts":
|
| 235 |
+
if include_examples:
|
| 236 |
+
example_script = resource_dir / "example.py"
|
| 237 |
+
example_script.write_text(EXAMPLE_SCRIPT.format(skill_name=skill_name))
|
| 238 |
+
example_script.chmod(0o755)
|
| 239 |
+
print("[OK] Created scripts/example.py")
|
| 240 |
+
else:
|
| 241 |
+
print("[OK] Created scripts/")
|
| 242 |
+
elif resource == "references":
|
| 243 |
+
if include_examples:
|
| 244 |
+
example_reference = resource_dir / "api_reference.md"
|
| 245 |
+
example_reference.write_text(EXAMPLE_REFERENCE.format(skill_title=skill_title))
|
| 246 |
+
print("[OK] Created references/api_reference.md")
|
| 247 |
+
else:
|
| 248 |
+
print("[OK] Created references/")
|
| 249 |
+
elif resource == "assets":
|
| 250 |
+
if include_examples:
|
| 251 |
+
example_asset = resource_dir / "example_asset.txt"
|
| 252 |
+
example_asset.write_text(EXAMPLE_ASSET)
|
| 253 |
+
print("[OK] Created assets/example_asset.txt")
|
| 254 |
+
else:
|
| 255 |
+
print("[OK] Created assets/")
|
| 256 |
+
|
| 257 |
+
|
| 258 |
+
def init_skill(skill_name, path, resources, include_examples, interface_overrides):
|
| 259 |
+
"""
|
| 260 |
+
Initialize a new skill directory with template SKILL.md.
|
| 261 |
+
|
| 262 |
+
Args:
|
| 263 |
+
skill_name: Name of the skill
|
| 264 |
+
path: Path where the skill directory should be created
|
| 265 |
+
resources: Resource directories to create
|
| 266 |
+
include_examples: Whether to create example files in resource directories
|
| 267 |
+
|
| 268 |
+
Returns:
|
| 269 |
+
Path to created skill directory, or None if error
|
| 270 |
+
"""
|
| 271 |
+
# Determine skill directory path
|
| 272 |
+
skill_dir = Path(path).resolve() / skill_name
|
| 273 |
+
|
| 274 |
+
# Check if directory already exists
|
| 275 |
+
if skill_dir.exists():
|
| 276 |
+
print(f"[ERROR] Skill directory already exists: {skill_dir}")
|
| 277 |
+
return None
|
| 278 |
+
|
| 279 |
+
# Create skill directory
|
| 280 |
+
try:
|
| 281 |
+
skill_dir.mkdir(parents=True, exist_ok=False)
|
| 282 |
+
print(f"[OK] Created skill directory: {skill_dir}")
|
| 283 |
+
except Exception as e:
|
| 284 |
+
print(f"[ERROR] Error creating directory: {e}")
|
| 285 |
+
return None
|
| 286 |
+
|
| 287 |
+
# Create SKILL.md from template
|
| 288 |
+
skill_title = title_case_skill_name(skill_name)
|
| 289 |
+
skill_content = SKILL_TEMPLATE.format(skill_name=skill_name, skill_title=skill_title)
|
| 290 |
+
|
| 291 |
+
skill_md_path = skill_dir / "SKILL.md"
|
| 292 |
+
try:
|
| 293 |
+
skill_md_path.write_text(skill_content)
|
| 294 |
+
print("[OK] Created SKILL.md")
|
| 295 |
+
except Exception as e:
|
| 296 |
+
print(f"[ERROR] Error creating SKILL.md: {e}")
|
| 297 |
+
return None
|
| 298 |
+
|
| 299 |
+
# Create agents/openai.yaml
|
| 300 |
+
try:
|
| 301 |
+
result = write_openai_yaml(skill_dir, skill_name, interface_overrides)
|
| 302 |
+
if not result:
|
| 303 |
+
return None
|
| 304 |
+
except Exception as e:
|
| 305 |
+
print(f"[ERROR] Error creating agents/openai.yaml: {e}")
|
| 306 |
+
return None
|
| 307 |
+
|
| 308 |
+
# Create resource directories if requested
|
| 309 |
+
if resources:
|
| 310 |
+
try:
|
| 311 |
+
create_resource_dirs(skill_dir, skill_name, skill_title, resources, include_examples)
|
| 312 |
+
except Exception as e:
|
| 313 |
+
print(f"[ERROR] Error creating resource directories: {e}")
|
| 314 |
+
return None
|
| 315 |
+
|
| 316 |
+
# Print next steps
|
| 317 |
+
print(f"\n[OK] Skill '{skill_name}' initialized successfully at {skill_dir}")
|
| 318 |
+
print("\nNext steps:")
|
| 319 |
+
print("1. Edit SKILL.md to complete the TODO items and update the description")
|
| 320 |
+
if resources:
|
| 321 |
+
if include_examples:
|
| 322 |
+
print("2. Customize or delete the example files in scripts/, references/, and assets/")
|
| 323 |
+
else:
|
| 324 |
+
print("2. Add resources to scripts/, references/, and assets/ as needed")
|
| 325 |
+
else:
|
| 326 |
+
print("2. Create resource directories only if needed (scripts/, references/, assets/)")
|
| 327 |
+
print("3. Update agents/openai.yaml if the UI metadata should differ")
|
| 328 |
+
print("4. Run the validator when ready to check the skill structure")
|
| 329 |
+
print(
|
| 330 |
+
"5. Forward-test complex skills with realistic user requests to ensure they work as intended"
|
| 331 |
+
)
|
| 332 |
+
|
| 333 |
+
return skill_dir
|
| 334 |
+
|
| 335 |
+
|
| 336 |
+
def main():
|
| 337 |
+
parser = argparse.ArgumentParser(
|
| 338 |
+
description="Create a new skill directory with a SKILL.md template.",
|
| 339 |
+
)
|
| 340 |
+
parser.add_argument("skill_name", help="Skill name (normalized to hyphen-case)")
|
| 341 |
+
parser.add_argument("--path", required=True, help="Output directory for the skill")
|
| 342 |
+
parser.add_argument(
|
| 343 |
+
"--resources",
|
| 344 |
+
default="",
|
| 345 |
+
help="Comma-separated list: scripts,references,assets",
|
| 346 |
+
)
|
| 347 |
+
parser.add_argument(
|
| 348 |
+
"--examples",
|
| 349 |
+
action="store_true",
|
| 350 |
+
help="Create example files inside the selected resource directories",
|
| 351 |
+
)
|
| 352 |
+
parser.add_argument(
|
| 353 |
+
"--interface",
|
| 354 |
+
action="append",
|
| 355 |
+
default=[],
|
| 356 |
+
help="Interface override in key=value format (repeatable)",
|
| 357 |
+
)
|
| 358 |
+
args = parser.parse_args()
|
| 359 |
+
|
| 360 |
+
raw_skill_name = args.skill_name
|
| 361 |
+
skill_name = normalize_skill_name(raw_skill_name)
|
| 362 |
+
if not skill_name:
|
| 363 |
+
print("[ERROR] Skill name must include at least one letter or digit.")
|
| 364 |
+
sys.exit(1)
|
| 365 |
+
if len(skill_name) > MAX_SKILL_NAME_LENGTH:
|
| 366 |
+
print(
|
| 367 |
+
f"[ERROR] Skill name '{skill_name}' is too long ({len(skill_name)} characters). "
|
| 368 |
+
f"Maximum is {MAX_SKILL_NAME_LENGTH} characters."
|
| 369 |
+
)
|
| 370 |
+
sys.exit(1)
|
| 371 |
+
if skill_name != raw_skill_name:
|
| 372 |
+
print(f"Note: Normalized skill name from '{raw_skill_name}' to '{skill_name}'.")
|
| 373 |
+
|
| 374 |
+
resources = parse_resources(args.resources)
|
| 375 |
+
if args.examples and not resources:
|
| 376 |
+
print("[ERROR] --examples requires --resources to be set.")
|
| 377 |
+
sys.exit(1)
|
| 378 |
+
|
| 379 |
+
path = args.path
|
| 380 |
+
|
| 381 |
+
print(f"Initializing skill: {skill_name}")
|
| 382 |
+
print(f" Location: {path}")
|
| 383 |
+
if resources:
|
| 384 |
+
print(f" Resources: {', '.join(resources)}")
|
| 385 |
+
if args.examples:
|
| 386 |
+
print(" Examples: enabled")
|
| 387 |
+
else:
|
| 388 |
+
print(" Resources: none (create as needed)")
|
| 389 |
+
print()
|
| 390 |
+
|
| 391 |
+
result = init_skill(skill_name, path, resources, args.examples, args.interface)
|
| 392 |
+
|
| 393 |
+
if result:
|
| 394 |
+
sys.exit(0)
|
| 395 |
+
else:
|
| 396 |
+
sys.exit(1)
|
| 397 |
+
|
| 398 |
+
|
| 399 |
+
if __name__ == "__main__":
|
| 400 |
+
main()
|
skills/.system/skill-creator/scripts/quick_validate.py
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Quick validation script for skills - minimal version
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import re
|
| 7 |
+
import sys
|
| 8 |
+
from pathlib import Path
|
| 9 |
+
|
| 10 |
+
import yaml
|
| 11 |
+
|
| 12 |
+
MAX_SKILL_NAME_LENGTH = 64
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def validate_skill(skill_path):
|
| 16 |
+
"""Basic validation of a skill"""
|
| 17 |
+
skill_path = Path(skill_path)
|
| 18 |
+
|
| 19 |
+
skill_md = skill_path / "SKILL.md"
|
| 20 |
+
if not skill_md.exists():
|
| 21 |
+
return False, "SKILL.md not found"
|
| 22 |
+
|
| 23 |
+
content = skill_md.read_text()
|
| 24 |
+
if not content.startswith("---"):
|
| 25 |
+
return False, "No YAML frontmatter found"
|
| 26 |
+
|
| 27 |
+
match = re.match(r"^---\n(.*?)\n---", content, re.DOTALL)
|
| 28 |
+
if not match:
|
| 29 |
+
return False, "Invalid frontmatter format"
|
| 30 |
+
|
| 31 |
+
frontmatter_text = match.group(1)
|
| 32 |
+
|
| 33 |
+
try:
|
| 34 |
+
frontmatter = yaml.safe_load(frontmatter_text)
|
| 35 |
+
if not isinstance(frontmatter, dict):
|
| 36 |
+
return False, "Frontmatter must be a YAML dictionary"
|
| 37 |
+
except yaml.YAMLError as e:
|
| 38 |
+
return False, f"Invalid YAML in frontmatter: {e}"
|
| 39 |
+
|
| 40 |
+
allowed_properties = {"name", "description", "license", "allowed-tools", "metadata"}
|
| 41 |
+
|
| 42 |
+
unexpected_keys = set(frontmatter.keys()) - allowed_properties
|
| 43 |
+
if unexpected_keys:
|
| 44 |
+
allowed = ", ".join(sorted(allowed_properties))
|
| 45 |
+
unexpected = ", ".join(sorted(unexpected_keys))
|
| 46 |
+
return (
|
| 47 |
+
False,
|
| 48 |
+
f"Unexpected key(s) in SKILL.md frontmatter: {unexpected}. Allowed properties are: {allowed}",
|
| 49 |
+
)
|
| 50 |
+
|
| 51 |
+
if "name" not in frontmatter:
|
| 52 |
+
return False, "Missing 'name' in frontmatter"
|
| 53 |
+
if "description" not in frontmatter:
|
| 54 |
+
return False, "Missing 'description' in frontmatter"
|
| 55 |
+
|
| 56 |
+
name = frontmatter.get("name", "")
|
| 57 |
+
if not isinstance(name, str):
|
| 58 |
+
return False, f"Name must be a string, got {type(name).__name__}"
|
| 59 |
+
name = name.strip()
|
| 60 |
+
if name:
|
| 61 |
+
if not re.match(r"^[a-z0-9-]+$", name):
|
| 62 |
+
return (
|
| 63 |
+
False,
|
| 64 |
+
f"Name '{name}' should be hyphen-case (lowercase letters, digits, and hyphens only)",
|
| 65 |
+
)
|
| 66 |
+
if name.startswith("-") or name.endswith("-") or "--" in name:
|
| 67 |
+
return (
|
| 68 |
+
False,
|
| 69 |
+
f"Name '{name}' cannot start/end with hyphen or contain consecutive hyphens",
|
| 70 |
+
)
|
| 71 |
+
if len(name) > MAX_SKILL_NAME_LENGTH:
|
| 72 |
+
return (
|
| 73 |
+
False,
|
| 74 |
+
f"Name is too long ({len(name)} characters). "
|
| 75 |
+
f"Maximum is {MAX_SKILL_NAME_LENGTH} characters.",
|
| 76 |
+
)
|
| 77 |
+
|
| 78 |
+
description = frontmatter.get("description", "")
|
| 79 |
+
if not isinstance(description, str):
|
| 80 |
+
return False, f"Description must be a string, got {type(description).__name__}"
|
| 81 |
+
description = description.strip()
|
| 82 |
+
if description:
|
| 83 |
+
if "<" in description or ">" in description:
|
| 84 |
+
return False, "Description cannot contain angle brackets (< or >)"
|
| 85 |
+
if len(description) > 1024:
|
| 86 |
+
return (
|
| 87 |
+
False,
|
| 88 |
+
f"Description is too long ({len(description)} characters). Maximum is 1024 characters.",
|
| 89 |
+
)
|
| 90 |
+
|
| 91 |
+
return True, "Skill is valid!"
|
| 92 |
+
|
| 93 |
+
|
| 94 |
+
if __name__ == "__main__":
|
| 95 |
+
if len(sys.argv) != 2:
|
| 96 |
+
print("Usage: python quick_validate.py <skill_directory>")
|
| 97 |
+
sys.exit(1)
|
| 98 |
+
|
| 99 |
+
valid, message = validate_skill(sys.argv[1])
|
| 100 |
+
print(message)
|
| 101 |
+
sys.exit(0 if valid else 1)
|
skills/.system/skill-installer/LICENSE.txt
ADDED
|
@@ -0,0 +1,202 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
Apache License
|
| 3 |
+
Version 2.0, January 2004
|
| 4 |
+
http://www.apache.org/licenses/
|
| 5 |
+
|
| 6 |
+
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
|
| 7 |
+
|
| 8 |
+
1. Definitions.
|
| 9 |
+
|
| 10 |
+
"License" shall mean the terms and conditions for use, reproduction,
|
| 11 |
+
and distribution as defined by Sections 1 through 9 of this document.
|
| 12 |
+
|
| 13 |
+
"Licensor" shall mean the copyright owner or entity authorized by
|
| 14 |
+
the copyright owner that is granting the License.
|
| 15 |
+
|
| 16 |
+
"Legal Entity" shall mean the union of the acting entity and all
|
| 17 |
+
other entities that control, are controlled by, or are under common
|
| 18 |
+
control with that entity. For the purposes of this definition,
|
| 19 |
+
"control" means (i) the power, direct or indirect, to cause the
|
| 20 |
+
direction or management of such entity, whether by contract or
|
| 21 |
+
otherwise, or (ii) ownership of fifty percent (50%) or more of the
|
| 22 |
+
outstanding shares, or (iii) beneficial ownership of such entity.
|
| 23 |
+
|
| 24 |
+
"You" (or "Your") shall mean an individual or Legal Entity
|
| 25 |
+
exercising permissions granted by this License.
|
| 26 |
+
|
| 27 |
+
"Source" form shall mean the preferred form for making modifications,
|
| 28 |
+
including but not limited to software source code, documentation
|
| 29 |
+
source, and configuration files.
|
| 30 |
+
|
| 31 |
+
"Object" form shall mean any form resulting from mechanical
|
| 32 |
+
transformation or translation of a Source form, including but
|
| 33 |
+
not limited to compiled object code, generated documentation,
|
| 34 |
+
and conversions to other media types.
|
| 35 |
+
|
| 36 |
+
"Work" shall mean the work of authorship, whether in Source or
|
| 37 |
+
Object form, made available under the License, as indicated by a
|
| 38 |
+
copyright notice that is included in or attached to the work
|
| 39 |
+
(an example is provided in the Appendix below).
|
| 40 |
+
|
| 41 |
+
"Derivative Works" shall mean any work, whether in Source or Object
|
| 42 |
+
form, that is based on (or derived from) the Work and for which the
|
| 43 |
+
editorial revisions, annotations, elaborations, or other modifications
|
| 44 |
+
represent, as a whole, an original work of authorship. For the purposes
|
| 45 |
+
of this License, Derivative Works shall not include works that remain
|
| 46 |
+
separable from, or merely link (or bind by name) to the interfaces of,
|
| 47 |
+
the Work and Derivative Works thereof.
|
| 48 |
+
|
| 49 |
+
"Contribution" shall mean any work of authorship, including
|
| 50 |
+
the original version of the Work and any modifications or additions
|
| 51 |
+
to that Work or Derivative Works thereof, that is intentionally
|
| 52 |
+
submitted to Licensor for inclusion in the Work by the copyright owner
|
| 53 |
+
or by an individual or Legal Entity authorized to submit on behalf of
|
| 54 |
+
the copyright owner. For the purposes of this definition, "submitted"
|
| 55 |
+
means any form of electronic, verbal, or written communication sent
|
| 56 |
+
to the Licensor or its representatives, including but not limited to
|
| 57 |
+
communication on electronic mailing lists, source code control systems,
|
| 58 |
+
and issue tracking systems that are managed by, or on behalf of, the
|
| 59 |
+
Licensor for the purpose of discussing and improving the Work, but
|
| 60 |
+
excluding communication that is conspicuously marked or otherwise
|
| 61 |
+
designated in writing by the copyright owner as "Not a Contribution."
|
| 62 |
+
|
| 63 |
+
"Contributor" shall mean Licensor and any individual or Legal Entity
|
| 64 |
+
on behalf of whom a Contribution has been received by Licensor and
|
| 65 |
+
subsequently incorporated within the Work.
|
| 66 |
+
|
| 67 |
+
2. Grant of Copyright License. Subject to the terms and conditions of
|
| 68 |
+
this License, each Contributor hereby grants to You a perpetual,
|
| 69 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
| 70 |
+
copyright license to reproduce, prepare Derivative Works of,
|
| 71 |
+
publicly display, publicly perform, sublicense, and distribute the
|
| 72 |
+
Work and such Derivative Works in Source or Object form.
|
| 73 |
+
|
| 74 |
+
3. Grant of Patent License. Subject to the terms and conditions of
|
| 75 |
+
this License, each Contributor hereby grants to You a perpetual,
|
| 76 |
+
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
|
| 77 |
+
(except as stated in this section) patent license to make, have made,
|
| 78 |
+
use, offer to sell, sell, import, and otherwise transfer the Work,
|
| 79 |
+
where such license applies only to those patent claims licensable
|
| 80 |
+
by such Contributor that are necessarily infringed by their
|
| 81 |
+
Contribution(s) alone or by combination of their Contribution(s)
|
| 82 |
+
with the Work to which such Contribution(s) was submitted. If You
|
| 83 |
+
institute patent litigation against any entity (including a
|
| 84 |
+
cross-claim or counterclaim in a lawsuit) alleging that the Work
|
| 85 |
+
or a Contribution incorporated within the Work constitutes direct
|
| 86 |
+
or contributory patent infringement, then any patent licenses
|
| 87 |
+
granted to You under this License for that Work shall terminate
|
| 88 |
+
as of the date such litigation is filed.
|
| 89 |
+
|
| 90 |
+
4. Redistribution. You may reproduce and distribute copies of the
|
| 91 |
+
Work or Derivative Works thereof in any medium, with or without
|
| 92 |
+
modifications, and in Source or Object form, provided that You
|
| 93 |
+
meet the following conditions:
|
| 94 |
+
|
| 95 |
+
(a) You must give any other recipients of the Work or
|
| 96 |
+
Derivative Works a copy of this License; and
|
| 97 |
+
|
| 98 |
+
(b) You must cause any modified files to carry prominent notices
|
| 99 |
+
stating that You changed the files; and
|
| 100 |
+
|
| 101 |
+
(c) You must retain, in the Source form of any Derivative Works
|
| 102 |
+
that You distribute, all copyright, patent, trademark, and
|
| 103 |
+
attribution notices from the Source form of the Work,
|
| 104 |
+
excluding those notices that do not pertain to any part of
|
| 105 |
+
the Derivative Works; and
|
| 106 |
+
|
| 107 |
+
(d) If the Work includes a "NOTICE" text file as part of its
|
| 108 |
+
distribution, then any Derivative Works that You distribute must
|
| 109 |
+
include a readable copy of the attribution notices contained
|
| 110 |
+
within such NOTICE file, excluding those notices that do not
|
| 111 |
+
pertain to any part of the Derivative Works, in at least one
|
| 112 |
+
of the following places: within a NOTICE text file distributed
|
| 113 |
+
as part of the Derivative Works; within the Source form or
|
| 114 |
+
documentation, if provided along with the Derivative Works; or,
|
| 115 |
+
within a display generated by the Derivative Works, if and
|
| 116 |
+
wherever such third-party notices normally appear. The contents
|
| 117 |
+
of the NOTICE file are for informational purposes only and
|
| 118 |
+
do not modify the License. You may add Your own attribution
|
| 119 |
+
notices within Derivative Works that You distribute, alongside
|
| 120 |
+
or as an addendum to the NOTICE text from the Work, provided
|
| 121 |
+
that such additional attribution notices cannot be construed
|
| 122 |
+
as modifying the License.
|
| 123 |
+
|
| 124 |
+
You may add Your own copyright statement to Your modifications and
|
| 125 |
+
may provide additional or different license terms and conditions
|
| 126 |
+
for use, reproduction, or distribution of Your modifications, or
|
| 127 |
+
for any such Derivative Works as a whole, provided Your use,
|
| 128 |
+
reproduction, and distribution of the Work otherwise complies with
|
| 129 |
+
the conditions stated in this License.
|
| 130 |
+
|
| 131 |
+
5. Submission of Contributions. Unless You explicitly state otherwise,
|
| 132 |
+
any Contribution intentionally submitted for inclusion in the Work
|
| 133 |
+
by You to the Licensor shall be under the terms and conditions of
|
| 134 |
+
this License, without any additional terms or conditions.
|
| 135 |
+
Notwithstanding the above, nothing herein shall supersede or modify
|
| 136 |
+
the terms of any separate license agreement you may have executed
|
| 137 |
+
with Licensor regarding such Contributions.
|
| 138 |
+
|
| 139 |
+
6. Trademarks. This License does not grant permission to use the trade
|
| 140 |
+
names, trademarks, service marks, or product names of the Licensor,
|
| 141 |
+
except as required for reasonable and customary use in describing the
|
| 142 |
+
origin of the Work and reproducing the content of the NOTICE file.
|
| 143 |
+
|
| 144 |
+
7. Disclaimer of Warranty. Unless required by applicable law or
|
| 145 |
+
agreed to in writing, Licensor provides the Work (and each
|
| 146 |
+
Contributor provides its Contributions) on an "AS IS" BASIS,
|
| 147 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
|
| 148 |
+
implied, including, without limitation, any warranties or conditions
|
| 149 |
+
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
|
| 150 |
+
PARTICULAR PURPOSE. You are solely responsible for determining the
|
| 151 |
+
appropriateness of using or redistributing the Work and assume any
|
| 152 |
+
risks associated with Your exercise of permissions under this License.
|
| 153 |
+
|
| 154 |
+
8. Limitation of Liability. In no event and under no legal theory,
|
| 155 |
+
whether in tort (including negligence), contract, or otherwise,
|
| 156 |
+
unless required by applicable law (such as deliberate and grossly
|
| 157 |
+
negligent acts) or agreed to in writing, shall any Contributor be
|
| 158 |
+
liable to You for damages, including any direct, indirect, special,
|
| 159 |
+
incidental, or consequential damages of any character arising as a
|
| 160 |
+
result of this License or out of the use or inability to use the
|
| 161 |
+
Work (including but not limited to damages for loss of goodwill,
|
| 162 |
+
work stoppage, computer failure or malfunction, or any and all
|
| 163 |
+
other commercial damages or losses), even if such Contributor
|
| 164 |
+
has been advised of the possibility of such damages.
|
| 165 |
+
|
| 166 |
+
9. Accepting Warranty or Additional Liability. While redistributing
|
| 167 |
+
the Work or Derivative Works thereof, You may choose to offer,
|
| 168 |
+
and charge a fee for, acceptance of support, warranty, indemnity,
|
| 169 |
+
or other liability obligations and/or rights consistent with this
|
| 170 |
+
License. However, in accepting such obligations, You may act only
|
| 171 |
+
on Your own behalf and on Your sole responsibility, not on behalf
|
| 172 |
+
of any other Contributor, and only if You agree to indemnify,
|
| 173 |
+
defend, and hold each Contributor harmless for any liability
|
| 174 |
+
incurred by, or claims asserted against, such Contributor by reason
|
| 175 |
+
of your accepting any such warranty or additional liability.
|
| 176 |
+
|
| 177 |
+
END OF TERMS AND CONDITIONS
|
| 178 |
+
|
| 179 |
+
APPENDIX: How to apply the Apache License to your work.
|
| 180 |
+
|
| 181 |
+
To apply the Apache License to your work, attach the following
|
| 182 |
+
boilerplate notice, with the fields enclosed by brackets "[]"
|
| 183 |
+
replaced with your own identifying information. (Don't include
|
| 184 |
+
the brackets!) The text should be enclosed in the appropriate
|
| 185 |
+
comment syntax for the file format. We also recommend that a
|
| 186 |
+
file or class name and description of purpose be included on the
|
| 187 |
+
same "printed page" as the copyright notice for easier
|
| 188 |
+
identification within third-party archives.
|
| 189 |
+
|
| 190 |
+
Copyright [yyyy] [name of copyright owner]
|
| 191 |
+
|
| 192 |
+
Licensed under the Apache License, Version 2.0 (the "License");
|
| 193 |
+
you may not use this file except in compliance with the License.
|
| 194 |
+
You may obtain a copy of the License at
|
| 195 |
+
|
| 196 |
+
http://www.apache.org/licenses/LICENSE-2.0
|
| 197 |
+
|
| 198 |
+
Unless required by applicable law or agreed to in writing, software
|
| 199 |
+
distributed under the License is distributed on an "AS IS" BASIS,
|
| 200 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
| 201 |
+
See the License for the specific language governing permissions and
|
| 202 |
+
limitations under the License.
|
skills/.system/skill-installer/SKILL.md
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: skill-installer
|
| 3 |
+
description: Install Codex skills into $CODEX_HOME/skills from a curated list or a GitHub repo path. Use when a user asks to list installable skills, install a curated skill, or install a skill from another repo (including private repos).
|
| 4 |
+
metadata:
|
| 5 |
+
short-description: Install curated skills from openai/skills or other repos
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# Skill Installer
|
| 9 |
+
|
| 10 |
+
Helps install skills. By default these are from https://github.com/openai/skills/tree/main/skills/.curated, but users can also provide other locations. Experimental skills live in https://github.com/openai/skills/tree/main/skills/.experimental and can be installed the same way.
|
| 11 |
+
|
| 12 |
+
Use the helper scripts based on the task:
|
| 13 |
+
- List skills when the user asks what is available, or if the user uses this skill without specifying what to do. Default listing is `.curated`, but you can pass `--path skills/.experimental` when they ask about experimental skills.
|
| 14 |
+
- Install from the curated list when the user provides a skill name.
|
| 15 |
+
- Install from another repo when the user provides a GitHub repo/path (including private repos).
|
| 16 |
+
|
| 17 |
+
Install skills with the helper scripts.
|
| 18 |
+
|
| 19 |
+
## Communication
|
| 20 |
+
|
| 21 |
+
When listing skills, output approximately as follows, depending on the context of the user's request. If they ask about experimental skills, list from `.experimental` instead of `.curated` and label the source accordingly:
|
| 22 |
+
"""
|
| 23 |
+
Skills from {repo}:
|
| 24 |
+
1. skill-1
|
| 25 |
+
2. skill-2 (already installed)
|
| 26 |
+
3. ...
|
| 27 |
+
Which ones would you like installed?
|
| 28 |
+
"""
|
| 29 |
+
|
| 30 |
+
After installing a skill, tell the user: "Restart Codex to pick up new skills."
|
| 31 |
+
|
| 32 |
+
## Scripts
|
| 33 |
+
|
| 34 |
+
All of these scripts use network, so when running in the sandbox, request escalation when running them.
|
| 35 |
+
|
| 36 |
+
- `scripts/list-skills.py` (prints skills list with installed annotations)
|
| 37 |
+
- `scripts/list-skills.py --format json`
|
| 38 |
+
- Example (experimental list): `scripts/list-skills.py --path skills/.experimental`
|
| 39 |
+
- `scripts/install-skill-from-github.py --repo <owner>/<repo> --path <path/to/skill> [<path/to/skill> ...]`
|
| 40 |
+
- `scripts/install-skill-from-github.py --url https://github.com/<owner>/<repo>/tree/<ref>/<path>`
|
| 41 |
+
- Example (experimental skill): `scripts/install-skill-from-github.py --repo openai/skills --path skills/.experimental/<skill-name>`
|
| 42 |
+
|
| 43 |
+
## Behavior and Options
|
| 44 |
+
|
| 45 |
+
- Defaults to direct download for public GitHub repos.
|
| 46 |
+
- If download fails with auth/permission errors, falls back to git sparse checkout.
|
| 47 |
+
- Aborts if the destination skill directory already exists.
|
| 48 |
+
- Installs into `$CODEX_HOME/skills/<skill-name>` (defaults to `~/.codex/skills`).
|
| 49 |
+
- Multiple `--path` values install multiple skills in one run, each named from the path basename unless `--name` is supplied.
|
| 50 |
+
- Options: `--ref <ref>` (default `main`), `--dest <path>`, `--method auto|download|git`.
|
| 51 |
+
|
| 52 |
+
## Notes
|
| 53 |
+
|
| 54 |
+
- Curated listing is fetched from `https://github.com/openai/skills/tree/main/skills/.curated` via the GitHub API. If it is unavailable, explain the error and exit.
|
| 55 |
+
- Private GitHub repos can be accessed via existing git credentials or optional `GITHUB_TOKEN`/`GH_TOKEN` for download.
|
| 56 |
+
- Git fallback tries HTTPS first, then SSH.
|
| 57 |
+
- The skills at https://github.com/openai/skills/tree/main/skills/.system are preinstalled, so no need to help users install those. If they ask, just explain this. If they insist, you can download and overwrite.
|
| 58 |
+
- Installed annotations come from `$CODEX_HOME/skills`.
|
skills/.system/skill-installer/agents/openai.yaml
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
interface:
|
| 2 |
+
display_name: "Skill Installer"
|
| 3 |
+
short_description: "Install curated skills from openai/skills or other repos"
|
| 4 |
+
icon_small: "./assets/skill-installer-small.svg"
|
| 5 |
+
icon_large: "./assets/skill-installer.png"
|
skills/.system/skill-installer/assets/skill-installer-small.svg
ADDED
|
|
skills/.system/skill-installer/assets/skill-installer.png
ADDED
|
skills/.system/skill-installer/scripts/github_utils.py
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Shared GitHub helpers for skill install scripts."""
|
| 3 |
+
|
| 4 |
+
from __future__ import annotations
|
| 5 |
+
|
| 6 |
+
import os
|
| 7 |
+
import urllib.request
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
def github_request(url: str, user_agent: str) -> bytes:
|
| 11 |
+
headers = {"User-Agent": user_agent}
|
| 12 |
+
token = os.environ.get("GITHUB_TOKEN") or os.environ.get("GH_TOKEN")
|
| 13 |
+
if token:
|
| 14 |
+
headers["Authorization"] = f"token {token}"
|
| 15 |
+
req = urllib.request.Request(url, headers=headers)
|
| 16 |
+
with urllib.request.urlopen(req) as resp:
|
| 17 |
+
return resp.read()
|
| 18 |
+
|
| 19 |
+
|
| 20 |
+
def github_api_contents_url(repo: str, path: str, ref: str) -> str:
|
| 21 |
+
return f"https://api.github.com/repos/{repo}/contents/{path}?ref={ref}"
|
skills/.system/skill-installer/scripts/install-skill-from-github.py
ADDED
|
@@ -0,0 +1,308 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""Install a skill from a GitHub repo path into $CODEX_HOME/skills."""
|
| 3 |
+
|
| 4 |
+
from __future__ import annotations
|
| 5 |
+
|
| 6 |
+
import argparse
|
| 7 |
+
from dataclasses import dataclass
|
| 8 |
+
import os
|
| 9 |
+
import shutil
|
| 10 |
+
import subprocess
|
| 11 |
+
import sys
|
| 12 |
+
import tempfile
|
| 13 |
+
import urllib.error
|
| 14 |
+
import urllib.parse
|
| 15 |
+
import zipfile
|
| 16 |
+
|
| 17 |
+
from github_utils import github_request
|
| 18 |
+
DEFAULT_REF = "main"
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
@dataclass
|
| 22 |
+
class Args:
|
| 23 |
+
url: str | None = None
|
| 24 |
+
repo: str | None = None
|
| 25 |
+
path: list[str] | None = None
|
| 26 |
+
ref: str = DEFAULT_REF
|
| 27 |
+
dest: str | None = None
|
| 28 |
+
name: str | None = None
|
| 29 |
+
method: str = "auto"
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
@dataclass
|
| 33 |
+
class Source:
|
| 34 |
+
owner: str
|
| 35 |
+
repo: str
|
| 36 |
+
ref: str
|
| 37 |
+
paths: list[str]
|
| 38 |
+
repo_url: str | None = None
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
class InstallError(Exception):
|
| 42 |
+
pass
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def _codex_home() -> str:
|
| 46 |
+
return os.environ.get("CODEX_HOME", os.path.expanduser("~/.codex"))
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
def _tmp_root() -> str:
|
| 50 |
+
base = os.path.join(tempfile.gettempdir(), "codex")
|
| 51 |
+
os.makedirs(base, exist_ok=True)
|
| 52 |
+
return base
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
def _request(url: str) -> bytes:
|
| 56 |
+
return github_request(url, "codex-skill-install")
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def _parse_github_url(url: str, default_ref: str) -> tuple[str, str, str, str | None]:
|
| 60 |
+
parsed = urllib.parse.urlparse(url)
|
| 61 |
+
if parsed.netloc != "github.com":
|
| 62 |
+
raise InstallError("Only GitHub URLs are supported for download mode.")
|
| 63 |
+
parts = [p for p in parsed.path.split("/") if p]
|
| 64 |
+
if len(parts) < 2:
|
| 65 |
+
raise InstallError("Invalid GitHub URL.")
|
| 66 |
+
owner, repo = parts[0], parts[1]
|
| 67 |
+
ref = default_ref
|
| 68 |
+
subpath = ""
|
| 69 |
+
if len(parts) > 2:
|
| 70 |
+
if parts[2] in ("tree", "blob"):
|
| 71 |
+
if len(parts) < 4:
|
| 72 |
+
raise InstallError("GitHub URL missing ref or path.")
|
| 73 |
+
ref = parts[3]
|
| 74 |
+
subpath = "/".join(parts[4:])
|
| 75 |
+
else:
|
| 76 |
+
subpath = "/".join(parts[2:])
|
| 77 |
+
return owner, repo, ref, subpath or None
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def _download_repo_zip(owner: str, repo: str, ref: str, dest_dir: str) -> str:
|
| 81 |
+
zip_url = f"https://codeload.github.com/{owner}/{repo}/zip/{ref}"
|
| 82 |
+
zip_path = os.path.join(dest_dir, "repo.zip")
|
| 83 |
+
try:
|
| 84 |
+
payload = _request(zip_url)
|
| 85 |
+
except urllib.error.HTTPError as exc:
|
| 86 |
+
raise InstallError(f"Download failed: HTTP {exc.code}") from exc
|
| 87 |
+
with open(zip_path, "wb") as file_handle:
|
| 88 |
+
file_handle.write(payload)
|
| 89 |
+
with zipfile.ZipFile(zip_path, "r") as zip_file:
|
| 90 |
+
_safe_extract_zip(zip_file, dest_dir)
|
| 91 |
+
top_levels = {name.split("/")[0] for name in zip_file.namelist() if name}
|
| 92 |
+
if not top_levels:
|
| 93 |
+
raise InstallError("Downloaded archive was empty.")
|
| 94 |
+
if len(top_levels) != 1:
|
| 95 |
+
raise InstallError("Unexpected archive layout.")
|
| 96 |
+
return os.path.join(dest_dir, next(iter(top_levels)))
|
| 97 |
+
|
| 98 |
+
|
| 99 |
+
def _run_git(args: list[str]) -> None:
|
| 100 |
+
result = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)
|
| 101 |
+
if result.returncode != 0:
|
| 102 |
+
raise InstallError(result.stderr.strip() or "Git command failed.")
|
| 103 |
+
|
| 104 |
+
|
| 105 |
+
def _safe_extract_zip(zip_file: zipfile.ZipFile, dest_dir: str) -> None:
|
| 106 |
+
dest_root = os.path.realpath(dest_dir)
|
| 107 |
+
for info in zip_file.infolist():
|
| 108 |
+
extracted_path = os.path.realpath(os.path.join(dest_dir, info.filename))
|
| 109 |
+
if extracted_path == dest_root or extracted_path.startswith(dest_root + os.sep):
|
| 110 |
+
continue
|
| 111 |
+
raise InstallError("Archive contains files outside the destination.")
|
| 112 |
+
zip_file.extractall(dest_dir)
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
def _validate_relative_path(path: str) -> None:
|
| 116 |
+
if os.path.isabs(path) or os.path.normpath(path).startswith(".."):
|
| 117 |
+
raise InstallError("Skill path must be a relative path inside the repo.")
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
def _validate_skill_name(name: str) -> None:
|
| 121 |
+
altsep = os.path.altsep
|
| 122 |
+
if not name or os.path.sep in name or (altsep and altsep in name):
|
| 123 |
+
raise InstallError("Skill name must be a single path segment.")
|
| 124 |
+
if name in (".", ".."):
|
| 125 |
+
raise InstallError("Invalid skill name.")
|
| 126 |
+
|
| 127 |
+
|
| 128 |
+
def _git_sparse_checkout(repo_url: str, ref: str, paths: list[str], dest_dir: str) -> str:
|
| 129 |
+
repo_dir = os.path.join(dest_dir, "repo")
|
| 130 |
+
clone_cmd = [
|
| 131 |
+
"git",
|
| 132 |
+
"clone",
|
| 133 |
+
"--filter=blob:none",
|
| 134 |
+
"--depth",
|
| 135 |
+
"1",
|
| 136 |
+
"--sparse",
|
| 137 |
+
"--single-branch",
|
| 138 |
+
"--branch",
|
| 139 |
+
ref,
|
| 140 |
+
repo_url,
|
| 141 |
+
repo_dir,
|
| 142 |
+
]
|
| 143 |
+
try:
|
| 144 |
+
_run_git(clone_cmd)
|
| 145 |
+
except InstallError:
|
| 146 |
+
_run_git(
|
| 147 |
+
[
|
| 148 |
+
"git",
|
| 149 |
+
"clone",
|
| 150 |
+
"--filter=blob:none",
|
| 151 |
+
"--depth",
|
| 152 |
+
"1",
|
| 153 |
+
"--sparse",
|
| 154 |
+
"--single-branch",
|
| 155 |
+
repo_url,
|
| 156 |
+
repo_dir,
|
| 157 |
+
]
|
| 158 |
+
)
|
| 159 |
+
_run_git(["git", "-C", repo_dir, "sparse-checkout", "set", *paths])
|
| 160 |
+
_run_git(["git", "-C", repo_dir, "checkout", ref])
|
| 161 |
+
return repo_dir
|
| 162 |
+
|
| 163 |
+
|
| 164 |
+
def _validate_skill(path: str) -> None:
|
| 165 |
+
if not os.path.isdir(path):
|
| 166 |
+
raise InstallError(f"Skill path not found: {path}")
|
| 167 |
+
skill_md = os.path.join(path, "SKILL.md")
|
| 168 |
+
if not os.path.isfile(skill_md):
|
| 169 |
+
raise InstallError("SKILL.md not found in selected skill directory.")
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
def _copy_skill(src: str, dest_dir: str) -> None:
|
| 173 |
+
os.makedirs(os.path.dirname(dest_dir), exist_ok=True)
|
| 174 |
+
if os.path.exists(dest_dir):
|
| 175 |
+
raise InstallError(f"Destination already exists: {dest_dir}")
|
| 176 |
+
shutil.copytree(src, dest_dir)
|
| 177 |
+
|
| 178 |
+
|
| 179 |
+
def _build_repo_url(owner: str, repo: str) -> str:
|
| 180 |
+
return f"https://github.com/{owner}/{repo}.git"
|
| 181 |
+
|
| 182 |
+
|
| 183 |
+
def _build_repo_ssh(owner: str, repo: str) -> str:
|
| 184 |
+
return f"git@github.com:{owner}/{repo}.git"
|
| 185 |
+
|
| 186 |
+
|
| 187 |
+
def _prepare_repo(source: Source, method: str, tmp_dir: str) -> str:
|
| 188 |
+
if method in ("download", "auto"):
|
| 189 |
+
try:
|
| 190 |
+
return _download_repo_zip(source.owner, source.repo, source.ref, tmp_dir)
|
| 191 |
+
except InstallError as exc:
|
| 192 |
+
if method == "download":
|
| 193 |
+
raise
|
| 194 |
+
err_msg = str(exc)
|
| 195 |
+
if "HTTP 401" in err_msg or "HTTP 403" in err_msg or "HTTP 404" in err_msg:
|
| 196 |
+
pass
|
| 197 |
+
else:
|
| 198 |
+
raise
|
| 199 |
+
if method in ("git", "auto"):
|
| 200 |
+
repo_url = source.repo_url or _build_repo_url(source.owner, source.repo)
|
| 201 |
+
try:
|
| 202 |
+
return _git_sparse_checkout(repo_url, source.ref, source.paths, tmp_dir)
|
| 203 |
+
except InstallError:
|
| 204 |
+
repo_url = _build_repo_ssh(source.owner, source.repo)
|
| 205 |
+
return _git_sparse_checkout(repo_url, source.ref, source.paths, tmp_dir)
|
| 206 |
+
raise InstallError("Unsupported method.")
|
| 207 |
+
|
| 208 |
+
|
| 209 |
+
def _resolve_source(args: Args) -> Source:
|
| 210 |
+
if args.url:
|
| 211 |
+
owner, repo, ref, url_path = _parse_github_url(args.url, args.ref)
|
| 212 |
+
if args.path is not None:
|
| 213 |
+
paths = list(args.path)
|
| 214 |
+
elif url_path:
|
| 215 |
+
paths = [url_path]
|
| 216 |
+
else:
|
| 217 |
+
paths = []
|
| 218 |
+
if not paths:
|
| 219 |
+
raise InstallError("Missing --path for GitHub URL.")
|
| 220 |
+
return Source(owner=owner, repo=repo, ref=ref, paths=paths)
|
| 221 |
+
|
| 222 |
+
if not args.repo:
|
| 223 |
+
raise InstallError("Provide --repo or --url.")
|
| 224 |
+
if "://" in args.repo:
|
| 225 |
+
return _resolve_source(
|
| 226 |
+
Args(url=args.repo, repo=None, path=args.path, ref=args.ref)
|
| 227 |
+
)
|
| 228 |
+
|
| 229 |
+
repo_parts = [p for p in args.repo.split("/") if p]
|
| 230 |
+
if len(repo_parts) != 2:
|
| 231 |
+
raise InstallError("--repo must be in owner/repo format.")
|
| 232 |
+
if not args.path:
|
| 233 |
+
raise InstallError("Missing --path for --repo.")
|
| 234 |
+
paths = list(args.path)
|
| 235 |
+
return Source(
|
| 236 |
+
owner=repo_parts[0],
|
| 237 |
+
repo=repo_parts[1],
|
| 238 |
+
ref=args.ref,
|
| 239 |
+
paths=paths,
|
| 240 |
+
)
|
| 241 |
+
|
| 242 |
+
|
| 243 |
+
def _default_dest() -> str:
|
| 244 |
+
return os.path.join(_codex_home(), "skills")
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
def _parse_args(argv: list[str]) -> Args:
|
| 248 |
+
parser = argparse.ArgumentParser(description="Install a skill from GitHub.")
|
| 249 |
+
parser.add_argument("--repo", help="owner/repo")
|
| 250 |
+
parser.add_argument("--url", help="https://github.com/owner/repo[/tree/ref/path]")
|
| 251 |
+
parser.add_argument(
|
| 252 |
+
"--path",
|
| 253 |
+
nargs="+",
|
| 254 |
+
help="Path(s) to skill(s) inside repo",
|
| 255 |
+
)
|
| 256 |
+
parser.add_argument("--ref", default=DEFAULT_REF)
|
| 257 |
+
parser.add_argument("--dest", help="Destination skills directory")
|
| 258 |
+
parser.add_argument(
|
| 259 |
+
"--name", help="Destination skill name (defaults to basename of path)"
|
| 260 |
+
)
|
| 261 |
+
parser.add_argument(
|
| 262 |
+
"--method",
|
| 263 |
+
choices=["auto", "download", "git"],
|
| 264 |
+
default="auto",
|
| 265 |
+
)
|
| 266 |
+
return parser.parse_args(argv, namespace=Args())
|
| 267 |
+
|
| 268 |
+
|
| 269 |
+
def main(argv: list[str]) -> int:
|
| 270 |
+
args = _parse_args(argv)
|
| 271 |
+
try:
|
| 272 |
+
source = _resolve_source(args)
|
| 273 |
+
source.ref = source.ref or args.ref
|
| 274 |
+
if not source.paths:
|
| 275 |
+
raise InstallError("No skill paths provided.")
|
| 276 |
+
for path in source.paths:
|
| 277 |
+
_validate_relative_path(path)
|
| 278 |
+
dest_root = args.dest or _default_dest()
|
| 279 |
+
tmp_dir = tempfile.mkdtemp(prefix="skill-install-", dir=_tmp_root())
|
| 280 |
+
try:
|
| 281 |
+
repo_root = _prepare_repo(source, args.method, tmp_dir)
|
| 282 |
+
installed = []
|
| 283 |
+
for path in source.paths:
|
| 284 |
+
skill_name = args.name if len(source.paths) == 1 else None
|
| 285 |
+
skill_name = skill_name or os.path.basename(path.rstrip("/"))
|
| 286 |
+
_validate_skill_name(skill_name)
|
| 287 |
+
if not skill_name:
|
| 288 |
+
raise InstallError("Unable to derive skill name.")
|
| 289 |
+
dest_dir = os.path.join(dest_root, skill_name)
|
| 290 |
+
if os.path.exists(dest_dir):
|
| 291 |
+
raise InstallError(f"Destination already exists: {dest_dir}")
|
| 292 |
+
skill_src = os.path.join(repo_root, path)
|
| 293 |
+
_validate_skill(skill_src)
|
| 294 |
+
_copy_skill(skill_src, dest_dir)
|
| 295 |
+
installed.append((skill_name, dest_dir))
|
| 296 |
+
finally:
|
| 297 |
+
if os.path.isdir(tmp_dir):
|
| 298 |
+
shutil.rmtree(tmp_dir, ignore_errors=True)
|
| 299 |
+
for skill_name, dest_dir in installed:
|
| 300 |
+
print(f"Installed {skill_name} to {dest_dir}")
|
| 301 |
+
return 0
|
| 302 |
+
except InstallError as exc:
|
| 303 |
+
print(f"Error: {exc}", file=sys.stderr)
|
| 304 |
+
return 1
|
| 305 |
+
|
| 306 |
+
|
| 307 |
+
if __name__ == "__main__":
|
| 308 |
+
raise SystemExit(main(sys.argv[1:]))
|
skills/.system/skill-installer/scripts/list-skills.py
ADDED
|
@@ -0,0 +1,107 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""List skills from a GitHub repo path."""
|
| 3 |
+
|
| 4 |
+
from __future__ import annotations
|
| 5 |
+
|
| 6 |
+
import argparse
|
| 7 |
+
import json
|
| 8 |
+
import os
|
| 9 |
+
import sys
|
| 10 |
+
import urllib.error
|
| 11 |
+
|
| 12 |
+
from github_utils import github_api_contents_url, github_request
|
| 13 |
+
|
| 14 |
+
DEFAULT_REPO = "openai/skills"
|
| 15 |
+
DEFAULT_PATH = "skills/.curated"
|
| 16 |
+
DEFAULT_REF = "main"
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
class ListError(Exception):
|
| 20 |
+
pass
|
| 21 |
+
|
| 22 |
+
|
| 23 |
+
class Args(argparse.Namespace):
|
| 24 |
+
repo: str
|
| 25 |
+
path: str
|
| 26 |
+
ref: str
|
| 27 |
+
format: str
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
def _request(url: str) -> bytes:
|
| 31 |
+
return github_request(url, "codex-skill-list")
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def _codex_home() -> str:
|
| 35 |
+
return os.environ.get("CODEX_HOME", os.path.expanduser("~/.codex"))
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def _installed_skills() -> set[str]:
|
| 39 |
+
root = os.path.join(_codex_home(), "skills")
|
| 40 |
+
if not os.path.isdir(root):
|
| 41 |
+
return set()
|
| 42 |
+
entries = set()
|
| 43 |
+
for name in os.listdir(root):
|
| 44 |
+
path = os.path.join(root, name)
|
| 45 |
+
if os.path.isdir(path):
|
| 46 |
+
entries.add(name)
|
| 47 |
+
return entries
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def _list_skills(repo: str, path: str, ref: str) -> list[str]:
|
| 51 |
+
api_url = github_api_contents_url(repo, path, ref)
|
| 52 |
+
try:
|
| 53 |
+
payload = _request(api_url)
|
| 54 |
+
except urllib.error.HTTPError as exc:
|
| 55 |
+
if exc.code == 404:
|
| 56 |
+
raise ListError(
|
| 57 |
+
"Skills path not found: "
|
| 58 |
+
f"https://github.com/{repo}/tree/{ref}/{path}"
|
| 59 |
+
) from exc
|
| 60 |
+
raise ListError(f"Failed to fetch skills: HTTP {exc.code}") from exc
|
| 61 |
+
data = json.loads(payload.decode("utf-8"))
|
| 62 |
+
if not isinstance(data, list):
|
| 63 |
+
raise ListError("Unexpected skills listing response.")
|
| 64 |
+
skills = [item["name"] for item in data if item.get("type") == "dir"]
|
| 65 |
+
return sorted(skills)
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def _parse_args(argv: list[str]) -> Args:
|
| 69 |
+
parser = argparse.ArgumentParser(description="List skills.")
|
| 70 |
+
parser.add_argument("--repo", default=DEFAULT_REPO)
|
| 71 |
+
parser.add_argument(
|
| 72 |
+
"--path",
|
| 73 |
+
default=DEFAULT_PATH,
|
| 74 |
+
help="Repo path to list (default: skills/.curated)",
|
| 75 |
+
)
|
| 76 |
+
parser.add_argument("--ref", default=DEFAULT_REF)
|
| 77 |
+
parser.add_argument(
|
| 78 |
+
"--format",
|
| 79 |
+
choices=["text", "json"],
|
| 80 |
+
default="text",
|
| 81 |
+
help="Output format",
|
| 82 |
+
)
|
| 83 |
+
return parser.parse_args(argv, namespace=Args())
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
def main(argv: list[str]) -> int:
|
| 87 |
+
args = _parse_args(argv)
|
| 88 |
+
try:
|
| 89 |
+
skills = _list_skills(args.repo, args.path, args.ref)
|
| 90 |
+
installed = _installed_skills()
|
| 91 |
+
if args.format == "json":
|
| 92 |
+
payload = [
|
| 93 |
+
{"name": name, "installed": name in installed} for name in skills
|
| 94 |
+
]
|
| 95 |
+
print(json.dumps(payload))
|
| 96 |
+
else:
|
| 97 |
+
for idx, name in enumerate(skills, start=1):
|
| 98 |
+
suffix = " (already installed)" if name in installed else ""
|
| 99 |
+
print(f"{idx}. {name}{suffix}")
|
| 100 |
+
return 0
|
| 101 |
+
except ListError as exc:
|
| 102 |
+
print(f"Error: {exc}", file=sys.stderr)
|
| 103 |
+
return 1
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
if __name__ == "__main__":
|
| 107 |
+
raise SystemExit(main(sys.argv[1:]))
|
skills/agent-kernel/SKILL.md
ADDED
|
@@ -0,0 +1,379 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: agentkernel
|
| 3 |
+
description: >
|
| 4 |
+
Spawn and orchestrate agents as local subprocesses or Kubernetes pods.
|
| 5 |
+
Each agent runs with an independent runtime, conversation, tools,
|
| 6 |
+
and skills. Use when a task benefits from parallel work, role
|
| 7 |
+
specialization, persistent agent state, or sandboxed execution.
|
| 8 |
+
metadata:
|
| 9 |
+
version: "2.1"
|
| 10 |
+
pre-condition: "0"
|
| 11 |
+
---
|
| 12 |
+
|
| 13 |
+
# AgentKernel
|
| 14 |
+
|
| 15 |
+
Spawn and orchestrate agents from `<helpers>` blocks. Each agent runs in its own process (local subprocess or Kubernetes pod) with an independent runtime, conversation state, tools, and skills. You decide what agents to create and what to say to them. The kernel handles process lifecycle, networking, image management, and health checks.
|
| 16 |
+
|
| 17 |
+
## AgentKernel vs `agents` Skill
|
| 18 |
+
|
| 19 |
+
| | `agents` skill | `agentkernel` |
|
| 20 |
+
|---|---|---|
|
| 21 |
+
| **Backends** | Local subprocesses only | Local subprocesses or k8s pods |
|
| 22 |
+
| **Addressing** | By name (`call_async("my-agent", ...)`) | By UUID + secret nonce |
|
| 23 |
+
| **Protocol** | Anthropic Messages API | Custom SSE (TurnRequest/TurnResponse) |
|
| 24 |
+
| **Access control** | Open — any caller can talk to any agent | Nonce-secured single-owner |
|
| 25 |
+
| **Teams / capacity** | No | Yes |
|
| 26 |
+
| **Image packaging** | No | Yes (OCI images for k8s) |
|
| 27 |
+
| **AgentBus** | No | Yes |
|
| 28 |
+
| **Dependencies** | API server skill | None |
|
| 29 |
+
|
| 30 |
+
**Use `agents`** for lightweight local agent workflows where convenience matters — create by name, call by name, check event logs.
|
| 31 |
+
|
| 32 |
+
**Use `agentkernel`** when you need k8s deployment, container isolation, capacity management, nonce-secured access, or agentbus observability.
|
| 33 |
+
|
| 34 |
+
## Core Concepts
|
| 35 |
+
|
| 36 |
+
**Kernel**: `AgentKernel` is the entry point. It wires together the spawner, agent client, and storage. The backend determines where agents run.
|
| 37 |
+
|
| 38 |
+
**Backends**: Two backends are available:
|
| 39 |
+
- **local** — agents run as subprocesses on the same machine. No isolation, no config file needed. Good for development and quick experiments.
|
| 40 |
+
- **kubernetes** — agents run as pods in a k8s cluster. Full container isolation. Requires a config file with cluster details.
|
| 41 |
+
|
| 42 |
+
The entire API after initialization is identical across backends.
|
| 43 |
+
|
| 44 |
+
**SpawnRequest + SpawnInfo**: A `SpawnRequest` defines the agent identity (name, team, metadata). The `spawn_info` field carries agent-type-specific config (system prompt, model, tools, etc.) — e.g. `OpenClawSpawnInfo`.
|
| 45 |
+
|
| 46 |
+
**Nonce**: Each spawn returns a `SpawnResult` containing the agent record and a secret nonce. The nonce is required for all communication — it enforces single-owner access. Only the entity that spawned an agent can talk to it.
|
| 47 |
+
|
| 48 |
+
**AgentBus**: Optional observability/safety layer. When enabled, all LLM inference and code execution events are logged to an agent bus that can be inspected externally.
|
| 49 |
+
|
| 50 |
+
**Teams**: Logical groups with capacity limits. Spawning into a full team raises an error.
|
| 51 |
+
|
| 52 |
+
## Initialization
|
| 53 |
+
|
| 54 |
+
### Local backend
|
| 55 |
+
|
| 56 |
+
No config file needed. Agents run as subprocesses with the same permissions as the parent process.
|
| 57 |
+
|
| 58 |
+
<helpers>
|
| 59 |
+
from agentic.kernel import AgentKernel
|
| 60 |
+
from agentic.kernel.plugins.openclaw import OpenClawPlugin
|
| 61 |
+
|
| 62 |
+
kernel = AgentKernel(backend="local", plugins=[OpenClawPlugin()])
|
| 63 |
+
</helpers>
|
| 64 |
+
|
| 65 |
+
### Kubernetes backend
|
| 66 |
+
|
| 67 |
+
Agents run as pods in the `agentkernel-0` namespace. The config file is at `agentkernel/examples/agentkernel.yaml`:
|
| 68 |
+
|
| 69 |
+
```yaml
|
| 70 |
+
backend: kubernetes
|
| 71 |
+
namespace: agentkernel-0
|
| 72 |
+
base_image: your-registry.example.com/agentkernel:latest
|
| 73 |
+
kubeconfig: ~/.kube/config
|
| 74 |
+
registry_url: your-registry.example.com
|
| 75 |
+
debug: true
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
- `debug: true` preserves pods on failure for inspection (otherwise they are cleaned up automatically).
|
| 79 |
+
|
| 80 |
+
<helpers>
|
| 81 |
+
from agentic.kernel import AgentKernel
|
| 82 |
+
from agentic.kernel.plugins.openclaw import OpenClawPlugin
|
| 83 |
+
|
| 84 |
+
kernel = AgentKernel.from_config("agentkernel/examples/agentkernel.yaml", plugins=[OpenClawPlugin()])
|
| 85 |
+
</helpers>
|
| 86 |
+
|
| 87 |
+
## API
|
| 88 |
+
|
| 89 |
+
All API calls below work identically regardless of backend.
|
| 90 |
+
|
| 91 |
+
### Spawn an Agent
|
| 92 |
+
|
| 93 |
+
<helpers>
|
| 94 |
+
import os
|
| 95 |
+
from agentic.kernel import SpawnRequest
|
| 96 |
+
from agentic.kernel.plugins.openclaw import OpenClawSpawnInfo
|
| 97 |
+
|
| 98 |
+
result = await kernel.spawner.spawn(SpawnRequest(
|
| 99 |
+
name="researcher",
|
| 100 |
+
agent_type="openclaw",
|
| 101 |
+
metadata={"role": "research"},
|
| 102 |
+
spawn_info=OpenClawSpawnInfo(
|
| 103 |
+
system_prompt="You are a research specialist. Be thorough and cite sources.",
|
| 104 |
+
model="claude-sonnet-4-5",
|
| 105 |
+
api_key=os.environ.get("LLM_API_KEY", ""),
|
| 106 |
+
),
|
| 107 |
+
))
|
| 108 |
+
|
| 109 |
+
agent = result.agent # Agent(id, name, team_id, state, metadata, ...)
|
| 110 |
+
nonce = result.nonce # Secret — required for all communication
|
| 111 |
+
print(f"Spawned: {agent.id} ({agent.name})")
|
| 112 |
+
</helpers>
|
| 113 |
+
|
| 114 |
+
`SpawnRequest` fields:
|
| 115 |
+
- `name` — agent name (also used in k8s pod naming)
|
| 116 |
+
- `team_id` — team for capacity tracking (optional, default: "")
|
| 117 |
+
- `metadata` — arbitrary labels for discovery (e.g. `{"role": "worker"}`)
|
| 118 |
+
- `image_id` — custom image from packaging (optional, defaults to base_image in k8s)
|
| 119 |
+
- `spawn_info` — agent-type-specific config (e.g. `OpenClawSpawnInfo`)
|
| 120 |
+
- `env` — extra environment variables forwarded to the agent process
|
| 121 |
+
|
| 122 |
+
`OpenClawSpawnInfo` fields:
|
| 123 |
+
- `system_prompt` — system prompt for the agent
|
| 124 |
+
- `model` — LLM model name (default: `"claude-sonnet-4-5"`)
|
| 125 |
+
- `provider` — LLM provider (default: `"anthropic"`)
|
| 126 |
+
- `tools` — list of tool names to enable (default: `["bash"]`)
|
| 127 |
+
- `thinking_level` — thinking level: `"none"`, `"low"`, `"medium"`, `"high"`
|
| 128 |
+
- `api_key` — LLM API key (also forwarded from host `LLM_API_KEY` env var)
|
| 129 |
+
- `base_url` — override LLM API base URL
|
| 130 |
+
|
| 131 |
+
### Send a Message (Turn)
|
| 132 |
+
|
| 133 |
+
Use the `ask()` helper to send a message and get the full response:
|
| 134 |
+
|
| 135 |
+
<helpers>
|
| 136 |
+
response = await ask(kernel, agent.id, nonce, "What are the latest findings on topic X?")
|
| 137 |
+
print(response)
|
| 138 |
+
</helpers>
|
| 139 |
+
|
| 140 |
+
The agent maintains conversation state — subsequent turns see the full history.
|
| 141 |
+
|
| 142 |
+
For manual streaming (e.g. to display progress), use `kernel.agent_client.turn()` directly — note `end=""` to avoid extra newlines between tokens:
|
| 143 |
+
|
| 144 |
+
<helpers>
|
| 145 |
+
import json
|
| 146 |
+
from agentic.kernel import TurnRequest
|
| 147 |
+
|
| 148 |
+
request = TurnRequest(
|
| 149 |
+
agent_id=agent.id,
|
| 150 |
+
nonce=nonce,
|
| 151 |
+
body=json.dumps({
|
| 152 |
+
"messages": [{"role": "user", "content": "What are the latest findings on topic X?"}]
|
| 153 |
+
}).encode(),
|
| 154 |
+
)
|
| 155 |
+
|
| 156 |
+
response_text = []
|
| 157 |
+
async for chunk in kernel.agent_client.turn(request):
|
| 158 |
+
if chunk.body:
|
| 159 |
+
print(chunk.body, end="", flush=True)
|
| 160 |
+
response_text.append(chunk.body)
|
| 161 |
+
if chunk.error:
|
| 162 |
+
print(f"\nError: {chunk.error}")
|
| 163 |
+
full_response = "".join(response_text)
|
| 164 |
+
</helpers>
|
| 165 |
+
|
| 166 |
+
### Get History
|
| 167 |
+
|
| 168 |
+
<helpers>
|
| 169 |
+
history = await kernel.agent_client.get_history(agent.id, last_n=5)
|
| 170 |
+
for entry in history:
|
| 171 |
+
print(f"[{entry['role']}] {entry['content'][:100]}")
|
| 172 |
+
</helpers>
|
| 173 |
+
|
| 174 |
+
### Get Agent Info
|
| 175 |
+
|
| 176 |
+
<helpers>
|
| 177 |
+
info = await kernel.agent_client.get_info(agent.id)
|
| 178 |
+
print(f"pid={info['pid']} cwd={info['cwd']} uid={info['uid']}")
|
| 179 |
+
</helpers>
|
| 180 |
+
|
| 181 |
+
### Check Status
|
| 182 |
+
|
| 183 |
+
<helpers>
|
| 184 |
+
statuses = await kernel.status()
|
| 185 |
+
for s in statuses:
|
| 186 |
+
line = f"{s['name']}: state={s['state']} live={s['live']}"
|
| 187 |
+
if s.get('pod_phase'): # k8s backend
|
| 188 |
+
line += f" pod={s['pod_phase']}"
|
| 189 |
+
if s.get('process_alive') is not None: # local backend
|
| 190 |
+
line += f" process_alive={s['process_alive']}"
|
| 191 |
+
print(line)
|
| 192 |
+
</helpers>
|
| 193 |
+
|
| 194 |
+
### Kill an Agent
|
| 195 |
+
|
| 196 |
+
<helpers>
|
| 197 |
+
await kernel.spawner.kill(agent.id)
|
| 198 |
+
</helpers>
|
| 199 |
+
|
| 200 |
+
### Clean Up All Agents
|
| 201 |
+
|
| 202 |
+
<helpers>
|
| 203 |
+
await kernel.cleanup()
|
| 204 |
+
</helpers>
|
| 205 |
+
|
| 206 |
+
## Teams
|
| 207 |
+
|
| 208 |
+
Teams reserve capacity and group agents together.
|
| 209 |
+
|
| 210 |
+
<helpers>
|
| 211 |
+
from agentic.kernel import CreateTeamRequest
|
| 212 |
+
|
| 213 |
+
# Reserve capacity
|
| 214 |
+
await kernel.spawner.create_team(CreateTeamRequest(
|
| 215 |
+
team_id="analysis-team",
|
| 216 |
+
resources={"cpu": 4},
|
| 217 |
+
))
|
| 218 |
+
|
| 219 |
+
# Spawn into the team
|
| 220 |
+
result = await kernel.spawner.spawn(SpawnRequest(
|
| 221 |
+
name="analyst",
|
| 222 |
+
team_id="analysis-team",
|
| 223 |
+
agent_type="openclaw",
|
| 224 |
+
spawn_info=OpenClawSpawnInfo(
|
| 225 |
+
system_prompt="You are a data analyst.",
|
| 226 |
+
api_key=os.environ.get("LLM_API_KEY", ""),
|
| 227 |
+
),
|
| 228 |
+
))
|
| 229 |
+
|
| 230 |
+
# Delete team (kills all agents first)
|
| 231 |
+
await kernel.spawner.delete_team("analysis-team")
|
| 232 |
+
</helpers>
|
| 233 |
+
|
| 234 |
+
## AgentBus
|
| 235 |
+
|
| 236 |
+
AgentBus adds observability and safety to agent execution. When enabled, the agent logs all LLM inference and code execution events to a bus that can be inspected via the agentbus CLI.
|
| 237 |
+
|
| 238 |
+
<helpers>
|
| 239 |
+
from agentic.kernel import AgentBusConfig
|
| 240 |
+
|
| 241 |
+
result = await kernel.spawner.spawn(SpawnRequest(
|
| 242 |
+
name="worker",
|
| 243 |
+
agent_type="openclaw",
|
| 244 |
+
spawn_info=OpenClawSpawnInfo(
|
| 245 |
+
system_prompt="You are a helpful worker.",
|
| 246 |
+
api_key=os.environ.get("LLM_API_KEY", ""),
|
| 247 |
+
),
|
| 248 |
+
agentbus=AgentBusConfig(
|
| 249 |
+
port=8095,
|
| 250 |
+
disable_safety=False,
|
| 251 |
+
),
|
| 252 |
+
))
|
| 253 |
+
</helpers>
|
| 254 |
+
|
| 255 |
+
To inspect the bus, you can use the agentbus skill.
|
| 256 |
+
|
| 257 |
+
# Kubernetes backend — port-forward first
|
| 258 |
+
kubectl --kubeconfig ~/.kube/config \
|
| 259 |
+
-n agentkernel-0 port-forward pod/agent-<id-prefix> 8095:8095
|
| 260 |
+
# Then poll as above
|
| 261 |
+
```
|
| 262 |
+
|
| 263 |
+
The bus ID follows the pattern `{agent_name}.{agent_uuid}`.
|
| 264 |
+
|
| 265 |
+
## Patterns
|
| 266 |
+
|
| 267 |
+
### Fan-out / Fan-in
|
| 268 |
+
|
| 269 |
+
Spawn specialists, query them in parallel, synthesize results.
|
| 270 |
+
|
| 271 |
+
<helpers>
|
| 272 |
+
import asyncio
|
| 273 |
+
|
| 274 |
+
# Spawn specialists
|
| 275 |
+
researcher_r = await kernel.spawner.spawn(SpawnRequest(
|
| 276 |
+
name="researcher", agent_type="openclaw", spawn_info=OpenClawSpawnInfo(
|
| 277 |
+
system_prompt="You are a research specialist.",
|
| 278 |
+
api_key=os.environ.get("LLM_API_KEY", ""),
|
| 279 |
+
),
|
| 280 |
+
))
|
| 281 |
+
analyst_r = await kernel.spawner.spawn(SpawnRequest(
|
| 282 |
+
name="analyst", agent_type="openclaw", spawn_info=OpenClawSpawnInfo(
|
| 283 |
+
system_prompt="You are a data analyst.",
|
| 284 |
+
api_key=os.environ.get("LLM_API_KEY", ""),
|
| 285 |
+
),
|
| 286 |
+
))
|
| 287 |
+
|
| 288 |
+
# Fan out — ask() collects streaming chunks into a single string
|
| 289 |
+
research_task = asyncio.create_task(
|
| 290 |
+
ask(kernel, researcher_r.agent.id, researcher_r.nonce, "Find papers on quantum error correction")
|
| 291 |
+
)
|
| 292 |
+
analysis_task = asyncio.create_task(
|
| 293 |
+
ask(kernel, analyst_r.agent.id, analyst_r.nonce, "Run cost-benefit analysis on approach X")
|
| 294 |
+
)
|
| 295 |
+
research, analysis = await asyncio.gather(research_task, analysis_task)
|
| 296 |
+
|
| 297 |
+
print(f"Research: {research[:200]}")
|
| 298 |
+
print(f"Analysis: {analysis[:200]}")
|
| 299 |
+
</helpers>
|
| 300 |
+
|
| 301 |
+
### Pipeline
|
| 302 |
+
|
| 303 |
+
One agent's output feeds the next.
|
| 304 |
+
|
| 305 |
+
<helpers>
|
| 306 |
+
raw_data = await ask(kernel, researcher_r.agent.id, researcher_r.nonce, "Gather data on topic X")
|
| 307 |
+
analysis = await ask(kernel, analyst_r.agent.id, analyst_r.nonce, f"Analyze this data:\n{raw_data}")
|
| 308 |
+
print(analysis)
|
| 309 |
+
</helpers>
|
| 310 |
+
|
| 311 |
+
### Image Packaging
|
| 312 |
+
|
| 313 |
+
Bundle custom code into agent images. On local backend, bundles are copied to a directory. On k8s, an OCI image is built and pushed to the registry.
|
| 314 |
+
|
| 315 |
+
<helpers>
|
| 316 |
+
from agentic.kernel import SourceBundle
|
| 317 |
+
|
| 318 |
+
# Upload code to blob storage
|
| 319 |
+
helpers_uri = kernel.blob_store.upload_dir("./my_helpers/")
|
| 320 |
+
|
| 321 |
+
# Build an agent image with the bundle
|
| 322 |
+
job = await kernel.packaging.create_agent_image(
|
| 323 |
+
name="custom-worker",
|
| 324 |
+
bundles=[SourceBundle(uri=helpers_uri, labels={"name": "my_helpers"})],
|
| 325 |
+
)
|
| 326 |
+
if job.image:
|
| 327 |
+
# Spawn an agent using the custom image
|
| 328 |
+
result = await kernel.spawner.spawn(SpawnRequest(
|
| 329 |
+
name="custom-agent",
|
| 330 |
+
agent_type="openclaw",
|
| 331 |
+
image_id=job.image.id,
|
| 332 |
+
spawn_info=OpenClawSpawnInfo(
|
| 333 |
+
system_prompt="You have custom tools available.",
|
| 334 |
+
api_key=os.environ.get("LLM_API_KEY", ""),
|
| 335 |
+
),
|
| 336 |
+
))
|
| 337 |
+
</helpers>
|
| 338 |
+
|
| 339 |
+
## Lifecycle
|
| 340 |
+
|
| 341 |
+
- Agents persist (as subprocesses or pods) until explicitly killed. Always clean up when done.
|
| 342 |
+
- Each agent has one conversation and one owner. The nonce enforces this — only the spawner can communicate with its agent.
|
| 343 |
+
- Teams have capacity limits. Spawning into a full team raises `ValueError`.
|
| 344 |
+
- The `LLM_API_KEY` and `OPENAI_API_KEY` environment variables are automatically forwarded to agent processes.
|
| 345 |
+
|
| 346 |
+
## Operations
|
| 347 |
+
|
| 348 |
+
**Note**: If behind a proxy, configure `HTTP_PROXY`/`HTTPS_PROXY` environment variables.
|
| 349 |
+
|
| 350 |
+
### Run examples locally
|
| 351 |
+
|
| 352 |
+
```bash
|
| 353 |
+
# Single agent, local backend (no config file needed)
|
| 354 |
+
LLM_API_KEY=... uv run python -m agentkernel.examples.simple_agent
|
| 355 |
+
|
| 356 |
+
# Team scenario, local backend
|
| 357 |
+
LLM_API_KEY=... uv run python -m agentkernel.examples.team_scenario
|
| 358 |
+
```
|
| 359 |
+
|
| 360 |
+
### Run examples on Kubernetes
|
| 361 |
+
|
| 362 |
+
```bash
|
| 363 |
+
# run_k8s_scenario.sh runs the scenario against the configured k8s cluster
|
| 364 |
+
LLM_API_KEY=... ./agentkernel/scripts/run_k8s_scenario.sh simple_agent
|
| 365 |
+
LLM_API_KEY=... ./agentkernel/scripts/run_k8s_scenario.sh team_scenario
|
| 366 |
+
```
|
| 367 |
+
|
| 368 |
+
### Build and push the base image (k8s only)
|
| 369 |
+
|
| 370 |
+
```bash
|
| 371 |
+
./scripts/build_base_image.sh --force-base
|
| 372 |
+
```
|
| 373 |
+
|
| 374 |
+
### Clean up cluster resources (k8s only)
|
| 375 |
+
|
| 376 |
+
```bash
|
| 377 |
+
./agentkernel/scripts/cleanup_k8s.sh # delete all agentkernel pods/svc/cm
|
| 378 |
+
./agentkernel/scripts/cleanup_k8s.sh --dry-run # preview what would be deleted
|
| 379 |
+
```
|
skills/hugging-face-evaluation/SKILL.md
ADDED
|
@@ -0,0 +1,262 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: hugging-face-evaluation
|
| 3 |
+
description: Add evaluation results to Hugging Face model repositories using the .eval_results/ format. Uses HF CLI for PR management and manual YAML creation.
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# Overview
|
| 7 |
+
|
| 8 |
+
This skill adds structured evaluation results to HuggingFace model repositories using the [`.eval_results/` format](https://huggingface.co/docs/hub/eval-results).
|
| 9 |
+
|
| 10 |
+
**What This Enables:**
|
| 11 |
+
- Results appear on model pages with benchmark links
|
| 12 |
+
- Scores are aggregated into benchmark dataset leaderboards
|
| 13 |
+
- Community contributions via Pull Requests
|
| 14 |
+
|
| 15 |
+
# Important
|
| 16 |
+
|
| 17 |
+
Evaluation PRs can only be opened on the Hugging Face Hub. They cannot be opened on the GitHub repository.
|
| 18 |
+
|
| 19 |
+
# Version
|
| 20 |
+
3.0.0
|
| 21 |
+
|
| 22 |
+
# Workflow Overview
|
| 23 |
+
|
| 24 |
+
The actual workflow uses:
|
| 25 |
+
1. **HF CLI** (`hf upload`, `hf download`) for PR operations
|
| 26 |
+
2. **Manual YAML creation** in `/tmp/pr-reviews/`
|
| 27 |
+
3. **`check_prs.py`** script to check for existing PRs
|
| 28 |
+
4. **curl** to fetch model cards and leaderboard data
|
| 29 |
+
|
| 30 |
+
See `references/hf_cli_for_prs.md` for detailed CLI instructions.
|
| 31 |
+
|
| 32 |
+
---
|
| 33 |
+
|
| 34 |
+
# CRITICAL: Multiple Scores for One Benchmark
|
| 35 |
+
|
| 36 |
+
Models can have multiple scores for the same benchmark (with/without tools). **Each variant MUST be in a separate file.**
|
| 37 |
+
|
| 38 |
+
## File Naming Convention
|
| 39 |
+
|
| 40 |
+
| Condition | File Name | Notes Field |
|
| 41 |
+
|-----------|-----------|-------------|
|
| 42 |
+
| Default (no tools) | `hle.yaml` | None (omit notes) |
|
| 43 |
+
| With tools | `hle_with_tools.yaml` | `notes: "With tools"` |
|
| 44 |
+
|
| 45 |
+
## Notes Field Rules
|
| 46 |
+
|
| 47 |
+
1. **No tools = No notes field** - Default assumption is "without tools"
|
| 48 |
+
2. **With tools = Add notes** - Only add when tools ARE used
|
| 49 |
+
3. **Standardized format** - Always use `notes: "With tools"` (capital W)
|
| 50 |
+
|
| 51 |
+
**CORRECT:**
|
| 52 |
+
```yaml
|
| 53 |
+
# hle.yaml (no tools - DEFAULT)
|
| 54 |
+
- dataset:
|
| 55 |
+
id: cais/hle
|
| 56 |
+
task_id: hle
|
| 57 |
+
value: 22.1
|
| 58 |
+
source:
|
| 59 |
+
url: https://huggingface.co/org/model
|
| 60 |
+
name: Model Card
|
| 61 |
+
user: username
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
```yaml
|
| 65 |
+
# hle_with_tools.yaml (with tools)
|
| 66 |
+
- dataset:
|
| 67 |
+
id: cais/hle
|
| 68 |
+
task_id: hle
|
| 69 |
+
value: 44.9
|
| 70 |
+
source:
|
| 71 |
+
url: https://huggingface.co/org/model
|
| 72 |
+
name: Model Card
|
| 73 |
+
user: username
|
| 74 |
+
notes: "With tools"
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
**INCORRECT:**
|
| 78 |
+
```yaml
|
| 79 |
+
notes: "Without tools" # Don't add notes for default
|
| 80 |
+
notes: "w/ tools" # Use standardized format
|
| 81 |
+
notes: "with tools" # Capital W required
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
---
|
| 85 |
+
|
| 86 |
+
# Core Workflow
|
| 87 |
+
|
| 88 |
+
## Step 1: Check for Existing PRs
|
| 89 |
+
|
| 90 |
+
**ALWAYS check before creating new PRs:**
|
| 91 |
+
|
| 92 |
+
```bash
|
| 93 |
+
uv run scripts/check_prs.py --repo-id "org/model-name"
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
If PRs exist, update them instead of creating new ones.
|
| 97 |
+
|
| 98 |
+
## Step 2: Fetch Model Card and Extract Scores
|
| 99 |
+
|
| 100 |
+
```bash
|
| 101 |
+
# Get model README
|
| 102 |
+
curl -s "https://huggingface.co/org/model-name/raw/main/README.md" | grep -i -A10 "HLE\|GPQA\|MMLU"
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
Or use MCP tools:
|
| 106 |
+
```
|
| 107 |
+
mcp__hf-mcp-server__hub_repo_details
|
| 108 |
+
repo_ids: ["org/model-name"]
|
| 109 |
+
include_readme: true
|
| 110 |
+
```
|
| 111 |
+
|
| 112 |
+
## Step 3: Create YAML File
|
| 113 |
+
|
| 114 |
+
```bash
|
| 115 |
+
mkdir -p /tmp/pr-reviews/new-prs
|
| 116 |
+
cd /tmp/pr-reviews/new-prs
|
| 117 |
+
|
| 118 |
+
cat > hle.yaml << 'EOF'
|
| 119 |
+
- dataset:
|
| 120 |
+
id: cais/hle
|
| 121 |
+
task_id: hle
|
| 122 |
+
value: 22.1
|
| 123 |
+
date: '2026-02-03'
|
| 124 |
+
source:
|
| 125 |
+
url: https://huggingface.co/org/model-name
|
| 126 |
+
name: Model Card
|
| 127 |
+
user: burtenshaw
|
| 128 |
+
EOF
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
## Step 4: Create PR
|
| 132 |
+
|
| 133 |
+
```bash
|
| 134 |
+
hf upload org/model-name hle.yaml .eval_results/hle.yaml \
|
| 135 |
+
--repo-type model --create-pr \
|
| 136 |
+
--commit-message "Add HLE evaluation result"
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
## Step 5: Get PR Number
|
| 140 |
+
|
| 141 |
+
```bash
|
| 142 |
+
uv run scripts/check_prs.py --repo-id "org/model-name"
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
---
|
| 146 |
+
|
| 147 |
+
# Updating Existing PRs
|
| 148 |
+
|
| 149 |
+
```bash
|
| 150 |
+
# Download PR contents
|
| 151 |
+
hf download org/model-name --repo-type model \
|
| 152 |
+
--revision refs/pr/<PR_NUMBER> \
|
| 153 |
+
--include ".eval_results/*" \
|
| 154 |
+
--local-dir /tmp/pr-reviews/model-pr<PR_NUMBER>
|
| 155 |
+
|
| 156 |
+
# Edit the YAML file, then upload
|
| 157 |
+
hf upload org/model-name /tmp/pr-reviews/updated.yaml .eval_results/hle.yaml \
|
| 158 |
+
--repo-type model \
|
| 159 |
+
--revision refs/pr/<PR_NUMBER> \
|
| 160 |
+
--commit-message "Update evaluation result"
|
| 161 |
+
```
|
| 162 |
+
|
| 163 |
+
---
|
| 164 |
+
|
| 165 |
+
# Deleting Files from PRs
|
| 166 |
+
|
| 167 |
+
Use Python API:
|
| 168 |
+
```bash
|
| 169 |
+
uv run --with huggingface_hub python3 << 'EOF'
|
| 170 |
+
from huggingface_hub import HfApi
|
| 171 |
+
api = HfApi()
|
| 172 |
+
api.delete_file(
|
| 173 |
+
path_in_repo=".eval_results/old_file.yaml",
|
| 174 |
+
repo_id="org/model-name",
|
| 175 |
+
repo_type="model",
|
| 176 |
+
revision="refs/pr/<PR_NUMBER>",
|
| 177 |
+
commit_message="Remove file"
|
| 178 |
+
)
|
| 179 |
+
EOF
|
| 180 |
+
```
|
| 181 |
+
|
| 182 |
+
---
|
| 183 |
+
|
| 184 |
+
# Fetching Leaderboard Data
|
| 185 |
+
|
| 186 |
+
```bash
|
| 187 |
+
# HLE leaderboard (requires auth for private datasets)
|
| 188 |
+
curl -s "https://huggingface.co/api/datasets/cais/hle/leaderboard" \
|
| 189 |
+
-H "Authorization: Bearer $HF_TOKEN"
|
| 190 |
+
|
| 191 |
+
# MMLU-Pro leaderboard (public)
|
| 192 |
+
curl -s "https://huggingface.co/api/datasets/TIGER-Lab/MMLU-Pro/leaderboard"
|
| 193 |
+
|
| 194 |
+
# Model eval results
|
| 195 |
+
curl -s "https://huggingface.co/api/models/org/model?expand[]=evalResults"
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
---
|
| 199 |
+
|
| 200 |
+
# .eval_results/ Format
|
| 201 |
+
|
| 202 |
+
```yaml
|
| 203 |
+
# .eval_results/hle.yaml
|
| 204 |
+
- dataset:
|
| 205 |
+
id: cais/hle # Required: Hub Benchmark dataset ID
|
| 206 |
+
task_id: hle # Required: task id from dataset's eval.yaml
|
| 207 |
+
value: 22.2 # Required: metric value
|
| 208 |
+
date: "2026-01-14" # Optional: ISO-8601 date
|
| 209 |
+
source: # Optional: attribution
|
| 210 |
+
url: https://huggingface.co/org/model
|
| 211 |
+
name: Model Card
|
| 212 |
+
user: username
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
---
|
| 216 |
+
|
| 217 |
+
# Supported Benchmarks
|
| 218 |
+
|
| 219 |
+
| Benchmark | Hub Dataset ID | Task ID |
|
| 220 |
+
|-----------|---------------|---------|
|
| 221 |
+
| HLE | cais/hle | hle |
|
| 222 |
+
| GPQA | Idavidrein/gpqa | diamond |
|
| 223 |
+
| MMLU-Pro | TIGER-Lab/MMLU-Pro | mmlu_pro |
|
| 224 |
+
|
| 225 |
+
---
|
| 226 |
+
|
| 227 |
+
# Tool-Using Agent Models
|
| 228 |
+
|
| 229 |
+
Models like MiroThinker, Nemotron-Orchestrator are inherently tool-using agents. For these:
|
| 230 |
+
|
| 231 |
+
1. Use `hle_with_tools.yaml` as filename
|
| 232 |
+
2. Add `notes: "With tools"`
|
| 233 |
+
3. Look for terms: "search agent", "agentic", "orchestrator", "code-interpreter"
|
| 234 |
+
|
| 235 |
+
---
|
| 236 |
+
|
| 237 |
+
# Environment Setup
|
| 238 |
+
|
| 239 |
+
```bash
|
| 240 |
+
export HF_TOKEN="your-huggingface-token"
|
| 241 |
+
```
|
| 242 |
+
|
| 243 |
+
---
|
| 244 |
+
|
| 245 |
+
# Scripts Reference
|
| 246 |
+
|
| 247 |
+
```bash
|
| 248 |
+
# Check for existing PRs (ALWAYS do this first)
|
| 249 |
+
uv run scripts/check_prs.py --repo-id "org/model-name"
|
| 250 |
+
```
|
| 251 |
+
|
| 252 |
+
See `references/hf_cli_for_prs.md` for complete HF CLI workflow documentation.
|
| 253 |
+
|
| 254 |
+
---
|
| 255 |
+
|
| 256 |
+
# Best Practices
|
| 257 |
+
|
| 258 |
+
1. **Always check for existing PRs** before creating new ones
|
| 259 |
+
2. **Separate files for variants** - `hle.yaml` for default, `hle_with_tools.yaml` for tools
|
| 260 |
+
3. **Notes only for non-default** - Omit notes for standard evaluations
|
| 261 |
+
4. **Standardized format** - Use `"With tools"` exactly (capital W)
|
| 262 |
+
5. **Verify scores** - Compare YAML against model card before submitting
|
skills/hugging-face-evaluation/examples/.env.example
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hugging Face Token (required for all operations)
|
| 2 |
+
# Get your token at: https://huggingface.co/settings/tokens
|
| 3 |
+
HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
|
| 4 |
+
|
| 5 |
+
# Artificial Analysis API Key (required for import-aa command)
|
| 6 |
+
# Get your key at: https://artificialanalysis.ai/
|
| 7 |
+
AA_API_KEY=aa_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
|
skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md
ADDED
|
@@ -0,0 +1,380 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Usage Examples
|
| 2 |
+
|
| 3 |
+
Practical examples for adding evaluations to HuggingFace model repositories using the `.eval_results/` format.
|
| 4 |
+
|
| 5 |
+
## Table of Contents
|
| 6 |
+
1. [Setup](#setup)
|
| 7 |
+
2. [Add Single Benchmark (Recommended)](#add-single-benchmark-recommended)
|
| 8 |
+
3. [Batch Process Trending Models](#batch-process-trending-models)
|
| 9 |
+
4. [Extract from README Tables](#extract-from-readme-tables)
|
| 10 |
+
5. [Import from Artificial Analysis](#import-from-artificial-analysis)
|
| 11 |
+
6. [Common Workflows](#common-workflows)
|
| 12 |
+
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
## Setup
|
| 16 |
+
|
| 17 |
+
### Environment Variables
|
| 18 |
+
|
| 19 |
+
```bash
|
| 20 |
+
# Required for creating PRs
|
| 21 |
+
export HF_TOKEN="hf_your_write_token_here"
|
| 22 |
+
|
| 23 |
+
# Optional: for Artificial Analysis source
|
| 24 |
+
export AA_API_KEY="your_aa_api_key_here"
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
Or use a `.env` file:
|
| 28 |
+
```bash
|
| 29 |
+
cp examples/.env.example .env
|
| 30 |
+
# Edit .env with your tokens
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
### Verify Installation
|
| 34 |
+
|
| 35 |
+
```bash
|
| 36 |
+
uv run scripts/evaluation_manager.py --help
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
---
|
| 40 |
+
|
| 41 |
+
## Add Single Benchmark (Recommended)
|
| 42 |
+
|
| 43 |
+
The simplest way to add a specific benchmark score to a model.
|
| 44 |
+
|
| 45 |
+
### Basic Usage
|
| 46 |
+
|
| 47 |
+
```bash
|
| 48 |
+
# Preview (default - prints YAML without uploading)
|
| 49 |
+
uv run scripts/evaluation_manager.py add-eval \
|
| 50 |
+
--benchmark HLE \
|
| 51 |
+
--repo-id "moonshotai/Kimi-K2-Thinking"
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
Output:
|
| 55 |
+
```
|
| 56 |
+
Looking up HLE score for moonshotai/Kimi-K2-Thinking from model_card...
|
| 57 |
+
Found: HLE = 23.9
|
| 58 |
+
Generated YAML:
|
| 59 |
+
- dataset:
|
| 60 |
+
id: cais/hle
|
| 61 |
+
task_id: default
|
| 62 |
+
value: 23.9
|
| 63 |
+
date: "2026-01-14"
|
| 64 |
+
source:
|
| 65 |
+
url: https://huggingface.co/moonshotai/Kimi-K2-Thinking
|
| 66 |
+
name: Model Card
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
### From Artificial Analysis
|
| 70 |
+
|
| 71 |
+
```bash
|
| 72 |
+
uv run scripts/evaluation_manager.py add-eval \
|
| 73 |
+
--benchmark HLE \
|
| 74 |
+
--repo-id "MiniMaxAI/MiniMax-M2.1" \
|
| 75 |
+
--source aa
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
### Create PR
|
| 79 |
+
|
| 80 |
+
```bash
|
| 81 |
+
# Always check for existing PRs first!
|
| 82 |
+
uv run scripts/evaluation_manager.py get-prs --repo-id "model/name"
|
| 83 |
+
|
| 84 |
+
# If no PRs exist, create one
|
| 85 |
+
uv run scripts/evaluation_manager.py add-eval \
|
| 86 |
+
--benchmark HLE \
|
| 87 |
+
--repo-id "model/name" \
|
| 88 |
+
--create-pr
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
### Push Directly (Your Own Model)
|
| 92 |
+
|
| 93 |
+
```bash
|
| 94 |
+
uv run scripts/evaluation_manager.py add-eval \
|
| 95 |
+
--benchmark GPQA \
|
| 96 |
+
--repo-id "your-username/your-model" \
|
| 97 |
+
--apply
|
| 98 |
+
```
|
| 99 |
+
|
| 100 |
+
### Provide Score Manually
|
| 101 |
+
|
| 102 |
+
```bash
|
| 103 |
+
uv run scripts/evaluation_manager.py add-eval \
|
| 104 |
+
--benchmark HLE \
|
| 105 |
+
--repo-id "model/name" \
|
| 106 |
+
--value 84.5 \
|
| 107 |
+
--create-pr
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## Batch Process Trending Models
|
| 113 |
+
|
| 114 |
+
Process multiple trending models at once.
|
| 115 |
+
|
| 116 |
+
### Preview Mode (Dry Run)
|
| 117 |
+
|
| 118 |
+
```bash
|
| 119 |
+
uv run scripts/batch_eval_prs.py --limit 10 --benchmark HLE --dry-run
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
Output:
|
| 123 |
+
```
|
| 124 |
+
==================================================
|
| 125 |
+
Batch Evaluation PR Creator
|
| 126 |
+
==================================================
|
| 127 |
+
Benchmark: HLE
|
| 128 |
+
Source: model_card
|
| 129 |
+
Pipeline tag: text-generation
|
| 130 |
+
Limit: 10
|
| 131 |
+
Sort: trending
|
| 132 |
+
Dry run: True
|
| 133 |
+
==================================================
|
| 134 |
+
|
| 135 |
+
Processing: LiquidAI/LFM2.5-1.2B-Instruct
|
| 136 |
+
Not found: HLE score not available
|
| 137 |
+
Processing: MiniMaxAI/MiniMax-M2.1
|
| 138 |
+
Found: HLE = 22.2
|
| 139 |
+
Status: Would create PR (dry run)
|
| 140 |
+
...
|
| 141 |
+
|
| 142 |
+
Summary:
|
| 143 |
+
Success: 3
|
| 144 |
+
Not found: 7
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
### Create PRs
|
| 148 |
+
|
| 149 |
+
```bash
|
| 150 |
+
# From model cards
|
| 151 |
+
uv run scripts/batch_eval_prs.py --limit 10 --benchmark HLE
|
| 152 |
+
|
| 153 |
+
# From Artificial Analysis
|
| 154 |
+
uv run scripts/batch_eval_prs.py --limit 10 --benchmark HLE --source aa
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
### Sort Options
|
| 158 |
+
|
| 159 |
+
```bash
|
| 160 |
+
# By downloads (more established models)
|
| 161 |
+
uv run scripts/batch_eval_prs.py --limit 20 --sort downloads --benchmark GPQA
|
| 162 |
+
|
| 163 |
+
# By likes
|
| 164 |
+
uv run scripts/batch_eval_prs.py --limit 10 --sort likes --benchmark MMLU-Pro
|
| 165 |
+
```
|
| 166 |
+
|
| 167 |
+
### Filter by Pipeline Tag
|
| 168 |
+
|
| 169 |
+
```bash
|
| 170 |
+
# Only text-generation models (default)
|
| 171 |
+
uv run scripts/batch_eval_prs.py --limit 10 --benchmark HLE --pipeline-tag text-generation
|
| 172 |
+
|
| 173 |
+
# Image generation models
|
| 174 |
+
uv run scripts/batch_eval_prs.py --limit 10 --benchmark HLE --pipeline-tag text-to-image
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
### Results Tracking
|
| 178 |
+
|
| 179 |
+
Results are saved to `runs/{benchmark}_{date}_{hash}.json`:
|
| 180 |
+
|
| 181 |
+
```bash
|
| 182 |
+
cat runs/hle_20260114_abc123.json
|
| 183 |
+
```
|
| 184 |
+
|
| 185 |
+
```json
|
| 186 |
+
{
|
| 187 |
+
"benchmark": "HLE",
|
| 188 |
+
"source": "aa",
|
| 189 |
+
"source_url": "https://artificialanalysis.ai",
|
| 190 |
+
"created": "2026-01-14T08:00:00Z",
|
| 191 |
+
"results": [
|
| 192 |
+
{
|
| 193 |
+
"repo_id": "MiniMaxAI/MiniMax-M2.1",
|
| 194 |
+
"value": 22.2,
|
| 195 |
+
"status": "pr_created",
|
| 196 |
+
"source_url": "https://artificialanalysis.ai"
|
| 197 |
+
}
|
| 198 |
+
]
|
| 199 |
+
}
|
| 200 |
+
```
|
| 201 |
+
|
| 202 |
+
---
|
| 203 |
+
|
| 204 |
+
## Extract from README Tables
|
| 205 |
+
|
| 206 |
+
For models with evaluation tables in their README.
|
| 207 |
+
|
| 208 |
+
### Step 1: Inspect Tables
|
| 209 |
+
|
| 210 |
+
```bash
|
| 211 |
+
uv run scripts/evaluation_manager.py inspect-tables \
|
| 212 |
+
--repo-id "deepseek-ai/DeepSeek-V3"
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
+
This shows all tables with their structure, helping you identify which table to extract.
|
| 216 |
+
|
| 217 |
+
### Step 2: Preview Extraction
|
| 218 |
+
|
| 219 |
+
```bash
|
| 220 |
+
uv run scripts/evaluation_manager.py extract-readme \
|
| 221 |
+
--repo-id "deepseek-ai/DeepSeek-V3" \
|
| 222 |
+
--table 1
|
| 223 |
+
```
|
| 224 |
+
|
| 225 |
+
### Step 3: Create PR
|
| 226 |
+
|
| 227 |
+
```bash
|
| 228 |
+
uv run scripts/evaluation_manager.py extract-readme \
|
| 229 |
+
--repo-id "deepseek-ai/DeepSeek-V3" \
|
| 230 |
+
--table 1 \
|
| 231 |
+
--create-pr
|
| 232 |
+
```
|
| 233 |
+
|
| 234 |
+
---
|
| 235 |
+
|
| 236 |
+
## Import from Artificial Analysis
|
| 237 |
+
|
| 238 |
+
Import all available benchmarks from Artificial Analysis API.
|
| 239 |
+
|
| 240 |
+
### Preview
|
| 241 |
+
|
| 242 |
+
```bash
|
| 243 |
+
uv run scripts/evaluation_manager.py import-aa \
|
| 244 |
+
--creator-slug "anthropic" \
|
| 245 |
+
--model-name "claude-sonnet-4" \
|
| 246 |
+
--repo-id "your-username/claude-mirror"
|
| 247 |
+
```
|
| 248 |
+
|
| 249 |
+
### Create PR
|
| 250 |
+
|
| 251 |
+
```bash
|
| 252 |
+
uv run scripts/evaluation_manager.py import-aa \
|
| 253 |
+
--creator-slug "anthropic" \
|
| 254 |
+
--model-name "claude-sonnet-4" \
|
| 255 |
+
--repo-id "your-username/claude-mirror" \
|
| 256 |
+
--apply --create-pr
|
| 257 |
+
```
|
| 258 |
+
|
| 259 |
+
### Finding Creator Slug and Model Name
|
| 260 |
+
|
| 261 |
+
Visit [Artificial Analysis](https://artificialanalysis.ai/) and check the URL:
|
| 262 |
+
- URL: `https://artificialanalysis.ai/models/{creator-slug}/{model-name}`
|
| 263 |
+
|
| 264 |
+
Common examples:
|
| 265 |
+
- Anthropic: `--creator-slug "anthropic" --model-name "claude-sonnet-4"`
|
| 266 |
+
- OpenAI: `--creator-slug "openai" --model-name "gpt-4-turbo"`
|
| 267 |
+
- Meta: `--creator-slug "meta" --model-name "llama-3-70b"`
|
| 268 |
+
|
| 269 |
+
---
|
| 270 |
+
|
| 271 |
+
## Common Workflows
|
| 272 |
+
|
| 273 |
+
### Workflow 1: Add Missing Benchmark to Popular Model
|
| 274 |
+
|
| 275 |
+
```bash
|
| 276 |
+
# 1. Check for existing PRs
|
| 277 |
+
uv run scripts/evaluation_manager.py get-prs \
|
| 278 |
+
--repo-id "meta-llama/Llama-3.1-8B-Instruct"
|
| 279 |
+
|
| 280 |
+
# 2. Preview what we'd add
|
| 281 |
+
uv run scripts/evaluation_manager.py add-eval \
|
| 282 |
+
--benchmark HLE \
|
| 283 |
+
--repo-id "meta-llama/Llama-3.1-8B-Instruct"
|
| 284 |
+
|
| 285 |
+
# 3. Create PR if score found
|
| 286 |
+
uv run scripts/evaluation_manager.py add-eval \
|
| 287 |
+
--benchmark HLE \
|
| 288 |
+
--repo-id "meta-llama/Llama-3.1-8B-Instruct" \
|
| 289 |
+
--create-pr
|
| 290 |
+
```
|
| 291 |
+
|
| 292 |
+
### Workflow 2: Batch Update Trending Models
|
| 293 |
+
|
| 294 |
+
```bash
|
| 295 |
+
# 1. Dry run to see which models have HLE scores
|
| 296 |
+
uv run scripts/batch_eval_prs.py --limit 20 --benchmark HLE --source aa --dry-run
|
| 297 |
+
|
| 298 |
+
# 2. Create PRs for models with scores
|
| 299 |
+
uv run scripts/batch_eval_prs.py --limit 20 --benchmark HLE --source aa
|
| 300 |
+
|
| 301 |
+
# 3. Check results
|
| 302 |
+
cat runs/hle_*.json | jq '.results[] | select(.status == "pr_created")'
|
| 303 |
+
```
|
| 304 |
+
|
| 305 |
+
### Workflow 3: Update Your Own Model
|
| 306 |
+
|
| 307 |
+
```bash
|
| 308 |
+
# 1. Add HLE score from your model card
|
| 309 |
+
uv run scripts/evaluation_manager.py add-eval \
|
| 310 |
+
--benchmark HLE \
|
| 311 |
+
--repo-id "your-username/your-model" \
|
| 312 |
+
--apply
|
| 313 |
+
|
| 314 |
+
# 2. Add GPQA score manually
|
| 315 |
+
uv run scripts/evaluation_manager.py add-eval \
|
| 316 |
+
--benchmark GPQA \
|
| 317 |
+
--repo-id "your-username/your-model" \
|
| 318 |
+
--value 84.5 \
|
| 319 |
+
--apply
|
| 320 |
+
```
|
| 321 |
+
|
| 322 |
+
---
|
| 323 |
+
|
| 324 |
+
## Output Format
|
| 325 |
+
|
| 326 |
+
Results are stored in `.eval_results/*.yaml`:
|
| 327 |
+
|
| 328 |
+
```yaml
|
| 329 |
+
- dataset:
|
| 330 |
+
id: cais/hle # Hub Benchmark dataset ID
|
| 331 |
+
task_id: default # Optional task ID
|
| 332 |
+
value: 23.9 # Metric value
|
| 333 |
+
date: "2026-01-14" # ISO-8601 date
|
| 334 |
+
source: # Attribution
|
| 335 |
+
url: https://huggingface.co/model/name
|
| 336 |
+
name: Model Card
|
| 337 |
+
```
|
| 338 |
+
|
| 339 |
+
---
|
| 340 |
+
|
| 341 |
+
## Supported Benchmarks
|
| 342 |
+
|
| 343 |
+
| Benchmark | Hub Dataset ID |
|
| 344 |
+
|-----------|---------------|
|
| 345 |
+
| HLE | cais/hle |
|
| 346 |
+
| GPQA | Idavidrein/gpqa |
|
| 347 |
+
| MMLU-Pro | TIGER-Lab/MMLU-Pro |
|
| 348 |
+
| GSM8K | openai/gsm8k |
|
| 349 |
+
|
| 350 |
+
To add a new benchmark, update `examples/metric_mapping.json`.
|
| 351 |
+
|
| 352 |
+
---
|
| 353 |
+
|
| 354 |
+
## Troubleshooting
|
| 355 |
+
|
| 356 |
+
### "AA_API_KEY not set"
|
| 357 |
+
```bash
|
| 358 |
+
export AA_API_KEY="your-key"
|
| 359 |
+
# or add to .env file
|
| 360 |
+
```
|
| 361 |
+
|
| 362 |
+
### "Could not find benchmark in model card"
|
| 363 |
+
The benchmark name may be formatted differently in the README. Check the model card manually.
|
| 364 |
+
|
| 365 |
+
### "Token does not have write access"
|
| 366 |
+
Generate a new token at https://huggingface.co/settings/tokens with Write scope.
|
| 367 |
+
|
| 368 |
+
---
|
| 369 |
+
|
| 370 |
+
## Getting Help
|
| 371 |
+
|
| 372 |
+
```bash
|
| 373 |
+
uv run scripts/evaluation_manager.py --help
|
| 374 |
+
uv run scripts/evaluation_manager.py add-eval --help
|
| 375 |
+
uv run scripts/batch_eval_prs.py --help
|
| 376 |
+
```
|
| 377 |
+
|
| 378 |
+
For more information:
|
| 379 |
+
- [HuggingFace Eval Results Documentation](https://huggingface.co/docs/hub/eval-results)
|
| 380 |
+
- [SKILL.md](../SKILL.md) - Complete skill documentation
|
skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py
ADDED
|
@@ -0,0 +1,220 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# /// script
|
| 2 |
+
# requires-python = ">=3.13"
|
| 3 |
+
# dependencies = [
|
| 4 |
+
# "huggingface-hub>=1.1.4",
|
| 5 |
+
# "python-dotenv>=1.2.1",
|
| 6 |
+
# "pyyaml>=6.0.3",
|
| 7 |
+
# "requests>=2.32.5",
|
| 8 |
+
# ]
|
| 9 |
+
# ///
|
| 10 |
+
|
| 11 |
+
"""
|
| 12 |
+
Add Artificial Analysis evaluations to a Hugging Face model repository.
|
| 13 |
+
|
| 14 |
+
This script outputs evaluation results in the new .eval_results/ format
|
| 15 |
+
as documented at https://huggingface.co/docs/hub/eval-results
|
| 16 |
+
|
| 17 |
+
NOTE: This is a standalone reference script. For integrated functionality
|
| 18 |
+
with additional features (README extraction, validation, etc.), use:
|
| 19 |
+
../scripts/evaluation_manager.py import-aa [options]
|
| 20 |
+
|
| 21 |
+
STANDALONE USAGE:
|
| 22 |
+
AA_API_KEY="<your-api-key>" HF_TOKEN="<your-huggingface-token>" \
|
| 23 |
+
python artificial_analysis_to_hub.py \
|
| 24 |
+
--creator-slug <artificial-analysis-creator-slug> \
|
| 25 |
+
--model-name <artificial-analysis-model-name> \
|
| 26 |
+
--repo-id <huggingface-repo-id>
|
| 27 |
+
|
| 28 |
+
INTEGRATED USAGE (Recommended):
|
| 29 |
+
python ../scripts/evaluation_manager.py import-aa \
|
| 30 |
+
--creator-slug <creator-slug> \
|
| 31 |
+
--model-name <model-name> \
|
| 32 |
+
--repo-id <repo-id> \
|
| 33 |
+
[--create-pr]
|
| 34 |
+
"""
|
| 35 |
+
|
| 36 |
+
import argparse
|
| 37 |
+
import json
|
| 38 |
+
import os
|
| 39 |
+
from datetime import date
|
| 40 |
+
from pathlib import Path
|
| 41 |
+
|
| 42 |
+
import requests
|
| 43 |
+
import yaml
|
| 44 |
+
import dotenv
|
| 45 |
+
from huggingface_hub import HfApi
|
| 46 |
+
|
| 47 |
+
dotenv.load_dotenv()
|
| 48 |
+
|
| 49 |
+
API_KEY = os.getenv("AA_API_KEY")
|
| 50 |
+
HF_TOKEN = os.getenv("HF_TOKEN")
|
| 51 |
+
URL = "https://artificialanalysis.ai/api/v2/data/llms/models"
|
| 52 |
+
HEADERS = {"x-api-key": API_KEY}
|
| 53 |
+
|
| 54 |
+
if not API_KEY:
|
| 55 |
+
raise ValueError("AA_API_KEY is not set")
|
| 56 |
+
if not HF_TOKEN:
|
| 57 |
+
raise ValueError("HF_TOKEN is not set")
|
| 58 |
+
|
| 59 |
+
|
| 60 |
+
def load_benchmark_mapping():
|
| 61 |
+
"""Load the benchmark-to-dataset mapping from metric_mapping.json."""
|
| 62 |
+
mapping_file = Path(__file__).parent / "metric_mapping.json"
|
| 63 |
+
|
| 64 |
+
if not mapping_file.exists():
|
| 65 |
+
# Fallback to minimal mapping
|
| 66 |
+
return {
|
| 67 |
+
"MMLU": {"dataset_id": "cais/mmlu", "aliases": ["mmlu"]},
|
| 68 |
+
"GPQA": {"dataset_id": "Idavidrein/gpqa", "task_id": "gpqa_diamond", "aliases": ["gpqa"]},
|
| 69 |
+
"GSM8K": {"dataset_id": "openai/gsm8k", "aliases": ["gsm8k"]},
|
| 70 |
+
}
|
| 71 |
+
|
| 72 |
+
with open(mapping_file) as f:
|
| 73 |
+
mapping = json.load(f)
|
| 74 |
+
|
| 75 |
+
mapping.pop("_comment", None)
|
| 76 |
+
return mapping
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
def find_benchmark_dataset(benchmark_name, mapping):
|
| 80 |
+
"""Find the Hub dataset ID for a benchmark name."""
|
| 81 |
+
normalized = benchmark_name.lower().replace(" ", "_").replace("-", "_")
|
| 82 |
+
|
| 83 |
+
# Try exact match
|
| 84 |
+
if benchmark_name in mapping:
|
| 85 |
+
entry = mapping[benchmark_name]
|
| 86 |
+
return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
|
| 87 |
+
|
| 88 |
+
# Try case-insensitive match
|
| 89 |
+
for key, entry in mapping.items():
|
| 90 |
+
if key.lower() == benchmark_name.lower():
|
| 91 |
+
return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
|
| 92 |
+
|
| 93 |
+
# Try matching aliases
|
| 94 |
+
for key, entry in mapping.items():
|
| 95 |
+
aliases = entry.get("aliases", [])
|
| 96 |
+
if normalized in [a.lower().replace(" ", "_").replace("-", "_") for a in aliases]:
|
| 97 |
+
return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
|
| 98 |
+
|
| 99 |
+
return None
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
def get_model_evaluations_data(creator_slug, model_name):
|
| 103 |
+
response = requests.get(URL, headers=HEADERS)
|
| 104 |
+
response_data = response.json()["data"]
|
| 105 |
+
for model in response_data:
|
| 106 |
+
if (
|
| 107 |
+
model["model_creator"]["slug"] == creator_slug
|
| 108 |
+
and model["slug"] == model_name
|
| 109 |
+
):
|
| 110 |
+
return model
|
| 111 |
+
raise ValueError(f"Model {model_name} not found")
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
def aa_evaluations_to_eval_results(model):
|
| 115 |
+
"""
|
| 116 |
+
Convert Artificial Analysis model data to .eval_results/ format.
|
| 117 |
+
|
| 118 |
+
Returns a list of evaluation entries ready for YAML output.
|
| 119 |
+
"""
|
| 120 |
+
if not model:
|
| 121 |
+
raise ValueError("Model data is required")
|
| 122 |
+
|
| 123 |
+
evaluations = model.get("evaluations", {})
|
| 124 |
+
mapping = load_benchmark_mapping()
|
| 125 |
+
results = []
|
| 126 |
+
today = date.today().isoformat()
|
| 127 |
+
|
| 128 |
+
for key, value in evaluations.items():
|
| 129 |
+
if value is None:
|
| 130 |
+
continue
|
| 131 |
+
|
| 132 |
+
# Convert key to title case for matching
|
| 133 |
+
benchmark_name = key.replace("_", " ").title()
|
| 134 |
+
dataset_info = find_benchmark_dataset(benchmark_name, mapping)
|
| 135 |
+
|
| 136 |
+
if not dataset_info:
|
| 137 |
+
# Try the original key as well
|
| 138 |
+
dataset_info = find_benchmark_dataset(key, mapping)
|
| 139 |
+
|
| 140 |
+
if not dataset_info:
|
| 141 |
+
print(f"Warning: Could not find Hub dataset ID for '{benchmark_name}'. Skipping.")
|
| 142 |
+
continue
|
| 143 |
+
|
| 144 |
+
entry = {
|
| 145 |
+
"dataset": {
|
| 146 |
+
"id": dataset_info["dataset_id"],
|
| 147 |
+
},
|
| 148 |
+
"value": value,
|
| 149 |
+
"date": today,
|
| 150 |
+
"source": {
|
| 151 |
+
"url": "https://artificialanalysis.ai",
|
| 152 |
+
"name": "Artificial Analysis",
|
| 153 |
+
}
|
| 154 |
+
}
|
| 155 |
+
|
| 156 |
+
# Add task_id if not default
|
| 157 |
+
if dataset_info.get("task_id") and dataset_info["task_id"] != "default":
|
| 158 |
+
entry["dataset"]["task_id"] = dataset_info["task_id"]
|
| 159 |
+
|
| 160 |
+
results.append(entry)
|
| 161 |
+
|
| 162 |
+
return results
|
| 163 |
+
|
| 164 |
+
|
| 165 |
+
def main():
|
| 166 |
+
parser = argparse.ArgumentParser()
|
| 167 |
+
parser.add_argument("--creator-slug", type=str, required=True)
|
| 168 |
+
parser.add_argument("--model-name", type=str, required=True)
|
| 169 |
+
parser.add_argument("--repo-id", type=str, required=True)
|
| 170 |
+
parser.add_argument("--filename", type=str, default="artificial_analysis.yaml",
|
| 171 |
+
help="Output filename in .eval_results/")
|
| 172 |
+
parser.add_argument("--dry-run", action="store_true",
|
| 173 |
+
help="Print YAML without uploading")
|
| 174 |
+
args = parser.parse_args()
|
| 175 |
+
|
| 176 |
+
aa_evaluations_data = get_model_evaluations_data(
|
| 177 |
+
creator_slug=args.creator_slug, model_name=args.model_name
|
| 178 |
+
)
|
| 179 |
+
|
| 180 |
+
eval_results = aa_evaluations_to_eval_results(model=aa_evaluations_data)
|
| 181 |
+
|
| 182 |
+
if not eval_results:
|
| 183 |
+
print("No evaluations could be mapped to Hub dataset IDs")
|
| 184 |
+
return
|
| 185 |
+
|
| 186 |
+
# Generate YAML content
|
| 187 |
+
yaml_content = yaml.dump(eval_results, sort_keys=False, allow_unicode=True)
|
| 188 |
+
|
| 189 |
+
if args.dry_run:
|
| 190 |
+
print("\nGenerated .eval_results/ YAML:")
|
| 191 |
+
print(yaml_content)
|
| 192 |
+
return
|
| 193 |
+
|
| 194 |
+
# Upload to .eval_results/ folder
|
| 195 |
+
api = HfApi(token=HF_TOKEN)
|
| 196 |
+
file_path = f".eval_results/{args.filename}"
|
| 197 |
+
|
| 198 |
+
commit_message = f"Add Artificial Analysis evaluations for {args.model_name}"
|
| 199 |
+
commit_description = (
|
| 200 |
+
"This commit adds structured evaluation results in the new .eval_results/ format. "
|
| 201 |
+
"Results will appear on the model page and linked benchmark leaderboards. "
|
| 202 |
+
"See https://huggingface.co/docs/hub/eval-results for format details."
|
| 203 |
+
)
|
| 204 |
+
|
| 205 |
+
api.upload_file(
|
| 206 |
+
path_or_fileobj=yaml_content.encode("utf-8"),
|
| 207 |
+
path_in_repo=file_path,
|
| 208 |
+
repo_id=args.repo_id,
|
| 209 |
+
repo_type="model",
|
| 210 |
+
commit_message=commit_message,
|
| 211 |
+
commit_description=commit_description,
|
| 212 |
+
create_pr=True,
|
| 213 |
+
)
|
| 214 |
+
|
| 215 |
+
print(f"✓ Pull request created for {args.repo_id}")
|
| 216 |
+
print(f" File: {file_path}")
|
| 217 |
+
|
| 218 |
+
|
| 219 |
+
if __name__ == "__main__":
|
| 220 |
+
main()
|
skills/hugging-face-evaluation/examples/eval.example.yaml
ADDED
|
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
- dataset:
|
| 2 |
+
id: cais/hle # Required. Hub dataset ID (must be a Benchmark)
|
| 3 |
+
task_id: default # Optional, in case there are multiple tasks or leaderboards for this dataset.
|
| 4 |
+
revision: <hash> # Optional. Dataset revision hash
|
| 5 |
+
value: 20.90 # Required. Metric value
|
| 6 |
+
verifyToken: <token> # Optional. Cryptographic proof of auditable evaluation
|
| 7 |
+
date: "2025-01-15" # Optional. ISO-8601 date or datetime (defaults to git commit time)
|
| 8 |
+
source: # Optional. Attribution for this result, for instance a repo containing output traces or a Paper
|
| 9 |
+
url: https://huggingface.co/spaces/SaylorTwift/smollm3-mmlu-pro # Required if source provided
|
| 10 |
+
name: Eval traces # Optional. Display name
|
| 11 |
+
user: SaylorTwift # Optional. HF username/org
|
skills/hugging-face-evaluation/examples/example_readme_tables.md
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Example Evaluation Table Formats
|
| 2 |
+
|
| 3 |
+
This file shows various formats of evaluation tables that can be extracted from model README files.
|
| 4 |
+
|
| 5 |
+
## Format 1: Benchmarks as Rows (Most Common)
|
| 6 |
+
|
| 7 |
+
```markdown
|
| 8 |
+
| Benchmark | Score |
|
| 9 |
+
|-----------|-------|
|
| 10 |
+
| MMLU | 85.2 |
|
| 11 |
+
| HumanEval | 72.5 |
|
| 12 |
+
| GSM8K | 91.3 |
|
| 13 |
+
| HellaSwag | 88.9 |
|
| 14 |
+
```
|
| 15 |
+
|
| 16 |
+
## Format 2: Multiple Metric Columns
|
| 17 |
+
|
| 18 |
+
```markdown
|
| 19 |
+
| Benchmark | Accuracy | F1 Score |
|
| 20 |
+
|-----------|----------|----------|
|
| 21 |
+
| MMLU | 85.2 | 0.84 |
|
| 22 |
+
| GSM8K | 91.3 | 0.91 |
|
| 23 |
+
| DROP | 78.5 | 0.77 |
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
## Format 3: Benchmarks as Columns
|
| 27 |
+
|
| 28 |
+
```markdown
|
| 29 |
+
| MMLU | HumanEval | GSM8K | HellaSwag |
|
| 30 |
+
|------|-----------|-------|-----------|
|
| 31 |
+
| 85.2 | 72.5 | 91.3 | 88.9 |
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
## Format 4: Percentage Values
|
| 35 |
+
|
| 36 |
+
```markdown
|
| 37 |
+
| Benchmark | Score |
|
| 38 |
+
|---------------|----------|
|
| 39 |
+
| MMLU | 85.2% |
|
| 40 |
+
| HumanEval | 72.5% |
|
| 41 |
+
| GSM8K | 91.3% |
|
| 42 |
+
| TruthfulQA | 68.7% |
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
## Format 5: Mixed Format with Categories
|
| 46 |
+
|
| 47 |
+
```markdown
|
| 48 |
+
### Reasoning
|
| 49 |
+
|
| 50 |
+
| Benchmark | Score |
|
| 51 |
+
|-----------|-------|
|
| 52 |
+
| MMLU | 85.2 |
|
| 53 |
+
| BBH | 82.4 |
|
| 54 |
+
| GPQA | 71.3 |
|
| 55 |
+
|
| 56 |
+
### Coding
|
| 57 |
+
|
| 58 |
+
| Benchmark | Score |
|
| 59 |
+
|-----------|-------|
|
| 60 |
+
| HumanEval | 72.5 |
|
| 61 |
+
| MBPP | 78.9 |
|
| 62 |
+
|
| 63 |
+
### Math
|
| 64 |
+
|
| 65 |
+
| Benchmark | Score |
|
| 66 |
+
|-----------|-------|
|
| 67 |
+
| GSM8K | 91.3 |
|
| 68 |
+
| MATH | 65.8 |
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
## Format 6: With Additional Columns
|
| 72 |
+
|
| 73 |
+
```markdown
|
| 74 |
+
| Benchmark | Score | Rank | Notes |
|
| 75 |
+
|-----------|-------|------|--------------------|
|
| 76 |
+
| MMLU | 85.2 | #5 | 5-shot |
|
| 77 |
+
| HumanEval | 72.5 | #8 | pass@1 |
|
| 78 |
+
| GSM8K | 91.3 | #3 | 8-shot, maj@1 |
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
## How the Extractor Works
|
| 82 |
+
|
| 83 |
+
The script will:
|
| 84 |
+
1. Find all markdown tables in the README
|
| 85 |
+
2. Identify which tables contain evaluation results
|
| 86 |
+
3. Parse the table structure (rows vs columns)
|
| 87 |
+
4. Extract numeric values as scores
|
| 88 |
+
5. Convert to model-index YAML format
|
| 89 |
+
|
| 90 |
+
## Tips for README Authors
|
| 91 |
+
|
| 92 |
+
To ensure your evaluation tables are properly extracted:
|
| 93 |
+
|
| 94 |
+
1. **Use clear headers**: Include "Benchmark", "Score", or similar terms
|
| 95 |
+
2. **Keep it simple**: Stick to benchmark name + score columns
|
| 96 |
+
3. **Use standard formats**: Follow markdown table syntax
|
| 97 |
+
4. **Include numeric values**: Ensure scores are parseable numbers
|
| 98 |
+
5. **Be consistent**: Use the same format across multiple tables
|
| 99 |
+
|
| 100 |
+
## Example Complete README Section
|
| 101 |
+
|
| 102 |
+
```markdown
|
| 103 |
+
# Model Card for MyModel-7B
|
| 104 |
+
|
| 105 |
+
## Evaluation Results
|
| 106 |
+
|
| 107 |
+
Our model was evaluated on several standard benchmarks:
|
| 108 |
+
|
| 109 |
+
| Benchmark | Score |
|
| 110 |
+
|---------------|-------|
|
| 111 |
+
| MMLU | 85.2 |
|
| 112 |
+
| HumanEval | 72.5 |
|
| 113 |
+
| GSM8K | 91.3 |
|
| 114 |
+
| HellaSwag | 88.9 |
|
| 115 |
+
| ARC-Challenge | 81.7 |
|
| 116 |
+
| TruthfulQA | 68.7 |
|
| 117 |
+
|
| 118 |
+
### Detailed Results
|
| 119 |
+
|
| 120 |
+
For more detailed results and methodology, see our [paper](link).
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
## Running the Extractor
|
| 124 |
+
|
| 125 |
+
```bash
|
| 126 |
+
# Extract from this example
|
| 127 |
+
python scripts/evaluation_manager.py extract-readme \
|
| 128 |
+
--repo-id "your-username/your-model" \
|
| 129 |
+
--dry-run
|
| 130 |
+
|
| 131 |
+
# Apply to your model card
|
| 132 |
+
python scripts/evaluation_manager.py extract-readme \
|
| 133 |
+
--repo-id "your-username/your-model" \
|
| 134 |
+
--task-type "text-generation"
|
| 135 |
+
```
|
skills/hugging-face-evaluation/examples/metric_mapping.json
ADDED
|
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_comment": "Maps benchmark names to Hub dataset IDs for .eval_results/ format. Dataset IDs must be registered Benchmarks on HuggingFace Hub.",
|
| 3 |
+
"MMLU": {
|
| 4 |
+
"dataset_id": "cais/mmlu",
|
| 5 |
+
"task_id": "default",
|
| 6 |
+
"aliases": ["mmlu", "massive_multitask_language_understanding"]
|
| 7 |
+
},
|
| 8 |
+
"MMLU-Pro": {
|
| 9 |
+
"dataset_id": "TIGER-Lab/MMLU-Pro",
|
| 10 |
+
"task_id": "mmlu_pro",
|
| 11 |
+
"aliases": ["mmlu_pro", "mmlu-pro"]
|
| 12 |
+
},
|
| 13 |
+
"MMLU-Redux": {
|
| 14 |
+
"dataset_id": "edinburgh-dawg/mmlu-redux",
|
| 15 |
+
"task_id": "default",
|
| 16 |
+
"aliases": ["mmlu_redux", "mmlu-redux"]
|
| 17 |
+
},
|
| 18 |
+
"HumanEval": {
|
| 19 |
+
"dataset_id": "openai/openai_humaneval",
|
| 20 |
+
"task_id": "default",
|
| 21 |
+
"aliases": ["humaneval", "human_eval"]
|
| 22 |
+
},
|
| 23 |
+
"GSM8K": {
|
| 24 |
+
"dataset_id": "openai/gsm8k",
|
| 25 |
+
"task_id": "default",
|
| 26 |
+
"aliases": ["gsm8k", "gsm_8k", "grade_school_math"]
|
| 27 |
+
},
|
| 28 |
+
"HellaSwag": {
|
| 29 |
+
"dataset_id": "Rowan/hellaswag",
|
| 30 |
+
"task_id": "default",
|
| 31 |
+
"aliases": ["hellaswag", "hella_swag"]
|
| 32 |
+
},
|
| 33 |
+
"ARC-Challenge": {
|
| 34 |
+
"dataset_id": "allenai/ai2_arc",
|
| 35 |
+
"task_id": "ARC-Challenge",
|
| 36 |
+
"aliases": ["arc_challenge", "arc-c", "arc_c"]
|
| 37 |
+
},
|
| 38 |
+
"ARC-Easy": {
|
| 39 |
+
"dataset_id": "allenai/ai2_arc",
|
| 40 |
+
"task_id": "ARC-Easy",
|
| 41 |
+
"aliases": ["arc_easy", "arc-e", "arc_e"]
|
| 42 |
+
},
|
| 43 |
+
"Winogrande": {
|
| 44 |
+
"dataset_id": "allenai/winogrande",
|
| 45 |
+
"task_id": "default",
|
| 46 |
+
"aliases": ["winogrande", "wino_grande"]
|
| 47 |
+
},
|
| 48 |
+
"TruthfulQA": {
|
| 49 |
+
"dataset_id": "truthfulqa/truthful_qa",
|
| 50 |
+
"task_id": "default",
|
| 51 |
+
"aliases": ["truthfulqa", "truthful_qa"]
|
| 52 |
+
},
|
| 53 |
+
"GPQA": {
|
| 54 |
+
"dataset_id": "Idavidrein/gpqa",
|
| 55 |
+
"task_id": "gpqa_diamond",
|
| 56 |
+
"aliases": ["gpqa", "gpqa_diamond"]
|
| 57 |
+
},
|
| 58 |
+
"DROP": {
|
| 59 |
+
"dataset_id": "ucinlp/drop",
|
| 60 |
+
"task_id": "default",
|
| 61 |
+
"aliases": ["drop"]
|
| 62 |
+
},
|
| 63 |
+
"BBH": {
|
| 64 |
+
"dataset_id": "lukaemon/bbh",
|
| 65 |
+
"task_id": "default",
|
| 66 |
+
"aliases": ["bbh", "big_bench_hard"]
|
| 67 |
+
},
|
| 68 |
+
"MATH": {
|
| 69 |
+
"dataset_id": "lighteval/MATH",
|
| 70 |
+
"task_id": "default",
|
| 71 |
+
"aliases": ["math"]
|
| 72 |
+
},
|
| 73 |
+
"HLE": {
|
| 74 |
+
"dataset_id": "cais/hle",
|
| 75 |
+
"task_id": "default",
|
| 76 |
+
"aliases": ["hle", "human_level_evaluation", "hle_text_only", "hle_(text_only)"]
|
| 77 |
+
},
|
| 78 |
+
"AIME25": {
|
| 79 |
+
"dataset_id": "OpenEvals/aime_2025",
|
| 80 |
+
"task_id": "default",
|
| 81 |
+
"aliases": ["aime25", "aime_25", "aime_2025"]
|
| 82 |
+
},
|
| 83 |
+
"SWE-bench": {
|
| 84 |
+
"dataset_id": "princeton-nlp/SWE-bench_Verified",
|
| 85 |
+
"task_id": "default",
|
| 86 |
+
"aliases": ["swe_bench", "swe-bench", "swe_bench_verified", "swe-bench_verified"]
|
| 87 |
+
},
|
| 88 |
+
"LiveCodeBench": {
|
| 89 |
+
"dataset_id": "livecodebench/livecodebench",
|
| 90 |
+
"task_id": "default",
|
| 91 |
+
"aliases": ["livecodebench", "live_code_bench", "livecodebenchv6"]
|
| 92 |
+
},
|
| 93 |
+
"SimpleQA": {
|
| 94 |
+
"dataset_id": "OpenEvals/SimpleQA",
|
| 95 |
+
"task_id": "default",
|
| 96 |
+
"aliases": ["simpleqa", "simple_qa"]
|
| 97 |
+
},
|
| 98 |
+
"AIME": {
|
| 99 |
+
"dataset_id": "OpenEvals/aime_24",
|
| 100 |
+
"task_id": "default",
|
| 101 |
+
"aliases": ["aime", "aime_24", "aime24"]
|
| 102 |
+
},
|
| 103 |
+
"IFEval": {
|
| 104 |
+
"dataset_id": "google/IFEval",
|
| 105 |
+
"task_id": "default",
|
| 106 |
+
"aliases": ["ifeval", "if_eval"]
|
| 107 |
+
},
|
| 108 |
+
"MBPP": {
|
| 109 |
+
"dataset_id": "google-research-datasets/mbpp",
|
| 110 |
+
"task_id": "default",
|
| 111 |
+
"aliases": ["mbpp"]
|
| 112 |
+
},
|
| 113 |
+
"MuSR": {
|
| 114 |
+
"dataset_id": "OpenEvals/MuSR",
|
| 115 |
+
"task_id": "default",
|
| 116 |
+
"aliases": ["musr"]
|
| 117 |
+
}
|
| 118 |
+
}
|
skills/hugging-face-evaluation/references/hf_cli_for_prs.md
ADDED
|
@@ -0,0 +1,258 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# HF CLI Workflow for Evaluation PRs
|
| 2 |
+
|
| 3 |
+
This document explains how to manage evaluation result PRs using the `hf` CLI and temporary directories.
|
| 4 |
+
|
| 5 |
+
## Directory Structure
|
| 6 |
+
|
| 7 |
+
Use `/tmp/pr-reviews/` as the working directory for PR operations:
|
| 8 |
+
|
| 9 |
+
```
|
| 10 |
+
/tmp/pr-reviews/
|
| 11 |
+
├── updates/ # YAML files for updating existing PRs
|
| 12 |
+
├── new-prs/ # YAML files for new PRs
|
| 13 |
+
├── <model-name>/ # Model-specific directories
|
| 14 |
+
└── check-<model>/ # Directories for verifying PR contents
|
| 15 |
+
```
|
| 16 |
+
|
| 17 |
+
## Creating New PRs
|
| 18 |
+
|
| 19 |
+
### Step 1: Create YAML file
|
| 20 |
+
|
| 21 |
+
```bash
|
| 22 |
+
mkdir -p /tmp/pr-reviews/new-prs
|
| 23 |
+
cd /tmp/pr-reviews/new-prs
|
| 24 |
+
|
| 25 |
+
cat > hle.yaml << 'EOF'
|
| 26 |
+
- dataset:
|
| 27 |
+
id: cais/hle
|
| 28 |
+
task_id: hle
|
| 29 |
+
value: 22.1
|
| 30 |
+
date: '2026-02-03'
|
| 31 |
+
source:
|
| 32 |
+
url: https://huggingface.co/org/model-name
|
| 33 |
+
name: Model Card
|
| 34 |
+
user: burtenshaw
|
| 35 |
+
EOF
|
| 36 |
+
```
|
| 37 |
+
|
| 38 |
+
### Step 2: Upload and create PR
|
| 39 |
+
|
| 40 |
+
```bash
|
| 41 |
+
hf upload org/model-name hle.yaml .eval_results/hle.yaml \
|
| 42 |
+
--repo-type model --create-pr \
|
| 43 |
+
--commit-message "Add HLE evaluation result"
|
| 44 |
+
```
|
| 45 |
+
|
| 46 |
+
### Step 3: Get PR number
|
| 47 |
+
|
| 48 |
+
```bash
|
| 49 |
+
uv run scripts/evaluation_manager.py get-prs --repo-id "org/model-name"
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
## Updating Existing PRs
|
| 53 |
+
|
| 54 |
+
### Step 1: Download current PR contents
|
| 55 |
+
|
| 56 |
+
```bash
|
| 57 |
+
hf download org/model-name --repo-type model \
|
| 58 |
+
--revision refs/pr/<PR_NUMBER> \
|
| 59 |
+
--include ".eval_results/*" \
|
| 60 |
+
--local-dir /tmp/pr-reviews/<model-name>-pr<PR_NUMBER>
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
### Step 2: Review current contents
|
| 64 |
+
|
| 65 |
+
```bash
|
| 66 |
+
cat /tmp/pr-reviews/<model-name>-pr<PR_NUMBER>/.eval_results/*.yaml
|
| 67 |
+
```
|
| 68 |
+
|
| 69 |
+
### Step 3: Create updated YAML
|
| 70 |
+
|
| 71 |
+
```bash
|
| 72 |
+
cat > /tmp/pr-reviews/updates/updated.yaml << 'EOF'
|
| 73 |
+
- dataset:
|
| 74 |
+
id: cais/hle
|
| 75 |
+
task_id: hle
|
| 76 |
+
value: 22.1
|
| 77 |
+
date: '2026-02-03'
|
| 78 |
+
source:
|
| 79 |
+
url: https://huggingface.co/org/model-name
|
| 80 |
+
name: Model Card
|
| 81 |
+
user: burtenshaw
|
| 82 |
+
notes: "With tools"
|
| 83 |
+
EOF
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
### Step 4: Push update to existing PR
|
| 87 |
+
|
| 88 |
+
```bash
|
| 89 |
+
hf upload org/model-name /tmp/pr-reviews/updates/updated.yaml .eval_results/hle.yaml \
|
| 90 |
+
--repo-type model \
|
| 91 |
+
--revision refs/pr/<PR_NUMBER> \
|
| 92 |
+
--commit-message "Update evaluation result"
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
## Deleting Files from PRs
|
| 96 |
+
|
| 97 |
+
Use the `huggingface_hub` Python API to delete files:
|
| 98 |
+
|
| 99 |
+
```bash
|
| 100 |
+
uv run --with huggingface_hub python3 << 'EOF'
|
| 101 |
+
from huggingface_hub import HfApi
|
| 102 |
+
api = HfApi()
|
| 103 |
+
|
| 104 |
+
api.delete_file(
|
| 105 |
+
path_in_repo=".eval_results/old_file.yaml",
|
| 106 |
+
repo_id="org/model-name",
|
| 107 |
+
repo_type="model",
|
| 108 |
+
revision="refs/pr/<PR_NUMBER>",
|
| 109 |
+
commit_message="Remove duplicate file"
|
| 110 |
+
)
|
| 111 |
+
EOF
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
## Verifying PR Contents
|
| 115 |
+
|
| 116 |
+
### Check what files are in a PR
|
| 117 |
+
|
| 118 |
+
```bash
|
| 119 |
+
rm -rf /tmp/check-<model>
|
| 120 |
+
hf download org/model-name --repo-type model \
|
| 121 |
+
--revision refs/pr/<PR_NUMBER> \
|
| 122 |
+
--include ".eval_results/*" \
|
| 123 |
+
--local-dir /tmp/check-<model>
|
| 124 |
+
|
| 125 |
+
ls -la /tmp/check-<model>/.eval_results/
|
| 126 |
+
```
|
| 127 |
+
|
| 128 |
+
### Compare PR to main branch
|
| 129 |
+
|
| 130 |
+
```bash
|
| 131 |
+
# Download main
|
| 132 |
+
hf download org/model-name --repo-type model \
|
| 133 |
+
--revision main \
|
| 134 |
+
--include ".eval_results/*" \
|
| 135 |
+
--local-dir /tmp/<model>-main
|
| 136 |
+
|
| 137 |
+
# Download PR
|
| 138 |
+
hf download org/model-name --repo-type model \
|
| 139 |
+
--revision refs/pr/<PR_NUMBER> \
|
| 140 |
+
--include ".eval_results/*" \
|
| 141 |
+
--local-dir /tmp/<model>-pr<PR_NUMBER>
|
| 142 |
+
|
| 143 |
+
# Compare
|
| 144 |
+
diff /tmp/<model>-main/.eval_results/ /tmp/<model>-pr<PR_NUMBER>/.eval_results/
|
| 145 |
+
```
|
| 146 |
+
|
| 147 |
+
## Multiple Score Variants
|
| 148 |
+
|
| 149 |
+
When a model has multiple scores for the same benchmark (e.g., with/without tools), create separate files:
|
| 150 |
+
|
| 151 |
+
```bash
|
| 152 |
+
cd /tmp/pr-reviews/new-prs
|
| 153 |
+
|
| 154 |
+
# Default (no tools) - no notes field
|
| 155 |
+
cat > hle.yaml << 'EOF'
|
| 156 |
+
- dataset:
|
| 157 |
+
id: cais/hle
|
| 158 |
+
task_id: hle
|
| 159 |
+
value: 10.2
|
| 160 |
+
date: '2026-02-03'
|
| 161 |
+
source:
|
| 162 |
+
url: https://huggingface.co/org/model-name
|
| 163 |
+
name: Model Card
|
| 164 |
+
user: burtenshaw
|
| 165 |
+
EOF
|
| 166 |
+
|
| 167 |
+
# With tools - add notes field
|
| 168 |
+
cat > hle_with_tools.yaml << 'EOF'
|
| 169 |
+
- dataset:
|
| 170 |
+
id: cais/hle
|
| 171 |
+
task_id: hle
|
| 172 |
+
value: 15.5
|
| 173 |
+
date: '2026-02-03'
|
| 174 |
+
source:
|
| 175 |
+
url: https://huggingface.co/org/model-name
|
| 176 |
+
name: Model Card
|
| 177 |
+
user: burtenshaw
|
| 178 |
+
notes: "With tools"
|
| 179 |
+
EOF
|
| 180 |
+
|
| 181 |
+
# Create separate PRs
|
| 182 |
+
hf upload org/model-name hle.yaml .eval_results/hle.yaml \
|
| 183 |
+
--repo-type model --create-pr \
|
| 184 |
+
--commit-message "Add HLE evaluation result"
|
| 185 |
+
|
| 186 |
+
hf upload org/model-name hle_with_tools.yaml .eval_results/hle_with_tools.yaml \
|
| 187 |
+
--repo-type model --create-pr \
|
| 188 |
+
--commit-message "Add HLE evaluation result (with tools)"
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
## Restoring Files Accidentally Deleted
|
| 192 |
+
|
| 193 |
+
If a PR shows a file as deleted (because it was removed from the PR branch), restore it from main:
|
| 194 |
+
|
| 195 |
+
```bash
|
| 196 |
+
# Download the file from main
|
| 197 |
+
hf download org/model-name --repo-type model \
|
| 198 |
+
--revision main \
|
| 199 |
+
--include ".eval_results/hle.yaml" \
|
| 200 |
+
--local-dir /tmp/<model>-main
|
| 201 |
+
|
| 202 |
+
# Re-upload to PR to restore it
|
| 203 |
+
hf upload org/model-name /tmp/<model>-main/.eval_results/hle.yaml .eval_results/hle.yaml \
|
| 204 |
+
--repo-type model \
|
| 205 |
+
--revision refs/pr/<PR_NUMBER> \
|
| 206 |
+
--commit-message "Restore original file"
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
## Common Patterns
|
| 210 |
+
|
| 211 |
+
### Batch create YAML files
|
| 212 |
+
|
| 213 |
+
```bash
|
| 214 |
+
cd /tmp/pr-reviews/updates
|
| 215 |
+
|
| 216 |
+
# Create multiple files in one script
|
| 217 |
+
for model in "org/model1" "org/model2"; do
|
| 218 |
+
cat > "${model//\//-}-hle.yaml" << EOF
|
| 219 |
+
- dataset:
|
| 220 |
+
id: cais/hle
|
| 221 |
+
task_id: hle
|
| 222 |
+
value: 22.1
|
| 223 |
+
source:
|
| 224 |
+
url: https://huggingface.co/$model
|
| 225 |
+
name: Model Card
|
| 226 |
+
user: burtenshaw
|
| 227 |
+
EOF
|
| 228 |
+
done
|
| 229 |
+
```
|
| 230 |
+
|
| 231 |
+
### Check for existing PRs before creating
|
| 232 |
+
|
| 233 |
+
Always check first:
|
| 234 |
+
|
| 235 |
+
```bash
|
| 236 |
+
uv run scripts/evaluation_manager.py get-prs --repo-id "org/model-name"
|
| 237 |
+
```
|
| 238 |
+
|
| 239 |
+
If PRs exist, update them instead of creating new ones.
|
| 240 |
+
|
| 241 |
+
## File Naming Convention
|
| 242 |
+
|
| 243 |
+
| Condition | File Name | Notes Field |
|
| 244 |
+
|-----------|-----------|-------------|
|
| 245 |
+
| Default (no tools) | `hle.yaml` | None (omit) |
|
| 246 |
+
| With tools | `hle_with_tools.yaml` | `notes: "With tools"` |
|
| 247 |
+
| Different task | `gpqa_diamond.yaml` | Based on task_id |
|
| 248 |
+
|
| 249 |
+
## Cleanup
|
| 250 |
+
|
| 251 |
+
After PRs are merged or work is complete:
|
| 252 |
+
|
| 253 |
+
```bash
|
| 254 |
+
rm -rf /tmp/pr-reviews/
|
| 255 |
+
rm -rf /tmp/check-*
|
| 256 |
+
rm -rf /tmp/*-main
|
| 257 |
+
rm -rf /tmp/*-pr*
|
| 258 |
+
```
|
skills/hugging-face-evaluation/references/hf_papers_extraction.md
ADDED
|
@@ -0,0 +1,297 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Paper Score Extraction via HF MCP Server
|
| 2 |
+
|
| 3 |
+
This document provides instructions for extracting benchmark scores from academic papers linked to HuggingFace models using the HF MCP Server tools.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Overview
|
| 8 |
+
|
| 9 |
+
Papers linked to HuggingFace models often contain comprehensive benchmark results that aren't in the model card. This guide shows how to:
|
| 10 |
+
|
| 11 |
+
1. Use `hub_repo_details` to discover papers linked to a model
|
| 12 |
+
2. Use `paper_search` to find and retrieve paper abstracts
|
| 13 |
+
3. Extract benchmark scores from paper abstracts/content
|
| 14 |
+
4. Use `WebFetch` on arxiv PDFs for detailed scores not in abstracts
|
| 15 |
+
5. Format results for `.eval_results/`
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## Step 1: Discover Linked Papers
|
| 20 |
+
|
| 21 |
+
### Get Model Details with README
|
| 22 |
+
|
| 23 |
+
Use `hub_repo_details` to fetch model metadata including linked papers:
|
| 24 |
+
|
| 25 |
+
```
|
| 26 |
+
mcp__hf-mcp-server__hub_repo_details
|
| 27 |
+
repo_ids: ["org/model-name"]
|
| 28 |
+
include_readme: true
|
| 29 |
+
```
|
| 30 |
+
|
| 31 |
+
Look for arXiv references in:
|
| 32 |
+
- `tags` array: entries starting with `arxiv:` (e.g., `"arxiv:2411.15124"`)
|
| 33 |
+
- README content: arXiv links or paper references
|
| 34 |
+
- Model metadata: `paperInfo` or `cardData.arxiv` fields
|
| 35 |
+
|
| 36 |
+
### Example Response Fields
|
| 37 |
+
|
| 38 |
+
The response will include:
|
| 39 |
+
- Model metadata (downloads, likes, tags)
|
| 40 |
+
- README content (if `include_readme: true`)
|
| 41 |
+
- Any linked paper IDs in tags
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## Step 2: Search for Papers
|
| 46 |
+
|
| 47 |
+
Once you have an arXiv ID or want to find related papers, use `paper_search`:
|
| 48 |
+
|
| 49 |
+
```
|
| 50 |
+
mcp__hf-mcp-server__paper_search
|
| 51 |
+
query: "OLMo-2 evaluation benchmark"
|
| 52 |
+
results_limit: 5
|
| 53 |
+
concise_only: false # Get full abstracts for score extraction
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
### Search Strategies
|
| 57 |
+
|
| 58 |
+
**By model name:**
|
| 59 |
+
```
|
| 60 |
+
query: "Llama 3.1 benchmark evaluation"
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
**By arXiv ID (if known):**
|
| 64 |
+
```
|
| 65 |
+
query: "2411.15124"
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
**By benchmark + model family:**
|
| 69 |
+
```
|
| 70 |
+
query: "MMLU GPQA Qwen2.5"
|
| 71 |
+
```
|
| 72 |
+
|
| 73 |
+
---
|
| 74 |
+
|
| 75 |
+
## Step 3: Extract Benchmark Scores
|
| 76 |
+
|
| 77 |
+
The paper search returns abstracts and paper content. Look for:
|
| 78 |
+
|
| 79 |
+
### Common Benchmark Mentions
|
| 80 |
+
|
| 81 |
+
Papers typically report headline numbers in abstracts:
|
| 82 |
+
- "achieves **85.2%** on MMLU"
|
| 83 |
+
- "state-of-the-art results on GPQA Diamond (72.1%)"
|
| 84 |
+
- "HLE score of 12.3%"
|
| 85 |
+
|
| 86 |
+
### Benchmark Name Variations
|
| 87 |
+
|
| 88 |
+
| Standard Name | Paper Variations |
|
| 89 |
+
|---------------|------------------|
|
| 90 |
+
| HLE | Humanity's Last Exam, HLE (Text Only) |
|
| 91 |
+
| GPQA | GPQA Diamond, GPQA-Diamond |
|
| 92 |
+
| MMLU | MMLU, MMLU-Pro, Massive Multitask |
|
| 93 |
+
| GSM8K | GSM8K, GSM-8K, Grade School Math |
|
| 94 |
+
| MATH | MATH, MATH-500 |
|
| 95 |
+
| HumanEval | HumanEval, human_eval |
|
| 96 |
+
| SWE-bench | SWE-bench, SWE-bench Verified |
|
| 97 |
+
|
| 98 |
+
### Score Format Normalization
|
| 99 |
+
|
| 100 |
+
- **Percentages**: `85.2%` → use `85.2`
|
| 101 |
+
- **Decimals**: `0.852` → convert to `85.2` if context shows percentages
|
| 102 |
+
- **Accuracy vs Error Rate**: Ensure you're extracting accuracy, not error rate
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## Step 4: Format for .eval_results/
|
| 107 |
+
|
| 108 |
+
Once you have extracted scores, format them as YAML:
|
| 109 |
+
|
| 110 |
+
```yaml
|
| 111 |
+
# .eval_results/{benchmark_name}.yaml
|
| 112 |
+
- dataset:
|
| 113 |
+
id: {hub_dataset_id}
|
| 114 |
+
task_id: {task_variant}
|
| 115 |
+
value: {score}
|
| 116 |
+
date: "{extraction_date}"
|
| 117 |
+
source:
|
| 118 |
+
url: https://arxiv.org/abs/{arxiv_id}
|
| 119 |
+
name: Paper
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
### Dataset ID Reference
|
| 123 |
+
|
| 124 |
+
| Benchmark | Dataset ID | Task ID |
|
| 125 |
+
|-----------|------------|---------|
|
| 126 |
+
| HLE | `cais/hle` | `default` |
|
| 127 |
+
| GPQA | `Idavidrein/gpqa` | `gpqa_diamond` |
|
| 128 |
+
| MMLU | `cais/mmlu` | `default` |
|
| 129 |
+
| MMLU-Pro | `TIGER-Lab/MMLU-Pro` | `default` |
|
| 130 |
+
| GSM8K | `openai/gsm8k` | `default` |
|
| 131 |
+
| MATH | `lighteval/MATH` | `default` |
|
| 132 |
+
| HumanEval | `openai/openai_humaneval` | `default` |
|
| 133 |
+
| DROP | `ucinlp/drop` | `default` |
|
| 134 |
+
| ARC-Challenge | `allenai/ai2_arc` | `ARC-Challenge` |
|
| 135 |
+
| HellaSwag | `Rowan/hellaswag` | `default` |
|
| 136 |
+
| TruthfulQA | `truthfulqa/truthful_qa` | `default` |
|
| 137 |
+
| IFEval | `google/IFEval` | `default` |
|
| 138 |
+
| SWE-bench | `princeton-nlp/SWE-bench_Verified` | `default` |
|
| 139 |
+
| AIME24 | `OpenEvals/aime_24` | `default` |
|
| 140 |
+
| AIME25 | `OpenEvals/aime_2025` | `default` |
|
| 141 |
+
| LiveCodeBench | `livecodebench/livecodebench` | `default` |
|
| 142 |
+
|
| 143 |
+
---
|
| 144 |
+
|
| 145 |
+
## Complete Example Workflow
|
| 146 |
+
|
| 147 |
+
### Scenario: Extract MMLU score for OLMo-2 from its paper
|
| 148 |
+
|
| 149 |
+
```
|
| 150 |
+
1. Get model details:
|
| 151 |
+
mcp__hf-mcp-server__hub_repo_details
|
| 152 |
+
repo_ids: ["allenai/OLMo-2-1124-7B-Instruct"]
|
| 153 |
+
include_readme: true
|
| 154 |
+
|
| 155 |
+
→ Found tags: ["arxiv:2411.15124", "arxiv:2501.00656"]
|
| 156 |
+
|
| 157 |
+
2. Search for the paper:
|
| 158 |
+
mcp__hf-mcp-server__paper_search
|
| 159 |
+
query: "OLMo-2 2501.00656"
|
| 160 |
+
results_limit: 3
|
| 161 |
+
|
| 162 |
+
→ Returns paper abstract with benchmark scores
|
| 163 |
+
|
| 164 |
+
3. Extract from abstract:
|
| 165 |
+
"OLMo-2-7B-Instruct achieves 61.3 on MMLU..."
|
| 166 |
+
|
| 167 |
+
If score not in abstract, fetch PDF:
|
| 168 |
+
WebFetch
|
| 169 |
+
url: "https://arxiv.org/pdf/2501.00656"
|
| 170 |
+
prompt: "Find MMLU score for OLMo-2-7B-Instruct in the evaluation tables."
|
| 171 |
+
|
| 172 |
+
4. Create the eval result:
|
| 173 |
+
$ uv run scripts/evaluation_manager.py add-eval \
|
| 174 |
+
--benchmark MMLU \
|
| 175 |
+
--repo-id "allenai/OLMo-2-1124-7B-Instruct" \
|
| 176 |
+
--value 61.3 \
|
| 177 |
+
--create-pr
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
---
|
| 181 |
+
|
| 182 |
+
## Tips for Better Extraction
|
| 183 |
+
|
| 184 |
+
### 1. Check Multiple Papers
|
| 185 |
+
Models may have multiple linked papers. Use `hub_repo_details` to find all arXiv tags, then search for each.
|
| 186 |
+
|
| 187 |
+
### 2. Use Concise Mode for Broad Searches
|
| 188 |
+
```
|
| 189 |
+
mcp__hf-mcp-server__paper_search
|
| 190 |
+
query: "large language model evaluation"
|
| 191 |
+
concise_only: true # 2-sentence summaries
|
| 192 |
+
results_limit: 10
|
| 193 |
+
```
|
| 194 |
+
|
| 195 |
+
### 3. Prefer Primary Sources
|
| 196 |
+
Use the model's own release paper rather than papers that cite it.
|
| 197 |
+
|
| 198 |
+
### 4. Note Evaluation Settings
|
| 199 |
+
Papers may report different settings (0-shot vs 5-shot, with/without CoT). Document which setting you're extracting.
|
| 200 |
+
|
| 201 |
+
### 5. Cross-Reference Model Card
|
| 202 |
+
If both paper and model card have scores, prefer the paper as the authoritative source but verify they match.
|
| 203 |
+
|
| 204 |
+
---
|
| 205 |
+
|
| 206 |
+
## Step 5: Extract Scores from Paper PDFs
|
| 207 |
+
|
| 208 |
+
The `paper_search` tool only returns abstracts, which often miss detailed benchmark tables. For comprehensive score extraction, fetch the full paper PDF.
|
| 209 |
+
|
| 210 |
+
### URL Pattern
|
| 211 |
+
|
| 212 |
+
HuggingFace paper links map directly to arxiv PDFs:
|
| 213 |
+
|
| 214 |
+
| Source | URL Pattern |
|
| 215 |
+
|--------|-------------|
|
| 216 |
+
| HF Paper Page | `https://huggingface.co/papers/{arxiv_id}` |
|
| 217 |
+
| arxiv Abstract | `https://arxiv.org/abs/{arxiv_id}` |
|
| 218 |
+
| arxiv PDF | `https://arxiv.org/pdf/{arxiv_id}` |
|
| 219 |
+
|
| 220 |
+
**Example**: `2601.01739` → `https://arxiv.org/pdf/2601.01739`
|
| 221 |
+
|
| 222 |
+
### Fetching PDF Content
|
| 223 |
+
|
| 224 |
+
Use `WebFetch` to retrieve and search the PDF:
|
| 225 |
+
|
| 226 |
+
```
|
| 227 |
+
WebFetch
|
| 228 |
+
url: "https://arxiv.org/pdf/{arxiv_id}"
|
| 229 |
+
prompt: "Extract all benchmark evaluation scores and results tables. Look for metrics like accuracy, F1, BLEU, pass@k, or percentage scores. List each benchmark name and its corresponding score."
|
| 230 |
+
```
|
| 231 |
+
|
| 232 |
+
### Targeted Extraction Prompts
|
| 233 |
+
|
| 234 |
+
For specific benchmarks:
|
| 235 |
+
|
| 236 |
+
```
|
| 237 |
+
prompt: "Find the HLE (Humanity's Last Exam) score in this paper. Look in results tables and the evaluation section."
|
| 238 |
+
```
|
| 239 |
+
|
| 240 |
+
```
|
| 241 |
+
prompt: "Extract all scores from the main results table. Include benchmark names, model variants, and numerical scores."
|
| 242 |
+
```
|
| 243 |
+
|
| 244 |
+
```
|
| 245 |
+
prompt: "Find MMLU, GPQA, GSM8K, and MATH scores for the main model in this paper."
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
### When to Use PDF Extraction
|
| 249 |
+
|
| 250 |
+
Use PDF fetching when:
|
| 251 |
+
- Abstract doesn't contain specific benchmark scores
|
| 252 |
+
- You need scores for multiple benchmarks
|
| 253 |
+
- Paper mentions "see Table X for full results"
|
| 254 |
+
- Model card references paper but lacks detailed numbers
|
| 255 |
+
|
| 256 |
+
### Example: Full PDF Extraction Workflow
|
| 257 |
+
|
| 258 |
+
```
|
| 259 |
+
1. Get arxiv ID from model:
|
| 260 |
+
mcp__hf-mcp-server__hub_repo_details
|
| 261 |
+
repo_ids: ["meta-llama/Llama-3.1-70B-Instruct"]
|
| 262 |
+
include_readme: true
|
| 263 |
+
|
| 264 |
+
→ Found: arxiv:2407.21783
|
| 265 |
+
|
| 266 |
+
2. Fetch PDF for detailed scores:
|
| 267 |
+
WebFetch
|
| 268 |
+
url: "https://arxiv.org/pdf/2407.21783"
|
| 269 |
+
prompt: "Extract benchmark scores for Llama 3.1 70B Instruct from all evaluation tables. Include MMLU, GPQA Diamond, HumanEval, GSM8K, MATH, and any other benchmarks."
|
| 270 |
+
|
| 271 |
+
3. Parse extracted scores and create eval results
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
---
|
| 275 |
+
|
| 276 |
+
## Common Issues
|
| 277 |
+
|
| 278 |
+
### Paper Score Differs from Model Card
|
| 279 |
+
- Paper may report different model size/variant
|
| 280 |
+
- Evaluation settings may differ
|
| 281 |
+
- Paper may be pre-release; model card updated post-release
|
| 282 |
+
|
| 283 |
+
**Solution**: Note the discrepancy and prefer the most recent source.
|
| 284 |
+
|
| 285 |
+
### Score Not Found in Paper Search
|
| 286 |
+
- Paper abstracts rarely contain full benchmark tables
|
| 287 |
+
- Paper may not have evaluated that benchmark
|
| 288 |
+
- Try searching with different query terms
|
| 289 |
+
- Check if benchmark uses a different name
|
| 290 |
+
|
| 291 |
+
**Solution**: Use `WebFetch` to fetch the full PDF (`https://arxiv.org/pdf/{arxiv_id}`) - detailed scores are typically in results tables within the paper body, not the abstract. Also try alternative query terms or benchmark aliases.
|
| 292 |
+
|
| 293 |
+
### Multiple Models in Paper
|
| 294 |
+
- Paper describes a family of models (7B, 13B, 70B)
|
| 295 |
+
- Results may combine scores across sizes
|
| 296 |
+
|
| 297 |
+
**Solution**: Carefully match the exact model variant to the HuggingFace repo.
|
skills/hugging-face-evaluation/references/model_card_extraction.md
ADDED
|
@@ -0,0 +1,244 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Model Card Score Extraction via HF MCP Server
|
| 2 |
+
|
| 3 |
+
This document provides instructions for extracting benchmark scores from HuggingFace model cards using the HF MCP Server tools.
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## Overview
|
| 8 |
+
|
| 9 |
+
Model cards often contain evaluation tables with benchmark scores. This guide shows how to:
|
| 10 |
+
|
| 11 |
+
1. Use `hub_repo_details` to fetch model card content
|
| 12 |
+
2. Search for benchmark variations in the README
|
| 13 |
+
3. Extract and normalize scores
|
| 14 |
+
4. Format results for `.eval_results/`
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## Step 1: Fetch the Model Card
|
| 19 |
+
|
| 20 |
+
Use `hub_repo_details` to get the model's README content:
|
| 21 |
+
|
| 22 |
+
```
|
| 23 |
+
mcp__hf-mcp-server__hub_repo_details
|
| 24 |
+
repo_ids: ["org/model-name"]
|
| 25 |
+
include_readme: true
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
This returns:
|
| 29 |
+
- Model metadata (downloads, likes, tags, pipeline_tag)
|
| 30 |
+
- Full README content (when `include_readme: true`)
|
| 31 |
+
- Linked papers and datasets
|
| 32 |
+
|
| 33 |
+
### Batch Fetching
|
| 34 |
+
|
| 35 |
+
You can fetch multiple models at once:
|
| 36 |
+
|
| 37 |
+
```
|
| 38 |
+
mcp__hf-mcp-server__hub_repo_details
|
| 39 |
+
repo_ids: ["meta-llama/Llama-3.1-8B-Instruct", "Qwen/Qwen2.5-7B-Instruct"]
|
| 40 |
+
include_readme: true
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## Step 2: Search for Benchmark Scores
|
| 46 |
+
|
| 47 |
+
### Benchmark Name Variations
|
| 48 |
+
|
| 49 |
+
Model cards use inconsistent naming. Search for these variations:
|
| 50 |
+
|
| 51 |
+
| Benchmark | Variations to Search |
|
| 52 |
+
|-----------|---------------------|
|
| 53 |
+
| HLE | `HLE`, `hle`, `Humanity's Last Exam`, `HLE (Text Only)` |
|
| 54 |
+
| GPQA | `GPQA`, `GPQA Diamond`, `gpqa_diamond`, `GPQA-Diamond` |
|
| 55 |
+
| MMLU-Pro | `MMLU-Pro`, `MMLU Pro`, `mmlu_pro`, `MMLU-PRO` |
|
| 56 |
+
| MMLU | `MMLU`, `mmlu`, `Massive Multitask Language Understanding` |
|
| 57 |
+
| GSM8K | `GSM8K`, `gsm8k`, `GSM-8K`, `Grade School Math` |
|
| 58 |
+
| HumanEval | `HumanEval`, `humaneval`, `human_eval` |
|
| 59 |
+
| HellaSwag | `HellaSwag`, `hellaswag`, `hella_swag` |
|
| 60 |
+
| ARC-Challenge | `ARC-Challenge`, `ARC-C`, `arc_challenge` |
|
| 61 |
+
| TruthfulQA | `TruthfulQA`, `truthful_qa`, `TruthfulQA MC` |
|
| 62 |
+
| MATH | `MATH`, `math`, `MATH-500` |
|
| 63 |
+
| AIME | `AIME`, `AIME24`, `AIME 2024`, `aime_24` |
|
| 64 |
+
| SWE-bench | `SWE-bench`, `SWE-bench Verified`, `swe_bench` |
|
| 65 |
+
| LiveCodeBench | `LiveCodeBench`, `LCB`, `LiveCodeBenchV6` |
|
| 66 |
+
| IFEval | `IFEval`, `IF-Eval`, `ifeval` |
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## Step 3: Identify Table Formats
|
| 71 |
+
|
| 72 |
+
Model cards typically present scores in these formats:
|
| 73 |
+
|
| 74 |
+
### Format A: Model-Column Table (most common)
|
| 75 |
+
```markdown
|
| 76 |
+
| Model | MMLU | GPQA | HLE |
|
| 77 |
+
|-------|------|------|-----|
|
| 78 |
+
| This Model | 85.2 | 72.1 | 12.3 |
|
| 79 |
+
| GPT-4 | 86.4 | 74.2 | 15.1 |
|
| 80 |
+
```
|
| 81 |
+
|
| 82 |
+
### Format B: Benchmark-Column Table
|
| 83 |
+
```markdown
|
| 84 |
+
| Benchmark | Score |
|
| 85 |
+
|-----------|-------|
|
| 86 |
+
| MMLU | 85.2 |
|
| 87 |
+
| GPQA | 72.1 |
|
| 88 |
+
| HLE | 12.3 |
|
| 89 |
+
```
|
| 90 |
+
|
| 91 |
+
### Format C: Inline Text
|
| 92 |
+
```markdown
|
| 93 |
+
Our model achieves **85.2%** on MMLU, **72.1%** on GPQA Diamond, and **12.3%** on HLE.
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
### Format D: Nested/Grouped Tables
|
| 97 |
+
```markdown
|
| 98 |
+
| Category | Benchmark | Score |
|
| 99 |
+
|----------|-----------|-------|
|
| 100 |
+
| Reasoning | GPQA | 72.1 |
|
| 101 |
+
| Knowledge | MMLU | 85.2 |
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
## Step 4: Extract and Normalize Scores
|
| 107 |
+
|
| 108 |
+
### Score Format Normalization
|
| 109 |
+
|
| 110 |
+
Scores may be presented as:
|
| 111 |
+
- **Percentages**: `85.2%` or `85.2` (when context implies %)
|
| 112 |
+
- **Decimals**: `0.852` (multiply by 100 for percentage)
|
| 113 |
+
- **Fractions**: `85.2/100`
|
| 114 |
+
|
| 115 |
+
**Important**: The `.eval_results/` format expects values matching the benchmark's standard scale. Most benchmarks use percentage scale (0-100).
|
| 116 |
+
|
| 117 |
+
---
|
| 118 |
+
|
| 119 |
+
## Step 5: Format for .eval_results/
|
| 120 |
+
|
| 121 |
+
Once you have the score, format it for `.eval_results/`:
|
| 122 |
+
|
| 123 |
+
```yaml
|
| 124 |
+
# .eval_results/{benchmark}.yaml
|
| 125 |
+
- dataset:
|
| 126 |
+
id: cais/hle # Hub dataset ID (see mapping below)
|
| 127 |
+
task_id: default # Task variant if applicable
|
| 128 |
+
value: 12.3 # Score value
|
| 129 |
+
date: "2026-01-14" # ISO date of extraction
|
| 130 |
+
source:
|
| 131 |
+
url: https://huggingface.co/{org}/{model}
|
| 132 |
+
name: Model Card
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
### Dataset ID Reference
|
| 136 |
+
|
| 137 |
+
| Benchmark | Dataset ID | Task ID |
|
| 138 |
+
|-----------|------------|---------|
|
| 139 |
+
| HLE | `cais/hle` | `default` |
|
| 140 |
+
| GPQA | `Idavidrein/gpqa` | `gpqa_diamond` |
|
| 141 |
+
| MMLU-Pro | `TIGER-Lab/MMLU-Pro` | `default` |
|
| 142 |
+
| MMLU | `cais/mmlu` | `default` |
|
| 143 |
+
| GSM8K | `openai/gsm8k` | `default` |
|
| 144 |
+
| HumanEval | `openai/openai_humaneval` | `default` |
|
| 145 |
+
| MATH | `lighteval/MATH` | `default` |
|
| 146 |
+
| ARC-Challenge | `allenai/ai2_arc` | `ARC-Challenge` |
|
| 147 |
+
| HellaSwag | `Rowan/hellaswag` | `default` |
|
| 148 |
+
| TruthfulQA | `truthfulqa/truthful_qa` | `default` |
|
| 149 |
+
| SWE-bench | `princeton-nlp/SWE-bench_Verified` | `default` |
|
| 150 |
+
| AIME24 | `OpenEvals/aime_24` | `default` |
|
| 151 |
+
| AIME25 | `OpenEvals/aime_2025` | `default` |
|
| 152 |
+
| LiveCodeBench | `livecodebench/livecodebench` | `default` |
|
| 153 |
+
| IFEval | `google/IFEval` | `default` |
|
| 154 |
+
|
| 155 |
+
---
|
| 156 |
+
|
| 157 |
+
## Complete Example Workflow
|
| 158 |
+
|
| 159 |
+
### Scenario: Extract HLE score from a model card
|
| 160 |
+
|
| 161 |
+
```
|
| 162 |
+
1. Fetch model card:
|
| 163 |
+
mcp__hf-mcp-server__hub_repo_details
|
| 164 |
+
repo_ids: ["Qwen/Qwen2.5-72B-Instruct"]
|
| 165 |
+
include_readme: true
|
| 166 |
+
|
| 167 |
+
2. Search README for HLE variations:
|
| 168 |
+
Found: "| HLE | 18.5 |" in evaluation table
|
| 169 |
+
|
| 170 |
+
3. Create the eval result:
|
| 171 |
+
$ uv run scripts/evaluation_manager.py add-eval \
|
| 172 |
+
--benchmark HLE \
|
| 173 |
+
--repo-id "Qwen/Qwen2.5-72B-Instruct" \
|
| 174 |
+
--value 18.5 \
|
| 175 |
+
--create-pr
|
| 176 |
+
```
|
| 177 |
+
|
| 178 |
+
---
|
| 179 |
+
|
| 180 |
+
## Finding Models with Evaluations
|
| 181 |
+
|
| 182 |
+
Use `model_search` to find models that might have benchmark scores:
|
| 183 |
+
|
| 184 |
+
### Search for Trending Models
|
| 185 |
+
```
|
| 186 |
+
mcp__hf-mcp-server__model_search
|
| 187 |
+
task: "text-generation"
|
| 188 |
+
sort: "trendingScore"
|
| 189 |
+
limit: 20
|
| 190 |
+
```
|
| 191 |
+
|
| 192 |
+
### Search by Author
|
| 193 |
+
```
|
| 194 |
+
mcp__hf-mcp-server__model_search
|
| 195 |
+
author: "meta-llama"
|
| 196 |
+
task: "text-generation"
|
| 197 |
+
limit: 10
|
| 198 |
+
```
|
| 199 |
+
|
| 200 |
+
### Search by Query
|
| 201 |
+
```
|
| 202 |
+
mcp__hf-mcp-server__model_search
|
| 203 |
+
query: "instruct chat"
|
| 204 |
+
task: "text-generation"
|
| 205 |
+
limit: 20
|
| 206 |
+
```
|
| 207 |
+
|
| 208 |
+
Then use `hub_repo_details` on promising results to check their model cards.
|
| 209 |
+
|
| 210 |
+
---
|
| 211 |
+
|
| 212 |
+
## Tips for Better Extraction
|
| 213 |
+
|
| 214 |
+
### 1. Check the Full README
|
| 215 |
+
Model cards may have scores in different sections (Overview, Evaluation, Benchmarks, Results).
|
| 216 |
+
|
| 217 |
+
### 2. Look for Multiple Tables
|
| 218 |
+
Some model cards have separate tables for different benchmark categories.
|
| 219 |
+
|
| 220 |
+
### 3. Note Evaluation Settings
|
| 221 |
+
Papers may report different settings (0-shot vs 5-shot, with/without CoT). Document which setting you're extracting.
|
| 222 |
+
|
| 223 |
+
### 4. Verify Against Papers
|
| 224 |
+
If both paper and model card have scores, prefer the paper as the authoritative source but verify they match.
|
| 225 |
+
|
| 226 |
+
---
|
| 227 |
+
|
| 228 |
+
## Common Issues
|
| 229 |
+
|
| 230 |
+
### Score in Image/Figure Only
|
| 231 |
+
**Solution**: Check if there's a linked technical report or paper with tabular data. Use `paper_search` to find it.
|
| 232 |
+
|
| 233 |
+
### Benchmark Name Differs Significantly
|
| 234 |
+
**Solution**: Search for the underlying task name (e.g., "graduate-level science" for GPQA).
|
| 235 |
+
|
| 236 |
+
### Multiple Scores for Same Benchmark
|
| 237 |
+
**Solution**: Prefer "0-shot" or "standard" settings; note the configuration in source attribution.
|
| 238 |
+
|
| 239 |
+
### Score Not Found
|
| 240 |
+
- Score genuinely not present in model card
|
| 241 |
+
- Benchmark not evaluated by model authors
|
| 242 |
+
- Try `paper_search` to find scores in linked papers
|
| 243 |
+
|
| 244 |
+
**Solution**: Document why extraction failed to distinguish "not found by automation" from "truly not available".
|
skills/hugging-face-evaluation/scripts/check_prs.py
ADDED
|
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# /// script
|
| 2 |
+
# requires-python = ">=3.10"
|
| 3 |
+
# dependencies = [
|
| 4 |
+
# "requests>=2.31.0",
|
| 5 |
+
# ]
|
| 6 |
+
# ///
|
| 7 |
+
|
| 8 |
+
"""
|
| 9 |
+
Check for open pull requests on a Hugging Face model repository.
|
| 10 |
+
|
| 11 |
+
Usage:
|
| 12 |
+
uv run scripts/check_prs.py --repo-id "org/model-name"
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
+
import argparse
|
| 16 |
+
import sys
|
| 17 |
+
from typing import Any
|
| 18 |
+
|
| 19 |
+
import requests
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
def get_open_prs(repo_id: str) -> list[dict[str, Any]]:
|
| 23 |
+
"""
|
| 24 |
+
Fetch open pull requests for a Hugging Face model repository.
|
| 25 |
+
|
| 26 |
+
Args:
|
| 27 |
+
repo_id: Hugging Face model repository ID (e.g., "nvidia/model-name")
|
| 28 |
+
|
| 29 |
+
Returns:
|
| 30 |
+
List of open PR dictionaries with num, title, author, and createdAt
|
| 31 |
+
"""
|
| 32 |
+
url = f"https://huggingface.co/api/models/{repo_id}/discussions"
|
| 33 |
+
|
| 34 |
+
try:
|
| 35 |
+
response = requests.get(url, timeout=30, allow_redirects=True)
|
| 36 |
+
response.raise_for_status()
|
| 37 |
+
|
| 38 |
+
data = response.json()
|
| 39 |
+
discussions = data.get("discussions", [])
|
| 40 |
+
|
| 41 |
+
open_prs = [
|
| 42 |
+
{
|
| 43 |
+
"num": d["num"],
|
| 44 |
+
"title": d["title"],
|
| 45 |
+
"author": d["author"]["name"],
|
| 46 |
+
"createdAt": d.get("createdAt", "unknown"),
|
| 47 |
+
}
|
| 48 |
+
for d in discussions
|
| 49 |
+
if d.get("status") == "open" and d.get("isPullRequest")
|
| 50 |
+
]
|
| 51 |
+
|
| 52 |
+
return open_prs
|
| 53 |
+
|
| 54 |
+
except requests.RequestException as e:
|
| 55 |
+
print(f"Error fetching PRs from Hugging Face: {e}", file=sys.stderr)
|
| 56 |
+
return []
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def list_open_prs(repo_id: str) -> None:
|
| 60 |
+
"""Display open pull requests for a model repository."""
|
| 61 |
+
prs = get_open_prs(repo_id)
|
| 62 |
+
|
| 63 |
+
print(f"\n{'='*70}")
|
| 64 |
+
print(f"Open Pull Requests for: {repo_id}")
|
| 65 |
+
print(f"{'='*70}")
|
| 66 |
+
|
| 67 |
+
if not prs:
|
| 68 |
+
print("\nNo open pull requests found.")
|
| 69 |
+
else:
|
| 70 |
+
print(f"\nFound {len(prs)} open PR(s):\n")
|
| 71 |
+
for pr in prs:
|
| 72 |
+
print(f" PR #{pr['num']} - {pr['title']}")
|
| 73 |
+
print(f" Author: {pr['author']}")
|
| 74 |
+
print(f" Created: {pr['createdAt']}")
|
| 75 |
+
print(f" URL: https://huggingface.co/{repo_id}/discussions/{pr['num']}")
|
| 76 |
+
print()
|
| 77 |
+
|
| 78 |
+
print(f"{'='*70}\n")
|
| 79 |
+
|
| 80 |
+
|
| 81 |
+
def main():
|
| 82 |
+
parser = argparse.ArgumentParser(
|
| 83 |
+
description="Check for open pull requests on a Hugging Face model repository.",
|
| 84 |
+
epilog="Always run this before creating new PRs to avoid duplicates.",
|
| 85 |
+
)
|
| 86 |
+
parser.add_argument(
|
| 87 |
+
"--repo-id",
|
| 88 |
+
type=str,
|
| 89 |
+
required=True,
|
| 90 |
+
help="HF repository ID (e.g., 'nvidia/model-name')",
|
| 91 |
+
)
|
| 92 |
+
|
| 93 |
+
args = parser.parse_args()
|
| 94 |
+
list_open_prs(args.repo_id)
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
if __name__ == "__main__":
|
| 98 |
+
main()
|
skills/hugging-face-evaluation/scripts/import_aa.py
ADDED
|
@@ -0,0 +1,353 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# /// script
|
| 2 |
+
# requires-python = ">=3.11"
|
| 3 |
+
# dependencies = [
|
| 4 |
+
# "huggingface-hub>=1.1.4",
|
| 5 |
+
# "python-dotenv>=1.0.0",
|
| 6 |
+
# "pyyaml>=6.0.0",
|
| 7 |
+
# "requests>=2.31.0",
|
| 8 |
+
# ]
|
| 9 |
+
# ///
|
| 10 |
+
"""
|
| 11 |
+
Import evaluation results from Artificial Analysis API.
|
| 12 |
+
|
| 13 |
+
Usage:
|
| 14 |
+
# Look up a specific benchmark (dry run - prints YAML)
|
| 15 |
+
AA_API_KEY=... uv run scripts/import_aa.py --repo-id "org/model" --benchmark HLE
|
| 16 |
+
|
| 17 |
+
# Look up a benchmark and create PR
|
| 18 |
+
AA_API_KEY=... uv run scripts/import_aa.py --repo-id "org/model" --benchmark GPQA --create-pr
|
| 19 |
+
|
| 20 |
+
# Import all available benchmarks
|
| 21 |
+
AA_API_KEY=... uv run scripts/import_aa.py --repo-id "org/model" --all
|
| 22 |
+
|
| 23 |
+
# Provide value manually (skip lookup)
|
| 24 |
+
uv run scripts/import_aa.py --repo-id "org/model" --benchmark HLE --value 22.5 --create-pr
|
| 25 |
+
"""
|
| 26 |
+
|
| 27 |
+
from __future__ import annotations
|
| 28 |
+
|
| 29 |
+
import argparse
|
| 30 |
+
import json
|
| 31 |
+
import os
|
| 32 |
+
import re
|
| 33 |
+
import sys
|
| 34 |
+
from datetime import date
|
| 35 |
+
from pathlib import Path
|
| 36 |
+
from typing import Any
|
| 37 |
+
|
| 38 |
+
import requests
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
AA_INDEX_URL = "https://artificialanalysis.ai/api/v2/data/llms/models"
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def load_env() -> None:
|
| 45 |
+
try:
|
| 46 |
+
import dotenv
|
| 47 |
+
dotenv.load_dotenv()
|
| 48 |
+
except ModuleNotFoundError:
|
| 49 |
+
pass
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def load_benchmark_mapping() -> dict[str, Any]:
|
| 53 |
+
script_dir = Path(__file__).parent
|
| 54 |
+
mapping_file = script_dir.parent / "examples" / "metric_mapping.json"
|
| 55 |
+
|
| 56 |
+
if not mapping_file.exists():
|
| 57 |
+
return {
|
| 58 |
+
"GPQA": {"dataset_id": "Idavidrein/gpqa", "task_id": "gpqa_diamond", "aliases": ["gpqa"]},
|
| 59 |
+
"HLE": {"dataset_id": "cais/hle", "task_id": "default", "aliases": ["hle"]},
|
| 60 |
+
"SimpleQA": {"dataset_id": "OpenEvals/SimpleQA", "task_id": "default", "aliases": ["simpleqa"]},
|
| 61 |
+
"MMLU": {"dataset_id": "cais/mmlu", "task_id": "default", "aliases": ["mmlu"]},
|
| 62 |
+
"GSM8K": {"dataset_id": "openai/gsm8k", "task_id": "default", "aliases": ["gsm8k"]},
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
with open(mapping_file) as f:
|
| 66 |
+
mapping = json.load(f)
|
| 67 |
+
mapping.pop("_comment", None)
|
| 68 |
+
return mapping
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def find_benchmark_dataset(benchmark_name: str, mapping: dict[str, Any]) -> dict[str, str] | None:
|
| 72 |
+
cleaned = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', benchmark_name)
|
| 73 |
+
cleaned = re.sub(r'\*\*([^\*]+)\*\*', r'\1', cleaned)
|
| 74 |
+
cleaned = re.sub(r'\*([^\*]+)\*', r'\1', cleaned)
|
| 75 |
+
cleaned = cleaned.strip()
|
| 76 |
+
|
| 77 |
+
normalized = cleaned.lower().replace(" ", "_").replace("-", "_")
|
| 78 |
+
base_name = re.sub(r'\s*\([^)]*\)\s*$', '', cleaned).strip()
|
| 79 |
+
base_normalized = base_name.lower().replace(" ", "_").replace("-", "_")
|
| 80 |
+
|
| 81 |
+
if cleaned in mapping:
|
| 82 |
+
entry = mapping[cleaned]
|
| 83 |
+
return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
|
| 84 |
+
|
| 85 |
+
for key, entry in mapping.items():
|
| 86 |
+
if key.lower() == cleaned.lower():
|
| 87 |
+
return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
|
| 88 |
+
|
| 89 |
+
for key, entry in mapping.items():
|
| 90 |
+
aliases = entry.get("aliases", [])
|
| 91 |
+
normalized_aliases = [a.lower().replace(" ", "_").replace("-", "_") for a in aliases]
|
| 92 |
+
if normalized in normalized_aliases:
|
| 93 |
+
return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
|
| 94 |
+
|
| 95 |
+
for key, entry in mapping.items():
|
| 96 |
+
key_normalized = key.lower().replace(" ", "_").replace("-", "_")
|
| 97 |
+
if normalized == key_normalized:
|
| 98 |
+
return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
|
| 99 |
+
|
| 100 |
+
if base_normalized != normalized:
|
| 101 |
+
for key, entry in mapping.items():
|
| 102 |
+
if key.lower() == base_name.lower():
|
| 103 |
+
return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
|
| 104 |
+
key_normalized = key.lower().replace(" ", "_").replace("-", "_")
|
| 105 |
+
if base_normalized == key_normalized:
|
| 106 |
+
return {"dataset_id": entry["dataset_id"], "task_id": entry.get("task_id", "default")}
|
| 107 |
+
|
| 108 |
+
return None
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
def fetch_aa_models(api_key: str) -> list[dict[str, Any]]:
|
| 112 |
+
response = requests.get(
|
| 113 |
+
AA_INDEX_URL,
|
| 114 |
+
headers={"x-api-key": api_key},
|
| 115 |
+
timeout=30,
|
| 116 |
+
)
|
| 117 |
+
response.raise_for_status()
|
| 118 |
+
data = response.json()
|
| 119 |
+
return list(data.get("data", []))
|
| 120 |
+
|
| 121 |
+
|
| 122 |
+
def find_model_in_aa(models: list[dict[str, Any]], repo_id: str) -> dict[str, Any] | None:
|
| 123 |
+
model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
|
| 124 |
+
model_name_normalized = model_name.lower().replace("-", " ").replace("_", " ")
|
| 125 |
+
|
| 126 |
+
for model in models:
|
| 127 |
+
aa_name = model.get("name", "").lower().replace("-", " ").replace("_", " ")
|
| 128 |
+
aa_slug = model.get("slug", "").lower().replace("-", " ").replace("_", " ")
|
| 129 |
+
if model_name_normalized in aa_name or model_name_normalized in aa_slug:
|
| 130 |
+
return model
|
| 131 |
+
|
| 132 |
+
return None
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
def lookup_benchmark_from_aa(
|
| 136 |
+
models: list[dict[str, Any]],
|
| 137 |
+
repo_id: str,
|
| 138 |
+
benchmark_name: str,
|
| 139 |
+
) -> float | None:
|
| 140 |
+
model = find_model_in_aa(models, repo_id)
|
| 141 |
+
if not model:
|
| 142 |
+
return None
|
| 143 |
+
|
| 144 |
+
evaluations = model.get("evaluations", {})
|
| 145 |
+
benchmark_normalized = benchmark_name.lower().replace(" ", "_").replace("-", "_")
|
| 146 |
+
|
| 147 |
+
for key, value in evaluations.items():
|
| 148 |
+
key_normalized = key.lower().replace(" ", "_").replace("-", "_")
|
| 149 |
+
if benchmark_normalized == key_normalized or benchmark_normalized in key_normalized:
|
| 150 |
+
if value is not None:
|
| 151 |
+
return float(value)
|
| 152 |
+
|
| 153 |
+
return None
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
def get_all_benchmarks_from_aa(
|
| 157 |
+
models: list[dict[str, Any]],
|
| 158 |
+
repo_id: str,
|
| 159 |
+
) -> list[dict[str, Any]]:
|
| 160 |
+
model = find_model_in_aa(models, repo_id)
|
| 161 |
+
if not model:
|
| 162 |
+
return []
|
| 163 |
+
|
| 164 |
+
evaluations = model.get("evaluations", {})
|
| 165 |
+
metrics = []
|
| 166 |
+
|
| 167 |
+
for key, value in evaluations.items():
|
| 168 |
+
if value is not None:
|
| 169 |
+
metrics.append({
|
| 170 |
+
"name": key.replace("_", " ").title(),
|
| 171 |
+
"type": key,
|
| 172 |
+
"value": float(value),
|
| 173 |
+
})
|
| 174 |
+
|
| 175 |
+
return metrics
|
| 176 |
+
|
| 177 |
+
|
| 178 |
+
def convert_to_eval_results_format(
|
| 179 |
+
metrics: list[dict[str, Any]],
|
| 180 |
+
source_url: str | None = None,
|
| 181 |
+
source_name: str | None = None,
|
| 182 |
+
source_user: str | None = None,
|
| 183 |
+
) -> list[dict[str, Any]]:
|
| 184 |
+
mapping = load_benchmark_mapping()
|
| 185 |
+
results = []
|
| 186 |
+
today = date.today().isoformat()
|
| 187 |
+
|
| 188 |
+
for metric in metrics:
|
| 189 |
+
benchmark_name = metric.get("name", "")
|
| 190 |
+
value = metric.get("value")
|
| 191 |
+
|
| 192 |
+
if value is None:
|
| 193 |
+
continue
|
| 194 |
+
|
| 195 |
+
dataset_info = find_benchmark_dataset(benchmark_name, mapping)
|
| 196 |
+
if not dataset_info:
|
| 197 |
+
print(f"Warning: Could not find Hub dataset ID for benchmark '{benchmark_name}'. Skipping.", file=sys.stderr)
|
| 198 |
+
continue
|
| 199 |
+
|
| 200 |
+
entry: dict[str, Any] = {
|
| 201 |
+
"dataset": {"id": dataset_info["dataset_id"]},
|
| 202 |
+
"value": value,
|
| 203 |
+
"date": today,
|
| 204 |
+
}
|
| 205 |
+
|
| 206 |
+
if dataset_info.get("task_id") and dataset_info["task_id"] != "default":
|
| 207 |
+
entry["dataset"]["task_id"] = dataset_info["task_id"]
|
| 208 |
+
|
| 209 |
+
if source_url:
|
| 210 |
+
entry["source"] = {"url": source_url}
|
| 211 |
+
if source_name:
|
| 212 |
+
entry["source"]["name"] = source_name
|
| 213 |
+
if source_user:
|
| 214 |
+
entry["source"]["user"] = source_user
|
| 215 |
+
|
| 216 |
+
results.append(entry)
|
| 217 |
+
|
| 218 |
+
return results
|
| 219 |
+
|
| 220 |
+
|
| 221 |
+
def upload_eval_results(
|
| 222 |
+
repo_id: str,
|
| 223 |
+
results: list[dict[str, Any]],
|
| 224 |
+
filename: str = "evaluations.yaml",
|
| 225 |
+
create_pr: bool = False,
|
| 226 |
+
commit_message: str | None = None,
|
| 227 |
+
) -> bool:
|
| 228 |
+
import yaml
|
| 229 |
+
from huggingface_hub import HfApi
|
| 230 |
+
|
| 231 |
+
load_env()
|
| 232 |
+
hf_token = os.getenv("HF_TOKEN")
|
| 233 |
+
if not hf_token:
|
| 234 |
+
print("Error: HF_TOKEN environment variable is not set", file=sys.stderr)
|
| 235 |
+
return False
|
| 236 |
+
|
| 237 |
+
api = HfApi(token=hf_token)
|
| 238 |
+
yaml_content = yaml.dump(results, sort_keys=False, allow_unicode=True, default_flow_style=False)
|
| 239 |
+
file_path = f".eval_results/{filename}"
|
| 240 |
+
|
| 241 |
+
if not commit_message:
|
| 242 |
+
model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
|
| 243 |
+
commit_message = f"Add Artificial Analysis evaluation results for {model_name}"
|
| 244 |
+
|
| 245 |
+
pr_description = """## Evaluation Results
|
| 246 |
+
|
| 247 |
+
This PR adds structured evaluation results using the new [`.eval_results/` format](https://huggingface.co/docs/hub/eval-results).
|
| 248 |
+
|
| 249 |
+
**Source:** [Artificial Analysis](https://artificialanalysis.ai)
|
| 250 |
+
|
| 251 |
+
### What This Enables
|
| 252 |
+
|
| 253 |
+
- **Model Page**: Results appear on the model page with benchmark links
|
| 254 |
+
- **Leaderboards**: Scores are aggregated into benchmark dataset leaderboards
|
| 255 |
+
- **Verification**: Support for cryptographic verification of evaluation runs
|
| 256 |
+
|
| 257 |
+
---
|
| 258 |
+
*Generated by [community-evals](https://github.com/huggingface/community-evals)*"""
|
| 259 |
+
|
| 260 |
+
try:
|
| 261 |
+
api.upload_file(
|
| 262 |
+
path_or_fileobj=yaml_content.encode("utf-8"),
|
| 263 |
+
path_in_repo=file_path,
|
| 264 |
+
repo_id=repo_id,
|
| 265 |
+
repo_type="model",
|
| 266 |
+
commit_message=commit_message,
|
| 267 |
+
commit_description=pr_description,
|
| 268 |
+
create_pr=create_pr,
|
| 269 |
+
)
|
| 270 |
+
|
| 271 |
+
action = "Pull request created" if create_pr else "Evaluation results uploaded"
|
| 272 |
+
print(f"✓ {action} successfully for {repo_id}")
|
| 273 |
+
print(f" File: {file_path}")
|
| 274 |
+
return True
|
| 275 |
+
|
| 276 |
+
except Exception as e:
|
| 277 |
+
print(f"Error uploading evaluation results: {e}", file=sys.stderr)
|
| 278 |
+
return False
|
| 279 |
+
|
| 280 |
+
|
| 281 |
+
def main() -> None:
|
| 282 |
+
parser = argparse.ArgumentParser(
|
| 283 |
+
description="Import evaluation results from Artificial Analysis API.",
|
| 284 |
+
)
|
| 285 |
+
parser.add_argument("--repo-id", required=True, help="HuggingFace repository ID")
|
| 286 |
+
parser.add_argument("--benchmark", help="Specific benchmark to look up (e.g., HLE, GPQA)")
|
| 287 |
+
parser.add_argument("--value", type=float, help="Manually provide the score (skips AA lookup)")
|
| 288 |
+
parser.add_argument("--all", action="store_true", help="Import all available benchmarks")
|
| 289 |
+
parser.add_argument("--source-user", help="HF username/org for attribution")
|
| 290 |
+
parser.add_argument("--filename", default="artificial_analysis.yaml", help="Output filename")
|
| 291 |
+
parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push")
|
| 292 |
+
parser.add_argument("--apply", action="store_true", help="Apply changes (default is dry run)")
|
| 293 |
+
parser.add_argument("--pretty", action="store_true", help="Pretty-print YAML output")
|
| 294 |
+
parser.add_argument("--verbose", action="store_true", help="Print progress to stderr")
|
| 295 |
+
args = parser.parse_args()
|
| 296 |
+
|
| 297 |
+
load_env()
|
| 298 |
+
|
| 299 |
+
if args.value is not None and args.benchmark:
|
| 300 |
+
metrics = [{"name": args.benchmark, "type": args.benchmark.lower(), "value": args.value}]
|
| 301 |
+
else:
|
| 302 |
+
api_key = os.getenv("AA_API_KEY")
|
| 303 |
+
if not api_key:
|
| 304 |
+
print("Error: AA_API_KEY is required to query Artificial Analysis.", file=sys.stderr)
|
| 305 |
+
sys.exit(1)
|
| 306 |
+
|
| 307 |
+
if args.verbose:
|
| 308 |
+
print("Fetching models from Artificial Analysis...", file=sys.stderr)
|
| 309 |
+
|
| 310 |
+
models = fetch_aa_models(api_key)
|
| 311 |
+
|
| 312 |
+
if args.all:
|
| 313 |
+
metrics = get_all_benchmarks_from_aa(models, args.repo_id)
|
| 314 |
+
if not metrics:
|
| 315 |
+
print(f"No benchmarks found for {args.repo_id} in Artificial Analysis", file=sys.stderr)
|
| 316 |
+
sys.exit(1)
|
| 317 |
+
elif args.benchmark:
|
| 318 |
+
value = lookup_benchmark_from_aa(models, args.repo_id, args.benchmark)
|
| 319 |
+
if value is None:
|
| 320 |
+
print(f"Could not find {args.benchmark} score for {args.repo_id} in Artificial Analysis", file=sys.stderr)
|
| 321 |
+
sys.exit(1)
|
| 322 |
+
print(f"Found: {args.benchmark} = {value}")
|
| 323 |
+
metrics = [{"name": args.benchmark, "type": args.benchmark.lower(), "value": value}]
|
| 324 |
+
else:
|
| 325 |
+
print("Error: Either --benchmark or --all is required", file=sys.stderr)
|
| 326 |
+
sys.exit(1)
|
| 327 |
+
|
| 328 |
+
eval_results = convert_to_eval_results_format(
|
| 329 |
+
metrics=metrics,
|
| 330 |
+
source_url="https://artificialanalysis.ai",
|
| 331 |
+
source_name="Artificial Analysis",
|
| 332 |
+
source_user=args.source_user,
|
| 333 |
+
)
|
| 334 |
+
|
| 335 |
+
if not eval_results:
|
| 336 |
+
print("No benchmarks could be mapped to Hub dataset IDs", file=sys.stderr)
|
| 337 |
+
sys.exit(1)
|
| 338 |
+
|
| 339 |
+
import yaml
|
| 340 |
+
print("\nImported evaluations (.eval_results/ format):")
|
| 341 |
+
print(yaml.dump(eval_results, sort_keys=False, allow_unicode=True, default_flow_style=False))
|
| 342 |
+
|
| 343 |
+
if args.apply or args.create_pr:
|
| 344 |
+
upload_eval_results(
|
| 345 |
+
repo_id=args.repo_id,
|
| 346 |
+
results=eval_results,
|
| 347 |
+
filename=args.filename,
|
| 348 |
+
create_pr=args.create_pr,
|
| 349 |
+
)
|
| 350 |
+
|
| 351 |
+
|
| 352 |
+
if __name__ == "__main__":
|
| 353 |
+
main()
|
skills/hugging-face-model-trainer/SKILL.md
ADDED
|
@@ -0,0 +1,718 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
name: hugging-face-model-trainer
|
| 3 |
+
description: This skill should be used when users want to train or fine-tune language models using TRL (Transformer Reinforcement Learning) on Hugging Face Jobs infrastructure. Covers SFT, DPO, GRPO and reward modeling training methods, plus GGUF conversion for local deployment. Includes guidance on the TRL Jobs package, UV scripts with PEP 723 format, dataset preparation and validation, hardware selection, cost estimation, Trackio monitoring, Hub authentication, and model persistence. Should be invoked for tasks involving cloud GPU training, GGUF conversion, or when users mention training on Hugging Face Jobs without local GPU setup.
|
| 4 |
+
license: Complete terms in LICENSE.txt
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
# TRL Training on Hugging Face Jobs
|
| 8 |
+
|
| 9 |
+
## Overview
|
| 10 |
+
|
| 11 |
+
Train language models using TRL (Transformer Reinforcement Learning) on fully managed Hugging Face infrastructure. No local GPU setup required—models train on cloud GPUs and results are automatically saved to the Hugging Face Hub.
|
| 12 |
+
|
| 13 |
+
**TRL provides multiple training methods:**
|
| 14 |
+
- **SFT** (Supervised Fine-Tuning) - Standard instruction tuning
|
| 15 |
+
- **DPO** (Direct Preference Optimization) - Alignment from preference data
|
| 16 |
+
- **GRPO** (Group Relative Policy Optimization) - Online RL training
|
| 17 |
+
- **Reward Modeling** - Train reward models for RLHF
|
| 18 |
+
|
| 19 |
+
**For detailed TRL method documentation:**
|
| 20 |
+
```python
|
| 21 |
+
hf_doc_search("your query", product="trl")
|
| 22 |
+
hf_doc_fetch("https://huggingface.co/docs/trl/sft_trainer") # SFT
|
| 23 |
+
hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer") # DPO
|
| 24 |
+
# etc.
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
**See also:** `references/training_methods.md` for method overviews and selection guidance
|
| 28 |
+
|
| 29 |
+
## When to Use This Skill
|
| 30 |
+
|
| 31 |
+
Use this skill when users want to:
|
| 32 |
+
- Fine-tune language models on cloud GPUs without local infrastructure
|
| 33 |
+
- Train with TRL methods (SFT, DPO, GRPO, etc.)
|
| 34 |
+
- Run training jobs on Hugging Face Jobs infrastructure
|
| 35 |
+
- Convert trained models to GGUF for local deployment (Ollama, LM Studio, llama.cpp)
|
| 36 |
+
- Ensure trained models are permanently saved to the Hub
|
| 37 |
+
- Use modern workflows with optimized defaults
|
| 38 |
+
|
| 39 |
+
### When to Use Unsloth
|
| 40 |
+
|
| 41 |
+
Use **Unsloth** (`references/unsloth.md`) instead of standard TRL when:
|
| 42 |
+
- **Limited GPU memory** - Unsloth uses ~60% less VRAM
|
| 43 |
+
- **Speed matters** - Unsloth is ~2x faster
|
| 44 |
+
- Training **large models (>13B)** - memory efficiency is critical
|
| 45 |
+
- Training **Vision-Language Models (VLMs)** - Unsloth has `FastVisionModel` support
|
| 46 |
+
|
| 47 |
+
See `references/unsloth.md` for complete Unsloth documentation and `scripts/unsloth_sft_example.py` for a production-ready training script.
|
| 48 |
+
|
| 49 |
+
## Key Directives
|
| 50 |
+
|
| 51 |
+
When assisting with training jobs:
|
| 52 |
+
|
| 53 |
+
1. **ALWAYS use `hf_jobs()` MCP tool** - Submit jobs using `hf_jobs("uv", {...})`, NOT bash `trl-jobs` commands. The `script` parameter accepts Python code directly. Do NOT save to local files unless the user explicitly requests it. Pass the script content as a string to `hf_jobs()`. If user asks to "train a model", "fine-tune", or similar requests, you MUST create the training script AND submit the job immediately using `hf_jobs()`.
|
| 54 |
+
|
| 55 |
+
2. **Always include Trackio** - Every training script should include Trackio for real-time monitoring. Use example scripts in `scripts/` as templates.
|
| 56 |
+
|
| 57 |
+
3. **Provide job details after submission** - After submitting, provide job ID, monitoring URL, estimated time, and note that the user can request status checks later.
|
| 58 |
+
|
| 59 |
+
4. **Use example scripts as templates** - Reference `scripts/train_sft_example.py`, `scripts/train_dpo_example.py`, etc. as starting points.
|
| 60 |
+
|
| 61 |
+
## Local Script Dependencies
|
| 62 |
+
|
| 63 |
+
To run scripts locally (like `estimate_cost.py`), install dependencies:
|
| 64 |
+
```bash
|
| 65 |
+
pip install -r requirements.txt
|
| 66 |
+
```
|
| 67 |
+
|
| 68 |
+
## Prerequisites Checklist
|
| 69 |
+
|
| 70 |
+
Before starting any training job, verify:
|
| 71 |
+
|
| 72 |
+
### ✅ **Account & Authentication**
|
| 73 |
+
- Hugging Face Account with [Pro](https://hf.co/pro), [Team](https://hf.co/enterprise), or [Enterprise](https://hf.co/enterprise) plan (Jobs require paid plan)
|
| 74 |
+
- Authenticated login: Check with `hf_whoami()`
|
| 75 |
+
- **HF_TOKEN for Hub Push** ⚠️ CRITICAL - Training environment is ephemeral, must push to Hub or ALL training results are lost
|
| 76 |
+
- Token must have write permissions
|
| 77 |
+
- **MUST pass `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config** to make token available (the `$HF_TOKEN` syntax
|
| 78 |
+
references your actual token value)
|
| 79 |
+
|
| 80 |
+
### ✅ **Dataset Requirements**
|
| 81 |
+
- Dataset must exist on Hub or be loadable via `datasets.load_dataset()`
|
| 82 |
+
- Format must match training method (SFT: "messages"/text/prompt-completion; DPO: chosen/rejected; GRPO: prompt-only)
|
| 83 |
+
- **ALWAYS validate unknown datasets** before GPU training to prevent format failures (see Dataset Validation section below)
|
| 84 |
+
- Size appropriate for hardware (Demo: 50-100 examples on t4-small; Production: 1K-10K+ on a10g-large/a100-large)
|
| 85 |
+
|
| 86 |
+
### ⚠️ **Critical Settings**
|
| 87 |
+
- **Timeout must exceed expected training time** - Default 30min is TOO SHORT for most training. Minimum recommended: 1-2 hours. Job fails and loses all progress if timeout is exceeded.
|
| 88 |
+
- **Hub push must be enabled** - Config: `push_to_hub=True`, `hub_model_id="username/model-name"`; Job: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
|
| 89 |
+
|
| 90 |
+
## Asynchronous Job Guidelines
|
| 91 |
+
|
| 92 |
+
**⚠️ IMPORTANT: Training jobs run asynchronously and can take hours**
|
| 93 |
+
|
| 94 |
+
### Action Required
|
| 95 |
+
|
| 96 |
+
**When user requests training:**
|
| 97 |
+
1. **Create the training script** with Trackio included (use `scripts/train_sft_example.py` as template)
|
| 98 |
+
2. **Submit immediately** using `hf_jobs()` MCP tool with script content inline - don't save to file unless user requests
|
| 99 |
+
3. **Report submission** with job ID, monitoring URL, and estimated time
|
| 100 |
+
4. **Wait for user** to request status checks - don't poll automatically
|
| 101 |
+
|
| 102 |
+
### Ground Rules
|
| 103 |
+
- **Jobs run in background** - Submission returns immediately; training continues independently
|
| 104 |
+
- **Initial logs delayed** - Can take 30-60 seconds for logs to appear
|
| 105 |
+
- **User checks status** - Wait for user to request status updates
|
| 106 |
+
- **Avoid polling** - Check logs only on user request; provide monitoring links instead
|
| 107 |
+
|
| 108 |
+
### After Submission
|
| 109 |
+
|
| 110 |
+
**Provide to user:**
|
| 111 |
+
- ✅ Job ID and monitoring URL
|
| 112 |
+
- ✅ Expected completion time
|
| 113 |
+
- ✅ Trackio dashboard URL
|
| 114 |
+
- ✅ Note that user can request status checks later
|
| 115 |
+
|
| 116 |
+
**Example Response:**
|
| 117 |
+
```
|
| 118 |
+
✅ Job submitted successfully!
|
| 119 |
+
|
| 120 |
+
Job ID: abc123xyz
|
| 121 |
+
Monitor: https://huggingface.co/jobs/username/abc123xyz
|
| 122 |
+
|
| 123 |
+
Expected time: ~2 hours
|
| 124 |
+
Estimated cost: ~$10
|
| 125 |
+
|
| 126 |
+
The job is running in the background. Ask me to check status/logs when ready!
|
| 127 |
+
```
|
| 128 |
+
|
| 129 |
+
## Quick Start: Three Approaches
|
| 130 |
+
|
| 131 |
+
**💡 Tip for Demos:** For quick demos on smaller GPUs (t4-small), omit `eval_dataset` and `eval_strategy` to save ~40% memory. You'll still see training loss and learning progress.
|
| 132 |
+
|
| 133 |
+
### Sequence Length Configuration
|
| 134 |
+
|
| 135 |
+
**TRL config classes use `max_length` (not `max_seq_length`)** to control tokenized sequence length:
|
| 136 |
+
|
| 137 |
+
```python
|
| 138 |
+
# ✅ CORRECT - If you need to set sequence length
|
| 139 |
+
SFTConfig(max_length=512) # Truncate sequences to 512 tokens
|
| 140 |
+
DPOConfig(max_length=2048) # Longer context (2048 tokens)
|
| 141 |
+
|
| 142 |
+
# ❌ WRONG - This parameter doesn't exist
|
| 143 |
+
SFTConfig(max_seq_length=512) # TypeError!
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
**Default behavior:** `max_length=1024` (truncates from right). This works well for most training.
|
| 147 |
+
|
| 148 |
+
**When to override:**
|
| 149 |
+
- **Longer context**: Set higher (e.g., `max_length=2048`)
|
| 150 |
+
- **Memory constraints**: Set lower (e.g., `max_length=512`)
|
| 151 |
+
- **Vision models**: Set `max_length=None` (prevents cutting image tokens)
|
| 152 |
+
|
| 153 |
+
**Usually you don't need to set this parameter at all** - the examples below use the sensible default.
|
| 154 |
+
|
| 155 |
+
### Approach 1: UV Scripts (Recommended—Default Choice)
|
| 156 |
+
|
| 157 |
+
UV scripts use PEP 723 inline dependencies for clean, self-contained training. **This is the primary approach for Claude Code.**
|
| 158 |
+
|
| 159 |
+
```python
|
| 160 |
+
hf_jobs("uv", {
|
| 161 |
+
"script": """
|
| 162 |
+
# /// script
|
| 163 |
+
# dependencies = ["trl>=0.12.0", "peft>=0.7.0", "trackio"]
|
| 164 |
+
# ///
|
| 165 |
+
|
| 166 |
+
from datasets import load_dataset
|
| 167 |
+
from peft import LoraConfig
|
| 168 |
+
from trl import SFTTrainer, SFTConfig
|
| 169 |
+
import trackio
|
| 170 |
+
|
| 171 |
+
dataset = load_dataset("trl-lib/Capybara", split="train")
|
| 172 |
+
|
| 173 |
+
# Create train/eval split for monitoring
|
| 174 |
+
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
|
| 175 |
+
|
| 176 |
+
trainer = SFTTrainer(
|
| 177 |
+
model="Qwen/Qwen2.5-0.5B",
|
| 178 |
+
train_dataset=dataset_split["train"],
|
| 179 |
+
eval_dataset=dataset_split["test"],
|
| 180 |
+
peft_config=LoraConfig(r=16, lora_alpha=32),
|
| 181 |
+
args=SFTConfig(
|
| 182 |
+
output_dir="my-model",
|
| 183 |
+
push_to_hub=True,
|
| 184 |
+
hub_model_id="username/my-model",
|
| 185 |
+
num_train_epochs=3,
|
| 186 |
+
eval_strategy="steps",
|
| 187 |
+
eval_steps=50,
|
| 188 |
+
report_to="trackio",
|
| 189 |
+
project="meaningful_prject_name", # project name for the training name (trackio)
|
| 190 |
+
run_name="meaningful_run_name", # descriptive name for the specific training run (trackio)
|
| 191 |
+
)
|
| 192 |
+
)
|
| 193 |
+
|
| 194 |
+
trainer.train()
|
| 195 |
+
trainer.push_to_hub()
|
| 196 |
+
""",
|
| 197 |
+
"flavor": "a10g-large",
|
| 198 |
+
"timeout": "2h",
|
| 199 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 200 |
+
})
|
| 201 |
+
```
|
| 202 |
+
|
| 203 |
+
**Benefits:** Direct MCP tool usage, clean code, dependencies declared inline (PEP 723), no file saving required, full control
|
| 204 |
+
**When to use:** Default choice for all training tasks in Claude Code, custom training logic, any scenario requiring `hf_jobs()`
|
| 205 |
+
|
| 206 |
+
#### Working with Scripts
|
| 207 |
+
|
| 208 |
+
⚠️ **Important:** The `script` parameter accepts either inline code (as shown above) OR a URL. **Local file paths do NOT work.**
|
| 209 |
+
|
| 210 |
+
**Why local paths don't work:**
|
| 211 |
+
Jobs run in isolated Docker containers without access to your local filesystem. Scripts must be:
|
| 212 |
+
- Inline code (recommended for custom training)
|
| 213 |
+
- Publicly accessible URLs
|
| 214 |
+
- Private repo URLs (with HF_TOKEN)
|
| 215 |
+
|
| 216 |
+
**Common mistakes:**
|
| 217 |
+
```python
|
| 218 |
+
# ❌ These will all fail
|
| 219 |
+
hf_jobs("uv", {"script": "train.py"})
|
| 220 |
+
hf_jobs("uv", {"script": "./scripts/train.py"})
|
| 221 |
+
hf_jobs("uv", {"script": "/path/to/train.py"})
|
| 222 |
+
```
|
| 223 |
+
|
| 224 |
+
**Correct approaches:**
|
| 225 |
+
```python
|
| 226 |
+
# ✅ Inline code (recommended)
|
| 227 |
+
hf_jobs("uv", {"script": "# /// script\n# dependencies = [...]\n# ///\n\n<your code>"})
|
| 228 |
+
|
| 229 |
+
# ✅ From Hugging Face Hub
|
| 230 |
+
hf_jobs("uv", {"script": "https://huggingface.co/user/repo/resolve/main/train.py"})
|
| 231 |
+
|
| 232 |
+
# ✅ From GitHub
|
| 233 |
+
hf_jobs("uv", {"script": "https://raw.githubusercontent.com/user/repo/main/train.py"})
|
| 234 |
+
|
| 235 |
+
# ✅ From Gist
|
| 236 |
+
hf_jobs("uv", {"script": "https://gist.githubusercontent.com/user/id/raw/train.py"})
|
| 237 |
+
```
|
| 238 |
+
|
| 239 |
+
**To use local scripts:** Upload to HF Hub first:
|
| 240 |
+
```bash
|
| 241 |
+
huggingface-cli repo create my-training-scripts --type model
|
| 242 |
+
huggingface-cli upload my-training-scripts ./train.py train.py
|
| 243 |
+
# Use: https://huggingface.co/USERNAME/my-training-scripts/resolve/main/train.py
|
| 244 |
+
```
|
| 245 |
+
|
| 246 |
+
### Approach 2: TRL Maintained Scripts (Official Examples)
|
| 247 |
+
|
| 248 |
+
TRL provides battle-tested scripts for all methods. Can be run from URLs:
|
| 249 |
+
|
| 250 |
+
```python
|
| 251 |
+
hf_jobs("uv", {
|
| 252 |
+
"script": "https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py",
|
| 253 |
+
"script_args": [
|
| 254 |
+
"--model_name_or_path", "Qwen/Qwen2.5-0.5B",
|
| 255 |
+
"--dataset_name", "trl-lib/Capybara",
|
| 256 |
+
"--output_dir", "my-model",
|
| 257 |
+
"--push_to_hub",
|
| 258 |
+
"--hub_model_id", "username/my-model"
|
| 259 |
+
],
|
| 260 |
+
"flavor": "a10g-large",
|
| 261 |
+
"timeout": "2h",
|
| 262 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 263 |
+
})
|
| 264 |
+
```
|
| 265 |
+
|
| 266 |
+
**Benefits:** No code to write, maintained by TRL team, production-tested
|
| 267 |
+
**When to use:** Standard TRL training, quick experiments, don't need custom code
|
| 268 |
+
**Available:** Scripts are available from https://github.com/huggingface/trl/tree/main/examples/scripts
|
| 269 |
+
|
| 270 |
+
### Finding More UV Scripts on Hub
|
| 271 |
+
|
| 272 |
+
The `uv-scripts` organization provides ready-to-use UV scripts stored as datasets on Hugging Face Hub:
|
| 273 |
+
|
| 274 |
+
```python
|
| 275 |
+
# Discover available UV script collections
|
| 276 |
+
dataset_search({"author": "uv-scripts", "sort": "downloads", "limit": 20})
|
| 277 |
+
|
| 278 |
+
# Explore a specific collection
|
| 279 |
+
hub_repo_details(["uv-scripts/classification"], repo_type="dataset", include_readme=True)
|
| 280 |
+
```
|
| 281 |
+
|
| 282 |
+
**Popular collections:** ocr, classification, synthetic-data, vllm, dataset-creation
|
| 283 |
+
|
| 284 |
+
### Approach 3: HF Jobs CLI (Direct Terminal Commands)
|
| 285 |
+
|
| 286 |
+
When the `hf_jobs()` MCP tool is unavailable, use the `hf jobs` CLI directly.
|
| 287 |
+
|
| 288 |
+
**⚠️ CRITICAL: CLI Syntax Rules**
|
| 289 |
+
|
| 290 |
+
```bash
|
| 291 |
+
# ✅ CORRECT syntax - flags BEFORE script URL
|
| 292 |
+
hf jobs uv run --flavor a10g-large --timeout 2h --secrets HF_TOKEN "https://example.com/train.py"
|
| 293 |
+
|
| 294 |
+
# ❌ WRONG - "run uv" instead of "uv run"
|
| 295 |
+
hf jobs run uv "https://example.com/train.py" --flavor a10g-large
|
| 296 |
+
|
| 297 |
+
# ❌ WRONG - flags AFTER script URL (will be ignored!)
|
| 298 |
+
hf jobs uv run "https://example.com/train.py" --flavor a10g-large
|
| 299 |
+
|
| 300 |
+
# ❌ WRONG - "--secret" instead of "--secrets" (plural)
|
| 301 |
+
hf jobs uv run --secret HF_TOKEN "https://example.com/train.py"
|
| 302 |
+
```
|
| 303 |
+
|
| 304 |
+
**Key syntax rules:**
|
| 305 |
+
1. Command order is `hf jobs uv run` (NOT `hf jobs run uv`)
|
| 306 |
+
2. All flags (`--flavor`, `--timeout`, `--secrets`) must come BEFORE the script URL
|
| 307 |
+
3. Use `--secrets` (plural), not `--secret`
|
| 308 |
+
4. Script URL must be the last positional argument
|
| 309 |
+
|
| 310 |
+
**Complete CLI example:**
|
| 311 |
+
```bash
|
| 312 |
+
hf jobs uv run \
|
| 313 |
+
--flavor a10g-large \
|
| 314 |
+
--timeout 2h \
|
| 315 |
+
--secrets HF_TOKEN \
|
| 316 |
+
"https://huggingface.co/user/repo/resolve/main/train.py"
|
| 317 |
+
```
|
| 318 |
+
|
| 319 |
+
**Check job status via CLI:**
|
| 320 |
+
```bash
|
| 321 |
+
hf jobs ps # List all jobs
|
| 322 |
+
hf jobs logs <job-id> # View logs
|
| 323 |
+
hf jobs inspect <job-id> # Job details
|
| 324 |
+
hf jobs cancel <job-id> # Cancel a job
|
| 325 |
+
```
|
| 326 |
+
|
| 327 |
+
### Approach 4: TRL Jobs Package (Simplified Training)
|
| 328 |
+
|
| 329 |
+
The `trl-jobs` package provides optimized defaults and one-liner training.
|
| 330 |
+
|
| 331 |
+
```bash
|
| 332 |
+
# Install
|
| 333 |
+
pip install trl-jobs
|
| 334 |
+
|
| 335 |
+
# Train with SFT (simplest possible)
|
| 336 |
+
trl-jobs sft \
|
| 337 |
+
--model_name Qwen/Qwen2.5-0.5B \
|
| 338 |
+
--dataset_name trl-lib/Capybara
|
| 339 |
+
```
|
| 340 |
+
|
| 341 |
+
**Benefits:** Pre-configured settings, automatic Trackio integration, automatic Hub push, one-line commands
|
| 342 |
+
**When to use:** User working in terminal directly (not Claude Code context), quick local experimentation
|
| 343 |
+
**Repository:** https://github.com/huggingface/trl-jobs
|
| 344 |
+
|
| 345 |
+
⚠️ **In Claude Code context, prefer using `hf_jobs()` MCP tool (Approach 1) when available.**
|
| 346 |
+
|
| 347 |
+
## Hardware Selection
|
| 348 |
+
|
| 349 |
+
| Model Size | Recommended Hardware | Cost (approx/hr) | Use Case |
|
| 350 |
+
|------------|---------------------|------------------|----------|
|
| 351 |
+
| <1B params | `t4-small` | ~$0.75 | Demos, quick tests only without eval steps |
|
| 352 |
+
| 1-3B params | `t4-medium`, `l4x1` | ~$1.50-2.50 | Development |
|
| 353 |
+
| 3-7B params | `a10g-small`, `a10g-large` | ~$3.50-5.00 | Production training |
|
| 354 |
+
| 7-13B params | `a10g-large`, `a100-large` | ~$5-10 | Large models (use LoRA) |
|
| 355 |
+
| 13B+ params | `a100-large`, `a10g-largex2` | ~$10-20 | Very large (use LoRA) |
|
| 356 |
+
|
| 357 |
+
**GPU Flavors:** cpu-basic/upgrade/performance/xl, t4-small/medium, l4x1/x4, a10g-small/large/largex2/largex4, a100-large, h100/h100x8
|
| 358 |
+
|
| 359 |
+
**Guidelines:**
|
| 360 |
+
- Use **LoRA/PEFT** for models >7B to reduce memory
|
| 361 |
+
- Multi-GPU automatically handled by TRL/Accelerate
|
| 362 |
+
- Start with smaller hardware for testing
|
| 363 |
+
|
| 364 |
+
**See:** `references/hardware_guide.md` for detailed specifications
|
| 365 |
+
|
| 366 |
+
## Critical: Saving Results to Hub
|
| 367 |
+
|
| 368 |
+
**⚠️ EPHEMERAL ENVIRONMENT—MUST PUSH TO HUB**
|
| 369 |
+
|
| 370 |
+
The Jobs environment is temporary. All files are deleted when the job ends. If the model isn't pushed to Hub, **ALL TRAINING IS LOST**.
|
| 371 |
+
|
| 372 |
+
### Required Configuration
|
| 373 |
+
|
| 374 |
+
**In training script/config:**
|
| 375 |
+
```python
|
| 376 |
+
SFTConfig(
|
| 377 |
+
push_to_hub=True,
|
| 378 |
+
hub_model_id="username/model-name", # MUST specify
|
| 379 |
+
hub_strategy="every_save", # Optional: push checkpoints
|
| 380 |
+
)
|
| 381 |
+
```
|
| 382 |
+
|
| 383 |
+
**In job submission:**
|
| 384 |
+
```python
|
| 385 |
+
{
|
| 386 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Enables authentication
|
| 387 |
+
}
|
| 388 |
+
```
|
| 389 |
+
|
| 390 |
+
### Verification Checklist
|
| 391 |
+
|
| 392 |
+
Before submitting:
|
| 393 |
+
- [ ] `push_to_hub=True` set in config
|
| 394 |
+
- [ ] `hub_model_id` includes username/repo-name
|
| 395 |
+
- [ ] `secrets` parameter includes HF_TOKEN
|
| 396 |
+
- [ ] User has write access to target repo
|
| 397 |
+
|
| 398 |
+
**See:** `references/hub_saving.md` for detailed troubleshooting
|
| 399 |
+
|
| 400 |
+
## Timeout Management
|
| 401 |
+
|
| 402 |
+
**⚠️ DEFAULT: 30 MINUTES—TOO SHORT FOR TRAINING**
|
| 403 |
+
|
| 404 |
+
### Setting Timeouts
|
| 405 |
+
|
| 406 |
+
```python
|
| 407 |
+
{
|
| 408 |
+
"timeout": "2h" # 2 hours (formats: "90m", "2h", "1.5h", or seconds as integer)
|
| 409 |
+
}
|
| 410 |
+
```
|
| 411 |
+
|
| 412 |
+
### Timeout Guidelines
|
| 413 |
+
|
| 414 |
+
| Scenario | Recommended | Notes |
|
| 415 |
+
|----------|-------------|-------|
|
| 416 |
+
| Quick demo (50-100 examples) | 10-30 min | Verify setup |
|
| 417 |
+
| Development training | 1-2 hours | Small datasets |
|
| 418 |
+
| Production (3-7B model) | 4-6 hours | Full datasets |
|
| 419 |
+
| Large model with LoRA | 3-6 hours | Depends on dataset |
|
| 420 |
+
|
| 421 |
+
**Always add 20-30% buffer** for model/dataset loading, checkpoint saving, Hub push operations, and network delays.
|
| 422 |
+
|
| 423 |
+
**On timeout:** Job killed immediately, all unsaved progress lost, must restart from beginning
|
| 424 |
+
|
| 425 |
+
## Cost Estimation
|
| 426 |
+
|
| 427 |
+
**Offer to estimate cost when planning jobs with known parameters.** Use `scripts/estimate_cost.py`:
|
| 428 |
+
|
| 429 |
+
```bash
|
| 430 |
+
uv run scripts/estimate_cost.py \
|
| 431 |
+
--model meta-llama/Llama-2-7b-hf \
|
| 432 |
+
--dataset trl-lib/Capybara \
|
| 433 |
+
--hardware a10g-large \
|
| 434 |
+
--dataset-size 16000 \
|
| 435 |
+
--epochs 3
|
| 436 |
+
```
|
| 437 |
+
|
| 438 |
+
Output includes estimated time, cost, recommended timeout (with buffer), and optimization suggestions.
|
| 439 |
+
|
| 440 |
+
**When to offer:** User planning a job, asks about cost/time, choosing hardware, job will run >1 hour or cost >$5
|
| 441 |
+
|
| 442 |
+
## Example Training Scripts
|
| 443 |
+
|
| 444 |
+
**Production-ready templates with all best practices:**
|
| 445 |
+
|
| 446 |
+
Load these scripts for correctly:
|
| 447 |
+
|
| 448 |
+
- **`scripts/train_sft_example.py`** - Complete SFT training with Trackio, LoRA, checkpoints
|
| 449 |
+
- **`scripts/train_dpo_example.py`** - DPO training for preference learning
|
| 450 |
+
- **`scripts/train_grpo_example.py`** - GRPO training for online RL
|
| 451 |
+
|
| 452 |
+
These scripts demonstrate proper Hub saving, Trackio integration, checkpoint management, and optimized parameters. Pass their content inline to `hf_jobs()` or use as templates for custom scripts.
|
| 453 |
+
|
| 454 |
+
## Monitoring and Tracking
|
| 455 |
+
|
| 456 |
+
**Trackio** provides real-time metrics visualization. See `references/trackio_guide.md` for complete setup guide.
|
| 457 |
+
|
| 458 |
+
**Key points:**
|
| 459 |
+
- Add `trackio` to dependencies
|
| 460 |
+
- Configure trainer with `report_to="trackio" and run_name="meaningful_name"`
|
| 461 |
+
|
| 462 |
+
### Trackio Configuration Defaults
|
| 463 |
+
|
| 464 |
+
**Use sensible defaults unless user specifies otherwise.** When generating training scripts with Trackio:
|
| 465 |
+
|
| 466 |
+
**Default Configuration:**
|
| 467 |
+
- **Space ID**: `{username}/trackio` (use "trackio" as default space name)
|
| 468 |
+
- **Run naming**: Unless otherwise specified, name the run in a way the user will recognize (e.g., descriptive of the task, model, or purpose)
|
| 469 |
+
- **Config**: Keep minimal - only include hyperparameters and model/dataset info
|
| 470 |
+
- **Project Name**: Use a Project Name to associate runs with a particular Project
|
| 471 |
+
|
| 472 |
+
**User overrides:** If user requests specific trackio configuration (custom space, run naming, grouping, or additional config), apply their preferences instead of defaults.
|
| 473 |
+
|
| 474 |
+
|
| 475 |
+
This is useful for managing multiple jobs with the same configuration or keeping training scripts portable.
|
| 476 |
+
|
| 477 |
+
See `references/trackio_guide.md` for complete documentation including grouping runs for experiments.
|
| 478 |
+
|
| 479 |
+
### Check Job Status
|
| 480 |
+
|
| 481 |
+
```python
|
| 482 |
+
# List all jobs
|
| 483 |
+
hf_jobs("ps")
|
| 484 |
+
|
| 485 |
+
# Inspect specific job
|
| 486 |
+
hf_jobs("inspect", {"job_id": "your-job-id"})
|
| 487 |
+
|
| 488 |
+
# View logs
|
| 489 |
+
hf_jobs("logs", {"job_id": "your-job-id"})
|
| 490 |
+
```
|
| 491 |
+
|
| 492 |
+
**Remember:** Wait for user to request status checks. Avoid polling repeatedly.
|
| 493 |
+
|
| 494 |
+
## Dataset Validation
|
| 495 |
+
|
| 496 |
+
**Validate dataset format BEFORE launching GPU training to prevent the #1 cause of training failures: format mismatches.**
|
| 497 |
+
|
| 498 |
+
### Why Validate
|
| 499 |
+
|
| 500 |
+
- 50%+ of training failures are due to dataset format issues
|
| 501 |
+
- DPO especially strict: requires exact column names (`prompt`, `chosen`, `rejected`)
|
| 502 |
+
- Failed GPU jobs waste $1-10 and 30-60 minutes
|
| 503 |
+
- Validation on CPU costs ~$0.01 and takes <1 minute
|
| 504 |
+
|
| 505 |
+
### When to Validate
|
| 506 |
+
|
| 507 |
+
**ALWAYS validate for:**
|
| 508 |
+
- Unknown or custom datasets
|
| 509 |
+
- DPO training (CRITICAL - 90% of datasets need mapping)
|
| 510 |
+
- Any dataset not explicitly TRL-compatible
|
| 511 |
+
|
| 512 |
+
**Skip validation for known TRL datasets:**
|
| 513 |
+
- `trl-lib/ultrachat_200k`, `trl-lib/Capybara`, `HuggingFaceH4/ultrachat_200k`, etc.
|
| 514 |
+
|
| 515 |
+
### Usage
|
| 516 |
+
|
| 517 |
+
```python
|
| 518 |
+
hf_jobs("uv", {
|
| 519 |
+
"script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
|
| 520 |
+
"script_args": ["--dataset", "username/dataset-name", "--split", "train"]
|
| 521 |
+
})
|
| 522 |
+
```
|
| 523 |
+
|
| 524 |
+
The script is fast, and will usually complete synchronously.
|
| 525 |
+
|
| 526 |
+
### Reading Results
|
| 527 |
+
|
| 528 |
+
The output shows compatibility for each training method:
|
| 529 |
+
|
| 530 |
+
- **`✓ READY`** - Dataset is compatible, use directly
|
| 531 |
+
- **`✗ NEEDS MAPPING`** - Compatible but needs preprocessing (mapping code provided)
|
| 532 |
+
- **`✗ INCOMPATIBLE`** - Cannot be used for this method
|
| 533 |
+
|
| 534 |
+
When mapping is needed, the output includes a **"MAPPING CODE"** section with copy-paste ready Python code.
|
| 535 |
+
|
| 536 |
+
### Example Workflow
|
| 537 |
+
|
| 538 |
+
```python
|
| 539 |
+
# 1. Inspect dataset (costs ~$0.01, <1 min on CPU)
|
| 540 |
+
hf_jobs("uv", {
|
| 541 |
+
"script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
|
| 542 |
+
"script_args": ["--dataset", "argilla/distilabel-math-preference-dpo", "--split", "train"]
|
| 543 |
+
})
|
| 544 |
+
|
| 545 |
+
# 2. Check output markers:
|
| 546 |
+
# ✓ READY → proceed with training
|
| 547 |
+
# ✗ NEEDS MAPPING → apply mapping code below
|
| 548 |
+
# ✗ INCOMPATIBLE → choose different method/dataset
|
| 549 |
+
|
| 550 |
+
# 3. If mapping needed, apply before training:
|
| 551 |
+
def format_for_dpo(example):
|
| 552 |
+
return {
|
| 553 |
+
'prompt': example['instruction'],
|
| 554 |
+
'chosen': example['chosen_response'],
|
| 555 |
+
'rejected': example['rejected_response'],
|
| 556 |
+
}
|
| 557 |
+
dataset = dataset.map(format_for_dpo, remove_columns=dataset.column_names)
|
| 558 |
+
|
| 559 |
+
# 4. Launch training job with confidence
|
| 560 |
+
```
|
| 561 |
+
|
| 562 |
+
### Common Scenario: DPO Format Mismatch
|
| 563 |
+
|
| 564 |
+
Most DPO datasets use non-standard column names. Example:
|
| 565 |
+
|
| 566 |
+
```
|
| 567 |
+
Dataset has: instruction, chosen_response, rejected_response
|
| 568 |
+
DPO expects: prompt, chosen, rejected
|
| 569 |
+
```
|
| 570 |
+
|
| 571 |
+
The validator detects this and provides exact mapping code to fix it.
|
| 572 |
+
|
| 573 |
+
## Converting Models to GGUF
|
| 574 |
+
|
| 575 |
+
After training, convert models to **GGUF format** for use with llama.cpp, Ollama, LM Studio, and other local inference tools.
|
| 576 |
+
|
| 577 |
+
**What is GGUF:**
|
| 578 |
+
- Optimized for CPU/GPU inference with llama.cpp
|
| 579 |
+
- Supports quantization (4-bit, 5-bit, 8-bit) to reduce model size
|
| 580 |
+
- Compatible with Ollama, LM Studio, Jan, GPT4All, llama.cpp
|
| 581 |
+
- Typically 2-8GB for 7B models (vs 14GB unquantized)
|
| 582 |
+
|
| 583 |
+
**When to convert:**
|
| 584 |
+
- Running models locally with Ollama or LM Studio
|
| 585 |
+
- Reducing model size with quantization
|
| 586 |
+
- Deploying to edge devices
|
| 587 |
+
- Sharing models for local-first use
|
| 588 |
+
|
| 589 |
+
**See:** `references/gguf_conversion.md` for complete conversion guide, including production-ready conversion script, quantization options, hardware requirements, usage examples, and troubleshooting.
|
| 590 |
+
|
| 591 |
+
**Quick conversion:**
|
| 592 |
+
```python
|
| 593 |
+
hf_jobs("uv", {
|
| 594 |
+
"script": "<see references/gguf_conversion.md for complete script>",
|
| 595 |
+
"flavor": "a10g-large",
|
| 596 |
+
"timeout": "45m",
|
| 597 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
|
| 598 |
+
"env": {
|
| 599 |
+
"ADAPTER_MODEL": "username/my-finetuned-model",
|
| 600 |
+
"BASE_MODEL": "Qwen/Qwen2.5-0.5B",
|
| 601 |
+
"OUTPUT_REPO": "username/my-model-gguf"
|
| 602 |
+
}
|
| 603 |
+
})
|
| 604 |
+
```
|
| 605 |
+
|
| 606 |
+
## Common Training Patterns
|
| 607 |
+
|
| 608 |
+
See `references/training_patterns.md` for detailed examples including:
|
| 609 |
+
- Quick demo (5-10 minutes)
|
| 610 |
+
- Production with checkpoints
|
| 611 |
+
- Multi-GPU training
|
| 612 |
+
- DPO training (preference learning)
|
| 613 |
+
- GRPO training (online RL)
|
| 614 |
+
|
| 615 |
+
## Common Failure Modes
|
| 616 |
+
|
| 617 |
+
### Out of Memory (OOM)
|
| 618 |
+
|
| 619 |
+
**Fix (try in order):**
|
| 620 |
+
1. Reduce batch size: `per_device_train_batch_size=1`, increase `gradient_accumulation_steps=8`. Effective batch size is `per_device_train_batch_size` x `gradient_accumulation_steps`. For best performance keep effective batch size close to 128.
|
| 621 |
+
2. Enable: `gradient_checkpointing=True`
|
| 622 |
+
3. Upgrade hardware: t4-small → l4x1, a10g-small → a10g-large etc.
|
| 623 |
+
|
| 624 |
+
### Dataset Misformatted
|
| 625 |
+
|
| 626 |
+
**Fix:**
|
| 627 |
+
1. Validate first with dataset inspector:
|
| 628 |
+
```bash
|
| 629 |
+
uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
|
| 630 |
+
--dataset name --split train
|
| 631 |
+
```
|
| 632 |
+
2. Check output for compatibility markers (✓ READY, ✗ NEEDS MAPPING, ✗ INCOMPATIBLE)
|
| 633 |
+
3. Apply mapping code from inspector output if needed
|
| 634 |
+
|
| 635 |
+
### Job Timeout
|
| 636 |
+
|
| 637 |
+
**Fix:**
|
| 638 |
+
1. Check logs for actual runtime: `hf_jobs("logs", {"job_id": "..."})`
|
| 639 |
+
2. Increase timeout with buffer: `"timeout": "3h"` (add 30% to estimated time)
|
| 640 |
+
3. Or reduce training: lower `num_train_epochs`, use smaller dataset, enable `max_steps`
|
| 641 |
+
4. Save checkpoints: `save_strategy="steps"`, `save_steps=500`, `hub_strategy="every_save"`
|
| 642 |
+
|
| 643 |
+
**Note:** Default 30min is insufficient for real training. Minimum 1-2 hours.
|
| 644 |
+
|
| 645 |
+
### Hub Push Failures
|
| 646 |
+
|
| 647 |
+
**Fix:**
|
| 648 |
+
1. Add to job: `secrets={"HF_TOKEN": "$HF_TOKEN"}`
|
| 649 |
+
2. Add to config: `push_to_hub=True`, `hub_model_id="username/model-name"`
|
| 650 |
+
3. Verify auth: `mcp__huggingface__hf_whoami()`
|
| 651 |
+
4. Check token has write permissions and repo exists (or set `hub_private_repo=True`)
|
| 652 |
+
|
| 653 |
+
### Missing Dependencies
|
| 654 |
+
|
| 655 |
+
**Fix:**
|
| 656 |
+
Add to PEP 723 header:
|
| 657 |
+
```python
|
| 658 |
+
# /// script
|
| 659 |
+
# dependencies = ["trl>=0.12.0", "peft>=0.7.0", "trackio", "missing-package"]
|
| 660 |
+
# ///
|
| 661 |
+
```
|
| 662 |
+
|
| 663 |
+
## Troubleshooting
|
| 664 |
+
|
| 665 |
+
**Common issues:**
|
| 666 |
+
- Job times out → Increase timeout, reduce epochs/dataset, use smaller model/LoRA
|
| 667 |
+
- Model not saved to Hub → Check push_to_hub=True, hub_model_id, secrets=HF_TOKEN
|
| 668 |
+
- Out of Memory (OOM) → Reduce batch size, increase gradient accumulation, enable LoRA, use larger GPU
|
| 669 |
+
- Dataset format error → Validate with dataset inspector (see Dataset Validation section)
|
| 670 |
+
- Import/module errors → Add PEP 723 header with dependencies, verify format
|
| 671 |
+
- Authentication errors → Check `mcp__huggingface__hf_whoami()`, token permissions, secrets parameter
|
| 672 |
+
|
| 673 |
+
**See:** `references/troubleshooting.md` for complete troubleshooting guide
|
| 674 |
+
|
| 675 |
+
## Resources
|
| 676 |
+
|
| 677 |
+
### References (In This Skill)
|
| 678 |
+
- `references/training_methods.md` - Overview of SFT, DPO, GRPO, KTO, PPO, Reward Modeling
|
| 679 |
+
- `references/training_patterns.md` - Common training patterns and examples
|
| 680 |
+
- `references/unsloth.md` - Unsloth for fast VLM training (~2x speed, 60% less VRAM)
|
| 681 |
+
- `references/gguf_conversion.md` - Complete GGUF conversion guide
|
| 682 |
+
- `references/trackio_guide.md` - Trackio monitoring setup
|
| 683 |
+
- `references/hardware_guide.md` - Hardware specs and selection
|
| 684 |
+
- `references/hub_saving.md` - Hub authentication troubleshooting
|
| 685 |
+
- `references/troubleshooting.md` - Common issues and solutions
|
| 686 |
+
|
| 687 |
+
### Scripts (In This Skill)
|
| 688 |
+
- `scripts/train_sft_example.py` - Production SFT template
|
| 689 |
+
- `scripts/train_dpo_example.py` - Production DPO template
|
| 690 |
+
- `scripts/train_grpo_example.py` - Production GRPO template
|
| 691 |
+
- `scripts/unsloth_sft_example.py` - Unsloth text LLM training template (faster, less VRAM)
|
| 692 |
+
- `scripts/estimate_cost.py` - Estimate time and cost (offer when appropriate)
|
| 693 |
+
- `scripts/convert_to_gguf.py` - Complete GGUF conversion script
|
| 694 |
+
|
| 695 |
+
### External Scripts
|
| 696 |
+
- [Dataset Inspector](https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py) - Validate dataset format before training (use via `uv run` or `hf_jobs`)
|
| 697 |
+
|
| 698 |
+
### External Links
|
| 699 |
+
- [TRL Documentation](https://huggingface.co/docs/trl)
|
| 700 |
+
- [TRL Jobs Training Guide](https://huggingface.co/docs/trl/en/jobs_training)
|
| 701 |
+
- [TRL Jobs Package](https://github.com/huggingface/trl-jobs)
|
| 702 |
+
- [HF Jobs Documentation](https://huggingface.co/docs/huggingface_hub/guides/jobs)
|
| 703 |
+
- [TRL Example Scripts](https://github.com/huggingface/trl/tree/main/examples/scripts)
|
| 704 |
+
- [UV Scripts Guide](https://docs.astral.sh/uv/guides/scripts/)
|
| 705 |
+
- [UV Scripts Organization](https://huggingface.co/uv-scripts)
|
| 706 |
+
|
| 707 |
+
## Key Takeaways
|
| 708 |
+
|
| 709 |
+
1. **Submit scripts inline** - The `script` parameter accepts Python code directly; no file saving required unless user requests
|
| 710 |
+
2. **Jobs are asynchronous** - Don't wait/poll; let user check when ready
|
| 711 |
+
3. **Always set timeout** - Default 30 min is insufficient; minimum 1-2 hours recommended
|
| 712 |
+
4. **Always enable Hub push** - Environment is ephemeral; without push, all results lost
|
| 713 |
+
5. **Include Trackio** - Use example scripts as templates for real-time monitoring
|
| 714 |
+
6. **Offer cost estimation** - When parameters are known, use `scripts/estimate_cost.py`
|
| 715 |
+
7. **Use UV scripts (Approach 1)** - Default to `hf_jobs("uv", {...})` with inline scripts; TRL maintained scripts for standard training; avoid bash `trl-jobs` commands in Claude Code
|
| 716 |
+
8. **Use hf_doc_fetch/hf_doc_search** for latest TRL documentation
|
| 717 |
+
9. **Validate dataset format** before training with dataset inspector (see Dataset Validation section)
|
| 718 |
+
10. **Choose appropriate hardware** for model size; use LoRA for models >7B
|
skills/hugging-face-model-trainer/references/gguf_conversion.md
ADDED
|
@@ -0,0 +1,296 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# GGUF Conversion Guide
|
| 2 |
+
|
| 3 |
+
After training models with TRL on Hugging Face Jobs, convert them to **GGUF format** for use with llama.cpp, Ollama, LM Studio, and other local inference tools.
|
| 4 |
+
|
| 5 |
+
**This guide provides production-ready, tested code based on successful conversions.** All critical dependencies and build steps are included.
|
| 6 |
+
|
| 7 |
+
## What is GGUF?
|
| 8 |
+
|
| 9 |
+
**GGUF** (GPT-Generated Unified Format):
|
| 10 |
+
- Optimized format for CPU/GPU inference with llama.cpp
|
| 11 |
+
- Supports quantization (4-bit, 5-bit, 8-bit) to reduce model size
|
| 12 |
+
- Compatible with: Ollama, LM Studio, Jan, GPT4All, llama.cpp
|
| 13 |
+
- Typically 2-8GB for 7B models (vs 14GB unquantized)
|
| 14 |
+
|
| 15 |
+
## When to Convert to GGUF
|
| 16 |
+
|
| 17 |
+
**Convert when:**
|
| 18 |
+
- Running models locally with Ollama or LM Studio
|
| 19 |
+
- Using CPU-optimized inference
|
| 20 |
+
- Reducing model size with quantization
|
| 21 |
+
- Deploying to edge devices
|
| 22 |
+
- Sharing models for local-first use
|
| 23 |
+
|
| 24 |
+
## Critical Success Factors
|
| 25 |
+
|
| 26 |
+
Based on production testing, these are **essential** for reliable conversion:
|
| 27 |
+
|
| 28 |
+
### 1. ✅ Install Build Tools FIRST
|
| 29 |
+
**Before cloning llama.cpp**, install build dependencies:
|
| 30 |
+
```python
|
| 31 |
+
subprocess.run(["apt-get", "update", "-qq"], check=True, capture_output=True)
|
| 32 |
+
subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True, capture_output=True)
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
**Why:** The quantization tool requires gcc and cmake. Installing after cloning doesn't help.
|
| 36 |
+
|
| 37 |
+
### 2. ✅ Use CMake (Not Make)
|
| 38 |
+
**Build the quantize tool with CMake:**
|
| 39 |
+
```python
|
| 40 |
+
# Create build directory
|
| 41 |
+
os.makedirs("/tmp/llama.cpp/build", exist_ok=True)
|
| 42 |
+
|
| 43 |
+
# Configure
|
| 44 |
+
subprocess.run([
|
| 45 |
+
"cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp",
|
| 46 |
+
"-DGGML_CUDA=OFF" # Faster build, CUDA not needed for quantization
|
| 47 |
+
], check=True, capture_output=True, text=True)
|
| 48 |
+
|
| 49 |
+
# Build
|
| 50 |
+
subprocess.run([
|
| 51 |
+
"cmake", "--build", "/tmp/llama.cpp/build",
|
| 52 |
+
"--target", "llama-quantize", "-j", "4"
|
| 53 |
+
], check=True, capture_output=True, text=True)
|
| 54 |
+
|
| 55 |
+
# Binary path
|
| 56 |
+
quantize_bin = "/tmp/llama.cpp/build/bin/llama-quantize"
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
**Why:** CMake is more reliable than `make` and produces consistent binary paths.
|
| 60 |
+
|
| 61 |
+
### 3. ✅ Include All Dependencies
|
| 62 |
+
**PEP 723 header must include:**
|
| 63 |
+
```python
|
| 64 |
+
# /// script
|
| 65 |
+
# dependencies = [
|
| 66 |
+
# "transformers>=4.36.0",
|
| 67 |
+
# "peft>=0.7.0",
|
| 68 |
+
# "torch>=2.0.0",
|
| 69 |
+
# "accelerate>=0.24.0",
|
| 70 |
+
# "huggingface_hub>=0.20.0",
|
| 71 |
+
# "sentencepiece>=0.1.99", # Required for tokenizer
|
| 72 |
+
# "protobuf>=3.20.0", # Required for tokenizer
|
| 73 |
+
# "numpy",
|
| 74 |
+
# "gguf",
|
| 75 |
+
# ]
|
| 76 |
+
# ///
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
**Why:** `sentencepiece` and `protobuf` are critical for tokenizer conversion. Missing them causes silent failures.
|
| 80 |
+
|
| 81 |
+
### 4. ✅ Verify Names Before Use
|
| 82 |
+
**Always verify repos exist:**
|
| 83 |
+
```python
|
| 84 |
+
# Before submitting job, verify:
|
| 85 |
+
hub_repo_details([ADAPTER_MODEL], repo_type="model")
|
| 86 |
+
hub_repo_details([BASE_MODEL], repo_type="model")
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
**Why:** Non-existent dataset/model names cause job failures that could be caught in seconds.
|
| 90 |
+
|
| 91 |
+
## Complete Conversion Script
|
| 92 |
+
|
| 93 |
+
See `scripts/convert_to_gguf.py` for the complete, production-ready script.
|
| 94 |
+
|
| 95 |
+
**Key features:**
|
| 96 |
+
- ✅ All dependencies in PEP 723 header
|
| 97 |
+
- ✅ Build tools installed automatically
|
| 98 |
+
- ✅ CMake build process (reliable)
|
| 99 |
+
- ✅ Comprehensive error handling
|
| 100 |
+
- ✅ Environment variable configuration
|
| 101 |
+
- ✅ Automatic README generation
|
| 102 |
+
|
| 103 |
+
## Quick Conversion Job
|
| 104 |
+
|
| 105 |
+
```python
|
| 106 |
+
# Before submitting: VERIFY MODELS EXIST
|
| 107 |
+
hub_repo_details(["username/my-finetuned-model"], repo_type="model")
|
| 108 |
+
hub_repo_details(["Qwen/Qwen2.5-0.5B"], repo_type="model")
|
| 109 |
+
|
| 110 |
+
# Submit conversion job
|
| 111 |
+
hf_jobs("uv", {
|
| 112 |
+
"script": open("trl/scripts/convert_to_gguf.py").read(), # Or inline the script
|
| 113 |
+
"flavor": "a10g-large",
|
| 114 |
+
"timeout": "45m",
|
| 115 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"},
|
| 116 |
+
"env": {
|
| 117 |
+
"ADAPTER_MODEL": "username/my-finetuned-model",
|
| 118 |
+
"BASE_MODEL": "Qwen/Qwen2.5-0.5B",
|
| 119 |
+
"OUTPUT_REPO": "username/my-model-gguf",
|
| 120 |
+
"HF_USERNAME": "username" # Optional, for README
|
| 121 |
+
}
|
| 122 |
+
})
|
| 123 |
+
```
|
| 124 |
+
|
| 125 |
+
## Conversion Process
|
| 126 |
+
|
| 127 |
+
The script performs these steps:
|
| 128 |
+
|
| 129 |
+
1. **Load and Merge** - Load base model and LoRA adapter, merge them
|
| 130 |
+
2. **Install Build Tools** - Install gcc, cmake (CRITICAL: before cloning llama.cpp)
|
| 131 |
+
3. **Setup llama.cpp** - Clone repo, install Python dependencies
|
| 132 |
+
4. **Convert to GGUF** - Create FP16 GGUF using llama.cpp converter
|
| 133 |
+
5. **Build Quantize Tool** - Use CMake to build `llama-quantize`
|
| 134 |
+
6. **Quantize** - Create Q4_K_M, Q5_K_M, Q8_0 versions
|
| 135 |
+
7. **Upload** - Upload all versions + README to Hub
|
| 136 |
+
|
| 137 |
+
## Quantization Options
|
| 138 |
+
|
| 139 |
+
Common quantization formats (from smallest to largest):
|
| 140 |
+
|
| 141 |
+
| Format | Size | Quality | Use Case |
|
| 142 |
+
|--------|------|---------|----------|
|
| 143 |
+
| **Q4_K_M** | ~300MB | Good | **Recommended** - best balance of size/quality |
|
| 144 |
+
| **Q5_K_M** | ~350MB | Better | Higher quality, slightly larger |
|
| 145 |
+
| **Q8_0** | ~500MB | Very High | Near-original quality |
|
| 146 |
+
| **F16** | ~1GB | Original | Full precision, largest file |
|
| 147 |
+
|
| 148 |
+
**Recommendation:** Create Q4_K_M, Q5_K_M, and Q8_0 versions to give users options.
|
| 149 |
+
|
| 150 |
+
## Hardware Requirements
|
| 151 |
+
|
| 152 |
+
**For conversion:**
|
| 153 |
+
- Small models (<1B): CPU-basic works, but slow
|
| 154 |
+
- Medium models (1-7B): a10g-large recommended
|
| 155 |
+
- Large models (7B+): a10g-large or a100-large
|
| 156 |
+
|
| 157 |
+
**Time estimates:**
|
| 158 |
+
- 0.5B model: ~15-25 minutes on A10G
|
| 159 |
+
- 3B model: ~30-45 minutes on A10G
|
| 160 |
+
- 7B model: ~45-60 minutes on A10G
|
| 161 |
+
|
| 162 |
+
## Using GGUF Models
|
| 163 |
+
|
| 164 |
+
**GGUF models work on both CPU and GPU.** They're optimized for CPU inference but can also leverage GPU acceleration when available.
|
| 165 |
+
|
| 166 |
+
### With Ollama (auto-detects GPU)
|
| 167 |
+
```bash
|
| 168 |
+
# Download GGUF
|
| 169 |
+
huggingface-cli download username/my-model-gguf model-q4_k_m.gguf
|
| 170 |
+
|
| 171 |
+
# Create Modelfile
|
| 172 |
+
echo "FROM ./model-q4_k_m.gguf" > Modelfile
|
| 173 |
+
|
| 174 |
+
# Create and run (uses GPU automatically if available)
|
| 175 |
+
ollama create my-model -f Modelfile
|
| 176 |
+
ollama run my-model
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
### With llama.cpp
|
| 180 |
+
```bash
|
| 181 |
+
# CPU only
|
| 182 |
+
./llama-cli -m model-q4_k_m.gguf -p "Your prompt"
|
| 183 |
+
|
| 184 |
+
# With GPU acceleration (offload 32 layers to GPU)
|
| 185 |
+
./llama-cli -m model-q4_k_m.gguf -ngl 32 -p "Your prompt"
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
### With LM Studio
|
| 189 |
+
1. Download the `.gguf` file
|
| 190 |
+
2. Import into LM Studio
|
| 191 |
+
3. Start chatting
|
| 192 |
+
|
| 193 |
+
## Best Practices
|
| 194 |
+
|
| 195 |
+
### ✅ DO:
|
| 196 |
+
1. **Verify repos exist** before submitting jobs (use `hub_repo_details`)
|
| 197 |
+
2. **Install build tools FIRST** before cloning llama.cpp
|
| 198 |
+
3. **Use CMake** for building quantize tool (not make)
|
| 199 |
+
4. **Include all dependencies** in PEP 723 header (especially sentencepiece, protobuf)
|
| 200 |
+
5. **Create multiple quantizations** - Give users choice
|
| 201 |
+
6. **Test on known models** before production use
|
| 202 |
+
7. **Use A10G GPU** for faster conversion
|
| 203 |
+
|
| 204 |
+
### ❌ DON'T:
|
| 205 |
+
1. **Assume repos exist** - Always verify with hub tools
|
| 206 |
+
2. **Use make** instead of CMake - Less reliable
|
| 207 |
+
3. **Remove dependencies** to "simplify" - They're all needed
|
| 208 |
+
4. **Skip build tools** - Quantization will fail silently
|
| 209 |
+
5. **Use default paths** - CMake puts binaries in build/bin/
|
| 210 |
+
|
| 211 |
+
## Common Issues
|
| 212 |
+
|
| 213 |
+
### Out of memory during merge
|
| 214 |
+
**Fix:**
|
| 215 |
+
- Use larger GPU (a10g-large or a100-large)
|
| 216 |
+
- Ensure `device_map="auto"` for automatic placement
|
| 217 |
+
- Use `dtype=torch.float16` or `torch.bfloat16`
|
| 218 |
+
|
| 219 |
+
### Conversion fails with architecture error
|
| 220 |
+
**Fix:**
|
| 221 |
+
- Ensure llama.cpp supports the model architecture
|
| 222 |
+
- Check for standard architecture (Qwen, Llama, Mistral, etc.)
|
| 223 |
+
- Update llama.cpp to latest: `git clone --depth 1 https://github.com/ggerganov/llama.cpp.git`
|
| 224 |
+
- Check llama.cpp documentation for model support
|
| 225 |
+
|
| 226 |
+
### Quantization fails
|
| 227 |
+
**Fix:**
|
| 228 |
+
- Verify build tools installed: `apt-get install build-essential cmake`
|
| 229 |
+
- Use CMake (not make) to build quantize tool
|
| 230 |
+
- Check binary path: `/tmp/llama.cpp/build/bin/llama-quantize`
|
| 231 |
+
- Verify FP16 GGUF exists before quantizing
|
| 232 |
+
|
| 233 |
+
### Missing sentencepiece error
|
| 234 |
+
**Fix:**
|
| 235 |
+
- Add to PEP 723 header: `"sentencepiece>=0.1.99", "protobuf>=3.20.0"`
|
| 236 |
+
- Don't remove dependencies to "simplify" - all are required
|
| 237 |
+
|
| 238 |
+
### Upload fails or times out
|
| 239 |
+
**Fix:**
|
| 240 |
+
- Large models (>2GB) need longer timeout: `"timeout": "1h"`
|
| 241 |
+
- Upload quantized versions separately if needed
|
| 242 |
+
- Check network/Hub status
|
| 243 |
+
|
| 244 |
+
## Lessons Learned
|
| 245 |
+
|
| 246 |
+
These are from production testing and real failures:
|
| 247 |
+
|
| 248 |
+
### 1. Always Verify Before Use
|
| 249 |
+
**Lesson:** Don't assume repos/datasets exist. Check first.
|
| 250 |
+
```python
|
| 251 |
+
# BEFORE submitting job
|
| 252 |
+
hub_repo_details(["trl-lib/argilla-dpo-mix-7k"], repo_type="dataset") # Would catch error
|
| 253 |
+
```
|
| 254 |
+
**Prevented failures:** Non-existent dataset names, typos in model names
|
| 255 |
+
|
| 256 |
+
### 2. Prioritize Reliability Over Performance
|
| 257 |
+
**Lesson:** Default to what's most likely to succeed.
|
| 258 |
+
- Use CMake (not make) - more reliable
|
| 259 |
+
- Disable CUDA in build - faster, not needed
|
| 260 |
+
- Include all dependencies - don't "simplify"
|
| 261 |
+
|
| 262 |
+
**Prevented failures:** Build failures, missing binaries
|
| 263 |
+
|
| 264 |
+
### 3. Create Atomic, Self-Contained Scripts
|
| 265 |
+
**Lesson:** Don't remove dependencies or steps. Scripts should work as a unit.
|
| 266 |
+
- All dependencies in PEP 723 header
|
| 267 |
+
- All build steps included
|
| 268 |
+
- Clear error messages
|
| 269 |
+
|
| 270 |
+
**Prevented failures:** Missing tokenizer libraries, build tool failures
|
| 271 |
+
|
| 272 |
+
## References
|
| 273 |
+
|
| 274 |
+
**In this skill:**
|
| 275 |
+
- `scripts/convert_to_gguf.py` - Complete, production-ready script
|
| 276 |
+
|
| 277 |
+
**External:**
|
| 278 |
+
- [llama.cpp Repository](https://github.com/ggerganov/llama.cpp)
|
| 279 |
+
- [GGUF Specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
|
| 280 |
+
- [Ollama Documentation](https://ollama.ai)
|
| 281 |
+
- [LM Studio](https://lmstudio.ai)
|
| 282 |
+
|
| 283 |
+
## Summary
|
| 284 |
+
|
| 285 |
+
**Critical checklist for GGUF conversion:**
|
| 286 |
+
- [ ] Verify adapter and base models exist on Hub
|
| 287 |
+
- [ ] Use production script from `scripts/convert_to_gguf.py`
|
| 288 |
+
- [ ] All dependencies in PEP 723 header (including sentencepiece, protobuf)
|
| 289 |
+
- [ ] Build tools installed before cloning llama.cpp
|
| 290 |
+
- [ ] CMake used for building quantize tool (not make)
|
| 291 |
+
- [ ] Correct binary path: `/tmp/llama.cpp/build/bin/llama-quantize`
|
| 292 |
+
- [ ] A10G GPU selected for reasonable conversion time
|
| 293 |
+
- [ ] Timeout set to 45m minimum
|
| 294 |
+
- [ ] HF_TOKEN in secrets for Hub upload
|
| 295 |
+
|
| 296 |
+
**The script in `scripts/convert_to_gguf.py` incorporates all these lessons and has been tested successfully in production.**
|
skills/hugging-face-model-trainer/references/hardware_guide.md
ADDED
|
@@ -0,0 +1,283 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Hardware Selection Guide
|
| 2 |
+
|
| 3 |
+
Choosing the right hardware (flavor) is critical for cost-effective training.
|
| 4 |
+
|
| 5 |
+
## Available Hardware
|
| 6 |
+
|
| 7 |
+
### CPU
|
| 8 |
+
- `cpu-basic` - Basic CPU, testing only
|
| 9 |
+
- `cpu-upgrade` - Enhanced CPU
|
| 10 |
+
|
| 11 |
+
**Use cases:** Dataset validation, preprocessing, testing scripts
|
| 12 |
+
**Not recommended for training:** Too slow for any meaningful training
|
| 13 |
+
|
| 14 |
+
### GPU Options
|
| 15 |
+
|
| 16 |
+
| Flavor | GPU | Memory | Use Case | Cost/hour |
|
| 17 |
+
|--------|-----|--------|----------|-----------|
|
| 18 |
+
| `t4-small` | NVIDIA T4 | 16GB | <1B models, demos | ~$0.50-1 |
|
| 19 |
+
| `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 |
|
| 20 |
+
| `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient training | ~$2-3 |
|
| 21 |
+
| `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU training | ~$8-12 |
|
| 22 |
+
| `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 |
|
| 23 |
+
| `a10g-large` | NVIDIA A10G | 24GB | 7-13B models | ~$4-6 |
|
| 24 |
+
| `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 |
|
| 25 |
+
| `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 |
|
| 26 |
+
| `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fast training | ~$8-12 |
|
| 27 |
+
|
| 28 |
+
### TPU Options
|
| 29 |
+
|
| 30 |
+
| Flavor | Type | Use Case |
|
| 31 |
+
|--------|------|----------|
|
| 32 |
+
| `v5e-1x1` | TPU v5e | Small TPU workloads |
|
| 33 |
+
| `v5e-2x2` | 4x TPU v5e | Medium TPU workloads |
|
| 34 |
+
| `v5e-2x4` | 8x TPU v5e | Large TPU workloads |
|
| 35 |
+
|
| 36 |
+
**Note:** TPUs require TPU-optimized code. Most TRL training uses GPUs.
|
| 37 |
+
|
| 38 |
+
## Selection Guidelines
|
| 39 |
+
|
| 40 |
+
### By Model Size
|
| 41 |
+
|
| 42 |
+
**Tiny Models (<1B parameters)**
|
| 43 |
+
- **Recommended:** `t4-small`
|
| 44 |
+
- **Example:** Qwen2.5-0.5B, TinyLlama
|
| 45 |
+
- **Batch size:** 4-8
|
| 46 |
+
- **Training time:** 1-2 hours for 1K examples
|
| 47 |
+
|
| 48 |
+
**Small Models (1-3B parameters)**
|
| 49 |
+
- **Recommended:** `t4-medium` or `a10g-small`
|
| 50 |
+
- **Example:** Qwen2.5-1.5B, Phi-2
|
| 51 |
+
- **Batch size:** 2-4
|
| 52 |
+
- **Training time:** 2-4 hours for 10K examples
|
| 53 |
+
|
| 54 |
+
**Medium Models (3-7B parameters)**
|
| 55 |
+
- **Recommended:** `a10g-small` or `a10g-large`
|
| 56 |
+
- **Example:** Qwen2.5-7B, Mistral-7B
|
| 57 |
+
- **Batch size:** 1-2 (or LoRA with 4-8)
|
| 58 |
+
- **Training time:** 4-8 hours for 10K examples
|
| 59 |
+
|
| 60 |
+
**Large Models (7-13B parameters)**
|
| 61 |
+
- **Recommended:** `a10g-large` or `a100-large`
|
| 62 |
+
- **Example:** Llama-3-8B, Mixtral-8x7B (with LoRA)
|
| 63 |
+
- **Batch size:** 1 (full fine-tuning) or 2-4 (LoRA)
|
| 64 |
+
- **Training time:** 6-12 hours for 10K examples
|
| 65 |
+
- **Note:** Always use LoRA/PEFT
|
| 66 |
+
|
| 67 |
+
**Very Large Models (13B+ parameters)**
|
| 68 |
+
- **Recommended:** `a100-large` with LoRA
|
| 69 |
+
- **Example:** Llama-3-13B, Llama-3-70B (LoRA only)
|
| 70 |
+
- **Batch size:** 1-2 with LoRA
|
| 71 |
+
- **Training time:** 8-24 hours for 10K examples
|
| 72 |
+
- **Note:** Full fine-tuning not feasible, use LoRA/PEFT
|
| 73 |
+
|
| 74 |
+
### By Budget
|
| 75 |
+
|
| 76 |
+
**Minimal Budget (<$5 total)**
|
| 77 |
+
- Use `t4-small`
|
| 78 |
+
- Train on subset of data (100-500 examples)
|
| 79 |
+
- Limit to 1-2 epochs
|
| 80 |
+
- Use small model (<1B)
|
| 81 |
+
|
| 82 |
+
**Small Budget ($5-20)**
|
| 83 |
+
- Use `t4-medium` or `a10g-small`
|
| 84 |
+
- Train on 1K-5K examples
|
| 85 |
+
- 2-3 epochs
|
| 86 |
+
- Model up to 3B parameters
|
| 87 |
+
|
| 88 |
+
**Medium Budget ($20-50)**
|
| 89 |
+
- Use `a10g-small` or `a10g-large`
|
| 90 |
+
- Train on 5K-20K examples
|
| 91 |
+
- 3-5 epochs
|
| 92 |
+
- Model up to 7B parameters
|
| 93 |
+
|
| 94 |
+
**Large Budget ($50-200)**
|
| 95 |
+
- Use `a10g-large` or `a100-large`
|
| 96 |
+
- Full dataset training
|
| 97 |
+
- Multiple epochs
|
| 98 |
+
- Model up to 13B parameters with LoRA
|
| 99 |
+
|
| 100 |
+
### By Training Type
|
| 101 |
+
|
| 102 |
+
**Quick Demo/Experiment**
|
| 103 |
+
- `t4-small`
|
| 104 |
+
- 50-100 examples
|
| 105 |
+
- 5-10 steps
|
| 106 |
+
- ~10-15 minutes
|
| 107 |
+
|
| 108 |
+
**Development/Iteration**
|
| 109 |
+
- `t4-medium` or `a10g-small`
|
| 110 |
+
- 1K examples
|
| 111 |
+
- 1 epoch
|
| 112 |
+
- ~30-60 minutes
|
| 113 |
+
|
| 114 |
+
**Production Training**
|
| 115 |
+
- `a10g-large` or `a100-large`
|
| 116 |
+
- Full dataset
|
| 117 |
+
- 3-5 epochs
|
| 118 |
+
- 4-12 hours
|
| 119 |
+
|
| 120 |
+
**Research/Experimentation**
|
| 121 |
+
- `a100-large`
|
| 122 |
+
- Multiple runs
|
| 123 |
+
- Various hyperparameters
|
| 124 |
+
- Budget for 20-50 hours
|
| 125 |
+
|
| 126 |
+
## Memory Considerations
|
| 127 |
+
|
| 128 |
+
### Estimating Memory Requirements
|
| 129 |
+
|
| 130 |
+
**Full fine-tuning:**
|
| 131 |
+
```
|
| 132 |
+
Memory (GB) ≈ (Model params in billions) × 20
|
| 133 |
+
```
|
| 134 |
+
|
| 135 |
+
**LoRA fine-tuning:**
|
| 136 |
+
```
|
| 137 |
+
Memory (GB) ≈ (Model params in billions) × 4
|
| 138 |
+
```
|
| 139 |
+
|
| 140 |
+
**Examples:**
|
| 141 |
+
- Qwen2.5-0.5B full: ~10GB ✅ fits t4-small
|
| 142 |
+
- Qwen2.5-1.5B full: ~30GB ❌ exceeds most GPUs
|
| 143 |
+
- Qwen2.5-1.5B LoRA: ~6GB ✅ fits t4-small
|
| 144 |
+
- Qwen2.5-7B full: ~140GB ❌ not feasible
|
| 145 |
+
- Qwen2.5-7B LoRA: ~28GB ✅ fits a10g-large
|
| 146 |
+
|
| 147 |
+
### Memory Optimization
|
| 148 |
+
|
| 149 |
+
If hitting memory limits:
|
| 150 |
+
|
| 151 |
+
1. **Use LoRA/PEFT**
|
| 152 |
+
```python
|
| 153 |
+
peft_config=LoraConfig(r=16, lora_alpha=32)
|
| 154 |
+
```
|
| 155 |
+
|
| 156 |
+
2. **Reduce batch size**
|
| 157 |
+
```python
|
| 158 |
+
per_device_train_batch_size=1
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
3. **Increase gradient accumulation**
|
| 162 |
+
```python
|
| 163 |
+
gradient_accumulation_steps=8 # Effective batch size = 1×8
|
| 164 |
+
```
|
| 165 |
+
|
| 166 |
+
4. **Enable gradient checkpointing**
|
| 167 |
+
```python
|
| 168 |
+
gradient_checkpointing=True
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
5. **Use mixed precision**
|
| 172 |
+
```python
|
| 173 |
+
bf16=True # or fp16=True
|
| 174 |
+
```
|
| 175 |
+
|
| 176 |
+
6. **Upgrade to larger GPU**
|
| 177 |
+
- t4 → a10g → a100
|
| 178 |
+
|
| 179 |
+
## Cost Estimation
|
| 180 |
+
|
| 181 |
+
### Formula
|
| 182 |
+
|
| 183 |
+
```
|
| 184 |
+
Total Cost = (Hours of training) × (Cost per hour)
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
### Example Calculations
|
| 188 |
+
|
| 189 |
+
**Quick demo:**
|
| 190 |
+
- Hardware: t4-small ($0.75/hour)
|
| 191 |
+
- Time: 15 minutes (0.25 hours)
|
| 192 |
+
- Cost: $0.19
|
| 193 |
+
|
| 194 |
+
**Development training:**
|
| 195 |
+
- Hardware: a10g-small ($3.50/hour)
|
| 196 |
+
- Time: 2 hours
|
| 197 |
+
- Cost: $7.00
|
| 198 |
+
|
| 199 |
+
**Production training:**
|
| 200 |
+
- Hardware: a10g-large ($5/hour)
|
| 201 |
+
- Time: 6 hours
|
| 202 |
+
- Cost: $30.00
|
| 203 |
+
|
| 204 |
+
**Large model with LoRA:**
|
| 205 |
+
- Hardware: a100-large ($10/hour)
|
| 206 |
+
- Time: 8 hours
|
| 207 |
+
- Cost: $80.00
|
| 208 |
+
|
| 209 |
+
### Cost Optimization Tips
|
| 210 |
+
|
| 211 |
+
1. **Start small:** Test on t4-small with subset
|
| 212 |
+
2. **Use LoRA:** 4-5x cheaper than full fine-tuning
|
| 213 |
+
3. **Optimize hyperparameters:** Fewer epochs if possible
|
| 214 |
+
4. **Set appropriate timeout:** Don't waste compute on stalled jobs
|
| 215 |
+
5. **Use checkpointing:** Resume if job fails
|
| 216 |
+
6. **Monitor costs:** Check running jobs regularly
|
| 217 |
+
|
| 218 |
+
## Multi-GPU Training
|
| 219 |
+
|
| 220 |
+
TRL automatically handles multi-GPU training with Accelerate when using multi-GPU flavors.
|
| 221 |
+
|
| 222 |
+
**Multi-GPU flavors:**
|
| 223 |
+
- `l4x4` - 4x L4 GPUs
|
| 224 |
+
- `a10g-largex2` - 2x A10G GPUs
|
| 225 |
+
- `a10g-largex4` - 4x A10G GPUs
|
| 226 |
+
|
| 227 |
+
**When to use:**
|
| 228 |
+
- Models >13B parameters
|
| 229 |
+
- Need faster training (linear speedup)
|
| 230 |
+
- Large datasets (>50K examples)
|
| 231 |
+
|
| 232 |
+
**Example:**
|
| 233 |
+
```python
|
| 234 |
+
hf_jobs("uv", {
|
| 235 |
+
"script": "train.py",
|
| 236 |
+
"flavor": "a10g-largex2", # 2 GPUs
|
| 237 |
+
"timeout": "4h",
|
| 238 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 239 |
+
})
|
| 240 |
+
```
|
| 241 |
+
|
| 242 |
+
No code changes needed—TRL/Accelerate handles distribution automatically.
|
| 243 |
+
|
| 244 |
+
## Choosing Between Options
|
| 245 |
+
|
| 246 |
+
### a10g vs a100
|
| 247 |
+
|
| 248 |
+
**Choose a10g when:**
|
| 249 |
+
- Model <13B parameters
|
| 250 |
+
- Budget conscious
|
| 251 |
+
- Training time not critical
|
| 252 |
+
|
| 253 |
+
**Choose a100 when:**
|
| 254 |
+
- Model 13B+ parameters
|
| 255 |
+
- Need fastest training
|
| 256 |
+
- Memory requirements high
|
| 257 |
+
- Budget allows
|
| 258 |
+
|
| 259 |
+
### Single vs Multi-GPU
|
| 260 |
+
|
| 261 |
+
**Choose single GPU when:**
|
| 262 |
+
- Model <7B parameters
|
| 263 |
+
- Budget constrained
|
| 264 |
+
- Simpler debugging
|
| 265 |
+
|
| 266 |
+
**Choose multi-GPU when:**
|
| 267 |
+
- Model >13B parameters
|
| 268 |
+
- Need faster training
|
| 269 |
+
- Large batch sizes required
|
| 270 |
+
- Cost-effective for large jobs
|
| 271 |
+
|
| 272 |
+
## Quick Reference
|
| 273 |
+
|
| 274 |
+
```python
|
| 275 |
+
# Model size → Hardware selection
|
| 276 |
+
HARDWARE_MAP = {
|
| 277 |
+
"<1B": "t4-small",
|
| 278 |
+
"1-3B": "a10g-small",
|
| 279 |
+
"3-7B": "a10g-large",
|
| 280 |
+
"7-13B": "a10g-large (LoRA) or a100-large",
|
| 281 |
+
">13B": "a100-large (LoRA required)"
|
| 282 |
+
}
|
| 283 |
+
```
|
skills/hugging-face-model-trainer/references/hub_saving.md
ADDED
|
@@ -0,0 +1,364 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Saving Training Results to Hugging Face Hub
|
| 2 |
+
|
| 3 |
+
**⚠️ CRITICAL:** Training environments are ephemeral. ALL results are lost when a job completes unless pushed to the Hub.
|
| 4 |
+
|
| 5 |
+
## Why Hub Push is Required
|
| 6 |
+
|
| 7 |
+
When running on Hugging Face Jobs:
|
| 8 |
+
- Environment is temporary
|
| 9 |
+
- All files deleted on job completion
|
| 10 |
+
- No local disk persistence
|
| 11 |
+
- Cannot access results after job ends
|
| 12 |
+
|
| 13 |
+
**Without Hub push, training is completely wasted.**
|
| 14 |
+
|
| 15 |
+
## Required Configuration
|
| 16 |
+
|
| 17 |
+
### 1. Training Configuration
|
| 18 |
+
|
| 19 |
+
In your SFTConfig or trainer config:
|
| 20 |
+
|
| 21 |
+
```python
|
| 22 |
+
SFTConfig(
|
| 23 |
+
push_to_hub=True, # Enable Hub push
|
| 24 |
+
hub_model_id="username/model-name", # Target repository
|
| 25 |
+
)
|
| 26 |
+
```
|
| 27 |
+
|
| 28 |
+
### 2. Job Configuration
|
| 29 |
+
|
| 30 |
+
When submitting the job:
|
| 31 |
+
|
| 32 |
+
```python
|
| 33 |
+
hf_jobs("uv", {
|
| 34 |
+
"script": "train.py",
|
| 35 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Provide authentication
|
| 36 |
+
})
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
**The `$HF_TOKEN` placeholder is automatically replaced with your Hugging Face token.**
|
| 40 |
+
|
| 41 |
+
## Complete Example
|
| 42 |
+
|
| 43 |
+
```python
|
| 44 |
+
# train.py
|
| 45 |
+
# /// script
|
| 46 |
+
# dependencies = ["trl"]
|
| 47 |
+
# ///
|
| 48 |
+
|
| 49 |
+
from trl import SFTTrainer, SFTConfig
|
| 50 |
+
from datasets import load_dataset
|
| 51 |
+
|
| 52 |
+
dataset = load_dataset("trl-lib/Capybara", split="train")
|
| 53 |
+
|
| 54 |
+
# Configure with Hub push
|
| 55 |
+
config = SFTConfig(
|
| 56 |
+
output_dir="my-model",
|
| 57 |
+
num_train_epochs=3,
|
| 58 |
+
|
| 59 |
+
# ✅ CRITICAL: Hub push configuration
|
| 60 |
+
push_to_hub=True,
|
| 61 |
+
hub_model_id="myusername/my-trained-model",
|
| 62 |
+
|
| 63 |
+
# Optional: Push strategy
|
| 64 |
+
push_to_hub_model_id="myusername/my-trained-model",
|
| 65 |
+
push_to_hub_organization=None,
|
| 66 |
+
push_to_hub_token=None, # Uses environment token
|
| 67 |
+
)
|
| 68 |
+
|
| 69 |
+
trainer = SFTTrainer(
|
| 70 |
+
model="Qwen/Qwen2.5-0.5B",
|
| 71 |
+
train_dataset=dataset,
|
| 72 |
+
args=config,
|
| 73 |
+
)
|
| 74 |
+
|
| 75 |
+
trainer.train()
|
| 76 |
+
|
| 77 |
+
# ✅ Push final model
|
| 78 |
+
trainer.push_to_hub()
|
| 79 |
+
|
| 80 |
+
print("✅ Model saved to: https://huggingface.co/myusername/my-trained-model")
|
| 81 |
+
```
|
| 82 |
+
|
| 83 |
+
**Submit with authentication:**
|
| 84 |
+
|
| 85 |
+
```python
|
| 86 |
+
hf_jobs("uv", {
|
| 87 |
+
"script": "train.py",
|
| 88 |
+
"flavor": "a10g-large",
|
| 89 |
+
"timeout": "2h",
|
| 90 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # ✅ Required!
|
| 91 |
+
})
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
## What Gets Saved
|
| 95 |
+
|
| 96 |
+
When `push_to_hub=True`:
|
| 97 |
+
|
| 98 |
+
1. **Model weights** - Final trained parameters
|
| 99 |
+
2. **Tokenizer** - Associated tokenizer
|
| 100 |
+
3. **Configuration** - Model config (config.json)
|
| 101 |
+
4. **Training arguments** - Hyperparameters used
|
| 102 |
+
5. **Model card** - Auto-generated documentation
|
| 103 |
+
6. **Checkpoints** - If `save_strategy="steps"` enabled
|
| 104 |
+
|
| 105 |
+
## Checkpoint Saving
|
| 106 |
+
|
| 107 |
+
Save intermediate checkpoints during training:
|
| 108 |
+
|
| 109 |
+
```python
|
| 110 |
+
SFTConfig(
|
| 111 |
+
output_dir="my-model",
|
| 112 |
+
push_to_hub=True,
|
| 113 |
+
hub_model_id="username/my-model",
|
| 114 |
+
|
| 115 |
+
# Checkpoint configuration
|
| 116 |
+
save_strategy="steps",
|
| 117 |
+
save_steps=100, # Save every 100 steps
|
| 118 |
+
save_total_limit=3, # Keep only last 3 checkpoints
|
| 119 |
+
)
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
**Benefits:**
|
| 123 |
+
- Resume training if job fails
|
| 124 |
+
- Compare checkpoint performance
|
| 125 |
+
- Use intermediate models
|
| 126 |
+
|
| 127 |
+
**Checkpoints are pushed to:** `username/my-model` (same repo)
|
| 128 |
+
|
| 129 |
+
## Authentication Methods
|
| 130 |
+
|
| 131 |
+
### Method 1: Automatic Token (Recommended)
|
| 132 |
+
|
| 133 |
+
```python
|
| 134 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
Uses your logged-in Hugging Face token automatically.
|
| 138 |
+
|
| 139 |
+
### Method 2: Explicit Token
|
| 140 |
+
|
| 141 |
+
```python
|
| 142 |
+
"secrets": {"HF_TOKEN": "hf_abc123..."}
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
Provide token explicitly (not recommended for security).
|
| 146 |
+
|
| 147 |
+
### Method 3: Environment Variable
|
| 148 |
+
|
| 149 |
+
```python
|
| 150 |
+
"env": {"HF_TOKEN": "hf_abc123..."}
|
| 151 |
+
```
|
| 152 |
+
|
| 153 |
+
Pass as regular environment variable (less secure than secrets).
|
| 154 |
+
|
| 155 |
+
**Always prefer Method 1** for security and convenience.
|
| 156 |
+
|
| 157 |
+
## Verification Checklist
|
| 158 |
+
|
| 159 |
+
Before submitting any training job, verify:
|
| 160 |
+
|
| 161 |
+
- [ ] `push_to_hub=True` in training config
|
| 162 |
+
- [ ] `hub_model_id` is specified (format: `username/model-name`)
|
| 163 |
+
- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
|
| 164 |
+
- [ ] Repository name doesn't conflict with existing repos
|
| 165 |
+
- [ ] You have write access to the target namespace
|
| 166 |
+
|
| 167 |
+
## Repository Setup
|
| 168 |
+
|
| 169 |
+
### Automatic Creation
|
| 170 |
+
|
| 171 |
+
If repository doesn't exist, it's created automatically when first pushing.
|
| 172 |
+
|
| 173 |
+
### Manual Creation
|
| 174 |
+
|
| 175 |
+
Create repository before training:
|
| 176 |
+
|
| 177 |
+
```python
|
| 178 |
+
from huggingface_hub import HfApi
|
| 179 |
+
|
| 180 |
+
api = HfApi()
|
| 181 |
+
api.create_repo(
|
| 182 |
+
repo_id="username/model-name",
|
| 183 |
+
repo_type="model",
|
| 184 |
+
private=False, # or True for private repo
|
| 185 |
+
)
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
### Repository Naming
|
| 189 |
+
|
| 190 |
+
**Valid names:**
|
| 191 |
+
- `username/my-model`
|
| 192 |
+
- `username/model-name`
|
| 193 |
+
- `organization/model-name`
|
| 194 |
+
|
| 195 |
+
**Invalid names:**
|
| 196 |
+
- `model-name` (missing username)
|
| 197 |
+
- `username/model name` (spaces not allowed)
|
| 198 |
+
- `username/MODEL` (uppercase discouraged)
|
| 199 |
+
|
| 200 |
+
## Troubleshooting
|
| 201 |
+
|
| 202 |
+
### Error: 401 Unauthorized
|
| 203 |
+
|
| 204 |
+
**Cause:** HF_TOKEN not provided or invalid
|
| 205 |
+
|
| 206 |
+
**Solutions:**
|
| 207 |
+
1. Verify `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job config
|
| 208 |
+
2. Check you're logged in: `huggingface-cli whoami`
|
| 209 |
+
3. Re-login: `huggingface-cli login`
|
| 210 |
+
|
| 211 |
+
### Error: 403 Forbidden
|
| 212 |
+
|
| 213 |
+
**Cause:** No write access to repository
|
| 214 |
+
|
| 215 |
+
**Solutions:**
|
| 216 |
+
1. Check repository namespace matches your username
|
| 217 |
+
2. Verify you're a member of organization (if using org namespace)
|
| 218 |
+
3. Check repository isn't private (if accessing org repo)
|
| 219 |
+
|
| 220 |
+
### Error: Repository not found
|
| 221 |
+
|
| 222 |
+
**Cause:** Repository doesn't exist and auto-creation failed
|
| 223 |
+
|
| 224 |
+
**Solutions:**
|
| 225 |
+
1. Manually create repository first
|
| 226 |
+
2. Check repository name format
|
| 227 |
+
3. Verify namespace exists
|
| 228 |
+
|
| 229 |
+
### Error: Push failed during training
|
| 230 |
+
|
| 231 |
+
**Cause:** Network issues or Hub unavailable
|
| 232 |
+
|
| 233 |
+
**Solutions:**
|
| 234 |
+
1. Training continues but final push fails
|
| 235 |
+
2. Checkpoints may be saved
|
| 236 |
+
3. Re-run push manually after job completes
|
| 237 |
+
|
| 238 |
+
### Issue: Model saved but not visible
|
| 239 |
+
|
| 240 |
+
**Possible causes:**
|
| 241 |
+
1. Repository is private—check https://huggingface.co/username
|
| 242 |
+
2. Wrong namespace—verify `hub_model_id` matches login
|
| 243 |
+
3. Push still in progress—wait a few minutes
|
| 244 |
+
|
| 245 |
+
## Manual Push After Training
|
| 246 |
+
|
| 247 |
+
If training completes but push fails, push manually:
|
| 248 |
+
|
| 249 |
+
```python
|
| 250 |
+
from transformers import AutoModel, AutoTokenizer
|
| 251 |
+
|
| 252 |
+
# Load from local checkpoint
|
| 253 |
+
model = AutoModel.from_pretrained("./output_dir")
|
| 254 |
+
tokenizer = AutoTokenizer.from_pretrained("./output_dir")
|
| 255 |
+
|
| 256 |
+
# Push to Hub
|
| 257 |
+
model.push_to_hub("username/model-name", token="hf_abc123...")
|
| 258 |
+
tokenizer.push_to_hub("username/model-name", token="hf_abc123...")
|
| 259 |
+
```
|
| 260 |
+
|
| 261 |
+
**Note:** Only possible if job hasn't completed (files still exist).
|
| 262 |
+
|
| 263 |
+
## Best Practices
|
| 264 |
+
|
| 265 |
+
1. **Always enable `push_to_hub=True`**
|
| 266 |
+
2. **Use checkpoint saving** for long training runs
|
| 267 |
+
3. **Verify Hub push** in logs before job completes
|
| 268 |
+
4. **Set appropriate `save_total_limit`** to avoid excessive checkpoints
|
| 269 |
+
5. **Use descriptive repo names** (e.g., `qwen-capybara-sft` not `model1`)
|
| 270 |
+
6. **Add model card** with training details
|
| 271 |
+
7. **Tag models** with relevant tags (e.g., `text-generation`, `fine-tuned`)
|
| 272 |
+
|
| 273 |
+
## Monitoring Push Progress
|
| 274 |
+
|
| 275 |
+
Check logs for push progress:
|
| 276 |
+
|
| 277 |
+
```python
|
| 278 |
+
hf_jobs("logs", {"job_id": "your-job-id"})
|
| 279 |
+
```
|
| 280 |
+
|
| 281 |
+
**Look for:**
|
| 282 |
+
```
|
| 283 |
+
Pushing model to username/model-name...
|
| 284 |
+
Upload file pytorch_model.bin: 100%
|
| 285 |
+
✅ Model pushed successfully
|
| 286 |
+
```
|
| 287 |
+
|
| 288 |
+
## Example: Full Production Setup
|
| 289 |
+
|
| 290 |
+
```python
|
| 291 |
+
# production_train.py
|
| 292 |
+
# /// script
|
| 293 |
+
# dependencies = ["trl>=0.12.0", "peft>=0.7.0"]
|
| 294 |
+
# ///
|
| 295 |
+
|
| 296 |
+
from datasets import load_dataset
|
| 297 |
+
from peft import LoraConfig
|
| 298 |
+
from trl import SFTTrainer, SFTConfig
|
| 299 |
+
import os
|
| 300 |
+
|
| 301 |
+
# Verify token is available
|
| 302 |
+
assert "HF_TOKEN" in os.environ, "HF_TOKEN not found in environment!"
|
| 303 |
+
|
| 304 |
+
# Load dataset
|
| 305 |
+
dataset = load_dataset("trl-lib/Capybara", split="train")
|
| 306 |
+
print(f"✅ Dataset loaded: {len(dataset)} examples")
|
| 307 |
+
|
| 308 |
+
# Configure with comprehensive Hub settings
|
| 309 |
+
config = SFTConfig(
|
| 310 |
+
output_dir="qwen-capybara-sft",
|
| 311 |
+
|
| 312 |
+
# Hub configuration
|
| 313 |
+
push_to_hub=True,
|
| 314 |
+
hub_model_id="myusername/qwen-capybara-sft",
|
| 315 |
+
hub_strategy="checkpoint", # Push checkpoints
|
| 316 |
+
|
| 317 |
+
# Checkpoint configuration
|
| 318 |
+
save_strategy="steps",
|
| 319 |
+
save_steps=100,
|
| 320 |
+
save_total_limit=3,
|
| 321 |
+
|
| 322 |
+
# Training settings
|
| 323 |
+
num_train_epochs=3,
|
| 324 |
+
per_device_train_batch_size=4,
|
| 325 |
+
|
| 326 |
+
# Logging
|
| 327 |
+
logging_steps=10,
|
| 328 |
+
logging_first_step=True,
|
| 329 |
+
)
|
| 330 |
+
|
| 331 |
+
# Train with LoRA
|
| 332 |
+
trainer = SFTTrainer(
|
| 333 |
+
model="Qwen/Qwen2.5-0.5B",
|
| 334 |
+
train_dataset=dataset,
|
| 335 |
+
args=config,
|
| 336 |
+
peft_config=LoraConfig(r=16, lora_alpha=32),
|
| 337 |
+
)
|
| 338 |
+
|
| 339 |
+
print("🚀 Starting training...")
|
| 340 |
+
trainer.train()
|
| 341 |
+
|
| 342 |
+
print("💾 Pushing final model to Hub...")
|
| 343 |
+
trainer.push_to_hub()
|
| 344 |
+
|
| 345 |
+
print("✅ Training complete!")
|
| 346 |
+
print(f"Model available at: https://huggingface.co/myusername/qwen-capybara-sft")
|
| 347 |
+
```
|
| 348 |
+
|
| 349 |
+
**Submit:**
|
| 350 |
+
|
| 351 |
+
```python
|
| 352 |
+
hf_jobs("uv", {
|
| 353 |
+
"script": "production_train.py",
|
| 354 |
+
"flavor": "a10g-large",
|
| 355 |
+
"timeout": "6h",
|
| 356 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 357 |
+
})
|
| 358 |
+
```
|
| 359 |
+
|
| 360 |
+
## Key Takeaway
|
| 361 |
+
|
| 362 |
+
**Without `push_to_hub=True` and `secrets={"HF_TOKEN": "$HF_TOKEN"}`, all training results are permanently lost.**
|
| 363 |
+
|
| 364 |
+
Always verify both are configured before submitting any training job.
|
skills/hugging-face-model-trainer/references/reliability_principles.md
ADDED
|
@@ -0,0 +1,371 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reliability Principles for Training Jobs
|
| 2 |
+
|
| 3 |
+
These principles are derived from real production failures and successful fixes. Following them prevents common failure modes and ensures reliable job execution.
|
| 4 |
+
|
| 5 |
+
## Principle 1: Always Verify Before Use
|
| 6 |
+
|
| 7 |
+
**Rule:** Never assume repos, datasets, or resources exist. Verify with tools first.
|
| 8 |
+
|
| 9 |
+
### What It Prevents
|
| 10 |
+
|
| 11 |
+
- **Non-existent datasets** - Jobs fail immediately when dataset doesn't exist
|
| 12 |
+
- **Typos in names** - Simple mistakes like "argilla-dpo-mix-7k" vs "ultrafeedback_binarized"
|
| 13 |
+
- **Incorrect paths** - Old or moved repos, renamed files
|
| 14 |
+
- **Missing dependencies** - Undocumented requirements
|
| 15 |
+
|
| 16 |
+
### How to Apply
|
| 17 |
+
|
| 18 |
+
**Before submitting ANY job:**
|
| 19 |
+
|
| 20 |
+
```python
|
| 21 |
+
# Verify dataset exists
|
| 22 |
+
dataset_search({"query": "dataset-name", "author": "author-name", "limit": 5})
|
| 23 |
+
hub_repo_details(["author/dataset-name"], repo_type="dataset")
|
| 24 |
+
|
| 25 |
+
# Verify model exists
|
| 26 |
+
hub_repo_details(["org/model-name"], repo_type="model")
|
| 27 |
+
|
| 28 |
+
# Check script/file paths (for URL-based scripts)
|
| 29 |
+
# Verify before using: https://github.com/user/repo/blob/main/script.py
|
| 30 |
+
```
|
| 31 |
+
|
| 32 |
+
**Examples that would have caught errors:**
|
| 33 |
+
|
| 34 |
+
```python
|
| 35 |
+
# ❌ WRONG: Assumed dataset exists
|
| 36 |
+
hf_jobs("uv", {
|
| 37 |
+
"script": """...""",
|
| 38 |
+
"env": {"DATASET": "trl-lib/argilla-dpo-mix-7k"} # Doesn't exist!
|
| 39 |
+
})
|
| 40 |
+
|
| 41 |
+
# ✅ CORRECT: Verify first
|
| 42 |
+
dataset_search({"query": "argilla dpo", "author": "trl-lib"})
|
| 43 |
+
# Would show: "trl-lib/ultrafeedback_binarized" is the correct name
|
| 44 |
+
|
| 45 |
+
hub_repo_details(["trl-lib/ultrafeedback_binarized"], repo_type="dataset")
|
| 46 |
+
# Confirms it exists before using
|
| 47 |
+
```
|
| 48 |
+
|
| 49 |
+
### Implementation Checklist
|
| 50 |
+
|
| 51 |
+
- [ ] Check dataset exists before training
|
| 52 |
+
- [ ] Verify base model exists before fine-tuning
|
| 53 |
+
- [ ] Confirm adapter model exists before GGUF conversion
|
| 54 |
+
- [ ] Test script URLs are valid before submitting
|
| 55 |
+
- [ ] Validate file paths in repositories
|
| 56 |
+
- [ ] Check for recent updates/renames of resources
|
| 57 |
+
|
| 58 |
+
**Time cost:** 5-10 seconds
|
| 59 |
+
**Time saved:** Hours of failed job time + debugging
|
| 60 |
+
|
| 61 |
+
---
|
| 62 |
+
|
| 63 |
+
## Principle 2: Prioritize Reliability Over Performance
|
| 64 |
+
|
| 65 |
+
**Rule:** Default to what is most likely to succeed, not what is theoretically fastest.
|
| 66 |
+
|
| 67 |
+
### What It Prevents
|
| 68 |
+
|
| 69 |
+
- **Hardware incompatibilities** - Features that fail on certain GPUs
|
| 70 |
+
- **Unstable optimizations** - Speed-ups that cause crashes
|
| 71 |
+
- **Complex configurations** - More failure points
|
| 72 |
+
- **Build system issues** - Unreliable compilation methods
|
| 73 |
+
|
| 74 |
+
### How to Apply
|
| 75 |
+
|
| 76 |
+
**Choose reliability:**
|
| 77 |
+
|
| 78 |
+
```python
|
| 79 |
+
# ❌ RISKY: Aggressive optimization that may fail
|
| 80 |
+
SFTConfig(
|
| 81 |
+
torch_compile=True, # Can fail on T4, A10G GPUs
|
| 82 |
+
optim="adamw_bnb_8bit", # Requires specific setup
|
| 83 |
+
fp16=False, # May cause training instability
|
| 84 |
+
...
|
| 85 |
+
)
|
| 86 |
+
|
| 87 |
+
# ✅ SAFE: Proven defaults
|
| 88 |
+
SFTConfig(
|
| 89 |
+
# torch_compile=True, # Commented with note: "Enable on H100 for 20% speedup"
|
| 90 |
+
optim="adamw_torch", # Standard, always works
|
| 91 |
+
fp16=True, # Stable and fast
|
| 92 |
+
...
|
| 93 |
+
)
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
**For build processes:**
|
| 97 |
+
|
| 98 |
+
```python
|
| 99 |
+
# ❌ UNRELIABLE: Uses make (platform-dependent)
|
| 100 |
+
subprocess.run(["make", "-C", "/tmp/llama.cpp", "llama-quantize"], check=True)
|
| 101 |
+
|
| 102 |
+
# ✅ RELIABLE: Uses CMake (consistent, documented)
|
| 103 |
+
subprocess.run([
|
| 104 |
+
"cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp",
|
| 105 |
+
"-DGGML_CUDA=OFF" # Disable CUDA for faster, more reliable build
|
| 106 |
+
], check=True)
|
| 107 |
+
|
| 108 |
+
subprocess.run([
|
| 109 |
+
"cmake", "--build", "/tmp/llama.cpp/build",
|
| 110 |
+
"--target", "llama-quantize", "-j", "4"
|
| 111 |
+
], check=True)
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
### Real-World Example
|
| 115 |
+
|
| 116 |
+
**The `torch.compile` failure:**
|
| 117 |
+
- Added for "20% speedup" on H100
|
| 118 |
+
- **Failed fatally on T4-medium** with cryptic error
|
| 119 |
+
- Misdiagnosed as dataset issue (cost hours)
|
| 120 |
+
- **Fix:** Disable by default, add as optional comment
|
| 121 |
+
|
| 122 |
+
**Result:** Reliability > 20% performance gain
|
| 123 |
+
|
| 124 |
+
### Implementation Checklist
|
| 125 |
+
|
| 126 |
+
- [ ] Use proven, standard configurations by default
|
| 127 |
+
- [ ] Comment out performance optimizations with hardware notes
|
| 128 |
+
- [ ] Use stable build systems (CMake > make)
|
| 129 |
+
- [ ] Test on target hardware before production
|
| 130 |
+
- [ ] Document known incompatibilities
|
| 131 |
+
- [ ] Provide "safe" and "fast" variants when needed
|
| 132 |
+
|
| 133 |
+
**Performance loss:** 10-20% in best case
|
| 134 |
+
**Reliability gain:** 95%+ success rate vs 60-70%
|
| 135 |
+
|
| 136 |
+
---
|
| 137 |
+
|
| 138 |
+
## Principle 3: Create Atomic, Self-Contained Scripts
|
| 139 |
+
|
| 140 |
+
**Rule:** Scripts should work as complete, independent units. Don't remove parts to "simplify."
|
| 141 |
+
|
| 142 |
+
### What It Prevents
|
| 143 |
+
|
| 144 |
+
- **Missing dependencies** - Removed "unnecessary" packages that are actually required
|
| 145 |
+
- **Incomplete processes** - Skipped steps that seem redundant
|
| 146 |
+
- **Environment assumptions** - Scripts that need pre-setup
|
| 147 |
+
- **Partial failures** - Some parts work, others fail silently
|
| 148 |
+
|
| 149 |
+
### How to Apply
|
| 150 |
+
|
| 151 |
+
**Complete dependency specifications:**
|
| 152 |
+
|
| 153 |
+
```python
|
| 154 |
+
# ❌ INCOMPLETE: "Simplified" by removing dependencies
|
| 155 |
+
# /// script
|
| 156 |
+
# dependencies = [
|
| 157 |
+
# "transformers",
|
| 158 |
+
# "peft",
|
| 159 |
+
# "torch",
|
| 160 |
+
# ]
|
| 161 |
+
# ///
|
| 162 |
+
|
| 163 |
+
# ✅ COMPLETE: All dependencies explicit
|
| 164 |
+
# /// script
|
| 165 |
+
# dependencies = [
|
| 166 |
+
# "transformers>=4.36.0",
|
| 167 |
+
# "peft>=0.7.0",
|
| 168 |
+
# "torch>=2.0.0",
|
| 169 |
+
# "accelerate>=0.24.0",
|
| 170 |
+
# "huggingface_hub>=0.20.0",
|
| 171 |
+
# "sentencepiece>=0.1.99", # Required for tokenizers
|
| 172 |
+
# "protobuf>=3.20.0", # Required for tokenizers
|
| 173 |
+
# "numpy",
|
| 174 |
+
# "gguf",
|
| 175 |
+
# ]
|
| 176 |
+
# ///
|
| 177 |
+
```
|
| 178 |
+
|
| 179 |
+
**Complete build processes:**
|
| 180 |
+
|
| 181 |
+
```python
|
| 182 |
+
# ❌ INCOMPLETE: Assumes build tools exist
|
| 183 |
+
subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"])
|
| 184 |
+
subprocess.run(["make", "-C", "/tmp/llama.cpp", "llama-quantize"]) # FAILS: no gcc/make
|
| 185 |
+
|
| 186 |
+
# ✅ COMPLETE: Installs all requirements
|
| 187 |
+
subprocess.run(["apt-get", "update", "-qq"], check=True)
|
| 188 |
+
subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True)
|
| 189 |
+
subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"])
|
| 190 |
+
# ... then build
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
### Real-World Example
|
| 194 |
+
|
| 195 |
+
**The `sentencepiece` failure:**
|
| 196 |
+
- Original script had it: worked fine
|
| 197 |
+
- "Simplified" version removed it: "doesn't look necessary"
|
| 198 |
+
- **GGUF conversion failed silently** - tokenizer couldn't convert
|
| 199 |
+
- Hard to debug: no obvious error message
|
| 200 |
+
- **Fix:** Restore all original dependencies
|
| 201 |
+
|
| 202 |
+
**Result:** Don't remove dependencies without thorough testing
|
| 203 |
+
|
| 204 |
+
### Implementation Checklist
|
| 205 |
+
|
| 206 |
+
- [ ] All dependencies in PEP 723 header with version pins
|
| 207 |
+
- [ ] All system packages installed by script
|
| 208 |
+
- [ ] No assumptions about pre-existing environment
|
| 209 |
+
- [ ] No "optional" steps that are actually required
|
| 210 |
+
- [ ] Test scripts in clean environment
|
| 211 |
+
- [ ] Document why each dependency is needed
|
| 212 |
+
|
| 213 |
+
**Complexity:** Slightly longer scripts
|
| 214 |
+
**Reliability:** Scripts "just work" every time
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## Principle 4: Provide Clear Error Context
|
| 219 |
+
|
| 220 |
+
**Rule:** When things fail, make it obvious what went wrong and how to fix it.
|
| 221 |
+
|
| 222 |
+
### How to Apply
|
| 223 |
+
|
| 224 |
+
**Wrap subprocess calls:**
|
| 225 |
+
|
| 226 |
+
```python
|
| 227 |
+
# ❌ UNCLEAR: Silent failure
|
| 228 |
+
subprocess.run([...], check=True, capture_output=True)
|
| 229 |
+
|
| 230 |
+
# ✅ CLEAR: Shows what failed
|
| 231 |
+
try:
|
| 232 |
+
result = subprocess.run(
|
| 233 |
+
[...],
|
| 234 |
+
check=True,
|
| 235 |
+
capture_output=True,
|
| 236 |
+
text=True
|
| 237 |
+
)
|
| 238 |
+
print(result.stdout)
|
| 239 |
+
if result.stderr:
|
| 240 |
+
print("Warnings:", result.stderr)
|
| 241 |
+
except subprocess.CalledProcessError as e:
|
| 242 |
+
print(f"❌ Command failed!")
|
| 243 |
+
print("STDOUT:", e.stdout)
|
| 244 |
+
print("STDERR:", e.stderr)
|
| 245 |
+
raise
|
| 246 |
+
```
|
| 247 |
+
|
| 248 |
+
**Validate inputs:**
|
| 249 |
+
|
| 250 |
+
```python
|
| 251 |
+
# ❌ UNCLEAR: Fails later with cryptic error
|
| 252 |
+
model = load_model(MODEL_NAME)
|
| 253 |
+
|
| 254 |
+
# ✅ CLEAR: Fails fast with clear message
|
| 255 |
+
if not MODEL_NAME:
|
| 256 |
+
raise ValueError("MODEL_NAME environment variable not set!")
|
| 257 |
+
|
| 258 |
+
print(f"Loading model: {MODEL_NAME}")
|
| 259 |
+
try:
|
| 260 |
+
model = load_model(MODEL_NAME)
|
| 261 |
+
print(f"✅ Model loaded successfully")
|
| 262 |
+
except Exception as e:
|
| 263 |
+
print(f"❌ Failed to load model: {MODEL_NAME}")
|
| 264 |
+
print(f"Error: {e}")
|
| 265 |
+
print("Hint: Check that model exists on Hub")
|
| 266 |
+
raise
|
| 267 |
+
```
|
| 268 |
+
|
| 269 |
+
### Implementation Checklist
|
| 270 |
+
|
| 271 |
+
- [ ] Wrap external calls with try/except
|
| 272 |
+
- [ ] Print stdout/stderr on failure
|
| 273 |
+
- [ ] Validate environment variables early
|
| 274 |
+
- [ ] Add progress indicators (✅, ❌, 🔄)
|
| 275 |
+
- [ ] Include hints for common failures
|
| 276 |
+
- [ ] Log configuration at start
|
| 277 |
+
|
| 278 |
+
---
|
| 279 |
+
|
| 280 |
+
## Principle 5: Test the Happy Path on Known-Good Inputs
|
| 281 |
+
|
| 282 |
+
**Rule:** Before using new code in production, test with inputs you know work.
|
| 283 |
+
|
| 284 |
+
### How to Apply
|
| 285 |
+
|
| 286 |
+
**Known-good test inputs:**
|
| 287 |
+
|
| 288 |
+
```python
|
| 289 |
+
# For training
|
| 290 |
+
TEST_DATASET = "trl-lib/Capybara" # Small, well-formatted, widely used
|
| 291 |
+
TEST_MODEL = "Qwen/Qwen2.5-0.5B" # Small, fast, reliable
|
| 292 |
+
|
| 293 |
+
# For GGUF conversion
|
| 294 |
+
TEST_ADAPTER = "evalstate/qwen-capybara-medium" # Known working model
|
| 295 |
+
TEST_BASE = "Qwen/Qwen2.5-0.5B" # Compatible base
|
| 296 |
+
```
|
| 297 |
+
|
| 298 |
+
**Testing workflow:**
|
| 299 |
+
|
| 300 |
+
1. Test with known-good inputs first
|
| 301 |
+
2. If that works, try production inputs
|
| 302 |
+
3. If production fails, you know it's the inputs (not code)
|
| 303 |
+
4. Isolate the difference
|
| 304 |
+
|
| 305 |
+
### Implementation Checklist
|
| 306 |
+
|
| 307 |
+
- [ ] Maintain list of known-good test models/datasets
|
| 308 |
+
- [ ] Test new scripts with test inputs first
|
| 309 |
+
- [ ] Document what makes inputs "good"
|
| 310 |
+
- [ ] Keep test jobs cheap (small models, short timeouts)
|
| 311 |
+
- [ ] Only move to production after test succeeds
|
| 312 |
+
|
| 313 |
+
**Time cost:** 5-10 minutes for test run
|
| 314 |
+
**Debugging time saved:** Hours
|
| 315 |
+
|
| 316 |
+
---
|
| 317 |
+
|
| 318 |
+
## Summary: The Reliability Checklist
|
| 319 |
+
|
| 320 |
+
Before submitting ANY job:
|
| 321 |
+
|
| 322 |
+
### Pre-Flight Checks
|
| 323 |
+
- [ ] **Verified** all repos/datasets exist (hub_repo_details)
|
| 324 |
+
- [ ] **Tested** with known-good inputs if new code
|
| 325 |
+
- [ ] **Using** proven hardware/configuration
|
| 326 |
+
- [ ] **Included** all dependencies in PEP 723 header
|
| 327 |
+
- [ ] **Installed** system requirements (build tools, etc.)
|
| 328 |
+
- [ ] **Set** appropriate timeout (not default 30m)
|
| 329 |
+
- [ ] **Configured** Hub push with HF_TOKEN
|
| 330 |
+
- [ ] **Added** clear error handling
|
| 331 |
+
|
| 332 |
+
### Script Quality
|
| 333 |
+
- [ ] Self-contained (no external setup needed)
|
| 334 |
+
- [ ] Complete dependencies listed
|
| 335 |
+
- [ ] Build tools installed by script
|
| 336 |
+
- [ ] Progress indicators included
|
| 337 |
+
- [ ] Error messages are clear
|
| 338 |
+
- [ ] Configuration logged at start
|
| 339 |
+
|
| 340 |
+
### Job Configuration
|
| 341 |
+
- [ ] Timeout > expected runtime + 30% buffer
|
| 342 |
+
- [ ] Hardware appropriate for model size
|
| 343 |
+
- [ ] Secrets include HF_TOKEN
|
| 344 |
+
- [ ] Environment variables set correctly
|
| 345 |
+
- [ ] Cost estimated and acceptable
|
| 346 |
+
|
| 347 |
+
**Following these principles transforms job success rate from ~60-70% to ~95%+**
|
| 348 |
+
|
| 349 |
+
---
|
| 350 |
+
|
| 351 |
+
## When Principles Conflict
|
| 352 |
+
|
| 353 |
+
Sometimes reliability and performance conflict. Here's how to choose:
|
| 354 |
+
|
| 355 |
+
| Scenario | Choose | Rationale |
|
| 356 |
+
|----------|--------|-----------|
|
| 357 |
+
| Demo/test | Reliability | Fast failure is worse than slow success |
|
| 358 |
+
| Production (first run) | Reliability | Prove it works before optimizing |
|
| 359 |
+
| Production (proven) | Performance | Safe to optimize after validation |
|
| 360 |
+
| Time-critical | Reliability | Failures cause more delay than slow runs |
|
| 361 |
+
| Cost-critical | Balanced | Test with small model, then optimize |
|
| 362 |
+
|
| 363 |
+
**General rule:** Reliability first, optimize second.
|
| 364 |
+
|
| 365 |
+
---
|
| 366 |
+
|
| 367 |
+
## Further Reading
|
| 368 |
+
|
| 369 |
+
- `troubleshooting.md` - Common issues and fixes
|
| 370 |
+
- `training_patterns.md` - Proven training configurations
|
| 371 |
+
- `gguf_conversion.md` - Production GGUF workflow
|
skills/hugging-face-model-trainer/references/trackio_guide.md
ADDED
|
@@ -0,0 +1,189 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Trackio Integration for TRL Training
|
| 2 |
+
|
| 3 |
+
**Trackio** is an experiment tracking library that provides real-time metrics visualization for remote training on Hugging Face Jobs infrastructure.
|
| 4 |
+
|
| 5 |
+
⚠️ **IMPORTANT**: For Jobs training (remote cloud GPUs):
|
| 6 |
+
- Training happens on ephemeral cloud runners (not your local machine)
|
| 7 |
+
- Trackio syncs metrics to a Hugging Face Space for real-time monitoring
|
| 8 |
+
- Without a Space, metrics are lost when the job completes
|
| 9 |
+
- The Space dashboard persists your training metrics permanently
|
| 10 |
+
|
| 11 |
+
## Setting Up Trackio for Jobs
|
| 12 |
+
|
| 13 |
+
**Step 1: Add trackio dependency**
|
| 14 |
+
```python
|
| 15 |
+
# /// script
|
| 16 |
+
# dependencies = [
|
| 17 |
+
# "trl>=0.12.0",
|
| 18 |
+
# "trackio", # Required!
|
| 19 |
+
# ]
|
| 20 |
+
# ///
|
| 21 |
+
```
|
| 22 |
+
|
| 23 |
+
**Step 2: Create a Trackio Space (one-time setup)**
|
| 24 |
+
|
| 25 |
+
**Option A: Let Trackio auto-create (Recommended)**
|
| 26 |
+
Pass a `space_id` to `trackio.init()` and Trackio will automatically create the Space if it doesn't exist.
|
| 27 |
+
|
| 28 |
+
**Option B: Create manually**
|
| 29 |
+
- Create Space via Hub UI at https://huggingface.co/new-space
|
| 30 |
+
- Select Gradio SDK
|
| 31 |
+
- OR use command: `huggingface-cli repo create my-trackio-dashboard --type space --space_sdk gradio`
|
| 32 |
+
|
| 33 |
+
**Step 3: Initialize Trackio with space_id**
|
| 34 |
+
```python
|
| 35 |
+
import trackio
|
| 36 |
+
|
| 37 |
+
trackio.init(
|
| 38 |
+
project="my-training",
|
| 39 |
+
space_id="username/trackio", # CRITICAL for Jobs! Replace 'username' with your HF username
|
| 40 |
+
config={
|
| 41 |
+
"model": "Qwen/Qwen2.5-0.5B",
|
| 42 |
+
"dataset": "trl-lib/Capybara",
|
| 43 |
+
"learning_rate": 2e-5,
|
| 44 |
+
}
|
| 45 |
+
)
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
**Step 4: Configure TRL to use Trackio**
|
| 49 |
+
```python
|
| 50 |
+
SFTConfig(
|
| 51 |
+
report_to="trackio",
|
| 52 |
+
# ... other config
|
| 53 |
+
)
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
**Step 5: Finish tracking**
|
| 57 |
+
```python
|
| 58 |
+
trainer.train()
|
| 59 |
+
trackio.finish() # Ensures final metrics are synced
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
## What Trackio Tracks
|
| 63 |
+
|
| 64 |
+
Trackio automatically logs:
|
| 65 |
+
- ✅ Training loss
|
| 66 |
+
- ✅ Learning rate
|
| 67 |
+
- ✅ GPU utilization
|
| 68 |
+
- ✅ Memory usage
|
| 69 |
+
- ✅ Training throughput
|
| 70 |
+
- ✅ Custom metrics
|
| 71 |
+
|
| 72 |
+
## How It Works with Jobs
|
| 73 |
+
|
| 74 |
+
1. **Training runs** → Metrics logged to local SQLite DB
|
| 75 |
+
2. **Every 5 minutes** → Trackio syncs DB to HF Dataset (Parquet)
|
| 76 |
+
3. **Space dashboard** → Reads from Dataset, displays metrics in real-time
|
| 77 |
+
4. **Job completes** → Final sync ensures all metrics persisted
|
| 78 |
+
|
| 79 |
+
## Default Configuration Pattern
|
| 80 |
+
|
| 81 |
+
**Use sensible defaults for trackio configuration unless user requests otherwise.**
|
| 82 |
+
|
| 83 |
+
### Recommended Defaults
|
| 84 |
+
|
| 85 |
+
```python
|
| 86 |
+
import trackio
|
| 87 |
+
|
| 88 |
+
trackio.init(
|
| 89 |
+
project="qwen-capybara-sft",
|
| 90 |
+
name="baseline-run", # Descriptive name user will recognize
|
| 91 |
+
space_id="username/trackio", # Default space: {username}/trackio
|
| 92 |
+
config={
|
| 93 |
+
# Keep config minimal - hyperparameters and model/dataset info only
|
| 94 |
+
"model": "Qwen/Qwen2.5-0.5B",
|
| 95 |
+
"dataset": "trl-lib/Capybara",
|
| 96 |
+
"learning_rate": 2e-5,
|
| 97 |
+
"num_epochs": 3,
|
| 98 |
+
}
|
| 99 |
+
)
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
**Key principles:**
|
| 103 |
+
- **Space ID**: Use `{username}/trackio` with "trackio" as default space name
|
| 104 |
+
- **Run naming**: Unless otherwise specified, name the run in a way the user will recognize
|
| 105 |
+
- **Config**: Keep minimal - don't automatically capture job metadata unless requested
|
| 106 |
+
- **Grouping**: Optional - only use if user requests organizing related experiments
|
| 107 |
+
|
| 108 |
+
## Grouping Runs (Optional)
|
| 109 |
+
|
| 110 |
+
The `group` parameter helps organize related runs together in the dashboard sidebar. This is useful when user is running multiple experiments with different configurations but wants to compare them together:
|
| 111 |
+
|
| 112 |
+
```python
|
| 113 |
+
# Example: Group runs by experiment type
|
| 114 |
+
trackio.init(project="my-project", run_name="baseline-run-1", group="baseline")
|
| 115 |
+
trackio.init(project="my-project", run_name="augmented-run-1", group="augmented")
|
| 116 |
+
trackio.init(project="my-project", run_name="tuned-run-1", group="tuned")
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
Runs with the same group name can be grouped together in the sidebar, making it easier to compare related experiments. You can group by any configuration parameter:
|
| 120 |
+
|
| 121 |
+
```python
|
| 122 |
+
# Hyperparameter sweep - group by learning rate
|
| 123 |
+
trackio.init(project="hyperparam-sweep", run_name="lr-0.001-run", group="lr_0.001")
|
| 124 |
+
trackio.init(project="hyperparam-sweep", run_name="lr-0.01-run", group="lr_0.01")
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
## Environment Variables for Jobs
|
| 128 |
+
|
| 129 |
+
You can configure trackio using environment variables instead of passing parameters to `trackio.init()`. This is useful for managing configuration across multiple jobs.
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
**`HF_TOKEN`**
|
| 134 |
+
Required for creating Spaces and writing to datasets (passed via `secrets`):
|
| 135 |
+
```python
|
| 136 |
+
hf_jobs("uv", {
|
| 137 |
+
"script": "...",
|
| 138 |
+
"secrets": {
|
| 139 |
+
"HF_TOKEN": "$HF_TOKEN" # Enables Space creation and Hub push
|
| 140 |
+
}
|
| 141 |
+
})
|
| 142 |
+
```
|
| 143 |
+
|
| 144 |
+
### Example with Environment Variables
|
| 145 |
+
|
| 146 |
+
```python
|
| 147 |
+
hf_jobs("uv", {
|
| 148 |
+
"script": """
|
| 149 |
+
# Training script - trackio config from environment
|
| 150 |
+
import trackio
|
| 151 |
+
from datetime import datetime
|
| 152 |
+
|
| 153 |
+
# Auto-generate run name
|
| 154 |
+
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M")
|
| 155 |
+
run_name = f"sft_qwen25_{timestamp}"
|
| 156 |
+
|
| 157 |
+
# Project and space_id can come from environment variables
|
| 158 |
+
trackio.init(run_name=run_name, group="SFT")
|
| 159 |
+
|
| 160 |
+
# ... training code ...
|
| 161 |
+
trackio.finish()
|
| 162 |
+
""",
|
| 163 |
+
"flavor": "a10g-large",
|
| 164 |
+
"timeout": "2h",
|
| 165 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 166 |
+
})
|
| 167 |
+
```
|
| 168 |
+
|
| 169 |
+
**When to use environment variables:**
|
| 170 |
+
- Managing multiple jobs with same configuration
|
| 171 |
+
- Keeping training scripts portable across projects
|
| 172 |
+
- Separating configuration from code
|
| 173 |
+
|
| 174 |
+
**When to use direct parameters:**
|
| 175 |
+
- Single job with specific configuration
|
| 176 |
+
- When clarity in code is preferred
|
| 177 |
+
- When each job has different project/space
|
| 178 |
+
|
| 179 |
+
## Viewing the Dashboard
|
| 180 |
+
|
| 181 |
+
After starting training:
|
| 182 |
+
1. Navigate to the Space: `https://huggingface.co/spaces/username/trackio`
|
| 183 |
+
2. The Gradio dashboard shows all tracked experiments
|
| 184 |
+
3. Filter by project, compare runs, view charts with smoothing
|
| 185 |
+
|
| 186 |
+
## Recommendation
|
| 187 |
+
|
| 188 |
+
- **Trackio**: Best for real-time monitoring during long training runs
|
| 189 |
+
- **Weights & Biases**: Best for team collaboration, requires account
|
skills/hugging-face-model-trainer/references/training_methods.md
ADDED
|
@@ -0,0 +1,150 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TRL Training Methods Overview
|
| 2 |
+
|
| 3 |
+
TRL (Transformer Reinforcement Learning) provides multiple training methods for fine-tuning and aligning language models. This reference provides a brief overview of each method.
|
| 4 |
+
|
| 5 |
+
## Supervised Fine-Tuning (SFT)
|
| 6 |
+
|
| 7 |
+
**What it is:** Standard instruction tuning with supervised learning on demonstration data.
|
| 8 |
+
|
| 9 |
+
**When to use:**
|
| 10 |
+
- Initial fine-tuning of base models on task-specific data
|
| 11 |
+
- Teaching new capabilities or domains
|
| 12 |
+
- Most common starting point for fine-tuning
|
| 13 |
+
|
| 14 |
+
**Dataset format:** Conversational format with "messages" field, OR text field, OR prompt/completion pairs
|
| 15 |
+
|
| 16 |
+
**Example:**
|
| 17 |
+
```python
|
| 18 |
+
from trl import SFTTrainer, SFTConfig
|
| 19 |
+
|
| 20 |
+
trainer = SFTTrainer(
|
| 21 |
+
model="Qwen/Qwen2.5-0.5B",
|
| 22 |
+
train_dataset=dataset,
|
| 23 |
+
args=SFTConfig(
|
| 24 |
+
output_dir="my-model",
|
| 25 |
+
push_to_hub=True,
|
| 26 |
+
hub_model_id="username/my-model",
|
| 27 |
+
eval_strategy="no", # Disable eval for simple example
|
| 28 |
+
# max_length=1024 is the default - only set if you need different length
|
| 29 |
+
)
|
| 30 |
+
)
|
| 31 |
+
trainer.train()
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
**Note:** For production training with evaluation monitoring, see `scripts/train_sft_example.py`
|
| 35 |
+
|
| 36 |
+
**Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/sft_trainer")`
|
| 37 |
+
|
| 38 |
+
## Direct Preference Optimization (DPO)
|
| 39 |
+
|
| 40 |
+
**What it is:** Alignment method that trains directly on preference pairs (chosen vs rejected responses) without requiring a reward model.
|
| 41 |
+
|
| 42 |
+
**When to use:**
|
| 43 |
+
- Aligning models to human preferences
|
| 44 |
+
- Improving response quality after SFT
|
| 45 |
+
- Have paired preference data (chosen/rejected responses)
|
| 46 |
+
|
| 47 |
+
**Dataset format:** Preference pairs with "chosen" and "rejected" fields
|
| 48 |
+
|
| 49 |
+
**Example:**
|
| 50 |
+
```python
|
| 51 |
+
from trl import DPOTrainer, DPOConfig
|
| 52 |
+
|
| 53 |
+
trainer = DPOTrainer(
|
| 54 |
+
model="Qwen/Qwen2.5-0.5B-Instruct", # Use instruct model
|
| 55 |
+
train_dataset=dataset,
|
| 56 |
+
args=DPOConfig(
|
| 57 |
+
output_dir="dpo-model",
|
| 58 |
+
beta=0.1, # KL penalty coefficient
|
| 59 |
+
eval_strategy="no", # Disable eval for simple example
|
| 60 |
+
# max_length=1024 is the default - only set if you need different length
|
| 61 |
+
)
|
| 62 |
+
)
|
| 63 |
+
trainer.train()
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
**Note:** For production training with evaluation monitoring, see `scripts/train_dpo_example.py`
|
| 67 |
+
|
| 68 |
+
**Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer")`
|
| 69 |
+
|
| 70 |
+
## Group Relative Policy Optimization (GRPO)
|
| 71 |
+
|
| 72 |
+
**What it is:** Online RL method that optimizes relative to group performance, useful for tasks with verifiable rewards.
|
| 73 |
+
|
| 74 |
+
**When to use:**
|
| 75 |
+
- Tasks with automatic reward signals (code execution, math verification)
|
| 76 |
+
- Online learning scenarios
|
| 77 |
+
- When DPO offline data is insufficient
|
| 78 |
+
|
| 79 |
+
**Dataset format:** Prompt-only format (model generates responses, reward computed online)
|
| 80 |
+
|
| 81 |
+
**Example:**
|
| 82 |
+
```python
|
| 83 |
+
# Use TRL maintained script
|
| 84 |
+
hf_jobs("uv", {
|
| 85 |
+
"script": "https://raw.githubusercontent.com/huggingface/trl/main/examples/scripts/grpo.py",
|
| 86 |
+
"script_args": [
|
| 87 |
+
"--model_name_or_path", "Qwen/Qwen2.5-0.5B-Instruct",
|
| 88 |
+
"--dataset_name", "trl-lib/math_shepherd",
|
| 89 |
+
"--output_dir", "grpo-model"
|
| 90 |
+
],
|
| 91 |
+
"flavor": "a10g-large",
|
| 92 |
+
"timeout": "4h",
|
| 93 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 94 |
+
})
|
| 95 |
+
```
|
| 96 |
+
|
| 97 |
+
**Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/grpo_trainer")`
|
| 98 |
+
|
| 99 |
+
## Reward Modeling
|
| 100 |
+
|
| 101 |
+
**What it is:** Train a reward model to score responses, used as a component in RLHF pipelines.
|
| 102 |
+
|
| 103 |
+
**When to use:**
|
| 104 |
+
- Building RLHF pipeline
|
| 105 |
+
- Need automatic quality scoring
|
| 106 |
+
- Creating reward signals for PPO training
|
| 107 |
+
|
| 108 |
+
**Dataset format:** Preference pairs with "chosen" and "rejected" responses
|
| 109 |
+
|
| 110 |
+
**Documentation:** `hf_doc_fetch("https://huggingface.co/docs/trl/reward_trainer")`
|
| 111 |
+
|
| 112 |
+
## Method Selection Guide
|
| 113 |
+
|
| 114 |
+
| Method | Complexity | Data Required | Use Case |
|
| 115 |
+
|--------|-----------|---------------|----------|
|
| 116 |
+
| **SFT** | Low | Demonstrations | Initial fine-tuning |
|
| 117 |
+
| **DPO** | Medium | Paired preferences | Post-SFT alignment |
|
| 118 |
+
| **GRPO** | Medium | Prompts + reward fn | Online RL with automatic rewards |
|
| 119 |
+
| **Reward** | Medium | Paired preferences | Building RLHF pipeline |
|
| 120 |
+
|
| 121 |
+
## Recommended Pipeline
|
| 122 |
+
|
| 123 |
+
**For most use cases:**
|
| 124 |
+
1. **Start with SFT** - Fine-tune base model on task data
|
| 125 |
+
2. **Follow with DPO** - Align to preferences using paired data
|
| 126 |
+
3. **Optional: GGUF conversion** - Deploy for local inference
|
| 127 |
+
|
| 128 |
+
**For advanced RL scenarios:**
|
| 129 |
+
1. **Start with SFT** - Fine-tune base model
|
| 130 |
+
2. **Train reward model** - On preference data
|
| 131 |
+
|
| 132 |
+
## Dataset Format Reference
|
| 133 |
+
|
| 134 |
+
For complete dataset format specifications, use:
|
| 135 |
+
```python
|
| 136 |
+
hf_doc_fetch("https://huggingface.co/docs/trl/dataset_formats")
|
| 137 |
+
```
|
| 138 |
+
|
| 139 |
+
Or validate your dataset:
|
| 140 |
+
```bash
|
| 141 |
+
uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
|
| 142 |
+
--dataset your/dataset --split train
|
| 143 |
+
```
|
| 144 |
+
|
| 145 |
+
## See Also
|
| 146 |
+
|
| 147 |
+
- `references/training_patterns.md` - Common training patterns and examples
|
| 148 |
+
- `scripts/train_sft_example.py` - Complete SFT template
|
| 149 |
+
- `scripts/train_dpo_example.py` - Complete DPO template
|
| 150 |
+
- [Dataset Inspector](https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py) - Dataset format validation tool
|
skills/hugging-face-model-trainer/references/training_patterns.md
ADDED
|
@@ -0,0 +1,203 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Common Training Patterns
|
| 2 |
+
|
| 3 |
+
This guide provides common training patterns and use cases for TRL on Hugging Face Jobs.
|
| 4 |
+
|
| 5 |
+
## Multi-GPU Training
|
| 6 |
+
|
| 7 |
+
Automatic distributed training across multiple GPUs. TRL/Accelerate handles distribution automatically:
|
| 8 |
+
|
| 9 |
+
```python
|
| 10 |
+
hf_jobs("uv", {
|
| 11 |
+
"script": """
|
| 12 |
+
# Your training script here (same as single GPU)
|
| 13 |
+
# No changes needed - Accelerate detects multiple GPUs
|
| 14 |
+
""",
|
| 15 |
+
"flavor": "a10g-largex2", # 2x A10G GPUs
|
| 16 |
+
"timeout": "4h",
|
| 17 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 18 |
+
})
|
| 19 |
+
```
|
| 20 |
+
|
| 21 |
+
**Tips for multi-GPU:**
|
| 22 |
+
- No code changes needed
|
| 23 |
+
- Use `per_device_train_batch_size` (per GPU, not total)
|
| 24 |
+
- Effective batch size = `per_device_train_batch_size` × `num_gpus` × `gradient_accumulation_steps`
|
| 25 |
+
- Monitor GPU utilization to ensure both GPUs are being used
|
| 26 |
+
|
| 27 |
+
## DPO Training (Preference Learning)
|
| 28 |
+
|
| 29 |
+
Train with preference data for alignment:
|
| 30 |
+
|
| 31 |
+
```python
|
| 32 |
+
hf_jobs("uv", {
|
| 33 |
+
"script": """
|
| 34 |
+
# /// script
|
| 35 |
+
# dependencies = ["trl>=0.12.0", "trackio"]
|
| 36 |
+
# ///
|
| 37 |
+
|
| 38 |
+
from datasets import load_dataset
|
| 39 |
+
from trl import DPOTrainer, DPOConfig
|
| 40 |
+
import trackio
|
| 41 |
+
|
| 42 |
+
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
|
| 43 |
+
|
| 44 |
+
# Create train/eval split
|
| 45 |
+
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
|
| 46 |
+
|
| 47 |
+
config = DPOConfig(
|
| 48 |
+
output_dir="dpo-model",
|
| 49 |
+
push_to_hub=True,
|
| 50 |
+
hub_model_id="username/dpo-model",
|
| 51 |
+
num_train_epochs=1,
|
| 52 |
+
beta=0.1, # KL penalty coefficient
|
| 53 |
+
eval_strategy="steps",
|
| 54 |
+
eval_steps=50,
|
| 55 |
+
report_to="trackio",
|
| 56 |
+
run_name="baseline_run", # use a meaningful run name
|
| 57 |
+
# max_length=1024, # Default - only set if you need different sequence length
|
| 58 |
+
)
|
| 59 |
+
|
| 60 |
+
trainer = DPOTrainer(
|
| 61 |
+
model="Qwen/Qwen2.5-0.5B-Instruct", # Use instruct model as base
|
| 62 |
+
train_dataset=dataset_split["train"],
|
| 63 |
+
eval_dataset=dataset_split["test"], # IMPORTANT: Provide eval_dataset when eval_strategy is enabled
|
| 64 |
+
args=config,
|
| 65 |
+
)
|
| 66 |
+
|
| 67 |
+
trainer.train()
|
| 68 |
+
trainer.push_to_hub()
|
| 69 |
+
trackio.finish()
|
| 70 |
+
""",
|
| 71 |
+
"flavor": "a10g-large",
|
| 72 |
+
"timeout": "3h",
|
| 73 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 74 |
+
})
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
**For DPO documentation:** Use `hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer")`
|
| 78 |
+
|
| 79 |
+
## GRPO Training (Online RL)
|
| 80 |
+
|
| 81 |
+
Group Relative Policy Optimization for online reinforcement learning:
|
| 82 |
+
|
| 83 |
+
```python
|
| 84 |
+
hf_jobs("uv", {
|
| 85 |
+
"script": "https://raw.githubusercontent.com/huggingface/trl/main/examples/scripts/grpo.py",
|
| 86 |
+
"script_args": [
|
| 87 |
+
"--model_name_or_path", "Qwen/Qwen2.5-0.5B-Instruct",
|
| 88 |
+
"--dataset_name", "trl-lib/math_shepherd",
|
| 89 |
+
"--output_dir", "grpo-model",
|
| 90 |
+
"--push_to_hub",
|
| 91 |
+
"--hub_model_id", "username/grpo-model"
|
| 92 |
+
],
|
| 93 |
+
"flavor": "a10g-large",
|
| 94 |
+
"timeout": "4h",
|
| 95 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
|
| 96 |
+
})
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
**For GRPO documentation:** Use `hf_doc_fetch("https://huggingface.co/docs/trl/grpo_trainer")`
|
| 100 |
+
|
| 101 |
+
## Trackio Configuration
|
| 102 |
+
|
| 103 |
+
**Use sensible defaults for trackio setup.** See `references/trackio_guide.md` for complete documentation including grouping runs for experiments.
|
| 104 |
+
|
| 105 |
+
### Basic Pattern
|
| 106 |
+
|
| 107 |
+
```python
|
| 108 |
+
import trackio
|
| 109 |
+
|
| 110 |
+
trackio.init(
|
| 111 |
+
project="my-training",
|
| 112 |
+
run_name="baseline-run", # Descriptive name user will recognize
|
| 113 |
+
space_id="username/trackio", # Default space: {username}/trackio
|
| 114 |
+
config={
|
| 115 |
+
# Keep config minimal - hyperparameters and model/dataset info only
|
| 116 |
+
"model": "Qwen/Qwen2.5-0.5B",
|
| 117 |
+
"dataset": "trl-lib/Capybara",
|
| 118 |
+
"learning_rate": 2e-5,
|
| 119 |
+
}
|
| 120 |
+
)
|
| 121 |
+
|
| 122 |
+
# Your training code...
|
| 123 |
+
|
| 124 |
+
trackio.finish()
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
### Grouping for Experiments (Optional)
|
| 128 |
+
|
| 129 |
+
When user wants to compare related runs, use the `group` parameter:
|
| 130 |
+
|
| 131 |
+
```python
|
| 132 |
+
# Hyperparameter sweep
|
| 133 |
+
trackio.init(project="hyperparam-sweep", run_name="lr-0.001", group="lr_0.001")
|
| 134 |
+
trackio.init(project="hyperparam-sweep", run_name="lr-0.01", group="lr_0.01")
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
## Pattern Selection Guide
|
| 138 |
+
|
| 139 |
+
| Use Case | Pattern | Hardware | Time |
|
| 140 |
+
|----------|---------|----------|------|
|
| 141 |
+
| SFT training | `scripts/train_sft_example.py` | a10g-large | 2-6 hours |
|
| 142 |
+
| Large dataset (>10K) | Multi-GPU | a10g-largex2 | 4-12 hours |
|
| 143 |
+
| Preference learning | DPO Training | a10g-large | 2-4 hours |
|
| 144 |
+
| Online RL | GRPO Training | a10g-large | 3-6 hours |
|
| 145 |
+
|
| 146 |
+
## Critical: Evaluation Dataset Requirements
|
| 147 |
+
|
| 148 |
+
**⚠️ IMPORTANT**: If you set `eval_strategy="steps"` or `eval_strategy="epoch"`, you **MUST** provide an `eval_dataset` to the trainer, or the training will hang.
|
| 149 |
+
|
| 150 |
+
### ✅ CORRECT - With eval dataset:
|
| 151 |
+
```python
|
| 152 |
+
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
|
| 153 |
+
|
| 154 |
+
trainer = SFTTrainer(
|
| 155 |
+
model="Qwen/Qwen2.5-0.5B",
|
| 156 |
+
train_dataset=dataset_split["train"],
|
| 157 |
+
eval_dataset=dataset_split["test"], # ← MUST provide when eval_strategy is enabled
|
| 158 |
+
args=SFTConfig(eval_strategy="steps", ...),
|
| 159 |
+
)
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
### ❌ WRONG - Will hang:
|
| 163 |
+
```python
|
| 164 |
+
trainer = SFTTrainer(
|
| 165 |
+
model="Qwen/Qwen2.5-0.5B",
|
| 166 |
+
train_dataset=dataset,
|
| 167 |
+
# NO eval_dataset but eval_strategy="steps" ← WILL HANG
|
| 168 |
+
args=SFTConfig(eval_strategy="steps", ...),
|
| 169 |
+
)
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
### Option: Disable evaluation if no eval dataset
|
| 173 |
+
```python
|
| 174 |
+
config = SFTConfig(
|
| 175 |
+
eval_strategy="no", # ← Explicitly disable evaluation
|
| 176 |
+
# ... other config
|
| 177 |
+
)
|
| 178 |
+
|
| 179 |
+
trainer = SFTTrainer(
|
| 180 |
+
model="Qwen/Qwen2.5-0.5B",
|
| 181 |
+
train_dataset=dataset,
|
| 182 |
+
# No eval_dataset needed
|
| 183 |
+
args=config,
|
| 184 |
+
)
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
## Best Practices
|
| 188 |
+
|
| 189 |
+
1. **Use train/eval splits** - Create evaluation split for monitoring progress
|
| 190 |
+
2. **Enable Trackio** - Monitor progress in real-time
|
| 191 |
+
3. **Add 20-30% buffer to timeout** - Account for loading/saving overhead
|
| 192 |
+
4. **Test with TRL official scripts first** - Use maintained examples before custom code
|
| 193 |
+
5. **Always provide eval_dataset** - When using eval_strategy, or set to "no"
|
| 194 |
+
6. **Use multi-GPU for large models** - 7B+ models benefit significantly
|
| 195 |
+
|
| 196 |
+
## See Also
|
| 197 |
+
|
| 198 |
+
- `scripts/train_sft_example.py` - Complete SFT template with Trackio and eval split
|
| 199 |
+
- `scripts/train_dpo_example.py` - Complete DPO template
|
| 200 |
+
- `scripts/train_grpo_example.py` - Complete GRPO template
|
| 201 |
+
- `references/hardware_guide.md` - Detailed hardware specifications
|
| 202 |
+
- `references/training_methods.md` - Overview of all TRL training methods
|
| 203 |
+
- `references/troubleshooting.md` - Common issues and solutions
|
skills/hugging-face-model-trainer/references/troubleshooting.md
ADDED
|
@@ -0,0 +1,282 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Troubleshooting TRL Training Jobs
|
| 2 |
+
|
| 3 |
+
Common issues and solutions when training with TRL on Hugging Face Jobs.
|
| 4 |
+
|
| 5 |
+
## Training Hangs at "Starting training..." Step
|
| 6 |
+
|
| 7 |
+
**Problem:** Job starts but hangs at the training step - never progresses, never times out, just sits there.
|
| 8 |
+
|
| 9 |
+
**Root Cause:** Using `eval_strategy="steps"` or `eval_strategy="epoch"` without providing an `eval_dataset` to the trainer.
|
| 10 |
+
|
| 11 |
+
**Solution:**
|
| 12 |
+
|
| 13 |
+
**Option A: Provide eval_dataset (recommended)**
|
| 14 |
+
```python
|
| 15 |
+
# Create train/eval split
|
| 16 |
+
dataset_split = dataset.train_test_split(test_size=0.1, seed=42)
|
| 17 |
+
|
| 18 |
+
trainer = SFTTrainer(
|
| 19 |
+
model="Qwen/Qwen2.5-0.5B",
|
| 20 |
+
train_dataset=dataset_split["train"],
|
| 21 |
+
eval_dataset=dataset_split["test"], # ← MUST provide when eval_strategy is enabled
|
| 22 |
+
args=SFTConfig(
|
| 23 |
+
eval_strategy="steps",
|
| 24 |
+
eval_steps=50,
|
| 25 |
+
...
|
| 26 |
+
),
|
| 27 |
+
)
|
| 28 |
+
```
|
| 29 |
+
|
| 30 |
+
**Option B: Disable evaluation**
|
| 31 |
+
```python
|
| 32 |
+
trainer = SFTTrainer(
|
| 33 |
+
model="Qwen/Qwen2.5-0.5B",
|
| 34 |
+
train_dataset=dataset,
|
| 35 |
+
# No eval_dataset
|
| 36 |
+
args=SFTConfig(
|
| 37 |
+
eval_strategy="no", # ← Explicitly disable
|
| 38 |
+
...
|
| 39 |
+
),
|
| 40 |
+
)
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
**Prevention:**
|
| 44 |
+
- Always create train/eval split for better monitoring
|
| 45 |
+
- Use `dataset.train_test_split(test_size=0.1, seed=42)`
|
| 46 |
+
- Check example scripts: `scripts/train_sft_example.py` includes proper eval setup
|
| 47 |
+
|
| 48 |
+
## Job Times Out
|
| 49 |
+
|
| 50 |
+
**Problem:** Job terminates before training completes, all progress lost.
|
| 51 |
+
|
| 52 |
+
**Solutions:**
|
| 53 |
+
- Increase timeout parameter (e.g., `"timeout": "4h"`)
|
| 54 |
+
- Reduce `num_train_epochs` or use smaller dataset slice
|
| 55 |
+
- Use smaller model or enable LoRA/PEFT to speed up training
|
| 56 |
+
- Add 20-30% buffer to estimated time for loading/saving overhead
|
| 57 |
+
|
| 58 |
+
**Prevention:**
|
| 59 |
+
- Always start with a quick demo run to estimate timing
|
| 60 |
+
- Use `scripts/estimate_cost.py` to get time estimates
|
| 61 |
+
- Monitor first runs closely via Trackio or logs
|
| 62 |
+
|
| 63 |
+
## Model Not Saved to Hub
|
| 64 |
+
|
| 65 |
+
**Problem:** Training completes but model doesn't appear on Hub - all work lost.
|
| 66 |
+
|
| 67 |
+
**Check:**
|
| 68 |
+
- [ ] `push_to_hub=True` in training config
|
| 69 |
+
- [ ] `hub_model_id` specified with username (e.g., `"username/model-name"`)
|
| 70 |
+
- [ ] `secrets={"HF_TOKEN": "$HF_TOKEN"}` in job submission
|
| 71 |
+
- [ ] User has write access to target repo
|
| 72 |
+
- [ ] Token has write permissions (check at https://huggingface.co/settings/tokens)
|
| 73 |
+
- [ ] Training script calls `trainer.push_to_hub()` at the end
|
| 74 |
+
|
| 75 |
+
**See:** `references/hub_saving.md` for detailed Hub authentication troubleshooting
|
| 76 |
+
|
| 77 |
+
## Out of Memory (OOM)
|
| 78 |
+
|
| 79 |
+
**Problem:** Job fails with CUDA out of memory error.
|
| 80 |
+
|
| 81 |
+
**Solutions (in order of preference):**
|
| 82 |
+
1. **Reduce batch size:** Lower `per_device_train_batch_size` (try 4 → 2 → 1)
|
| 83 |
+
2. **Increase gradient accumulation:** Raise `gradient_accumulation_steps` to maintain effective batch size
|
| 84 |
+
3. **Disable evaluation:** Remove `eval_dataset` and `eval_strategy` (saves ~40% memory, good for demos)
|
| 85 |
+
4. **Enable LoRA/PEFT:** Use `peft_config=LoraConfig(r=8, lora_alpha=16)` to train adapters only (smaller rank = less memory)
|
| 86 |
+
5. **Use larger GPU:** Switch from `t4-small` → `l4x1` → `a10g-large` → `a100-large`
|
| 87 |
+
6. **Enable gradient checkpointing:** Set `gradient_checkpointing=True` in config (slower but saves memory)
|
| 88 |
+
7. **Use smaller model:** Try a smaller variant (e.g., 0.5B instead of 3B)
|
| 89 |
+
|
| 90 |
+
**Memory guidelines:**
|
| 91 |
+
- T4 (16GB): <1B models with LoRA
|
| 92 |
+
- A10G (24GB): 1-3B models with LoRA, <1B full fine-tune
|
| 93 |
+
- A100 (40GB/80GB): 7B+ models with LoRA, 3B full fine-tune
|
| 94 |
+
|
| 95 |
+
## Parameter Naming Issues
|
| 96 |
+
|
| 97 |
+
**Problem:** `TypeError: SFTConfig.__init__() got an unexpected keyword argument 'max_seq_length'`
|
| 98 |
+
|
| 99 |
+
**Cause:** TRL config classes use `max_length`, not `max_seq_length`.
|
| 100 |
+
|
| 101 |
+
**Solution:**
|
| 102 |
+
```python
|
| 103 |
+
# ✅ CORRECT - TRL uses max_length
|
| 104 |
+
SFTConfig(max_length=512)
|
| 105 |
+
DPOConfig(max_length=512)
|
| 106 |
+
|
| 107 |
+
# ❌ WRONG - This will fail
|
| 108 |
+
SFTConfig(max_seq_length=512)
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
**Note:** Most TRL configs don't require explicit max_length - the default (1024) works well. Only set if you need a specific value.
|
| 112 |
+
|
| 113 |
+
## Dataset Format Error
|
| 114 |
+
|
| 115 |
+
**Problem:** Training fails with dataset format errors or missing fields.
|
| 116 |
+
|
| 117 |
+
**Solutions:**
|
| 118 |
+
1. **Check format documentation:**
|
| 119 |
+
```python
|
| 120 |
+
hf_doc_fetch("https://huggingface.co/docs/trl/dataset_formats")
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
2. **Validate dataset before training:**
|
| 124 |
+
```bash
|
| 125 |
+
uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
|
| 126 |
+
--dataset <dataset-name> --split train
|
| 127 |
+
```
|
| 128 |
+
Or via hf_jobs:
|
| 129 |
+
```python
|
| 130 |
+
hf_jobs("uv", {
|
| 131 |
+
"script": "https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py",
|
| 132 |
+
"script_args": ["--dataset", "dataset-name", "--split", "train"]
|
| 133 |
+
})
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
3. **Verify field names:**
|
| 137 |
+
- **SFT:** Needs "messages" field (conversational), OR "text" field, OR "prompt"/"completion"
|
| 138 |
+
- **DPO:** Needs "chosen" and "rejected" fields
|
| 139 |
+
- **GRPO:** Needs prompt-only format
|
| 140 |
+
|
| 141 |
+
4. **Check dataset split:**
|
| 142 |
+
- Ensure split exists (e.g., `split="train"`)
|
| 143 |
+
- Preview dataset: `load_dataset("name", split="train[:5]")`
|
| 144 |
+
|
| 145 |
+
## Import/Module Errors
|
| 146 |
+
|
| 147 |
+
**Problem:** Job fails with "ModuleNotFoundError" or import errors.
|
| 148 |
+
|
| 149 |
+
**Solutions:**
|
| 150 |
+
1. **Add PEP 723 header with dependencies:**
|
| 151 |
+
```python
|
| 152 |
+
# /// script
|
| 153 |
+
# dependencies = [
|
| 154 |
+
# "trl>=0.12.0",
|
| 155 |
+
# "peft>=0.7.0",
|
| 156 |
+
# "transformers>=4.36.0",
|
| 157 |
+
# ]
|
| 158 |
+
# ///
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
2. **Verify exact format:**
|
| 162 |
+
- Must have `# ///` delimiters (with space after `#`)
|
| 163 |
+
- Dependencies must be valid PyPI package names
|
| 164 |
+
- Check spelling and version constraints
|
| 165 |
+
|
| 166 |
+
3. **Test locally first:**
|
| 167 |
+
```bash
|
| 168 |
+
uv run train.py # Tests if dependencies are correct
|
| 169 |
+
```
|
| 170 |
+
|
| 171 |
+
## Authentication Errors
|
| 172 |
+
|
| 173 |
+
**Problem:** Job fails with authentication or permission errors when pushing to Hub.
|
| 174 |
+
|
| 175 |
+
**Solutions:**
|
| 176 |
+
1. **Verify authentication:**
|
| 177 |
+
```python
|
| 178 |
+
mcp__huggingface__hf_whoami() # Check who's authenticated
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
2. **Check token permissions:**
|
| 182 |
+
- Go to https://huggingface.co/settings/tokens
|
| 183 |
+
- Ensure token has "write" permission
|
| 184 |
+
- Token must not be "read-only"
|
| 185 |
+
|
| 186 |
+
3. **Verify token in job:**
|
| 187 |
+
```python
|
| 188 |
+
"secrets": {"HF_TOKEN": "$HF_TOKEN"} # Must be in job config
|
| 189 |
+
```
|
| 190 |
+
|
| 191 |
+
4. **Check repo permissions:**
|
| 192 |
+
- User must have write access to target repo
|
| 193 |
+
- If org repo, user must be member with write access
|
| 194 |
+
- Repo must exist or user must have permission to create
|
| 195 |
+
|
| 196 |
+
## Job Stuck or Not Starting
|
| 197 |
+
|
| 198 |
+
**Problem:** Job shows "pending" or "starting" for extended period.
|
| 199 |
+
|
| 200 |
+
**Solutions:**
|
| 201 |
+
- Check Jobs dashboard for status: https://huggingface.co/jobs
|
| 202 |
+
- Verify hardware availability (some GPU types may have queues)
|
| 203 |
+
- Try different hardware flavor if one is heavily utilized
|
| 204 |
+
- Check for account billing issues (Jobs requires paid plan)
|
| 205 |
+
|
| 206 |
+
**Typical startup times:**
|
| 207 |
+
- CPU jobs: 10-30 seconds
|
| 208 |
+
- GPU jobs: 30-90 seconds
|
| 209 |
+
- If >3 minutes: likely queued or stuck
|
| 210 |
+
|
| 211 |
+
## Training Loss Not Decreasing
|
| 212 |
+
|
| 213 |
+
**Problem:** Training runs but loss stays flat or doesn't improve.
|
| 214 |
+
|
| 215 |
+
**Solutions:**
|
| 216 |
+
1. **Check learning rate:** May be too low (try 2e-5 to 5e-5) or too high (try 1e-6)
|
| 217 |
+
2. **Verify dataset quality:** Inspect examples to ensure they're reasonable
|
| 218 |
+
3. **Check model size:** Very small models may not have capacity for task
|
| 219 |
+
4. **Increase training steps:** May need more epochs or larger dataset
|
| 220 |
+
5. **Verify dataset format:** Wrong format may cause degraded training
|
| 221 |
+
|
| 222 |
+
## Logs Not Appearing
|
| 223 |
+
|
| 224 |
+
**Problem:** Cannot see training logs or progress.
|
| 225 |
+
|
| 226 |
+
**Solutions:**
|
| 227 |
+
1. **Wait 30-60 seconds:** Initial logs can be delayed
|
| 228 |
+
2. **Check logs via MCP tool:**
|
| 229 |
+
```python
|
| 230 |
+
hf_jobs("logs", {"job_id": "your-job-id"})
|
| 231 |
+
```
|
| 232 |
+
3. **Use Trackio for real-time monitoring:** See `references/trackio_guide.md`
|
| 233 |
+
4. **Verify job is actually running:**
|
| 234 |
+
```python
|
| 235 |
+
hf_jobs("inspect", {"job_id": "your-job-id"})
|
| 236 |
+
```
|
| 237 |
+
|
| 238 |
+
## Checkpoint/Resume Issues
|
| 239 |
+
|
| 240 |
+
**Problem:** Cannot resume from checkpoint or checkpoint not saved.
|
| 241 |
+
|
| 242 |
+
**Solutions:**
|
| 243 |
+
1. **Enable checkpoint saving:**
|
| 244 |
+
```python
|
| 245 |
+
SFTConfig(
|
| 246 |
+
save_strategy="steps",
|
| 247 |
+
save_steps=100,
|
| 248 |
+
hub_strategy="every_save", # Push each checkpoint
|
| 249 |
+
)
|
| 250 |
+
```
|
| 251 |
+
|
| 252 |
+
2. **Verify checkpoints pushed to Hub:** Check model repo for checkpoint folders
|
| 253 |
+
|
| 254 |
+
3. **Resume from checkpoint:**
|
| 255 |
+
```python
|
| 256 |
+
trainer = SFTTrainer(
|
| 257 |
+
model="username/model-name", # Can be checkpoint path
|
| 258 |
+
resume_from_checkpoint="username/model-name/checkpoint-1000",
|
| 259 |
+
)
|
| 260 |
+
```
|
| 261 |
+
|
| 262 |
+
## Getting Help
|
| 263 |
+
|
| 264 |
+
If issues persist:
|
| 265 |
+
|
| 266 |
+
1. **Check TRL documentation:**
|
| 267 |
+
```python
|
| 268 |
+
hf_doc_search("your issue", product="trl")
|
| 269 |
+
```
|
| 270 |
+
|
| 271 |
+
2. **Check Jobs documentation:**
|
| 272 |
+
```python
|
| 273 |
+
hf_doc_fetch("https://huggingface.co/docs/huggingface_hub/guides/jobs")
|
| 274 |
+
```
|
| 275 |
+
|
| 276 |
+
3. **Review related guides:**
|
| 277 |
+
- `references/hub_saving.md` - Hub authentication issues
|
| 278 |
+
- `references/hardware_guide.md` - Hardware selection and specs
|
| 279 |
+
- `references/training_patterns.md` - Eval dataset requirements
|
| 280 |
+
- SKILL.md "Working with Scripts" section - Script format and URL issues
|
| 281 |
+
|
| 282 |
+
4. **Ask in HF forums:** https://discuss.huggingface.co/
|