p2p-stabletoolbench / README.md
Dwootton's picture
Add comprehensive README
00c71c4 verified
metadata
license: mit
tags:
  - tool-use
  - evaluation
  - play2prompt
  - stabletoolbench

Play2Prompt (P2P) StableToolBench Evaluation Pipeline

Replicates the Play2Prompt paper conditions on StableToolBench using Llama-3.1-8B-Instruct.

Designed for extensibility: The 4 conditions are controlled by two pluggable components — tool descriptions and in-context examples. To test your own description types, just drop replacement files into p2p_data/descriptions/ and p2p_data/examples/.

See pipeline/ directory for all source code.