🚀 Starting Agent Loop Tool Efficiency Test

#42

by bukit - opened Mar 21

Mar 21

📊 Configuration:
Base URL: http://localhost:9099/v1
Model: gpt-oss-20b-UD-Q6_K_XL_1
Test Cases: 17
Output: results/agent_test_results_gpt-oss-20b-UD-Q6_K_XL_1_20260321_100412.json
Log File: logs/agent_test_logs_gpt-oss-20b-UD-Q6_K_XL_1_20260321_100412.log

🔄 Running agent tests...
Starting agent test suite with 17 test cases
Running agent test: zero_greeting
Running agent test: medium_search_and_add
Running agent test: zero_general_question
Running agent test: zero_weather_question
Running agent test: simple_search_electronics
Running agent test: simple_add_iphone
Running agent test: simple_view_cart
Running agent test: simple_remove_product
Running agent test: simple_checkout
Running agent test: medium_search_category_and_add
Running agent test: medium_remove_and_add
Running agent test: complex_cart_management
Running agent test: complex_shopping_workflow
Running agent test: complex_gift_shopping
Running agent test: zero_thank_you
Running agent test: zero_capabilities
Running agent test: medium_view_and_add
✅ Tests completed in 51.2178944s

📈 Agent Test Results

Total Tests: 17
✅ Passed: 16
❌ Failed: 1
⏱️ Total LLM Time: 9m8.6337151s
⏱️ Average Time per Request: 13.381310124s

📋 Test Case Results:

Test Case: zero_capabilities
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 13.6273255s
Tool Calls: 0

Test Case: simple_remove_product
Status: ✅ PASSED
Matched Path: direct_remove
Response Time: 14.2621763s
Tool Calls: 1
Tools Used: remove_from_cart

Test Case: simple_checkout
Status: ✅ PASSED
Matched Path: direct_checkout
Response Time: 18.5823717s
Tool Calls: 1
Tools Used: checkout

Test Case: simple_view_cart
Status: ✅ PASSED
Matched Path: view_cart
Response Time: 19.0528687s
Tool Calls: 1
Tools Used: view_cart

Test Case: simple_search_electronics
Status: ✅ PASSED
Matched Path: search_by_category
Response Time: 25.4073783s
Tool Calls: 1
Tools Used: search_products

Test Case: medium_view_and_add
Status: ✅ PASSED
Matched Path: view_then_add
Response Time: 26.3762938s
Tool Calls: 2
Tools Used: view_cart, add_to_cart

Test Case: zero_general_question
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 28.739193s
Tool Calls: 0

Test Case: zero_thank_you
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 29.2047859s
Tool Calls: 0

Test Case: medium_search_and_add
Status: ✅ PASSED
Matched Path: search_by_query
Response Time: 31.9444122s
Tool Calls: 2
Tools Used: search_products, add_to_cart

Test Case: medium_search_category_and_add
Status: ✅ PASSED
Matched Path: search_then_add
Response Time: 32.1428478s
Tool Calls: 2
Tools Used: search_products, add_to_cart

Test Case: zero_greeting
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 36.3653798s
Tool Calls: 0

Test Case: simple_add_iphone
Status: ✅ PASSED
Matched Path: direct_add
Response Time: 38.3027008s
Tool Calls: 1
Tools Used: add_to_cart

Test Case: zero_weather_question
Status: ✅ PASSED
Matched Path: no_tools
Response Time: 39.286728s
Tool Calls: 0

Test Case: medium_remove_and_add
Status: ✅ PASSED
Matched Path: remove_then_add
Response Time: 46.4043367s
Tool Calls: 2
Tools Used: remove_from_cart, add_to_cart

Test Case: complex_shopping_workflow
Status: ✅ PASSED
Matched Path: full_workflow_with_headphones
Response Time: 48.4718223s
Tool Calls: 4
Tools Used: search_products, add_to_cart, view_cart, checkout

Test Case: complex_gift_shopping
Status: ❌ FAILED
Response Time: 49.3239259s
Tool Calls: 5
Tools Used: search_products, search_products, search_products, add_to_cart, add_to_cart

Test Case: complex_cart_management
Status: ✅ PASSED
Matched Path: cart_organization
Response Time: 51.215769s
Tool Calls: 3
Tools Used: view_cart, remove_from_cart, add_to_cart

❌ Failed Tests Details:

Test Case: complex_gift_shopping
Expected Tool Variants: 2
Variant 1 (gift_shopping_workflow): 5 tools
Variant 2 (gift_shopping_workflow): 5 tools
Actual Tool Calls: 5
1. search_products
2. search_products
3. search_products
4. add_to_cart
5. add_to_cart
Response Time: 49.3239259s

📊 Overall Success Rate: 94.12% // https://github.com/docker/model-test

llama-server --port 9099 -ngl 99 -fa on -c 16384 --temp 1 --top-p 1.0 --top-k 0 --jinja -m X:\path\to\gpt-oss-20b-UD-Q6_K_XL.gguf

65 tokens/s on RTX 3060 12GB | RAM 32 GB

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

🚀 Starting Agent Loop Tool Efficiency Test

📈 Agent Test Results

📋 Test Case Results:

Test Case: zero_capabilities Status: ✅ PASSED Matched Path: no_tools Response Time: 13.6273255s Tool Calls: 0

Test Case: simple_remove_product Status: ✅ PASSED Matched Path: direct_remove Response Time: 14.2621763s Tool Calls: 1 Tools Used: remove_from_cart

Test Case: simple_checkout Status: ✅ PASSED Matched Path: direct_checkout Response Time: 18.5823717s Tool Calls: 1 Tools Used: checkout

Test Case: simple_view_cart Status: ✅ PASSED Matched Path: view_cart Response Time: 19.0528687s Tool Calls: 1 Tools Used: view_cart

Test Case: simple_search_electronics Status: ✅ PASSED Matched Path: search_by_category Response Time: 25.4073783s Tool Calls: 1 Tools Used: search_products

Test Case: medium_view_and_add Status: ✅ PASSED Matched Path: view_then_add Response Time: 26.3762938s Tool Calls: 2 Tools Used: view_cart, add_to_cart

Test Case: zero_general_question Status: ✅ PASSED Matched Path: no_tools Response Time: 28.739193s Tool Calls: 0

Test Case: zero_thank_you Status: ✅ PASSED Matched Path: no_tools Response Time: 29.2047859s Tool Calls: 0

Test Case: medium_search_and_add Status: ✅ PASSED Matched Path: search_by_query Response Time: 31.9444122s Tool Calls: 2 Tools Used: search_products, add_to_cart

Test Case: medium_search_category_and_add Status: ✅ PASSED Matched Path: search_then_add Response Time: 32.1428478s Tool Calls: 2 Tools Used: search_products, add_to_cart

Test Case: zero_greeting Status: ✅ PASSED Matched Path: no_tools Response Time: 36.3653798s Tool Calls: 0

Test Case: simple_add_iphone Status: ✅ PASSED Matched Path: direct_add Response Time: 38.3027008s Tool Calls: 1 Tools Used: add_to_cart

Test Case: zero_weather_question Status: ✅ PASSED Matched Path: no_tools Response Time: 39.286728s Tool Calls: 0

Test Case: medium_remove_and_add Status: ✅ PASSED Matched Path: remove_then_add Response Time: 46.4043367s Tool Calls: 2 Tools Used: remove_from_cart, add_to_cart

Test Case: complex_shopping_workflow Status: ✅ PASSED Matched Path: full_workflow_with_headphones Response Time: 48.4718223s Tool Calls: 4 Tools Used: search_products, add_to_cart, view_cart, checkout

Test Case: complex_gift_shopping Status: ❌ FAILED Response Time: 49.3239259s Tool Calls: 5 Tools Used: search_products, search_products, search_products, add_to_cart, add_to_cart

Test Case: complex_cart_management Status: ✅ PASSED Matched Path: cart_organization Response Time: 51.215769s Tool Calls: 3 Tools Used: view_cart, remove_from_cart, add_to_cart