Text Generation
GGUF
conversational

Interesting model..

#2
by bukit - opened

πŸš€ Starting Agent Loop Tool Efficiency Test
πŸ“Š Configuration:
Base URL: http://localhost:9099/v1
Model: Youtu-LLM-2B-Q8_0
Test Cases: 17
Output: results/agent_test_results_Youtu-LLM-2B-Q8_0_20260109_135041.json
Log File: logs/agent_test_logs_Youtu-LLM-2B-Q8_0_20260109_135041.log

πŸ”„ Running agent tests...
Starting agent test suite with 17 test cases
Running agent test: zero_greeting
Running agent test: zero_weather_question
Running agent test: simple_search_electronics
Running agent test: zero_thank_you
Running agent test: zero_capabilities
Running agent test: simple_remove_product
Running agent test: simple_add_iphone
Running agent test: simple_view_cart
Running agent test: medium_search_and_add
Running agent test: simple_checkout
Running agent test: medium_search_category_and_add
Running agent test: medium_remove_and_add
Running agent test: medium_view_and_add
Running agent test: complex_shopping_workflow
Running agent test: complex_cart_management
Running agent test: complex_gift_shopping
Running agent test: zero_general_question
βœ… Tests completed in 7m25.2517067s

πŸ“ˆ Agent Test Results

Total Tests: 17
βœ… Passed: 15
❌ Failed: 2
⏱️ Total LLM Time: 1h35m33.5856949s
⏱️ Average Time per Request: 2m53.745021057s

πŸ“‹ Test Case Results:

Test Case: zero_capabilities
Status: βœ… PASSED
Matched Path: no_tools
Response Time: 1m32.3953871s
Tool Calls: 0

Test Case: zero_thank_you
Status: βœ… PASSED
Matched Path: no_tools
Response Time: 2m18.4499702s
Tool Calls: 0

Test Case: complex_gift_shopping
Status: ❌ FAILED
Response Time: 4m30.8619954s
Tool Calls: 0

Test Case: simple_view_cart
Status: βœ… PASSED
Matched Path: view_cart
Response Time: 4m55.1403006s
Tool Calls: 1
Tools Used: view_cart

Test Case: simple_checkout
Status: βœ… PASSED
Matched Path: direct_checkout
Response Time: 5m25.7924959s
Tool Calls: 1
Tools Used: checkout

Test Case: complex_shopping_workflow
Status: ❌ FAILED
Response Time: 5m43.437793s
Tool Calls: 0

Test Case: zero_general_question
Status: βœ… PASSED
Matched Path: no_tools
Response Time: 5m52.5598254s
Tool Calls: 0

Test Case: simple_search_electronics
Status: βœ… PASSED
Matched Path: search_by_category
Response Time: 6m1.0488439s
Tool Calls: 1
Tools Used: search_products

Test Case: zero_greeting
Status: βœ… PASSED
Matched Path: no_tools
Response Time: 6m5.646711s
Tool Calls: 0

Test Case: zero_weather_question
Status: βœ… PASSED
Matched Path: no_tools
Response Time: 6m6.3095409s
Tool Calls: 0

Test Case: simple_remove_product
Status: βœ… PASSED
Matched Path: direct_remove
Response Time: 6m18.0800179s
Tool Calls: 1
Tools Used: remove_from_cart

Test Case: medium_view_and_add
Status: βœ… PASSED
Matched Path: view_then_add
Response Time: 6m34.1837379s
Tool Calls: 2
Tools Used: view_cart, add_to_cart

Test Case: simple_add_iphone
Status: βœ… PASSED
Matched Path: direct_add
Response Time: 6m38.3669481s
Tool Calls: 1
Tools Used: add_to_cart

Test Case: medium_search_category_and_add
Status: βœ… PASSED
Matched Path: search_then_add
Response Time: 6m38.7526495s
Tool Calls: 2
Tools Used: search_products, add_to_cart

Test Case: medium_remove_and_add
Status: βœ… PASSED
Matched Path: remove_then_add
Response Time: 6m39.5141649s
Tool Calls: 2
Tools Used: remove_from_cart, add_to_cart

Test Case: medium_search_and_add
Status: βœ… PASSED
Matched Path: search_by_query
Response Time: 6m47.8384886s
Tool Calls: 2
Tools Used: search_products, add_to_cart

Test Case: complex_cart_management
Status: βœ… PASSED
Matched Path: cart_organization
Response Time: 7m25.2484402s
Tool Calls: 3
Tools Used: view_cart, remove_from_cart, add_to_cart

❌ Failed Tests Details:

Test Case: complex_gift_shopping
Expected Tool Variants: 2
Variant 1 (gift_shopping_workflow): 5 tools
Variant 2 (gift_shopping_workflow): 5 tools
Actual Tool Calls: 0
Response Time: 4m30.8619954s

Test Case: complex_shopping_workflow
Expected Tool Variants: 4
Variant 1 (full_workflow_with_iphone): 4 tools
Variant 2 (full_workflow_with_headphones): 4 tools
Variant 3 (full_workflow_with_headphones_and_iphone): 5 tools
Variant 4 (full_workflow_with_iphone_and_headphones): 5 tools
Actual Tool Calls: 0
Response Time: 5m43.437793s

πŸ“Š Overall Success Rate: 88.24%

https://github.com/docker/model-test

RTX 3060 12GB // Temp 0

Sign up or log in to comment