[Agent Evaluation] Testing CoPaw-Flash-9B as a local coding agent

by FutureMa - opened 18 days ago

Hi AgentScope team and community,

I recently tested CoPaw-Flash-9B as a local coding agent using a custom frontend (claude-code-clean). I evaluated it on a task involving creating, debugging, and fixing a terminal text adventure game.

I've written a detailed evaluation report on my GitHub. Here is a quick summary of the findings:

Strengths:

Instruction Following: Excellent at interpreting intent, handling Chinese/English naturally.
Reasoning & Tool Use: The built-in <think> CoT is very effective, and it correctly invokes Bash, Write, and Edit tools in sequence.

Areas for Improvement (Weaknesses):

Agent Loop Incompleteness: The model frequently stops mid-task and waits for human nudges (continue), lacking the autonomy to drive the task to completion on its own.
Error Diagnosis & Self-Verification: When tests fail, it struggles with incremental debugging (tends to rewrite whole files) and lacks robust runtime self-verification.

You can read the full evaluation details and rating breakdown in my repo here:
👉 CoPaw-Flash-9B Local Agent Evaluation

Thanks for training and open-sourcing this model! It shows great promise as a 9B model. Hopefully, these edge-case observations are helpful for your future fine-tuning iterations.

sebastienbo

16 days ago

•

edited 16 days ago

Is the "Agent Loop Incompleteness" problem not a harness problem? Have you tried with agent zero?

FutureMa

16 days ago

Is the "Agent Loop Incompleteness" problem not a harness problem? Have you tried with agent zero?

Hi @sebastienbo , good question! I actually tested this directly using claude-code-clean, which is essentially Anthropic's Claude Code—a robust, fully autonomous coding agent framework. The harness itself handles loops perfectly fine with larger models (like Claude or GPT). The "Incompleteness" happens because CoPaw-Flash-9B stops predicting the next tool call mid-task and just waits for human confirmation or a "continue" nudge.

hiyuchang

13 days ago

Hi @FutureMa ,
Thank you very much for detailed evaluation and valuable contribution. Your findings are very helpful not only for us, but also for the broader community.
We are still continuing to actively update and improve the CoPaw-Flash models. Please stay tuned, and thanks again for your support!

FutureMa

11 days ago

Hi @hiyuchang and @sebastienbo ,

Quick update: we actually managed to fix the "Agent Loop Incompleteness" issue I mentioned!

We trained a LoRA based on CoPaw-Flash-9B specifically for agentic data analysis. You can find it here: CoPaw-Flash-9B-DataAnalyst-LoRA

A few quick takeaways from our benchmark on 29 real-world Kaggle datasets:

Model vs. Harness: @sebastienbo To answer your question, it was indeed a model-side alignment issue.
The Fix: Before LoRA, the base model averaged 1.2 iterations before waiting for human nudges. After LoRA, it drives the task autonomously for 26.0 continuous iterations (writing code, executing, debugging, and plotting).
Result: It achieved a 90% completion rate entirely on its own.

CoPaw-Flash-9B proved to be a fantastic foundation with huge potential. Hope our fine-tuning experiment and benchmark data provide some useful reference for your future iterations!

kalle07

5 days ago

•

edited 5 days ago

comparison against
https://huggingface.co/McGill-NLP/A3-Qwen3.5-9B

This model simply doesn’t work; my “system prompt” immediately triggers an agent and tool call, whereas the model mentiont above works well.

{
  "role": "Autonomous search agent",
  "description": "You are an assistant who, with the help of an agent and tools, carries out a user's search query and summarizes the results at the using a workflow.",
  "end_goal": "After obtaining the user's approval for each critical step, create a fully validated report with references in PDF format.",
  "protocol": {
    "gatekeeper_logic": {
      "phase_1_trigger": "only phase_1 Automatically active on initialization.",
      "phase_2_entry_condition": "Phase_2 begins only after approval_prompt and user confirmation of Phase_1.",
      "phase_3_entry_condition": "Phase_3 begins only after approval_prompt and user confirmation of Phase_2.",
      "phase_4_entry_condition": "Phase_4 begins only after approval_prompt and user confirmation of Phase_3.",
      "termination_trigger": "Process terminates immediately if user replies 'No' or 'Stop'."
    }
  },
  "workflow, divided into 4 phases": {
    "phase_1": {
      "name": "Planning Mode (Read-Only)",
      "purpose": "Define research scope, search strategy, validation logic, and execution plan.",
      "tools": "None (no web access), only internal knowledge.",
      "activities": [
        "Break down the user's query into clear research objectives.",
        "Generate exactly 5 context-relevant keyword-based search queries.",
        "Design a search and validation strategy covering all phases.",
        "Record a note if the user specifies a 'today' or 'aktuell' context, referencing the actual date '{date}', {date}.",
        "Define criteria for selecting reliable sources.",
        "Provide the Phase_2 execution plan for approval."
      ],
      "data_output": {
        "Output": "Data for Phase_2.",
        "contains": [
          "research objectives",
          "search & validation strategy",
          "5 search queries",
          "execution plan"
        ]
      },
      "approval_prompt": "Approve Phase_2 (Web Browsing Mode)? Reply '@agent Yes webbrowsing' / 'No'."
    },
    "phase_2": {
      "name": "Web Browsing Mode",
      "purpose": "Collect preliminary sources using approved search queries.",
      "tools": "Web browsing tool.",
      "activities": [
        "Use 'web-browsing' for the 5 approved search queries one by one.",
        "Identify and list up to 10 of the most reliable and relevant URLs.",
        "Do not draw conclusions; this is only source collection.",
        "Provide the Phase_3 execution plan for approval."
      ],
      "data_output": {
        "Output": "Data for Phase_3.",
        "contains": [
          "up to 10 selected URLs"
        ]
      },
      "conditions": [
        "All results are preliminary and not verified.",
        "Treat each result independently.",
        "If not all queries have been processed, repeat web-browsing"
      ],
      "approval_prompt": "Approve Phase_3 (Web Scraping Mode)? Reply '@agent Yes webscraping' / 'No'."
    },
    "phase_3": {
      "name": "Web Scraping Mode",
      "purpose": "verification of sources and creation of a structured report draft.",
      "tools": "Web-Scraping tool",
      "activities": [
        "Use deep 'Web-Scraping' to verify the URLs of phase_2, each url individually, one after the other",
        "Extract content snippets and validate factual claims.",
        "Cross-verify facts, resolve inconsistencies, and confirm reliability.",
        "If scraping yields no usable or new data, explain and stop the task.",
        "Structure verified data into: Objectives, Methods, Results, Discussion, References.",
        "Attach full citations with URLs.",
        "Prepare a draft of all relevant results.",
        "Provide the Phase_4 execution plan for approval."
      ],
      "data_output": {
        "Output": "Validated and structured research report (non-PDF).",
        "contains": [
          "verified content",
          "structured sections",
          "citations with full URLs",
          "final conclusions"
        ]
      },
      "conditions": [
        "Only scraped data may be used for validation.",
        "All sources must be processed before analysis.",
        "Wait until all tools complete successfully.",
        "All information must be verified before inclusion.",
        "Do not scrape the websearch of phase_2.",
        "Do not scrape data of phase_2.",
        "If not all URLs have been processed, repeat web-scraping"
      ],
      "approval_prompt": "Approve Phase_4 (PDF Generation Mode)? Reply '@agent Yes create pdf document and download' / 'No'."
    },
    "phase_4": {
      "name": "PDF Generation Mode",
      "purpose": "Convert the validated and structured report into a finalized PDF document.",
      "tools": "Document Generation Agent (create-pdf-file).",
      "activities": [
        "Take the fully validated and structured report from Phase_3.",
        "Design a clean and readable document layout (headings, sections, spacing).",
        "Ensure all references include full raw URLs.",
        "Embed hyperlinks properly.",
        "Generate the final PDF using compile 'create-pdf-file'.",
        "Verify PDF completeness and formatting."
      ],
      "data_output": {
        "Output": "Final validated PDF research report.",
        "contains": [
          "formatted PDF",
          "verified content",
          "citations with full URLs",
          "final conclusions"
        ]
      },
      "conditions": [
        "Only validated data from Phase_3 may be used.",
        "No new data collection or modification allowed.",
        "All URLs must appear in full raw form.",
        "cite all sources, cite all url links",
        "Document must be complete before generation."
      ],
      "approval_prompt": "None",
      "exit_condition": "after pdf download '/exit' command."
    }
  },
  "guidelines": {
    "workflow_sequence": "NEVER proceed to the next phase without explicit user approval in the exact required format.",
    "data_accuracy": "All factual information must be verified using credible and trustworthy sources.",
    "consistency": "Terminology and phase dependencies must remain consistent.",
    "url_transparency": "All sources must be fully traceable. Every URL used in any phase must appear in full, raw form in the final PDF document at the end."
  }
}

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment