Fantastic Model

#5
by apollo-mg - opened

Only had a brief time to work with the model so far, but using an agentic-harness so far is working great. Just wanted to say thanks!

FINAL_Bench org

Only had a brief time to work with the model so far, but using an agentic-harness so far is working great. Just wanted to say thanks!

Thank you β€” this genuinely made our day.

Agentic workloads were one of the regimes we were most curious about
but didn't have space to cover in the paper, so hearing it holds up
in your harness is incredibly useful signal.

If you ever feel like sharing rough notes on what kinds of tasks
you've been throwing at it β€” tool-use patterns, failure modes,
anything β€” we'd love to learn from it. Either way, thanks again πŸ™

You're very welcome. Here was my first unexpected surprise using the model tonight...

Screenshot_20260514_231530

Screenshot_20260514_232737

FINAL_Bench org

Okay this one stopped us in our tracks πŸ™

A few things in those screenshots we genuinely didn't expect to see:

β€” Darwin chaining two tool calls in a row to land the snapshot in the
right directory, when only one was explicitly requested
β€” Recovery of intent through a broken intercept layer (we have no
test for that; you just gave us one)
β€” The proactive "here's the next 4 things, tell me where to strike first"
β€” that wasn't trained in, it's emerging from the merge

And honestly, seeing it run as a Sovereign engine on a ROCm 7.2.3 /
amdsmi stack is exactly the deployment shape we hoped Darwin would
land in but didn't have the AMD hardware to validate ourselves.
That alone is incredibly useful signal.

If you ever feel like sharing more β€” even a one-paragraph note on
how Apollo wires intercepts and tool calls β€” we'd read it carefully.
And if any of those four next-steps you listed (the Daydream daemon
one especially caught our eye) ever needs a Darwin variant tuned
differently, ping us.

Bravo right back. This is the kind of field report papers can't capture.

Okay this one stopped us in our tracks πŸ™

A few things in those screenshots we genuinely didn't expect to see:

β€” Darwin chaining two tool calls in a row to land the snapshot in the
right directory, when only one was explicitly requested
β€” Recovery of intent through a broken intercept layer (we have no
test for that; you just gave us one)
β€” The proactive "here's the next 4 things, tell me where to strike first"
β€” that wasn't trained in, it's emerging from the merge

And honestly, seeing it run as a Sovereign engine on a ROCm 7.2.3 /
amdsmi stack is exactly the deployment shape we hoped Darwin would
land in but didn't have the AMD hardware to validate ourselves.
That alone is incredibly useful signal.

If you ever feel like sharing more β€” even a one-paragraph note on
how Apollo wires intercepts and tool calls β€” we'd read it carefully.
And if any of those four next-steps you listed (the Daydream daemon
one especially caught our eye) ever needs a Darwin variant tuned
differently, ping us.

Bravo right back. This is the kind of field report papers can't capture.

Glad you found that as fun as I did! I'm in the process of using Gemini to help me build a model lab to do more rigorous synthetic testing of real-world agentic workloads. Trying to get some of it running today and I'll report back my first data on your model, as well as any others I get around to testing. I'll share a bit more about my personal project as well once I type it out.

Happy to help!

I'm also going to do some testing on the larger quantizations of the model using my Dual P100 rig (32GB pooled).

So here's another amazing example of this model punching above its weight.

Context: I'm chatting with Darwin after a few unsuccessful attempts to test my new software_engineer subagent...


[Mark] > Yeah it shouldn't have anything to do with KV cache quantization. The subagent gets a fresh context and we weren't already pushing close to the context envelope max. There's almost certainly something wrong with the tool. I loaded up the server with a less quantized version of the model for a sanity check. Would you try the subagent once more?

[Tool Invoked] software_engineer
{
"task": "Create a lightweight, multi-threaded TCP port scanner in Python with these requirements:\n\n1. Concurrency: Use concurrent.futures.ThreadPoolExecutor for parallel port scanning β€” each thread scans one or more ports concurrently\n2. **Features:\n - Configurable port range (start and end) via argparse CLI arguments\n - Configurable timeout per connection attempt (default 1 second)\n - Thread pool with configurable max threads (default 100) via argparse -t flag\n3. CLI Interface: Use argparse for clean argument parsing β€” host (localhost by default), start port, end port, timeout, thread count\n4. Performance: Include timing metrics β€” total scan duration and ports-per-second throughput\n\nDeliverables:\n1. A complete Python script that is production-ready with inline comments\n2. A brief performance report including:\n - Estimated time to scan 1000 ports with default settings (1 thread vs 50 threads vs 200 threads)\n - Expected memory footprint (should be minimal β€” no external dependencies beyond stdlib)\n - Any known limitations or edge cases\n\nAcceptance Criteria:**\n- Code must be clean, well-documented, and PEP8-compliant\n- Must use Python 3.7+ compatible syntax (no f-string improvements from 3.10+)\n- Must handle KeyboardInterrupt gracefully (clean shutdown)\n- Must not require pip install β€” only stdlib modules allowed\n\nPlease provide the complete source code with inline comments explaining key design decisions, especially around thread management and connection handling.",
"context": "This is a validation test for the software_engineer subagent in Apollo CLI. A less quantized model version is now loaded on the server for this test. We want to assess: (1) code quality of generated Python, (2) architectural soundness of the multi-threaded approach, (3) whether it respects constraints (stdlib only, Python 3.7+ compat), and (4) accuracy of its performance claims. The software_engineer agent is configured with 30 maxTurns and loads from profiles.yaml['software_engineer'] profile using Qwen 3.6 27B model."
}

[Tool Success] Engineering Report:

I've been going in circles because:

  1. My Python tool input can't contain `"""...

Root Cause Identified: PTC Parser Bug

Mark, this is now clear β€” it's not quantization. It's not context limits. It's a bug in Apollo's Programmatic Tool Calling (PTC) parser.

The subagent explicitly states:

"My Python tool input can't contain """ β€” it gets interpreted as triple-quote boundaries by the Apollo PTC parser itself"

What's Happening

When the software_engineer subagent tries to generate Python code containing triple-quoted strings (docstrings, multi-line strings), the PTC wrapper strips or corrupts them before they reach the model. This causes:

  1. Model attempts to write code with """..."""
  2. PTC parser mangles the tool input β†’ silent failure
  3. Model tries workarounds (chr(34)*3, base64)
  4. Loop repeats β†’ eventually outputs meta-statement instead of code

Verification

The llama.cpp logs confirm this:

  • Prompt: 11,910 tokens (task description + context)
  • Generation: Only 530 tokens produced before stopping
  • Output: Just "Engineering Report:\n\nI've been going in circles..." β€” no actual code

The model is literally trying to tell us what's wrong. The PTC parser is intercepting and corrupting triple-quote strings in tool inputs.


Next Steps

This needs a fix in the Apollo PTC layer, not the model. Want me to:

  1. Investigate the PTC parser β€” trace where triple-quotes get stripped
  2. Move on β€” we already have a working port scanner from my implementation
    [Context: ~14k tokens]

[Mark] >


<Insert Spongebob gif - "A few moments later..." (use your imagination :)

I am now chatting with Gemini CLI (Pro 3.1) about this newfound revelation...

✦ The subagent was absolutely right! It correctly diagnosed its own failure mode, and I've found exactly where the "PTC parser" was stripping the triple-quotes and failing silently.

Root Cause Analysis
Because the local Qwen/Darwin model is trying to write Python code with native docstrings and multiline blocks, it is outputting literal, unescaped """ (and literal newlines) inside the JSON string value for the run_python_script
tool's "code" argument.

This causes a cascading failure:

  1. llama-server emits the tool arguments.
  2. The Apollo OpenAI adapter calls JSON.parse(toolCall.arguments).
  3. Because the JSON string contains unescaped newlines and raw """, JSON.parse throws a SyntaxError.
  4. The adapter's catch block silently swallowed the error, setting the input to an empty object {}.
  5. Apollo passed {} to Zod, which immediately threw a "missing required field 'code'" error back to the subagent.
  6. The subagent tried over and over to rewrite it, eventually giving up because it thought its tool call was being intentionally stripped.
  7. In the subsequent run you just did, because the model already learned that """ fails, it eventually just output a text summary and stopped.

The Fix
I have patched the parsing layers in open-multi-agent-upstream:

  1. src/llm/openai.ts & src/llm/openai-common.ts: Added a fallback parser. If JSON.parse fails but the tool is run_python_script or bash, it uses a targeted RegEx to extract the content inside the {"code": ...} bracket, cleanly
    stripping any Python triple-quotes or unescaped quotes before passing it forward.
  2. src/tool/text-tool-extractor.ts: Applied the same fallback logic for scenarios where the model outputs the tool in its raw text form instead of native tool_calls.

I have verified the project compiles (tsc --noEmit). The framework is now hardened against this exact Python triple-quote "hallucination," which should allow the software engineer subagent to write complex multi-line strings and
docstrings smoothly.


FINAL_Bench org

This is exactly the kind of field report that matters more than any benchmark number.
Darwin diagnosing its own failure mode mid-loop β€” correctly identifying the PTC parser as the culprit rather than the model or quantization β€” is Error Recovery in the wild. That's what FINAL Bench was designed to measure, and seeing it surface organically in a real agentic workload is genuinely exciting.
The triple-quote JSON corruption is a classic silent failure pattern. Glad Gemini helped you trace and patch it cleanly. The fallback regex approach in openai.ts is the right call for that layer.
Looking forward to the model lab results. The Daydream daemon thread especially β€” ping us when you're ready to go deeper.

Sign up or log in to comment