Benchmarking agents in the context of full tools, that do not live under A2A. #1

BarrySmith · 2025-11-24T16:37:56Z

BarrySmith
Nov 24, 2025

Often, we wish to benchmark a set of related agents in the context of a "higher-level" tools, such as gemini-cli or warp. Since these tools are not A2A agents, how would one fit that type of benchmarking with AgentBeats?

Say I write an agent that helps with a particular aspect of code development. For example, an agent that "knows CUDA programming tricks for high performance". Say this agent doesn't know how to compile CUDA code or run it so I would like an accessor that evaluates using this agent with, for example, gemini-cli, warp, or similar tools (with a variety of different LLMs).
What are everyone's thoughts on how to fix this into AgentBeats. For example, wrap gemini-cli and Warp as A2A agents? Some other approach?

Thanks!

camelop · 2025-11-26T19:36:57Z

camelop
Nov 26, 2025
Maintainer

Hi Barry, thanks for the question. I'm not sure if I complelely get the second part of the question but here are some relevant thoughts:

Benchmarking gemini-cli / warp terminal as they are separate agents:
- if the goal is to benchmark these tools themselves (which can also be seen as separate agentic systems), ideally, one should add a thin layer of wrapper on top of these tools to make them A2A compatible. E.g. open an A2A task interface and route the programming task to gemini-cli, or put terminal-bench instructions to instantiate a request to warp terminal. Then, the tool + wrapper can be seen as a complete agentbeats assessee agent and be compatible with various benchmarkings.
- if the goal is to call these tools to build new agent (e.g. if I want an agent that run gemini-cli and claude-code simultaniously and deliver whichever finishes faster), then as long as the top-level orchistrating is receiving a2a message, there isn't any specification or recommended practice so far for how integration can be done (e.g. from simplest os.system to some level of SDK-based integration, whatever works).
Building the assessor agent to evaluate above tools: I believe in this case, the key here is what instructions to pass to the agent, in what modality, and with what feedback. One should assume that these tools are already somehow converted to A2A agent that understands A2A task specifications, and use as natural way as possible to describe these tasks to the agent.
What about "using higher-level tools" say the gemini-cli is provided via a MCP hosted by the assessor agent: while personally I would perfer to avoid the complexity by just leaving the agentic part to assessee, if this is really the most reasonable setup in some scenarios, I would recommend to rethink it as a question of how to use MCP to expose gemini-cli operation interface. But most time, I would just recommend to make it a collaborative tasks, with one helper agent being gemini-cli or warp.

I'm not sure if any of these thoughts are directly relevant to your question, but happy to follow up if you have further thoughts or clarifications.

1 reply

BarrySmith Dec 3, 2025
Author

Thanks for the response. I would say for me, the interest is mostly "the goal is to benchmark these tools themselves " but more than just that. Since these tools can use mcp agents as helpers, one would actually be benchmarking gemini-cli PLUS a chosen set of "helper agents". One can even ask questions like "as I improve helper-agent-x" (but keep gemini-cli and the other helper agents fixed) how does the entire process (using gemini-cli plus multiple helper agents, including helper-agent-x) improve?

Now why do I want to think about it this way? Because helper-agent-x may be very specialized to help with ONE tiny aspect of a problem and thus it may be difficult (even impossible) to measure the improvement one gets by improving helper-agent-x in a standalone way. While measuring the improvement in gemini-cli PLUS helper-agent-x will often be trivial.

I'll go even further and submit that this statement will be true about most agents people are developing; measuring improvements in them standalone is impossible, but in the context of a larger tool is easy. Thus I submit it is imperative that AgentBeats or similar benchmarking systems need, from day one, to be able handle "tools" such as gemini-cli easily. I am a beginner in all this stuff and could certainly try to find the time to wrap gemini-cli as an a2a but I am surprised that AgentBeats (for example) doesn't just do this process once and make it available which would save me (and I submit most agent developers) a heck of a coding project (in something I am not familar with). Is any one familar with any working code (not proposals or thoughts) that turn gemini-cli or warp etc into an a2a?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmarking agents in the context of full tools, that do not live under A2A. #1

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Benchmarking agents in the context of full tools, that do not live under A2A. #1

Uh oh!

BarrySmith Nov 24, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

camelop Nov 26, 2025 Maintainer

Uh oh!

BarrySmith Dec 3, 2025 Author

BarrySmith
Nov 24, 2025

Replies: 1 comment 1 reply

camelop
Nov 26, 2025
Maintainer

BarrySmith Dec 3, 2025
Author