agent-evaluation

Here are 18 public repositories matching this topic...

coze-dev / coze-loop

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

agent open-source playground ai monitoring evaluation openai observability agentops coze langchain llmops prompt-management llm-observability agent-evaluation eino agent-observability

Updated Sep 25, 2025
Go

Giskard-AI / giskard-oss

Sponsor

Star

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated Sep 18, 2025
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Sep 24, 2025
Python

mozilla-ai / any-agent

Star

A single interface to use and evaluate different agent frameworks

ai mcp agents a2a agent-evaluation

Updated Sep 22, 2025
Python

rungalileo / agent-leaderboard

Star

Ranking LLMs on agentic tasks

ai evaluation ai-agents synthetic-data ai-evaluation llms ai-benchmark agent-evaluation

Updated Sep 10, 2025
Jupyter Notebook

Cre4T3Tiv3 / ai-agents-reality-check

Sponsor

Star

Mathematical benchmark exposing the massive performance gap between real agents and LLM wrappers. Rigorous multi-dimensional evaluation with statistical validation (95% CI, Cohen's h) and reproducible methodology. Separates architectural theater from real systems through stress testing, network resilience, and failure analysis.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Aug 8, 2025
Python

SparkBeyond / agentune

Star

Tune your AI Agent to best meet its KPI with a cyclic process of analyze, improve and simulate

customer-support customer-service conversational-agents ai-agents chatbot-evaluation agent-simulator kpi-analysis agent-evaluation agent-optimization sales-agents customer-facing-agents kpi-optimization

Updated Jul 25, 2025
Python

shiragannavar / Testing-RAG

Star

evaluation ground-truth llm generative-ai agent-evaluation

Updated May 12, 2025
Python

chaosync-org / awesome-ai-agent-testing

Star

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

testing qa benchmark machine-learning evaluation chaos artificial-intelligence chaos-monkey testing-tools awesome-list quality-assurance ai-safety ai-agents chaos-engineering llm llm-evaluation agentic-ai ai-benchmark agent-evaluation

Updated May 28, 2025

lml2468 / ContextOptimizer

Star

Intelligent Context Engineering Assistant for Multi-Agent Systems. Analyze, optimize, and enhance your AI agent configurations with AI-powered insights

multi-agent-systems prompt-engineering agent-evaluation context-engineering agent-optimizer

Updated Jul 5, 2025
Python

JetBrains / teamcity-ai-agent-testing-demo

Star

End-to-end TeamCity framework to run AI agents on SWE-Bench Lite. Spin up isolated Docker images per task, extract patches, score with the official harness, and aggregate success rates. As an example, we'll look at Junie and Google Gemini CLI

ai evaluation eval evaluation-framework agentic-ai agent-evaluation evaluation-tools

Updated Aug 13, 2025
Kotlin

PabloCabaleiro / pondera

Star

Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.

python ai agents model-agnostic ai-evaluation llms llm-evaluation llm-evaluation-framework llm-judge agent-evaluation ai-evaluation-framework rubric-based-evaluation yaml-first

Updated Sep 19, 2025
Python

smuddana-7 / Cart-Pole-Gymnasium-Environment

Star

Train a reinforcement learning agent using PPO to balance a pole on a cart in the CartPole-v0 environment using Gymnasium and Stable-Baselines3. Includes model training, evaluation, and rendering using Python and Jupyter Notebook.

reinforcement-learning python3 agent-evaluation openai-gymnasium cartpole-v0-environment

Updated Jun 2, 2025
Jupyter Notebook

Sai-Santhan-Dodda / ai-navigation-automation

Star

Browser automation agent for Bunnings website using the browser-use library, orchestrated via the laminar framework, managed with uv for Python environments, and running in Brave Browser for stealth and CAPTCHA bypass.

python automation gemini openai brave browser-automation laminar uv llms ollama stealth-browsing browser-use agent-evaluation

Updated Aug 10, 2025
Python

ahsanblock / NVIDIA-AgentIQ-Agents-Evaluator

Star

Visual dashboard to evaluate multi-agent & RAG-based AI apps. Compare models on accuracy, latency, token usage, and trust metrics - powered by NVIDIA AgentIQ