Agent-Eval: CI Evaluation Harness for Multi-Agent Development

Problem

Modern development workflows are increasingly relying on multiple AI agents simultaneously: Claude for ideation, Cursor for precision editing, Copilot for completions. This creates an environment where tool coexistence is the norm, not tool choice. The challenge I observed was that agent instruction files (like CLAUDE.md or Cursor rules) become shared dependencies across these tools and repositories.

A single line change in these instruction files can silently degrade productivity, but there was no systematic way to detect this drift. Teams were relying on informal trust networks and anecdotal feedback to maintain these critical configuration files. This approach simply doesn’t scale, especially in enterprise environments where multiple teams contribute to shared agent configurations.

I needed a safety net, a way to run behavioral regression tests that could reliably detect when changes made things worse, even if they couldn’t perfectly measure “better.”

Approach

I designed agent-eval as a CI-native evaluation harness built around three core primitives: Tasks (what to test), Configs (how to configure the agent), and Graders (how to score results). This separation of concerns allows teams to mix and match evaluation scenarios without coupling test definitions to specific agent configurations.

Key Design Elements

Isolation-first execution: Every evaluation runs in a fresh temporary directory with injected configuration, ensuring reproducibility. Docker containerization provides an additional layer of isolation for sensitive environments.
Matrix evaluation: The harness supports running tasks across multiple configurations with multiple runs each, enabling statistical comparison between setups (baseline vs. skills-only vs. full configuration).
Hybrid grading: I implemented both code-based assertions (tests pass, file contains specific content) and LLM-based rubric evaluation. This allows teams to capture both objective correctness and subjective quality measures.
Pluggable executor architecture: The system supports multiple backends including Claude CLI and Docker containers, making it adaptable to different deployment contexts.

The configuration variant system was particularly important for A/B testing. I created four presets (baseline with no skills or CLAUDE.md, skills-only, claude-md-only, and full) allowing teams to isolate the impact of different instruction types on agent behavior.

Outcomes

Systematic regression detection: Teams can now run harness regression to compare baseline and current results, catching behavioral drift before it impacts developers
Reproducible evaluation: The containerized execution model ensures consistent results across CI environments and local development
Scaffolding for experimentation: The harness scaffold command generates templates for new skill tests, lowering the barrier to expanding test coverage

Key Contributions

Designed the three-tier architecture separating task definition, configuration management, and scoring logic for maximum flexibility
Implemented the EvalRunner orchestration layer coordinating isolation, execution, and reporting across evaluation runs
Built the artifact archiving system preserving evaluation outputs for post-hoc analysis and debugging
Created comprehensive CLI tooling including dry-run validation, task discovery, and matrix execution commands
Developed the regression comparison algorithm with support for multiple metrics including pass rate, Pass@3, and token efficiency
Authored extensive documentation covering API reference, grading algorithms, and extension patterns for custom executors and assertions

Co-authored with AI, based on the author's working sessions, dictations, and notes.

Agent-Eval: CI Evaluation Harness for Multi-Agent Development

Impact Summary

Role

Timeline

Scale

Links

Problem

Approach

Key Design Elements

Outcomes

Key Contributions

Key Takeaways

Related Projects

AWS Security Group Mapper: Visual Analysis Tool for Cloud Security

Fighters Paradise: Modern Game Engine Reimplementation in Rust

anvil-serving: Quality-Gated Local Model Routing