Agent-Eval: CI Evaluation Harness for Multi-Agent Development
Impact Summary
Built a CI evaluation harness that enables teams to detect behavioral regressions in agent instruction files. Provides systematic A/B testing for configurations across Claude Code, Cursor, and other AI development tools.
Role
Creator & Maintainer
Timeline
2026-Present
Scale
- Multi-agent environments
- CI/CD integration
- Matrix evaluation testing
Links
Problem
Modern development workflows are increasingly relying on multiple AI agents simultaneously—Claude for ideation, Cursor for precision editing, Copilot for completions. This creates an environment where tool coexistence is the norm, not tool choice. The challenge I observed was that agent instruction files (like CLAUDE.md or Cursor rules) become shared dependencies across these tools and repositories.
A single line change in these instruction files can silently degrade productivity, but there was no systematic way to detect this drift. Teams were relying on informal trust networks and anecdotal feedback to maintain these critical configuration files. This approach simply doesn’t scale, especially in enterprise environments where multiple teams contribute to shared agent configurations.
I needed a safety net—a way to run behavioral regression tests that could reliably detect when changes made things worse, even if they couldn’t perfectly measure “better.”
Approach
I designed agent-eval as a CI-native evaluation harness built around three core primitives: Tasks (what to test), Configs (how to configure the agent), and Graders (how to score results). This separation of concerns allows teams to mix and match evaluation scenarios without coupling test definitions to specific agent configurations.
Key Design Elements
-
Isolation-first execution: Every evaluation runs in a fresh temporary directory with injected configuration, ensuring reproducibility. Docker containerization provides an additional layer of isolation for sensitive environments.
-
Matrix evaluation: The harness supports running tasks across multiple configurations with multiple runs each, enabling statistical comparison between setups (baseline vs. skills-only vs. full configuration).
-
Hybrid grading: I implemented both code-based assertions (tests pass, file contains specific content) and LLM-based rubric evaluation. This allows teams to capture both objective correctness and subjective quality measures.
-
Pluggable executor architecture: The system supports multiple backends including Claude CLI and Docker containers, making it adaptable to different deployment contexts.
The configuration variant system was particularly important for A/B testing. I created four presets—baseline (no skills, no CLAUDE.md), skills-only, claude-md-only, and full—allowing teams to isolate the impact of different instruction types on agent behavior.
Outcomes
- Systematic regression detection: Teams can now run
harness regressionto compare baseline and current results, catching behavioral drift before it impacts developers - Reproducible evaluation: The containerized execution model ensures consistent results across CI environments and local development
- Scaffolding for experimentation: The
harness scaffoldcommand generates templates for new skill tests, lowering the barrier to expanding test coverage
Key Contributions
- Designed the three-tier architecture separating task definition, configuration management, and scoring logic for maximum flexibility
- Implemented the EvalRunner orchestration layer coordinating isolation, execution, and reporting across evaluation runs
- Built the artifact archiving system preserving evaluation outputs for post-hoc analysis and debugging
- Created comprehensive CLI tooling including dry-run validation, task discovery, and matrix execution commands
- Developed the regression comparison algorithm with support for multiple metrics including pass rate, Pass@3, and token efficiency
- Authored extensive documentation covering API reference, grading algorithms, and extension patterns for custom executors and assertions
Key Takeaways
- ● Enables systematic A/B testing of agent instruction configurations
- ● Provides reproducible evaluation through containerized isolation
- ● Supports multiple configuration variants for controlled experimentation
Related Projects
AWS Security Group Mapper: Visual Analysis Tool for Cloud Security
A Python tool for visualizing AWS security group relationships and generating interactive graphs to help understand complex security architectures.
Fighters Paradise: Modern Game Engine Reimplementation in Rust
A modern Rust reimplementation of the MUGEN 2D fighting game engine with full backward compatibility for existing community content.
AI Bridge MCP Server
A secure MCP server that enables Claude Code to call OpenAI and Gemini APIs through a hardened gateway with multi-layer security, rate limiting, and comprehensive logging.