Open Source Open Source

Agent-Eval: CI Evaluation Harness for Multi-Agent Development

Impact Summary

Built a CI evaluation harness that enables teams to detect behavioral regressions in agent instruction files. Provides systematic A/B testing for configurations across Claude Code, Cursor, and other AI development tools.

Role

Creator & Maintainer

Timeline

2026-Present

Scale

  • Multi-agent environments
  • CI/CD integration
  • Matrix evaluation testing

Links

Problem

Modern development workflows are increasingly relying on multiple AI agents simultaneously—Claude for ideation, Cursor for precision editing, Copilot for completions. This creates an environment where tool coexistence is the norm, not tool choice. The challenge I observed was that agent instruction files (like CLAUDE.md or Cursor rules) become shared dependencies across these tools and repositories.

A single line change in these instruction files can silently degrade productivity, but there was no systematic way to detect this drift. Teams were relying on informal trust networks and anecdotal feedback to maintain these critical configuration files. This approach simply doesn’t scale, especially in enterprise environments where multiple teams contribute to shared agent configurations.

I needed a safety net—a way to run behavioral regression tests that could reliably detect when changes made things worse, even if they couldn’t perfectly measure “better.”

Approach

I designed agent-eval as a CI-native evaluation harness built around three core primitives: Tasks (what to test), Configs (how to configure the agent), and Graders (how to score results). This separation of concerns allows teams to mix and match evaluation scenarios without coupling test definitions to specific agent configurations.

Key Design Elements

  • Isolation-first execution: Every evaluation runs in a fresh temporary directory with injected configuration, ensuring reproducibility. Docker containerization provides an additional layer of isolation for sensitive environments.

  • Matrix evaluation: The harness supports running tasks across multiple configurations with multiple runs each, enabling statistical comparison between setups (baseline vs. skills-only vs. full configuration).

  • Hybrid grading: I implemented both code-based assertions (tests pass, file contains specific content) and LLM-based rubric evaluation. This allows teams to capture both objective correctness and subjective quality measures.

  • Pluggable executor architecture: The system supports multiple backends including Claude CLI and Docker containers, making it adaptable to different deployment contexts.

The configuration variant system was particularly important for A/B testing. I created four presets—baseline (no skills, no CLAUDE.md), skills-only, claude-md-only, and full—allowing teams to isolate the impact of different instruction types on agent behavior.

Outcomes

  • Systematic regression detection: Teams can now run harness regression to compare baseline and current results, catching behavioral drift before it impacts developers
  • Reproducible evaluation: The containerized execution model ensures consistent results across CI environments and local development
  • Scaffolding for experimentation: The harness scaffold command generates templates for new skill tests, lowering the barrier to expanding test coverage

Key Contributions

  • Designed the three-tier architecture separating task definition, configuration management, and scoring logic for maximum flexibility
  • Implemented the EvalRunner orchestration layer coordinating isolation, execution, and reporting across evaluation runs
  • Built the artifact archiving system preserving evaluation outputs for post-hoc analysis and debugging
  • Created comprehensive CLI tooling including dry-run validation, task discovery, and matrix execution commands
  • Developed the regression comparison algorithm with support for multiple metrics including pass rate, Pass@3, and token efficiency
  • Authored extensive documentation covering API reference, grading algorithms, and extension patterns for custom executors and assertions

Key Takeaways

  • Enables systematic A/B testing of agent instruction configurations
  • Provides reproducible evaluation through containerized isolation
  • Supports multiple configuration variants for controlled experimentation

Related Projects