Community Series: Evaluating AI Agents with Arize AI
Sales end soon

Community Series: Evaluating AI Agents with Arize AI

Join our 3-part webinar series with Arize AI to explore agent architectures, evaluation techniques, and how agents can evaluate themselves.

By Data Science Dojo

Select date and time

Wednesday, May 21 · 10 - 11am PDT

Location

Online

About this event

  • Event lasts 1 hour

Welcome to the community series with Arize AI! In this three-part webinar series, we will dive into the foundations of agent architectures, evaluation techniques, and advanced methods like agents evaluating themselves.

Part 1: What is an AI Agent?
What we will cover:
- Define an AI agent and its components: memory, planning, and tool use.
- Compare single-agent vs. multi-agent systems and their use cases.
- Explore AI agent architectures: Router-Tool, ReAct, hierarchical, and swarm-based.
- Examine real-world design patterns: task routing, tool chaining, and role specialization.
- Identify common failure modes like infinite loops and brittle planning.
- Discuss the need for improved evaluation methods, tracing, and observability.
- Watch a live Arize Phoenix demo on agent tracing, evaluation, and debugging.

Part 2: How Do You Evaluate Agents?
What We Will Cover:
- Why traditional metrics like BLEU and ROUGE fall short for agent evaluation.
- Core agent evaluation methods: code-based, LLM-driven, human feedback, and ground truth comparisons.
- Writing high-quality LLM evaluations aligned with real-world tasks.
- Building and benchmarking LLM evaluations using ground truth data.
- Best practices for capturing telemetry and scaling evaluations.
- How OpenInference standards enhance system interoperability and consistency.
- Hands-on Exercise: Evaluate a sample agent run with Arize Phoenix using code-based and LLM evaluations.

Part 3: Can Agents Evaluate Themselves?
What We Will Cover:
- Evaluate not just agent outputs, but the reasoning behind them.
- Measure convergence and reasoning paths for quality and efficiency.
- Assess collaboration and role effectiveness in multi-agent systems.
- Evaluate planning quality in hierarchical and crew-based agents.
- Explore agents-as-judges: self-evaluation, peer review, and feedback loops.
- Apply these methods to large-scale agentic AI systems.
- Live demo of agent-as-judge or multi-agent evaluation using Arize Phoenix.

Organized by

At Data Science Dojo, we're extremely passionate about data science. We've helped educate and train 10,000+ employees from more than 2,500 companies globally, including many leaders in tech like Microsoft, Apple, and Facebook.