JLR Challenge # 1 Technical Workshop (1st Offering) by: Soroush Ziaeinejad
An Introduction to AI Agent Benchmarks (1st Offering)
School of Computer Science – JLR Challenge # 1 Technical Workshop
An Introduction to AI Agent Benchmarks (1st Offering)
Presenter: Soroush Ziaeinejad
Date: Friday, October 24th, 2025
Time: 1:00 PM
Location: Workshop Space, 4th Floor - 300 Ouellette Ave., School of Computer Science Advanced Computing Hub
Abstract:
As AI systems move from just answering questions to acting like agents that can plan, use tools, and interact with real environments, we need new ways to measure their abilities. Traditional benchmarks only test final answers, but agent benchmarks focus on how well an AI can solve problems step by step, correct mistakes, and adapt to different tasks. Each benchmark looks at a different skill: HumanEval tests if an agent can write correct code, Mint checks how an agent uses tools to solve problems, GAIA evaluates reasoning across text, images, and real-world data, and SWEBench-Lite measures how well an agent can understand and fix real software issues. This presentation will explain these benchmarks, show how they differ, and discuss what they help us learn about the strengths and weaknesses of current AI agents.
Workshop Outline:
• Why Agent Benchmarks Are Needed
• Core Concepts in Agent Evaluation
• Overview of HumanEval, Mint, GAIA, and SWEBench-Lite
• Comparison of benchmark goals and methodologies
• Hands-on demonstration: running and interpreting results from one benchmark
• Discussion and Q&A
Prerequisites:
• Basic understanding of AI or machine learning concepts
• Familiarity with large language models (LLMs) and their applications
Biography:
Soroush is a Ph.D. candidate and research assistant in Computer Science at the University of Windsor. He received his bachelor’s degree in Software Engineering and his master’s degree in AI specializing in computer vision and video processing. His current research focuses on privacy and security in AI, with a particular emphasis in distributed and collaborative learning systems.
An Introduction to AI Agent Benchmarks (1st Offering)
School of Computer Science – JLR Challenge # 1 Technical Workshop
An Introduction to AI Agent Benchmarks (1st Offering)
Presenter: Soroush Ziaeinejad
Date: Friday, October 24th, 2025
Time: 1:00 PM
Location: Workshop Space, 4th Floor - 300 Ouellette Ave., School of Computer Science Advanced Computing Hub
Abstract:
As AI systems move from just answering questions to acting like agents that can plan, use tools, and interact with real environments, we need new ways to measure their abilities. Traditional benchmarks only test final answers, but agent benchmarks focus on how well an AI can solve problems step by step, correct mistakes, and adapt to different tasks. Each benchmark looks at a different skill: HumanEval tests if an agent can write correct code, Mint checks how an agent uses tools to solve problems, GAIA evaluates reasoning across text, images, and real-world data, and SWEBench-Lite measures how well an agent can understand and fix real software issues. This presentation will explain these benchmarks, show how they differ, and discuss what they help us learn about the strengths and weaknesses of current AI agents.
Workshop Outline:
• Why Agent Benchmarks Are Needed
• Core Concepts in Agent Evaluation
• Overview of HumanEval, Mint, GAIA, and SWEBench-Lite
• Comparison of benchmark goals and methodologies
• Hands-on demonstration: running and interpreting results from one benchmark
• Discussion and Q&A
Prerequisites:
• Basic understanding of AI or machine learning concepts
• Familiarity with large language models (LLMs) and their applications
Biography:
Soroush is a Ph.D. candidate and research assistant in Computer Science at the University of Windsor. He received his bachelor’s degree in Software Engineering and his master’s degree in AI specializing in computer vision and video processing. His current research focuses on privacy and security in AI, with a particular emphasis in distributed and collaborative learning systems.