DATA SCIENCE / AI

Chaos Testing for Chatbots: Simulating Customers to Evaluate AI Agents

📅 Jueves 16 de abril 🕐 12:30 - 13:00 (Santiago, GMT-4) 📍 Stream A 🌐 English
Most conversational AI demos look great in single-turn prompts. But real customers don’t behave like prompts: they interrupt, change goals mid-way, provide incomplete information, and ask follow-ups that force the system to stay consistent across multiple steps.

In this session, I’ll share how we built an AI Simulator to evaluate multi-turn conversational systems in a realistic way. Instead of testing a chatbot with isolated prompts, we simulate complete customer journeys, troubleshooting flows, account issues, configuration tasks and automatically measure task completion, correctness, and recovery behavior when the agent makes mistakes.

You’ll learn how multi-turn simulation exposes failure modes that traditional evaluation misses (wrong tool usage, premature answers, policy violations, drift across turns, and “confidently wrong” resolution). We’ll cover the design of customer personas, scenario templates, success criteria, and how to turn simulation results into a production-grade metric suite that enables regression testing and reliable iteration.

If you're building agents, RAG assistants, or support chatbots, this talk will show you how to evaluate them like real systems, not like demos.
Register for free