← Volver a NERDflix

Chaos Testing for Chatbots: Simulating Customers to Evaluate AI Agents

Protagonista: Priyan Pattnayak
Año: 2026
País: Chile
Género: Nerdearla Chile 2026
Track: Data Science / Ai
Idioma: Inglés

Most conversational AI demos look great in single-turn prompts. But real customers don’t behave like prompts: they interrupt, change goals mid-way, provide incomplete information, and ask follow-ups that force the system to stay consistent across multiple steps. In this session, I’ll share how we built an AI Simulator to evaluate multi-turn conversational systems in a realistic way. Instead of testing a chatbot with isolated prompts, we simulate complete customer journeys, troubleshooting flows, account issues, configuration tasks and automatically measure task completion, correctness, and recovery behavior when the agent makes mistakes. You’ll learn how multi-turn simulation exposes failure modes that traditional evaluation misses (wrong tool usage, premature answers, policy violations, drift across turns, and “confidently wrong” resolution). We’ll cover the design of customer personas, scenario templates, success criteria, and how to turn simulation results into a production-grade metric suite that enables regression testing and reliable iteration. If you're building agents, RAG assistants, or support chatbots, this talk will show you how to evaluate them like real systems, not like demos.

data-science-ai ai chatbot evaluation customer-simulation testing

Chaos Testing for Chatbots: Simulating Customers to Evaluate AI Agents

Sobre Priyan Pattnayak

Priyan Pattnayak

Senior Principal Scientist - Oracle Cloud AI

Priyaranjan (Priyan) Pattnayak is a Senior Principal Data Scientist at Oracle Cloud working on agentic and conversational AI systems for enterprise support. He builds evaluation infrastructure for multi-turn conversational experiences, including customer simulation frameworks that measure completion and correctness at scale. His work focuses on making AI assistants reliable in production through structured evaluation, failure attribution, and system-level design. He has published/filed over 30 papers and patents in top tier forums and is an active researcher in the NLP community.