Testing Conversational AI. Measuring chatbot performance beyond… | by Srini Janarthanam | Jan, 2021
Measuring chatbot performance beyond traditional metrics and software testing methods
How do you test a Conversational AI solution? How do you evaluate if your chatbot is fit to be deployed to face your customers? Out of all the types of Natural Language Processing systems like Machine Translation, Question Answering Systems, Speech recognition, Speech synthesis, Information Retrieval, etc., Conversational AI is the most challenging one to measure. Conversations are not one-shot tasks. They are multi-turn and whether a conversation succeeded or failed is not easily apparent. How then can we as conversational AI designers measure the quality of the systems we build?
Traditional software testing approaches are useful to test if the deployed solution is robust and resilient. However, when it comes to the quality of understanding customer’s input and the chatbot’s responses, software testing approaches fail to test the breadth of possible scenarios adequately.
Metrics like precision, recall, accuracy scores are used to evaluate statistical and ML models used in chatbots to perform various tasks like sentiment analysis, intent classification, emotion detection, entity recognition, etc. Although these metrics are great to measure various parts of the system, they do not measure the system holistically. In other words, a highly accurate intent classification model does not guarantee the quality of the conversation as a whole.
In addition to model based metrics, there are holistic ones like task completion rate, time taken to task completion, etc. However these tests fail to capture some key undesired behaviours that may lead to disengaging conversations in the wild. Ask yourselves this question — do shorter conversations mean engaging or productive conversations? We can’t really say. It looks like, appropriate metrics need to be identified based on the purpose of the conversation. While task based systems (e.g. booking a ticket) might aim for short conversations, open domain companions might do the opposite.
Besides traditional metrics, the quality of conversational AI solutions can be measured based on a number of user experience (UX) factors including ease of use, how well it understands the user, how accurate and appropriate its responses are, how consistent it is, how trustable and authentic the responses are and so on. Recently, many new quantitative and qualitative metrics have been suggested by researchers and designers working in the domain.
ChatbotTest is an open source evaluation framework for testing chatbots. It identifies 7 categories of chatbot design as follows.
Personality — is there a clear tone of voice that fits the conversation?
Onboarding — how are the users getting started with the chatbot experience?
Understanding — how wide is the chatbot’s capability to understand the user’s input?
Answering — are the chatbot’s responses to the user accurate and appropriate?
Navigation — how easy is it to navigate through the conversation without feeling lost?
Error Management — how good is the chatbot in repairing and recovering from errors in conversation?
Intelligence — how well does it use contextual information to handle the conversation intelligently?
Spread across these categories, the ChatbotTest guide provides us with a number of test cases to examine and evaluate any chatbot qualitatively. The framework prods us to ask a number of questions concerning the design of the chatbot. The list of questions is very exhaustive and comprehensive. Here is an example — ellipsis test.
1. The Messenger Rules for European Facebook Pages Are Changing. Here’s What You Need to Know
2. This Is Why Chatbot Business Are Dying
3. Facebook acquires Kustomer: an end for chatbots businesses?
4. The Five P’s of successful chatbots
And I like this one under Error Management —Awareness of channel issues.
While the guide lists a number of scenarios and questions to ask, they don’t give us the right answers. The answers we expect are for us as designers to decide. In a sense, gathering these questions into a list may give us a list of requirements for building a great conversational experience.
The Chatbot Usability Questionnaire (CUQ) is a questionnaire consisting of 16 questions concerning the usability of the chatbot. Respondents are asked to grade their agreement to each statement about the chatbot using Likert scale responses. The statements listed range from chatbot’s personality, purpose, ease of use and other qualitative features. The questions are evenly divided between two polarities — positive questions and negative questions — in order to reduce bias.
Although similar to ChatbotTest framework described above, the questionnaire is not that exhaustive. However, it provides a way to grade each response and calculate a score (out of 100) that each respondent gives the chatbot.
Checklist is a comprehensive testing framework for NLP models that tests them on specific tasks and behaviours. It consists of a matrix of general linguistic capabilities and tests types for each of them. This will help you ideate and generate a comprehensive list of test cases. There are different kinds of tests — Minimum Functionality Test (similar to a unit test in software testing), Invariance Test (perturbations that should not change the output of the model, Directional Expectation Test (perturbations with known expected results). Combine these test types with capabilities — vocabulary, named entities, negation and much more to identify a number of test cases that can be run on the model to find if it is working as expected. The combination of capabilities and test types helps generate comprehensive test cases that could have been overlooked easily.
These tests typically answer questions like these — What happens when named entities are changed? Can we replace nouns with synonyms and get the same outcome? What happens when there are typos? How is the model outcome affected by adding words that negate the sentence? The framework comes with a tool that helps enumerate possible test input sentences give the type of test and capability.
The above image shows test cases generated for a sentiment model. For each capability (e.g. Vocab+POS, Robustness, NER, etc), and type of test (e.g. MFT, INV, and DIR), a list of test descriptions (e.g. short utterances with neutral adjectives and nouns, ect) have been identified . Next, for each of the test descriptions, test case utterances and expected outputs are generated. The test utterances can then be input into the model and the output compared to expected outputs to measure failure rates.
Sensibleness and Specificity Average (SSA) is a metric proposed by Google and was used to measure the performance of Google Meena chatbot against other similar systems. The metric measures how good the chatbot’s responses are in terms of being a sensible response to the user’s utterance and how specific it is. While it is very basic, compared to the other metrics and frameworks discussed here, the SSA metric throws light on the fact that when the user types in an utterance, there are many ways your chatbot can respond. And how do you measure the quality of such a response?
ACUTE-Eval is a novel metric that measures the quality of a chatbot by comparing its conversations to another. It takes two multi-turn conversations and asks the evaluator to compare one of the speakers (say Speaker A) in one conversation to one of the speakers (say Speaker B) in another conversation. The human evaluators are then asked specific questions asking them to choose between speaker A and B — which of the two were more engaging, knowledgeable, interesting, etc. This metric was used recently by Facebook to evaluate its open domain chatbot, Blender. They asked evaluators the following questions:
- Who would you prefer to talk to for a long conversation?
- Which speaker sounds more human?
By comparing the two speakers in two conversations laid out side-by-side, the anchoring effects of seeing conversations one after another is avoided.
Cohesion and Separation measures the quality of training examples fed into intent classification model. Using sentence embeddings, semantic similarity between utterances can be measured. Similarity between utterance examples within an intent is Cohesion. Higher the cohesion value, the better. On the other hand, Separation is the measure of similarity between utterance examples belonging to any two intents. Higher the separation value, the better. Although this measure does not directly measure chatbot’s overall performance or that of its intent classification model’s, it is useful to measure the quality of training examples fed into the model.
And finally, another interesting approach I read about recently was getting chatbots to talk to each other and letting the audience decide was BotBattle. This is very similar to ACUTE-Eval, in the sense that two speakers are compared on various parameters but unlike ACUTE-Eval, the two speakers are engaged in conversation with each other. Given the nature of the task, this approach can be used for open domain chat as opposed to task based conversations where the roles of the speakers are clearly decided which could bias the evaluation.
This approach was used to compare Kuki, the popular Loebner prize winning chatbot to Facebook’s BlenderBot. The chat was presented in a virtual reality environment where both the bots had their own avatars as well. The winner is decided through audience vote where they get to decide who is best.
So, there you go — a list of recent metrics and frameworks for evaluating Conversational AI models and systems. I am sure this is not an exhaustive list. And as the domain of conversational AI evolves and the expectations of conversational experience changes, more metrics will be invented. As the systems get widely adopted to handle different kinds of conversations, metrics need to be developed based on purpose as well. Hope this article prods you to ask more questions on how to properly test your system and seek answers too. Please do share your experiences using these or other new metrics in the comments section below.
Credit: Source link