October 2019

Reading Between the Lines:

Checking the Accuracy of Chatbot Phrases
Just like humans, chatbots need to be introduced to language and trained over time to build on acquired knowledge to further develop fluency. In both cases, increased exposure is key to ascertaining that meaning was successfully understood and that there is a capability to draw on experience and use intuition to process new vocabulary. Whereas children’s progress is judged through daily conversations and encounters, a chatbot’s development must be evaluated through the deployment of tests.

To demonstrate this concept, let’s say we are trying to test whether the chatbot understands the concept of machine insurance. To confirm whether the chatbot will be able to recognize language about machine insurance, but not confuse it with other language entered in as learning data, we need to write tests (in the form of phrases) that contain features typical of the language and define reports with appropriate measures for assessing the chatbot’s precision. Quality measures for the chatbot can be defined in different ways. But overall, you must answer one very important question: What do we mean by saying the chatbot learns to improve the classification of phrases?!

The answer is not so simple. Let’s assume we have the following categories defined for the chatbot:

  • Machine insurance
  • Machine technology
  • Type of the machine
  • Cost of credit

If the user types the sentence ‘I want to pay insurance for my new machine’, it does not mean that the chatbot will classify it into only one category. The classifier should assign a sentence with a very large “value” to only one category, but this sentence can also be assigned to other categories with a small value, eg:

Classification ScoreCategory Name
91%Machine insurance
41%Machine technology
30%Type of the machine
17%Cost of credit

The expression ‘I want to pay insurance for my new machine’ has been classified as Machine Insurance with a value of 91%, while the amount in the line underneath indicates matching this sentence to the cost of machine insurance with a value of 41%. The other two categories have an even smaller value that match with the entered sentence. Let us assume that the values of assigning a phrase to a category are in the interval (0; 1).

Therefore, looking at the results shown above, it can be concluded that the chatbot is confident in classifying this phrase because the difference between the first valid classification value and the second is equal to 50%.

During the classification of a phrase other issues which could cause problems include:

  1. too small a difference between the first two categories assigned
  2. the correct phrase’s value being too low
  3. uniform distribution of the category classification, indicating the chatbot is unsure how to classify a phrase

By testing a chatbot, not only is one able to train it and increase its levels of comprehension, but one can establish a systematic approach to handling new language which results in a chatbot performing at more advanced levels with increased comprehension and communication skills.

Checking the accuracy of the chatbot’s phrase classification is a crucial aspect of developing a chatbot’s proficiency, and just like in teaching children, enables it to learn on its own and build on its knowledge base.

Read other articles in the series: Technically Speaking:

Any questions?

Have we piqued your interest?

Michał Walerowski

Business & Product Development Manager +48 505 243 086


    43B Jana Pawła II Avenue
    Podium Park II Building
    31-864 Cracow, Poland