How would you test voice recognition?

When I first got my phone the “Okay Google” functionality was turned on. I thought I’d give it a try. After saying “Okay Google”, it didn’t seem to be able recognise anything else I was saying. I gave it a few more tries on different occasions but eventually turned it off. I haven’t used anything like it since because I can type faster than repeating what I’ve said numerous times.

As I see more and more advertisements for Alexa, Cortana, Siri, etc. I wonder how would you test something like this? Where would you start?


Hi Heather, nice to emeet :wink:
Generally speaking, you want an automated test lab that can emulate the injection and recording of audio into/from the voice interface (Siri/Alexa/in app/etc.). On top of that, an ideal solution would allow you to transform strings of text to speech and vice versa. This way, you can create a complete script with a “dictionary” of strings as input and anticipated outcome for validation. This solution would allow you to accomplish a number of things:

  • test audio on any device, app,…
  • scale your testing reliably (even testers who can’t write code can create CSVs with “dictionaries”)
  • test both ideal speech, but also emulate noisy environments, such as the ones in real life
  • test both functionality (correctness of the response) but also responsiveness of the chatbot

Anyway, this is probably a little more than you asked for…happy to chat more.
I described such as solution in this article:

Hope this helps…


Hi Amir, nice to emeet you too :grin:

This is great!

What sort of tools do you use for this? It’s something I’ve not delved too much into myself.

In your article you mention using solutions such as Selenium and Appium for automation. I’m thinking there’s a huge difference between “normal” automation and using those tools for voice recognition testing. Am I wrong? Or do you have any go to places where a newbie could learn about how to get started with this?

In the speech functions section of your article, you talk about testing for voice imperfections. Based on my experience above, I’m particularly interested in that. How does one go about getting a bank of such recordings available? Is there a service where you can access snippets of such voice imperfections or is it something you have to create yourself?

I can’t recall who mentioned voice testing they did at Brighton last year, but they recommended using clips from Monty Python due to all the silly accents they used, and would allow you to see how extreme a voice it can handle.

Accents and similar sounding words (the Two Ronnies fork handles vs four candles comes to mind) would be my first place to go when testing voice. From there, I would want to know when giving words, if it doesn’t hear certain letters or syllables, does it get you to try again or does it just guess?


Ive not been involved in super complicated voice skills but being aware of how it all hangs together is also key.

In Alexa for example you trigger an intent “weather” with different utterances “What’s the weather like” “is it raining”.

Coming up with good and distinctive range of utterances is key from a development team perspective. I say distinctive because if your skill has more than one intent (feature it can do) you do not want those utterances to sound too similar.

There are some outsource companies funded by amazon who promise you the ability to test with a variety of real users. You can upload your skill code to them and select what sort of user group you want - mostly male vs female, age etc.
They then use the skill for you and record how they got on in the testing interface and you have access to all of the results and responses.
We haven’t used their service yet but it sounds like a good user testing idea.

Accents are definitely an issue but so is environmental noise.
Unexpected pauses and saying quit cancel help is really useful to find errors.

We test our code throughly using unit tests but you need to be able to find the line of when you are testing Alexa versus the code of your skill.
Having someone from amazon to hand to help identify if you could be using better utterances versus your code is wrong helps massively when starting out.

I am yet to try record and playback techniques but this would definitely be possible.
From a manual tester POV I make a list of everything I will say and how and record the responses. Using a mic on your computer definitely helps this too as you will quickly get into a muddle and won;t realise what caused a bug.
Being systematic is key.

Sorry I am rambling. It is the end of the day. I may post more concrete examples.


Interesting topic, as it seems this is both hot, trending and easy to create challenges alongside confusion.
I believe there are two aspects of it needing consideration, and they are not clearly delimited.
First, would be testing the voice assistant, in terms of voice to text recognition. In GUI terms, think like testing that MFC (for Windows scenario) draws buttons and windows as it should.
Second, would be testing the conversational interface, given that the system offers usable and reliable building blocks. In GUI terms, this means testing that our app uses the MFC controls as intended and without breaking the guest OS rules .

First scenario is something that should be covered by likes like Google, Apple, MSFT, those that provide the primary interface. I see this done with annotated audio files that are passed through the speech recognition engine, with or without added difficulty (e.g. background noise, variable speech rates, sudden drop). At this level one would test that “Ok Google” returns to the API client the correct intent value, regardless of the language, dialect, accent the user uses while speaking.

Second level is testing that the skill works as intended, given clear instructions. Given that the speech engine returns valid intents, testing the skill results in a much simpler job, of exploring the decision tree to determine how it behaves. This can be done pragmatically, by injecting intent values or corresponding tokens (e.g. commands)

Getting back, to the original question, @Heather, what do you want to test? The speech to text engine flow (e.g. Siri/Google Assistant/Cortana) or the skill’s logic and flow?

1 Like

Why not both? :grin:

1 Like

I tested Kinect for Xbox One and 360 ages ago… it was indeed a simpler system since it would only recognise specific inputs and not variations of those, but back then if the voice command was recognised more than 90% of the times you tried it that was enough, it was interesting to test that commands do not overlap and issuing one triggers another action, try commands in a different language… we all know how to be creative :slight_smile:


Having spent 2 hours doing end to end testing with our Alexa skill yesterday there are also some interesting things you can ask your dev to implement or do yourself if you are more of a coder than I am.

For example we had a debug mode toggle, as some of the responses from the skill were quite long. So to quicker test decision trees we can toggle the debug mode on and we would get a quick response of “Response 1”.
And if a Yes or no question needed to be answered “Response 2 - yes or no”.
You can do this for all types or responses and rempromprts and it helped us find places where incorrect responses were given (we had a list of the responses and their ids and they also had a name such as “story 1”, “story re-prompt 1”.

This helped me cut down the test time of the general decision tree. You obviously still need to test with the correct responses toggled on to make sure it sounds natural and engaging.

Also like all other work, when we found a bug, the dev would create a unit test and re-create the bug and then fix it. I really liked this approach. As skills at the moment tend to be a lot smaller we tend to pair on testing and fixing it.
I also found that the environment you test in can make a massive difference on if the skill understands you or not. Rooms with echos or having the device too close to a wall make a difference. So there are environmental factors too.


I did that for testing Pay TV systems a while ago. In short I wanted to make sure that when I was tuned on a channel it was the channel I was expecting. And since I could master the content of the channels using video and audio test feeds it was easier.
So as usual I took an out of the box approach. a) I found out a text to speech application that could run on a computer. b) I found and configured a speech recognition application that run on a computer (it was ‘mouse it’ at that time) c) I’ve used a tool similar to autoit (it was Silktest but Ranorex, EggPlant would do) that could easily manipulate all applications. d) I had to manipulate some hardware automatically so I’ve used a Stamp micro controller and actuators, the whole thing under control of Silktest . To recognize speech that was coming out from the TV, Siltest opened notepad and the speech to text dumped the recognized words into notepad. Easy then to grab notepad content and process it. In order to ‘say’ something you have several options. If you have recorded sentences you may use VLC or any player to play those files into the microphone or audio input device. In order to ensure proper quality I had a PCI sound card and I’ve hooked up audio out to audio in. If it is generated text, then you can have autoit to type the text in notepad and have the text to speech to read it. The only trick here is to properly handle the mute of the audio in to avoid larsen, loopback on text. Works good for English, get bad with some other languages, like French.

1 Like

Areas where Voice recognition can be tested:

  • Automated phone systems:- Many top software testing companies use phone systems that help direct the caller to the correct department.
  • If you have ever been asked something like “Say or press number 2 for support” and you spoke “2”, you used voice recognition.
  • Google Voice:- Google voice is a service that allows you to search and ask questions on your computer, tablet and phone.
  • Siri:- Apple’s Siri is another good example of voice recognition that helps answer questions on Apple devices.
  • Car Bluetooth:- For cars with Bluetooth or Hand-free phone pairing, you can use voice recognition to make commands such as “call my boss” to make calls without taking your eyes off the road.
  • Car controllers:- Engine start service is provided in various cars that is controlled by voice, so various voice combinations should be used to test it.

Testing Solutions:

  • On typing correct option, correct result should be audible to the user for automated voice systems.
  • All options are working correctly and redirecting to correct department.
  • User should get the correct results on getting a voice response.
  • By using slang, we can check responsiveness of the system.
  • All platforms should be covered like ‘Win7/Win8/Win10/Win11’ and Browsers ‘IE9/IE11/FF/Chrome/Safari/Edge’.
  • User voices like of male, female any particular sound should be accepted as input.
  • Various languages should be checked to test voice recognition is working correctly.
  • Animal sounds should also be accepted as input.
  • Rate of speech should be checked. For example, Slow and fast sounds should be accepted by system correctly.

Hope this information will be very helpful for you.


Given that there have been (anecdotal) instances of digital assistants being activated by phrases spoken in radio or tv programmes, I would want to check how well the device or software handles discrimination between the registered user and “accidental” activations. Relying completely on outsourced or other third party test suites here might not detect the issue.

(Not being a user of these things, are there recommendations for volume sensitivity or the user recognition learning curve?)


This could be interesting/useful for making Voice Recognition a bit better

An interesting blog post I read today about the usability and use cases of these devices

1 Like

Someone should build up a sound bite library of different genders, dialects and so on… If they need a Black Country accent, after a few beers, I can help (my accent becomes thicker, and Google struggles).

I’ve resigned myself to the fact that it’s never going to understand my accent :laughing:

1 Like

I just came across, has anyone used that for testing voice?

Or are there other tools around worth looking into?

1 Like

My son’s Vector struggles to understand him…

My limited experience to date is as follows.

  • Only by building a skill, did I have an understanding how I might test it. This si absolutely critical for all testers. We did this through mobbing, a mob of 6, 2 testers, 2 devs, 2 creative people. But make up your own

  • Mobbing allowed us all to learn how a skill is put together, the basics, what comes out the box with the skill, what Amazon own, what we do

  • The online videos and how to’s were usually out of date - they would be wouldnt they, because its one of the fastest moving platforms (referring to Alexa)

  • Dont duplicate what Amazon have already tested. As Simon Stewart said at Test.Bash about testing browsers, dont test the browser functionality, you can bet Google, firefox are all over that, focus on your functionality. Its the same for Alexa plus others. Testing with the alexa close to the wall, every single skill will act the same. Test where you will add value for your skill.

  • Understand the tool sets at your disposal, the alexa developers kit comes with loads of freeebies to help. Automation will be your friend, as usual. The moment that voice is changed to text, let automation help.

  • Being part of the journey, as we were, at the beginning, shaping the intents, doing all the building, linking the pipeline together, we were able, as testers, to understand exactly what to test, and the how as well. It helped us build our edge cases, and we learnt to code with our devs, as only one person coded at a time, and we all took our turns