Interesting topic, as it seems this is both hot, trending and easy to create challenges alongside confusion.
I believe there are two aspects of it needing consideration, and they are not clearly delimited.
First, would be testing the voice assistant, in terms of voice to text recognition. In GUI terms, think like testing that MFC (for Windows scenario) draws buttons and windows as it should.
Second, would be testing the conversational interface, given that the system offers usable and reliable building blocks. In GUI terms, this means testing that our app uses the MFC controls as intended and without breaking the guest OS rules .
First scenario is something that should be covered by likes like Google, Apple, MSFT, those that provide the primary interface. I see this done with annotated audio files that are passed through the speech recognition engine, with or without added difficulty (e.g. background noise, variable speech rates, sudden drop). At this level one would test that “Ok Google” returns to the API client the correct intent value, regardless of the language, dialect, accent the user uses while speaking.
Second level is testing that the skill works as intended, given clear instructions. Given that the speech engine returns valid intents, testing the skill results in a much simpler job, of exploring the decision tree to determine how it behaves. This can be done pragmatically, by injecting intent values or corresponding tokens (e.g. commands)
Getting back, to the original question, @Heather, what do you want to test? The speech to text engine flow (e.g. Siri/Google Assistant/Cortana) or the skill’s logic and flow?