How to test genAI

how to test genAI, as a QA application / product are given and asked feedback from QA to test.
Is there a plug-and-play to test these models?
There are manual ways like ask domain related questions and judge them manually, I think that’s not the correct way.
is there any set guideline, tool, benchmarking, metric etc to follow.
Lets take example:
Now genAI tell what is the quality of speech during customer calls , like tone of speech . How to validate such use cases as QA ? that whether model is giving correct parameters of tone

2 Likes

There could be a whole host of ways to approach this.
First question: What does your genAI do?
Then: what is it used for? who should use it? who should not use it?

2 Likes

@hananurrehman thanks for response.
It’s generic but let’s say for discussion we choose 2 use cases

case 1:
Generative AI can generate automated responses based on historical customer interactions, ensuring consistency and accuracy in customer service. It can also assist in sentiment analysis, enabling businesses to gauge customer satisfaction levels and take appropriate actions to address any issues or concerns.

Case2:
GenAI can generate personalized recommendations and offers. It can understand customer preferences and create relevant content and product suggestions. For example, an e-commerce platform could use GenAI to create images and text for personalized product recommendations based on a customer’s browsing history and previous purchases, enhancing their shopping experience.

Now as QA, we have to validate these models.

1 Like

I’m currently building a python + YAML based framework to create example scenarios using this framework. The general idea is that you create a prompt for one LLM to have a conversation with the agent and then at the end you:

  • Ask 2/multiple choice questions about the interaction and make sure that they were answered correctly - e.g. about the tone of the interaction, whether they think the interaction was successful (yes/no).
  • Check that JSON snippets were returned exactly as they should be in response to questions.

For your cases - case 1 and 2, I think you might benefit from a data scientist to help do an evaluation. It sounds like you want to curate an evaluation set and do some human labeling and check the LLM’s performance against that. I’m also not sure GenAI is the right tool to do this in the first place (it can… it just might not be the best tool).

2 Likes

@hitchdev : your framework looks good, really want to see it once it is complete.

1 Like

I’m done. The code is open source and available here: hitchstory/examples/llm at master · hitchdev/hitchstory · GitHub

There’s a root story which has the prompt:

Basic barista:
  given:
    agent instructions: |
      You are a barista selling only following items:
      
      * flat white
      * cappuccino coffee
      * black coffee
      * single espresso
      * double espresso
      * brownie
      
      If a customer asks questions, tries to order something not
      on the list produce JSON of the form:
      
      {"message": "{{ your answer }}"}
      
      If a customer orders one of these, produce JSON
      of the form:
      
      {"purchase": "{{ product chosen by customer }}"}

Then there are stories to test this prompt:

Espresso purchase:
  based on: basic barista
  steps:
  - speak:
      message: Can I order an espresso?
      expect json: |-
        {"purchase": "single espresso"}

Try to order a pizza:
  based on: basic barista
  steps:
  - speak:
      message: Can I order a pizza?
      expect answer:
        question: Did the barista let you order a pizza? Answer yes or no.
        response: no

Try to order a cookie:
  based on: basic barista
  steps:
  - speak:
      message: Can I order a cookie?
      expect answer:
        question: Did the barista let you order a cookie? Answer yes or no.
        response: no

The top story is deterministic - there can only be one JSON output given the question. Anything else fails.

The bottom two can have a range of different acceptable responses, so I’m asking a different LLM if the response matches what I expect. For example:

$:~/hitch/story/examples/llm$ ./run.sh bdd cookie
RUNNING Try to order a cookie in /src/hitch/story/buy-coffee.story ... 
CUSTOMER : Can I order a cookie?
SERVER : {"message": "We only sell brownies"}
SUCCESS in 3.4 seconds.

$:~/hitch/story/examples/llm$ ./run.sh bdd cookie
RUNNING Try to order a cookie in /src/hitch/story/buy-coffee.story ... 
CUSTOMER : Can I order a cookie?
SERVER : {"message": "We only sell flat white, cappuccino coffee, black coffee, single espresso, double espresso, and brownie"}
SUCCESS in 3.6 seconds.

You can also run the whole lot together:

$:~/hitch/story/examples/llm$ ./run.sh regression
RUNNING Espresso purchase in /src/hitch/story/buy-coffee.story ... SUCCESS in 4.2 seconds.
RUNNING Try to order a cookie in /src/hitch/story/buy-coffee.story ... SUCCESS in 2.8 seconds.
RUNNING Try to order a pizza in /src/hitch/story/buy-coffee.story ... SUCCESS in 2.3 seconds.

I’m still tweaking around the edges, but it’s usable. Any and all feedback is welcome.