🤖 Day 28: Build your own AI Tools

Day 28 - We’re nearly there! Thanks for sticking with this challenge!

Earlier in the challenge you explored various uses of Large Language Models (LLMs) such as ChatGPT and Bing Copilot but concerns were raised about the privacy of the data we share with these tools. We also came up against scenarios where these models just lack the context to generate reasonable output.

These are very real concerns and are faced by many companies that are adopting AI and especially generative AI, such as LLMs. Luckily there are approaches we can adopt to address these.

Today’s Task

So, in today’s task, we will investigate how some of these approaches; this is a huge field, and in today’s task, we will only be exploring these at a high level. We will do this through a set of 4 Walkthroughs that focus on addressing:

  • Data Privacy through the use of locally hosted LLMs
  • Improved Contextual Behaviors through fine-tuning and context retrieval.

Don’t worry, you won’t need to write any code - the code for each walkthrough is provided for you but you will have the opportunity to make some small modifications to experiment with the approach.

The walkthroughs make use of Google’s Colaboratory (Colab for short), so you will need a Google Account to access and run the code. We are using this for education purposes only

Task Steps

  1. Learn about Colaboratory - Watch the short Introduction to Colab video to understand how to use Colab.

  2. Complete a walkthrough - Select one or more of the walkthroughs to complete
    a. Access the repository for today’s task at GitHub - BillMatthews/mot-30-days-ai-in-testing
    b. Read the ReadMe file to better understand what each walkthrough covers.
    c. Pick one (or more) of the walkthroughs that interest you and select the link on the ReadMe page. This will open a Colaboratory Notebook containing the walkthrough.
    d. Read the information and follow the instructions in the notebook to complete the walkthrough.
    e. You shouldn’t need to change any of the code provided but most notebooks have options for you to experiment with the inputs.

  3. Reflect - Review the reflection questions at the end of the walkthrough

  4. Share your insights - Consider sharing your insights with the community by responding to this post with:
    a. Which walkthrough you choose and why
    b. How well you think this approach addresses your concerns about data privacy and/or context awareness.
    c. What opportunities does the approach provide you and your team?

Why Take Part

  • Taking it to the next level: It’s easy to use tools such as ChatGPT, but teams will quickly reach the limits of its usefulness. By taking part in today’s tasks you will become aware of how we can push past these limitations and start to innovate within the field of AI in Testing.

:mortar_board: Support your learning and the community. Go Pro!


Hello all,

I have picked up Walkthrough 1 to run my own local LLM for Chat. I have downloaded LLaVA 1.5 model of 3.97 GB size & tried to run the file through this command:

.\llava-v1.5-7b-q4.llamafile -ngl 9999

However, it didn’t worked and facing below error:

Edit: On the second run of the same command the error:

Any help on how to resolve this error and go ahead?

(I tried to search some solutions but all the resolutions sounds so technical and I’m not able to understand it.)


Hi, sorry you’ve had an error on this walkthrough. The pre-build versions should work on “most machines” according to their ReadMe.

I’ll have a look to see if i can reproduce - it might be a bad image.

Do you get the same error if you try a different model? I used the “Mistral-7B-Instruct” model recently and it ran ok.


Thanks much Bill!

No Bill, I haven’t tried other models as this download itself took so much time due to my poor bandwidth.


Hi my fellow testers, here is my response to today’s challenge. It was a particularly interesting one today, thanks @billmatthews.

Learn about Colaboratory

I watched the video and glad that I did as I had never heard of Google Colab before

Complete a walkthrough

For this I chose walkthrough 3 as I am really interesting in giving an AI tool some data to make it more aware of my specific context. I was able to follow through and run the code without issue, upload some example data to my Google Drive and then ask the AI chat tool some questions about it.

Reflect - Review the reflection questions at the end of the walkthrough

Did the RAG based LLM use context from your uploaded documents?

  • Yes it was able to access the data that I asked it about in the chat interface

Do you think that the RAG based LLM included relevant content into the prompt?

  • Yes I was comparing the values it was returning with the original input and they looked identical

image (16)

Assuming your team wants to incorporate an LLM into the Testing toolbelt, what are the use cases for using this approach in your team? What documents might you load into the Vector Database so that prompts have access to them?

  • For my use case I would want to upload real but non-confidential user data so that I can ask the AI tool to generate a certain set of test data to use in my tests, this would help me throughout my testing, for example I could see how the graphs looks with real data in them.

For anyone who encounters similar errors with the pre-built LlamaFile images, I think this is being caused by a memory issue. If your machine has a GPU the model will attempt to use it but if it doesn’t have enough memory it will fail to launch.
If you don’t have a GPU it should just use your CPU and standard memory- most machines have more standard memory but less GPU memory.

If you encounter this I would suggest trying either the “TinyLlama-1.1B”, “Rocket-3B” or “Phi-2” models - these are very small and so the performance might not be as good.

This walkthrough is to give you a taste of what is possible so if you wanted to deploy an LLM within your local infrastructure you’d probably want to pick a larger model (or fine-tune your own) and run it on a machine of an appropriate size.


Thanks for the suggestion Bill, I’ve successfully downloaded this model & had a li’l chat with it.

Like already mentioned, the performance isn’t that good, as it’s very small :baby:t4:. I have tried couple of things with it. One of the chat is as below:

For a single prompt, it didn’t provided output for all the things asked, but as I broke the paragraph into separate sentences as a separate prompt, it gave better output.


In what ways did running an LLM locally differ from using a service such as Chat-GPT?

This way of running LLMs locally can offer more control & privacy over our tasks.


Happy Thursday all

I went with walkthrough-1-running-local-llms.ipynb

I have to say that using this I was thinking, what is this all about :smiley:
I was talking to a llama and a bear and wondering if someone slipped acid in my lunch.

I have to say I am not sure about this at all. I asked it some very short and basic questions and it really laboured to answer.
I know it is learning but a simple question like 'What time is it in Brisbane Australia?" and it needed prompting. It came back with 10:30 so I had to re-ask
am or pm. it took some time to answer pm.
Even Bing in Edge would have given me the answer quicker.

But keep an open mind :smiley:

I have now closed the app and tried to delete but it wouldn’t let me.
So did some searching and it eventually closed.

This highlights the need for due diligence. I downloaded it as it is Mozilla.
But why the delayed response ? What was it doing for those minutes after I closed it ?

I would suggest hosting our own tools will be controlled by our IT people.


Glad to hear you got a model running!

The Tiny-Llama has 1.1 Billion parameters where as the Llama2 model (that Tiny-Llama is based on) has 7 Billion parameters (for the smallest form) and 70 Billion (for the largest form).

You can think about the number of parameters as the capacity that the model has to learn. In general, but not always, larger models will learn more from the data it’s trained on and generally produce better output.

To put the compute in perspective - a 7 Billion parameter model generally needs about 16 GB of memory (preferably GPU) to run. When you get to models with 70 Billion parameters you need something on the order of 168 GB of GPU memory! That’s a lot and quite expensive!

See Calculating GPU memory for serving LLMs | Substratus.AI for details of one way to calculate this.

1 Like

Addressing Data Privacy and Context Awareness

The approach of using Colaboratory for this walkthrough seems to be quite suitable for addressing concerns about data privacy and context awareness. Since Colaboratory is a cloud-based Jupyter notebook environment, it provides the advantage of running code in a controlled, secure, and isolated environment. Users can upload their datasets directly to Colab, which keeps the data processing and analysis contained within the platform.

Regarding context awareness, the walkthrough likely includes explanations and code examples on how to handle sensitive data appropriately. Python libraries often have built-in functions and methods for data anonymization, encryption, or simply removing identifiable information.

1 Like

Considering the limited time at the end of Q1 development cycle, I just tried to get more information about Google Colab and here what I learned from ChatGPT:
Google Colab is commonly used for AI:
1. Free GPU/TPU: Google Colab provides free access to Graphical Processing Units and Tensor Processing Units, which are essential for training deep learning models efficiently.
2. Pre-installed Libraries: Colab comes with many popular Python libraries pre-installed, including TensorFlow, PyTorch, Keras, OpenCV, and more, making it easy to start working on AI projects without worrying about installation issues.
3. Collaboration: Colab allows for easy collaboration with others by sharing notebooks and enabling real-time editing.
4. Cloud-based: Since Colab runs in the cloud, you don’t need to worry about hardware limitations, and you can access your work from anywhere with an internet connection.
5. Integration with Google Drive: Colab seamlessly integrates with Google Drive, allowing you to store your data and notebooks in the cloud.

1 Like

Howdy my fellow QA Engineers

I embarked on a task on Day 28, choosing Walkthrough 1: Running Local Large Language Models (LLMs). I successfully integrated the local LLM into my app and prompted it to generate some test cases based on a UI snapshot.

Here’s a breakdown of my analysis:

  • Test Case Accuracy: I found that 50% of the test cases generated by the LLM were accurate. This indicated that the LLM was able to correctly interpret the UI snapshot and generate relevant test cases half of the time.
  • Over Prediction: I noticed that the LLM over-predicted some information. This meant that the LLM generated test cases that weren’t directly relevant to the UI snapshot or included extra information that wasn’t necessary.
  • Security: I felt secure with the results and prompts as they were generated locally. This was likely because running the LLM locally ensured data privacy and control over the model’s inputs and outputs.

In conclusion, integrating a Local LLM for generating test cases was a valuable exercise. Despite some inaccuracies and over predictions, the security of using the local model was a significant advantage. It’s important to remember that the performance of LLMs can vary based on the specific task and the model’s training data. Therefore, continual monitoring and adjustment of the model is necessary to improve its performance. This was a great learning experience for me! :blush:


Manoj: can you please summarise the screenshot

Llama: The image displays a cell phone screen with the word "Tild" written on it. Below this text, there is an option to sign in and create or manage passwords for secure access.

Manoj: can you please write some test cases for the above ui screen

Llama: 1. Test case: Verify that the "Sign In" button is clickable and leads to a successful login process when clicked by a user.
2. Test case: Check if there are any visual errors or inconsistencies in the UI, such as missing text, incorrect color schemes, or unresponsive buttons.
3. Test case: Ensure that users can successfully sign up for an account and create their password without facing technical issues.
4. Test case: Confirm that the "Sign In" button is functional when a user has already signed in to their account by clicking on it after logging in.
5. Test case: Check if there are any security measures, such as CAPTCHA or two-factor authentication, implemented during the login process for added protection against unauthorised access.
  • I tried to attached the screenshot but it did not upload.

Thank you


I have a bit experience using GPT4All. I happen to use Linux and they happen to provide Flatpak, which made installation a breeze.

I only needed to download a model, put it in right directory and tell GPT4All to use it. I don’t remember what exact model I have been using - it was one of the options that were recommended to me at work, and I just rolled with it. It was on a smaller side, probably around 5-10 GB file.

There are few visible differences when comparing local GPT4All with Bing Copilot. First and foremost, GPT4All was much, much slower - on my machine, it typed maybe a word a second, sometimes longer. Obviously the performance will depend on the hardware you have, and I do not have dedicated graphics card.

Bing Copilot feels better at conversation, more human. This might be improved if I had used larger or newer model.

GPT4All (and all local models, really) are resource hogs - running the model reserved multiple GB of RAM (I think 10-20 GB) and used CPU while generating responses. I imagine similar resources are being used when you use Copilot, it’s just they are used on some server managed by Microsoft instead of your computer, so you don’t really notice it.

GPT4All was unable to improve answers based on web search results. Depending on your query and requirements, this might be a good or bad thing.

Finally, GPT4All does not have a limit of messages you can send. Bing Copilot consistently allows me to send only 5 messages, which is not a case for GPT4All. I think all LLM models have a running memory, so if you converse with it long enough, it will eventually “forget” first messages. But for most practical purposes that is irrelevant.

Overall, my experience of running LLM model locally is that there is a higher bar of entry (you need powerful computer and you might have problems setting it up), while performance and quality seem lower than Copilot. On the other hand, you can run it securely, privately and offline; you can converse it as long as you want; and you can do it for free, in contrast to something like ChatGPT 4. So right now that appears to be a solution for people who value these things more than user experience.

But it’s still amazing that in relatively short time the community was able to get us to the point where LLMs can be run on new-ish consumer-grade hardware.


Hello Manoj, Can I know which specific Llamafile/model have you downloaded? :slightly_smiling_face:

Thanks for sharing this link Bill! A simple formula, but a lot of memory required to get a better output!

1 Like

Hey poojitha,

I used the llava-v1.5-7b-q4.llamafile model from GitHub - Mozilla-Ocho/llamafile: Distribute and run LLMs with a single file.

Thank you

1 Like

Hey, @billmatthews and fellow learners!

I want to first give a shoutout to @billmatthews for the amazing task he has created for us.

I was pinging him and @poojitha-chandra a lot for their help as I got stuck at multiple places. (Commands not working, Collab runtime issues, Failure in understanding some steps) and they both helped me with patience and compassion.

I was able to complete our walkthrough # 3 where I created my own AI bot trained on my own data files. Loved the results and am still in awe of the possibilities it opens for me.

If anyone gets stuck with this task, I would highly recommend they check this video out. I have shared my complete demonstration along with mistakes to avoid. Check it out here:

Do share your thoughts & feedback!



Day 28

Which walkthrough you choose and why

I chose walkthrough 3, as Retrieval Augmented Generation addresses the two main challenges beyond privacy, the total confidence that an LLM gives an incorrect answer and keeping your model up to date as your testing progresses.

This video was a great plain language explainer too I think, could be shared with stakeholders:

How well you think this approach addresses your concerns about data privacy and/or context awareness.

I fed the model with 4 testability PDF’s I created about the smells of hard to test architecture. I changed the prompt to:

Summarise the testing smells that show your architecture has poor testability

From what I can see from the output (I didn’t use the interface):

RAG PROMPT - Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

There was an explicit instruction not to make up an answer!

It used two of the PDF’s to come up with its answer, which was correct, the other two didn’t reference smells:

Helpful Answer: The testing smells that show your architecture has poor testability include: 
(1) Too many production issues, 
(2) Pre-release regression cycles, 
(3) Lack of automation and exploratory testing, (4) Hesitance to change code, 
(5) Testing not considered during architectural design, 
(6) Team seeking more testers, 
(7) Too many slow UI tests, 
(8) Important scenarios not tested, 
(9) Ineffective unit and integration tests, 
(10) Cluttered ineffective logging, 
(11) Flaky nondeterministic automation, 
(12) Tests with duplication and irrelevant detail, 
(13) Issues difficult to reproduce and isolate. 

These smells impact the team's ability to deliver value and cause team satisfaction to decrease.

I think if this model was deployed on a ring fenced server with no internet access and a means for the team to add more documents for context it would go a long way to addressing the privacy problem.

What opportunities does the approach provide you and your team?

Lots of testers share their exploratory testing notes on their team wiki, I think it would be an awesome resource for the whole team to access it as a model for test idea and design inspiration. Or even link it to your test management tool. Jira, wiki’s and test management tools have a big flaw in that they are information hiding tools, so anything that turns this around, is a good thing.

@billmatthews this was a great exercise, I loved it! Thank you!

Also looks like I should take the hugging face course as well. :slight_smile:


That’s a nice idea and has some interesting possibilities! I think it is very doable and it’s the kind of innovation that I think will come from testers rather than vendors…although they might borrow this one now :slight_smile:

1 Like

Agree, this would be of less interest to vendors because the sales pitch is much harder than say, self healing tests.

I think it would be great to get some innovations from the community, a RAG model with exploratory testing notes, test design techniques and testability information would be a great gift to your development team. Would reduce friction getting developers more involved in exploratory testing, giving them a good starting point to test from.

1 Like