đŸ€– Day 10: Critically Analyse AI-Generated Tests

Wow, that time saved is impressive!

I agree. I worry about this. There’s something special when we get to that powerful combo of “critical thinking meets creativity” in our careers as testers. It’s a real skill to work on. Might AI tools just take the fun and life experience out of such a career growth opportunity?

Hey All,
For this challenge, I used Chatgpt to create the feature description and Gemini to create test scenarios

What I did here

  1. I first tried with something like FB’s registration feature, here is the chat:
    https://g.co/gemini/share/e629a1c5d170

  2. I tried something a little more complicated, and chose the segmentation feature for an AB testing tool for this one I am currently interested in this and here is the result of this chat: https://g.co/gemini/share/65bde2ccc016

Findings:

How do the AI-generated tests differ from tests you would have created?

  1. The writing format itself is not that different but coverage is missing some points that I would have added anyway even if it was not explicitly written in feature des. i.e. OAuth integration sceanrios for fb’s alternative signup option

  2. Did not like the scenarios for the segmentation feature at all but maybe a different prompt or tool might have done a better job here, either way can use it as a guidline

Your perspective on how AI-generated tests benefit the testing process.
facilitates the boring bits maybe, will definitely use it for repetitve tasks and maybe fore new features coverage if reviewing the scenarios will take less time than writing them

What do you see as the limitations of the tool you evaluated?
scenarios for a more complex feature was taken too out of context, need to evalute this some more

2 Likes

Original tests:
Given Feature: Sign-up functionality Scenario: Basic Sign-Up Process Given the user is on the sign-up page When they fill in valid information in all required fields And they click on the sign-up button Then they should be successfully registered And they should be redirected to the dashboard or confirmation page Scenario: Empty Fields Validation Given the user is on the sign-up page When they attempt to submit the form with one or more required fields empty Then they should see error messages next to the empty fields indicating that they are required Scenario: Invalid Email Format Given the user is on the sign-up page When they enter an invalid email format And they submit the form Then they should see an appropriate error message indicating the invalid email format Scenario: Password Strength Validation Given the user is on the sign-up page When they enter a password that does not meet the strength requirements And they submit the form Then they should see an error message indicating the password strength requirements Scenario: Unique Email Validation Given the user is on the sign-up page When they enter an email address that is already registered in the system And they submit the form Then they should see an error message indicating that the email is already in use Scenario: Confirmation Email Given the user signs up with a new email address When they check their email inbox Then they should receive a confirmation email containing the necessary instructions and verification link Scenario: Security Measures Given the user is on the sign-up page When they attempt SQL injection attacks or Cross-Site Scripting attacks Then the system should handle these security threats appropriately and not allow malicious input Scenario: Performance Testing Given the system is under varying loads When multiple concurrent sign-up requests are simulated Then the response time of the sign-up process should be measured And the system should remain responsive and stable even under peak loads Scenario: Accessibility Testing Given the user is on the sign-up page When they test using screen readers and keyboard-only inputs Then all elements should be properly labeled and navigable And the color contrast and visual elements should meet accessibility standards Scenario: Localization Testing Given the user changes the language/locale settings of the platform When they test the sign-up process in the selected language/locale Then all text, labels, and error messages should be displayed correctly and culturally appropriate Cucumber data-table format: Feature: Sign-up functionality Scenario: Basic Sign-Up Process Given | the user is on the sign-up page When | they fill in valid information in all required fields And | they click on the sign-up button Then | they should be successfully registered And | they should be redirected to the dashboard or confirmation page Scenario: Empty Fields Validation Given | the user is on the sign-up page When | they attempt to submit the form with one or more required fields empty Then | they should see error messages next to the empty fields indicating that they are required Scenario: Invalid Email Format Given | the user is on the sign-up page When | they enter an invalid email format And | they submit the form Then | they should see an appropriate error message indicating the invalid email format Scenario: Password Strength Validation Given | the user is on the sign-up page When | they enter a password that does not meet the strength requirements And | they submit the form Then | they should see an error message indicating the password strength requirements Scenario: Unique Email Validation Given | the user is on the sign-up page When | they enter an email address that is already registered in the system And | they submit the form Then | they should see an error message indicating that the email is already in use Scenario: Confirmation Email Given | the user signs up with a new email address When | they check their email inbox Then | they should receive a confirmation email containing the necessary instructions and verification link Scenario: Security Measures Given | the user is on the sign-up page When | they attempt SQL injection attacks or Cross-Site Scripting attacks Then | the system should handle these security threats appropriately and not allow malicious input Scenario: Performance Testing Given | the system is under varying loads When | multiple concurrent sign-up requests are simulated Then | the response time of the sign-up process should be measured And | the system should remain responsive and stable even under peak loads Scenario: Accessibility Testing Given | the user is on the sign-up page When | they test using screen readers and keyboard-only inputs Then | all elements should be properly labeled and navigable And | the color contrast and visual elements should meet accessibility standards Scenario: Localization Testing Given | the user changes the language/locale settings of the platform When | they test the sign-up process in the selected language/locale Then | all text, labels, and error messages should be displayed correctly and culturally appropriate what interesting and edge case scenarios can you come up with

ChatGPT

I then asked ChatGPT for some interesting and edge case scenarios for the sign-up functionality:

It gave me:

  1. Edge Case: Maximum Field Length
  • Given the user is on the sign-up page
  • When they attempt to fill in fields with the maximum allowed length
  • And they click on the sign-up button
  • Then they should be successfully registered if all fields are within the maximum length limit
  1. Edge Case: International Characters
  • Given the user is on the sign-up page
  • When they attempt to fill in fields with international characters (e.g., accented letters, non-Latin characters)
  • And they click on the sign-up button
  • Then the system should accept and process these characters correctly
  1. Edge Case: Uncommon Email Domains
  • Given the user is on the sign-up page
  • When they enter an email address with an uncommon domain (e.g., .info, .co)
  • And they submit the form
  • Then the system should validate and accept such email addresses if they are valid
  1. Edge Case: Account Recovery
  • Given the user is on the sign-up page
  • When they forget their password after signing up
  • And they initiate the password recovery process
  • Then they should receive an email with instructions on resetting their password
  1. Edge Case: Network Interruption
  • Given the user is on the sign-up page
  • When they attempt to submit the form
  • And there’s a temporary network interruption during submission
  • Then the system should retain the entered information and allow the user to retry without losing data
  1. Edge Case: Concurrent Sign-Ups with Same Email
  • Given multiple users attempt to sign up simultaneously with the same email address
  • When they submit the form concurrently
  • Then the system should handle and prevent duplicate sign-ups with the same email address
  1. Edge Case: Temporary Email Addresses
  • Given the user is on the sign-up page
  • When they attempt to sign up using a temporary or disposable email address
  • Then the system should reject such email addresses and prompt the user to provide a permanent one
  1. Edge Case: Sign-Up From Different Devices
  • Given the user starts the sign-up process on one device
  • When they switch to another device midway through the process
  • Then the system should allow them to resume the sign-up process seamlessly without losing entered data
  1. Edge Case: User Behavior Analysis
  • Given the user is on the sign-up page
  • When they exhibit unusual behavior patterns (e.g., filling out the form too quickly, using automation tools)
  • Then the system should flag and potentially investigate such behavior for signs of fraud or misuse.
  1. Edge Case: Sign-Up Through Third-Party Services
  • Given the user is on the sign-up page
  • When they choose to sign up using a third-party service (e.g., Google, Facebook)
  • Then the system should securely integrate with the chosen service and handle the sign-up process accordingly.

I then asked how for insights on the establishment of the sufficiency or correctness of a list of test scenarios

the response was: 1. Coverage: Ensuring that the scenarios cover various aspects of the feature adequately. This includes positive and negative test cases, edge cases, and scenarios covering different user interactions and potential system behaviors.
2. Relevance: Each scenario should address a specific aspect or requirement of the feature. It’s essential to ensure that the scenarios are relevant to the functionality being tested and contribute to the overall quality of the feature.
3. Completeness: The list should cover all essential functionalities and potential failure points within the feature. This includes ensuring that all requirements and user stories are adequately addressed by the test scenarios.
4. Risk-based Testing: Prioritizing scenarios based on the potential impact and likelihood of occurrence can help ensure that testing efforts are focused on the most critical areas first.
5. Validation against Requirements: Ensuring that each scenario aligns with the specified requirements or acceptance criteria for the feature.

I then followed up with given these two sets of tests, what else should be tested and it gave me

  1. Session Management:
  • Scenario: User’s session expires during sign-up
    • Given the user is on the sign-up page
    • When they take a significant amount of time to fill in the form
    • And their session expires
    • Then the system should prompt them to log in again and retain the entered data.
  1. Browser Compatibility:
  • Scenario: Sign-up process across different web browsers
    • Given the user is on the sign-up page
    • When they attempt to sign up using different web browsers (e.g., Chrome, Firefox, Safari, Edge)
    • Then the sign-up process should function correctly and consistently across all supported browsers.
  1. Data Privacy and GDPR Compliance:
  • Scenario: User data handling and consent
    • Given the user is on the sign-up page
    • When they provide their personal information and agree to the terms of service
    • Then the system should handle and store their data securely, in compliance with relevant data protection regulations such as GDPR.
  1. Cross-device Testing:
  • Scenario: Sign-up process on various devices (e.g., desktop, mobile, tablet)
    • Given the user is on the sign-up page
    • When they attempt to sign up using different devices with varying screen sizes and resolutions
    • Then the sign-up process should be responsive and display correctly on all supported devices.
  1. Load Balancing and Scalability Testing:
  • Scenario: Sign-up process under high load and stress conditions
    • Given the system is under heavy load or stress
    • When multiple concurrent sign-up requests are made continuously
    • Then the system should maintain performance and stability, and new sign-ups should be processed without errors or delays.
  1. Review the scenarios: critically examine the scenarios generated and compare them to those that you, as a tester, might have created:
    a. What did the tool do well? - It was fast and offered a good spread of test scenarios
    b. What did the tool not do so well with? - It didnt order it in a high risk high value approach.

1. High Risk, High Value:

  • Security Measures: Given the potential for severe consequences of security breaches, ensuring the system can handle SQL injection and XSS attacks appropriately is critical.
  • Performance Testing: As performance issues can significantly impact user experience, measuring response time and system stability under load is crucial for maintaining user satisfaction.
  • Unique Email Validation: Preventing users from signing up with already registered email addresses helps maintain data integrity and prevents account duplication.
  • Session Management: Ensuring that the system handles session expirations during sign-up to prevent data loss is essential for user experience and data integrity.
  • Data Privacy and GDPR Compliance: Mishandling user data can have legal and reputational consequences, making GDPR compliance and secure data handling imperative.
  1. Medium Risk, Medium Value:
  • Empty Fields Validation: While empty fields can lead to user frustration, they are less likely to have severe consequences compared to security or data privacy issues.
  • Invalid Email Format: While important for data accuracy, invalid email format errors are less severe than security vulnerabilities or performance issues.
  • Browser Compatibility: While important for ensuring broad usability, issues with browser compatibility are typically less critical than security or performance concerns.
  • Cross-device Testing: Ensuring responsiveness across devices enhances user experience but may not pose significant risks if not thoroughly tested.
  • Load Balancing and Scalability Testing: While important for ensuring system reliability, load testing may have lower immediate impact compared to security or data privacy issues.
  1. Low Risk, Low Value:
  • International Characters: While supporting international characters is beneficial for inclusivity, issues related to them are less likely to have severe consequences.
  • Temporary Email Addresses: While rejecting temporary email addresses helps prevent misuse, it may not be a high priority compared to other security measures.
  • Sign-up From Different Devices: While seamless multi-device support enhances user experience, issues related to it may have relatively low impact.

c. How straightforward was it to express the inputs to generate the tests? You definitely have to check the response and tweak your prompts a few time to get to optimal results but overall its a great, quick start to the the process started.
d. Are the generated tests correct and ready to use? Depending on what you testing they can be if you are testing something super basic yeah, for super complex systems definitely no.
e. How complete do you think the tests are? What does the tool miss? They are only as complete as the detail in your prompt everytime you add/remove/change something in your prompt the response is updated accordingly, as you can see, i have 4 different responses listed here. Each response led to further tweaks to my prompt.
2. Share your findings: Let the community know about your findings by replying to this post. Consider sharing: - Listed above
a. How do the AI-generated tests differ from tests you would have created? - Way faster and simpler
b. Your perspective on how AI-generated tests benefit the testing process. - Its an quick and easy way to get the process started and the responses definitely help trigger further thinking and prompt tweaking. You cannot follow the responses blindly, you need to read it and consider gaps add more context/, tweak the prompt, repeat until you reach more usable test suite.
c. What do you see as the limitations of the tool you evaluated? Without context the value is truly lost .

1 Like

Today with Copilot I tried to generate some test scenarios for our company product - FormX.ai on its feature called “Fixed Layout Extractor”. The feature is quite complicated and hard to describe with my poor English, so I also attach an image of the UI to see how much it could return.

This is the prompt in my first trial:

The Fixed Layout Extractor extracts data from an image of a document info JSON format. User needs to draw regions on the image to define characteristics of image locate the data to extract, and draw regions on the image to define the relative position of the data to extract. In the end it will display all the extracted items on the document to the user. The attached image is the UI of this feature. Please design the test scenarios for this feature.

The results it gave includes:

  • Positive and Negative Scenarios - mainly focusing on the user input (document layout, type, image quality, etc.)
  • Boundary Scenarios - including extreme image size and system limit
  • Security Concern for the product - which is relevant but not practically is a testable scenario

It is not very right on the direction, so by the hints from Day 10 questions, I asked it to regenerate with those Test Design Techniques applied and evaluate again.

Second prompt:

Please design the functional scenarios with application of Test Design Techniques such as Boundary Value Analysis, Combinatorial Testing or Path Testing, and consider other attributes such as accessibility, performance or security.

This time the results are much better with its diversity after the tuning, and it really suits my case:

1. **Positive Scenarios:**
   - **High-Quality Scan Verification:**
     - **Test Objective**: Verify how the system handles the maximum acceptable file size for image uploads.
     - **Test Steps**:
       1. Upload an image at the maximum file size allowed.
       2. Verify that the system processes the image efficiently without errors.
       3. Confirm that the extracted data is accurate.
     - **Boundary Values**: Image size just below and just above the maximum limit.

   - **Layout Adaptability:**
     - **Test Objective**: Validate the system's behavior with documents that push the limits of quality.
     - **Test Steps**:
       1. Upload wrinkled, torn, or partially obscured documents.
       2. Observe how the system responds:
          - Does it prompt for a clearer image?
          - Does it handle challenging conditions gracefully?
       3. Verify that extracted data remains accurate.
     - **Boundary Values**: Documents with extreme quality issues.

2. **Combinatorial Testing (Pairwise Testing):**
   - **Extraction Settings Combinations:**
     - **Test Objective**: Cover various combinations of extraction settings (e.g., region selection, document type, auto extraction items).
     - **Test Steps**:
       1. Test with different region selections (e.g., entire document, specific sections).
       2. Combine document types (e.g., invoices, receipts) with auto extraction items (e.g., date, time).
       3. Verify that all combinations work as expected.
     - **Pairs to Test**:
       - Region selection × Document type
       - Region selection × Auto extraction items
       - Document type × Auto extraction items

3. **Path Testing:**
   - **Scenario 5: End-to-End Workflow:**
     - **Test Objective**: Validate the complete workflow from image upload to displaying extracted items.
     - **Test Steps**:
       1. Upload an image.
       2. Draw regions to define characteristics and relative positions.
       3. Verify that the system processes the image correctly.
       4. Check if all extracted items are displayed to the user.
     - **Paths to Test**: Image upload → Region drawing → Extraction → Display results.

4. **Performance Testing:**
   - **Scenario 6: Large Batch Processing:**
     - **Test Objective**: Assess system performance when processing multiple images simultaneously.
     - **Test Steps**:
       1. Upload a batch of images (e.g., 50 images).
       2. Measure processing time and resource utilization.
       3. Ensure that performance remains acceptable.
     - **Performance Metrics**: Processing time, memory usage.

5. **Security and Compliance Testing:**
   - **Scenario 7: Data Protection and Privacy:**
     - **Test Objective**: Validate compliance with privacy regulations.
     - **Test Steps**:
       1. Upload documents containing sensitive information (e.g., personal IDs, financial data).
       2. Verify that extracted data is not exposed.
       3. Ensure compliance with data protection laws.
     - **Security Aspects**: Data encryption, access controls.
6. **Accessibility and Usability Testing:**
   - **Scenario 8: Screen Reader Compatibility:**
     - **Test Objective**: Evaluate accessibility for users with disabilities.
     - **Test Steps**:
       1. Use screen readers to navigate the UI.
       2. Verify that regions can be drawn using keyboard shortcuts.
       3. Confirm adherence to accessibility guidelines (e.g., proper labels, focus management).

7. **Negative Scenarios:**
   - **Scenario 9: Low-Quality Images:**
     - **Test Objective**: Check system behavior with low-resolution images.
     - **Test Steps**:
       1. Upload an image with poor resolution.
       2. Verify that the system displays an appropriate error or warning message.
       3. Ensure the user is prompted for a clearer image.
     - **Boundary Value**: Extremely low-resolution image.

   - **Scenario 10: Unclear Labels and Text:**
     - **Test Objective**: Test system response to unclear labels or text.
     - **Test Steps**:
       1. Use an image with smudged or illegible text.
       2. Observe how the system handles unclear content.
       3. Ensure graceful handling or user guidance.

8. **Boundary Scenarios:**
   - **Scenario 11: Maximum File Size Handling:**
     - **Test Objective**: Verify system response to the maximum file size.
     - **Test Steps**:
       1. Upload an image just below and just above the maximum limit.
       2. Confirm proper handling without performance degradation.
     - **Boundary Values**: Maximum file size.

With this practice I also learned how to make the bot generate results closer to our own style, by instructing it the Test Design Techniques to apply during the generation.


Some more thoughts:

Unlike experienced tester knowing what techniques are the best for the products in different occasions, it is hard for AI tools to choose or do both high-level and low-level test scenarios at the same time, or to decide the best strategies to test the product.

As different level of test scenarios requires different techniques, it is important to instruct the AI tool with the techniques to use in the beginning of the generation, as without enough training (which maybe a high-cost effort for now), it could not decide the techniques or test approaches to use for the situation as how human can do.

4 Likes

Hi All,
This task seems to be a bit the same as the previous one so I will consider my previous prompt of ‘give test cases for forgot password screen’ but with a little tweak here ’ the test cases for the forgot password page for both UI and FN’ and the test cases and scenarios it provided me were obvious and to the descriptive at the same time.
Tool Used : ChatGPT v3.5
Outcome: Use detailed input for a better output of your problem.

1 Like

Please consider my previous day (Day 8) response for this day.

1 Like

situation will be self-explanatory from prompts. I used “Concise” GPT version of ChatGPT4. I prompted as follows (where in the prompt, it’s the code that’s wrapped in code quote tags):

hi, pretend you are a QA engineer writing a pytest test suite for python code. your current task is writing a test suite for some group theory functions, to be used in a mathematical context. you have to test a function which computes the number of groups of size n (up to isomorphism), where n is an integer.

the following function signature is defined:

def TooComplicatedGroupSizeError(ValueError):
    pass


def num_groups_of_size(n: int) -> int:
    """
    compute the number of groups (up to isomorphism) of size `n`.
    raises a TooComplicatedSizeError for `n` too high.
    raises a standard ValueError for n <= 0.
    """
    ...  # function body here

the function is expected to return correct integer values, unless it cannot compute the result (for example because the value is too high or of an irregular class), in which case it returns a specific error as indicated.

firstly, write black box unit tests, that test this function behaves as expected, without knowing how the function is implemented.

This gave me tests which passed. It had multiple assertions per function, which I asked to split out with @pytest.parametrize (it then gave the “test case of 0” a parametrize, which I told it to undo
).

It then gave me this:

@pytest.mark.parametrize("n, expected", [
    (1, 1),  # Trivial group
    (2, 1),  # Z2
    (3, 1),  # Z3
    (4, 2),  # Z4, Klein four-group
    (5, 1),  # Z5
    # Add more known cases as necessary
])
def test_num_groups_of_size_known_values(n, expected):
    assert num_groups_of_size(n) == expected

def test_num_groups_of_size_zero():
    # Testing for n=0, should raise ValueError
    with pytest.raises(ValueError):
        num_groups_of_size(0)

@pytest.mark.parametrize("n", [-1, -10])
def test_num_groups_of_size_negative(n):
    with pytest.raises(ValueError):
        num_groups_of_size(n)

@pytest.mark.parametrize("n", [1000])  # Assuming 1000 is beyond the function's capability
def test_num_groups_of_size_too_high(n):
    with pytest.raises(TooComplicatedGroupSizeError):
        num_groups_of_size(n)

Good stuff - the tests passed, except 4 failed because I’d typo’d “def TooComplicatedGroupSizeError” rather than class, so it gave errors about BaseException
 bug found! Fixing those and all the tests passed. Now I gave it the source code and asked for more tests:

Okay, now given the following implementation, give me some further tests. DO NOT REMOVE any of your existing test cases, and include comments where applicable. also, note that the implementation may be incorrect. if any part of the implementation goes against known mathematics, please write a test that will _fail_ and highlight this.

def num_groups_of_size(n: int) -> int:
    """
    compute the number of groups (up to isomorphism) of size `n`.
    raises a TooComplicatedSizeError for `n` too high.
    raises a standard ValueError for n <= 0.
    """
    if n == 1:
        return 1
    prime_factors = sympy.primefactors(n)
    if len(prime_factors) == 1:  # p^n
        p = prime_factors[0]
        if n == p:
            return 1  # cyclic group
        if n == p ** 2:
            return 2  # cyclic and cartesian
        if n == p ** 3:
            return 5  # this is a classification theorem (3 abelian, 2 non-abelian)
        # powers above 3 are more complicated
    if (
        len(prime_factors) == 2
        and n == prime_factors[0] * prime_factors[1]
        and 2 in prime_factors
    ):
        return 2  # D_2n and C_2n
    raise TooComplicatedGroupSizeError(f"not easy to find num groups of size {n}")

So ChatGPT came up with some more tests, and they’re mostly good, but one is bad! The additional tests it suggests are:

@pytest.mark.parametrize("n, expected", [
    (7, 1),  # Prime number, only 1 group: Z7
    (9, 2),  # Prime square, p=3: Z9, and one non-abelian group
    (27, 5),  # Prime cube, p=3: 5 groups (3 abelian, 2 non-abelian)
    # More prime power cases can be added if necessary
])
def test_num_groups_of_size_prime_powers(n, expected):
    assert num_groups_of_size(n) == expected

@pytest.mark.parametrize("n, expected", [
    (6, 2),  # 2 * 3, should have 2 groups: Z6 (cyclic) and S3 (symmetric group, non-abelian)
    (10, 2),  # 2 * 5, two groups: Z10 and D5 (dihedral group)
    # Additional tests for products of two primes can be added
])
def test_num_groups_of_size_two_primes_including_two(n, expected):
    assert num_groups_of_size(n) == expected

# Test to expose potential mathematical inaccuracies
@pytest.mark.parametrize("n, expected_failure", [
    (8, 5),  # For 8, there are 5 groups, not 2 as the simple p^2 case might suggest
    # Add more cases that are mathematically complex and incorrectly handled by the function
])
def test_num_groups_of_size_mathematical_accuracy(n, expected_failure):
    # These tests are expected to fail given the current implementation
    with pytest.raises(AssertionError):
        assert num_groups_of_size(n) == expected_failure

# Additional test for too complicated scenario
def test_num_groups_of_size_very_high_prime_power():
    # High prime powers not directly covered by the function
    with pytest.raises(TooComplicatedGroupSizeError):
        num_groups_of_size(2**10)  # 1024, should raise TooComplicatedGroupSizeError

The particular test that goes like this is quite awful:

with pytest.raises(AssertionError):
    assert num_groups_of_size(n) == expected_failure

I tried to ask it if the test was wrong and the code was wrong, and was quite stubborn in saying the code was wrong even when I said it gave the right answer; took a couple of prompts to convince it otherwise. Don’t always trust GPTs! It then gave me a fixed test; I pointed out it could consolidate into a parametrize above, and it did so (also consolidating the test_num_groups_of_size_prime_powers test; fair enough, it’s all down to labelling at this point). Now all my tests pass, yay!

I ran coverage and it had 100% line/branch coverage of my function. Yay!

I then asked it to write some xfail tests for future functionality (I also explicitly told it to make them strict=True), and it gave sensible things that all xfail:

@pytest.mark.xfail(
    reason="Handling for prime powers greater than 3 not implemented.",
    strict=True
)
@pytest.mark.parametrize("n", [
    2**4,  # 16, should be more than 2 groups
    3**4,  # 81, also more complex
])
def test_num_groups_of_size_high_prime_powers(n):
    # Acknowledges unimplemented functionality for high prime powers.
    assert num_groups_of_size(n) > 2

@pytest.mark.xfail(
    reason="Functionality for composite numbers with more than two prime factors \
not implemented.",
    strict=True
)
@pytest.mark.parametrize("n, expected", [
    (30, 4),  # 2*3*5, has a specific number of groups not handled
    (60, None),  # 2^2*3*5, more complex scenario, expected value to be filled
])
def test_num_groups_of_size_composite_numbers(n, expected):
    # Indicates missing handling for numbers with >2 prime factors.
    assert num_groups_of_size(n) == expected

@pytest.mark.xfail(
    reason="General solution for arbitrary n not implemented.",
    strict=True
)
def test_num_groups_of_size_arbitrary():
    n = 64  # A power of 2, but larger than handled cases
    # Placeholder assertion; exact number should be determined later.
    assert num_groups_of_size(n) > 5

Overall, a good set of basic tests written with very little thought and very simple prompting. They were correct mod the one test that was just wrong, but that came up straight away as an odd failure so was easy to question. The coverage seems pretty decent and good future-facing tests for added functionality, including the n=60 case where ChatGPT didn’t hazard what the number of groups was but at least had it ready to with f(n) ?= None set up as a future regression test! :stuck_out_tongue:

Since this function was a deliberately simple int->int function, the tests look pretty similar to what I’d write. I guess I’d personally try to include more asymptotic things (even if maybe xfailed), e.g. some tests that there are >=m groups of large size 2^k. but
 I guess ChatGPT had that for k=6; way ahead of me.