situation will be self-explanatory from prompts. I used âConciseâ GPT version of ChatGPT4. I prompted as follows (where in the prompt, itâs the code thatâs wrapped in code quote tags):
hi, pretend you are a QA engineer writing a pytest test suite for python code. your current task is writing a test suite for some group theory functions, to be used in a mathematical context. you have to test a function which computes the number of groups of size n (up to isomorphism), where n is an integer.
the following function signature is defined:
def TooComplicatedGroupSizeError(ValueError):
pass
def num_groups_of_size(n: int) -> int:
"""
compute the number of groups (up to isomorphism) of size `n`.
raises a TooComplicatedSizeError for `n` too high.
raises a standard ValueError for n <= 0.
"""
... # function body here
the function is expected to return correct integer values, unless it cannot compute the result (for example because the value is too high or of an irregular class), in which case it returns a specific error as indicated.
firstly, write black box unit tests, that test this function behaves as expected, without knowing how the function is implemented.
This gave me tests which passed. It had multiple assertions per function, which I asked to split out with @pytest.parametrize (it then gave the âtest case of 0â a parametrize, which I told it to undoâŠ).
It then gave me this:
@pytest.mark.parametrize("n, expected", [
(1, 1), # Trivial group
(2, 1), # Z2
(3, 1), # Z3
(4, 2), # Z4, Klein four-group
(5, 1), # Z5
# Add more known cases as necessary
])
def test_num_groups_of_size_known_values(n, expected):
assert num_groups_of_size(n) == expected
def test_num_groups_of_size_zero():
# Testing for n=0, should raise ValueError
with pytest.raises(ValueError):
num_groups_of_size(0)
@pytest.mark.parametrize("n", [-1, -10])
def test_num_groups_of_size_negative(n):
with pytest.raises(ValueError):
num_groups_of_size(n)
@pytest.mark.parametrize("n", [1000]) # Assuming 1000 is beyond the function's capability
def test_num_groups_of_size_too_high(n):
with pytest.raises(TooComplicatedGroupSizeError):
num_groups_of_size(n)
Good stuff - the tests passed, except 4 failed because Iâd typoâd âdef TooComplicatedGroupSizeErrorâ rather than class, so it gave errors about BaseException⊠bug found! Fixing those and all the tests passed. Now I gave it the source code and asked for more tests:
Okay, now given the following implementation, give me some further tests. DO NOT REMOVE any of your existing test cases, and include comments where applicable. also, note that the implementation may be incorrect. if any part of the implementation goes against known mathematics, please write a test that will _fail_ and highlight this.
def num_groups_of_size(n: int) -> int:
"""
compute the number of groups (up to isomorphism) of size `n`.
raises a TooComplicatedSizeError for `n` too high.
raises a standard ValueError for n <= 0.
"""
if n == 1:
return 1
prime_factors = sympy.primefactors(n)
if len(prime_factors) == 1: # p^n
p = prime_factors[0]
if n == p:
return 1 # cyclic group
if n == p ** 2:
return 2 # cyclic and cartesian
if n == p ** 3:
return 5 # this is a classification theorem (3 abelian, 2 non-abelian)
# powers above 3 are more complicated
if (
len(prime_factors) == 2
and n == prime_factors[0] * prime_factors[1]
and 2 in prime_factors
):
return 2 # D_2n and C_2n
raise TooComplicatedGroupSizeError(f"not easy to find num groups of size {n}")
So ChatGPT came up with some more tests, and theyâre mostly good, but one is bad! The additional tests it suggests are:
@pytest.mark.parametrize("n, expected", [
(7, 1), # Prime number, only 1 group: Z7
(9, 2), # Prime square, p=3: Z9, and one non-abelian group
(27, 5), # Prime cube, p=3: 5 groups (3 abelian, 2 non-abelian)
# More prime power cases can be added if necessary
])
def test_num_groups_of_size_prime_powers(n, expected):
assert num_groups_of_size(n) == expected
@pytest.mark.parametrize("n, expected", [
(6, 2), # 2 * 3, should have 2 groups: Z6 (cyclic) and S3 (symmetric group, non-abelian)
(10, 2), # 2 * 5, two groups: Z10 and D5 (dihedral group)
# Additional tests for products of two primes can be added
])
def test_num_groups_of_size_two_primes_including_two(n, expected):
assert num_groups_of_size(n) == expected
# Test to expose potential mathematical inaccuracies
@pytest.mark.parametrize("n, expected_failure", [
(8, 5), # For 8, there are 5 groups, not 2 as the simple p^2 case might suggest
# Add more cases that are mathematically complex and incorrectly handled by the function
])
def test_num_groups_of_size_mathematical_accuracy(n, expected_failure):
# These tests are expected to fail given the current implementation
with pytest.raises(AssertionError):
assert num_groups_of_size(n) == expected_failure
# Additional test for too complicated scenario
def test_num_groups_of_size_very_high_prime_power():
# High prime powers not directly covered by the function
with pytest.raises(TooComplicatedGroupSizeError):
num_groups_of_size(2**10) # 1024, should raise TooComplicatedGroupSizeError
The particular test that goes like this is quite awful:
with pytest.raises(AssertionError):
assert num_groups_of_size(n) == expected_failure
I tried to ask it if the test was wrong and the code was wrong, and was quite stubborn in saying the code was wrong even when I said it gave the right answer; took a couple of prompts to convince it otherwise. Donât always trust GPTs! It then gave me a fixed test; I pointed out it could consolidate into a parametrize above, and it did so (also consolidating the test_num_groups_of_size_prime_powers test; fair enough, itâs all down to labelling at this point). Now all my tests pass, yay!
I ran coverage and it had 100% line/branch coverage of my function. Yay!
I then asked it to write some xfail tests for future functionality (I also explicitly told it to make them strict=True), and it gave sensible things that all xfail:
@pytest.mark.xfail(
reason="Handling for prime powers greater than 3 not implemented.",
strict=True
)
@pytest.mark.parametrize("n", [
2**4, # 16, should be more than 2 groups
3**4, # 81, also more complex
])
def test_num_groups_of_size_high_prime_powers(n):
# Acknowledges unimplemented functionality for high prime powers.
assert num_groups_of_size(n) > 2
@pytest.mark.xfail(
reason="Functionality for composite numbers with more than two prime factors \
not implemented.",
strict=True
)
@pytest.mark.parametrize("n, expected", [
(30, 4), # 2*3*5, has a specific number of groups not handled
(60, None), # 2^2*3*5, more complex scenario, expected value to be filled
])
def test_num_groups_of_size_composite_numbers(n, expected):
# Indicates missing handling for numbers with >2 prime factors.
assert num_groups_of_size(n) == expected
@pytest.mark.xfail(
reason="General solution for arbitrary n not implemented.",
strict=True
)
def test_num_groups_of_size_arbitrary():
n = 64 # A power of 2, but larger than handled cases
# Placeholder assertion; exact number should be determined later.
assert num_groups_of_size(n) > 5
Overall, a good set of basic tests written with very little thought and very simple prompting. They were correct mod the one test that was just wrong, but that came up straight away as an odd failure so was easy to question. The coverage seems pretty decent and good future-facing tests for added functionality, including the n=60 case where ChatGPT didnât hazard what the number of groups was but at least had it ready to with f(n) ?= None set up as a future regression test! 
Since this function was a deliberately simple int->int function, the tests look pretty similar to what Iâd write. I guess Iâd personally try to include more asymptotic things (even if maybe xfailed), e.g. some tests that there are >=m groups of large size 2^k. but⊠I guess ChatGPT had that for k=6; way ahead of me.