When considering the duration there are two factors that I like to take into account.
One is sample size. If you are doing 10 parallel users at an average for 243 ms ~ 4 * 10 = 100 requests / s
100 rps * 120 = 12000 samples. Should be sufficient to be fairly confident in the result.
The second factor is trickier and that is internal job frequency. This requires a lot more information about your system and it’s dimensions. But things that I normally would like to cover are any standing jobs like database is replicated every 10 minutes. Or every hour all the orders are batch archived. Another one is cleanup like garbage collection etc. This is not a timing based but a resource based event but I would like to make sure to measure the average response time with at least one cleanup in it since some system take way longer to respond during these.
As a bonus: If your application / service also have a lot different use cases as in fetching data and adding new, and if it is a critical metric that the business takes decisions on I would suggest to add a mixed traffic model and have an agent measure response times during “normal load” of the system since that may be more in line with the actual user experience.