It depends on the context, so not all of these will be relevant.
Has the model been overfitted to its training data? For instance, a model to tell photos of cats from photos of dogs might accidentally be trained on photos of dogs that are outdoors and cats sat on furniture indoors. So the model actually detects grass vs. chairs rather than dogs vs. cats. So it’s worth testing with valid data that’s unlike the training data e.g. indoor dogs and outdoor cats.
If the model gives a yes / no, what are its most important rates - true positive, false positive, true negative, and false negative? Do all 4 combinations matter the same amount in your context?
Do errors occur equally for all kinds of input, or are they more common for certain subsets of the input space? Does this matter in your context?
The improvement over time probably won’t come for free. In production, the model will be frozen and so will give the same output if given the same input more than once. The behaviour will change only when the model is retrained.
This retraining could happen after X more time has passed, after X more data has been processed, or the performance of the model has dropped by / below X. (The last one is if the data is gradually changing over time, so a fixed model will become less and less accurate.)
Which kind of threshold to trigger retraining makes most sense for your context?
It’s probably something you have at least partly sorted because of the data engineering that’s currently happening, but traceability, provenance, governance etc. are all important. If the model starts behaving differently (maybe worse) today than yesterday, can you work out why? Was a new model released? Why and by whom? Was this a manual process and someone forgot a step? What data was version X of the model trained on? Version X-1? Can you easily go back to version X-1? Can you produce a model that’s trained on the same data as version X,-1, but then train it further on new data?
There are some architecture or design questions to consider (which I realise is much easier said than done). There’s more than one kind of model that you could use - would you get better performance with a different one? For a given model there can be big parameters to tune e.g. how many layers to have in a neural network.
Even with the type and basic structure fixed, there’s the question of features. Instead of plugging all available data into the model, a pre-processed subset (the features) is used. First, a subset of the variables is chosen (by a human) because it is most strongly linked to the desired output. Once this filtering has happened, simple maths can occur before the features are ready, e.g. applying log to base 10, taking the minimum of two bits of data etc.
Are the features the right ones? Are there inputs with the same or similar features, that should lead to different outputs?