AI is clearly smart and developing at a mind-boggling pace. Sometimes you might wonder, is AI like traditional software? Do you still need to test it?
The answers are no and then yes. It’s not like traditional software, and AI apps are indeed harder to test, but they still need to be tested.
LLMs are probabilistic and their output is complex, so traditional assert-based tests seem very difficult. As a result, more and more software shops are giving up and relying on user reports. This is a precarious position, because users will often lose trust in the app rather than reporting errors.
Let’s start simple. What can you test?
First of all, isolate the deterministic parts and test those the traditional way. Second, AI components need to be approached at the level of the entire system.
An Example
Here’s a RAG app architecture that’s fairly common. In fact, most RAG apps are some variant of this:
Download/Ingest → Chunk → Calculate Embedding → Store in DB → Query → LLM
This gives you an LLM app (e.g. a chatbot) that is able to answer questions based on a corpus of data. Maybe you have a pile of PDFs and you want to offer your customers a chatbot that answers questions from that library.
How do we tackle testing this thing?
First, isolate the units. Most of these steps are easily unit tested — downloading, chunking, storing & querying. Write tests for these just like any other component.
Second, the unit tests are testing too small of an area and don’t verify that the system works as a whole. We need to address that differently.
Specific to this example, chunking is the most error-prone part. If you don’t chunk, then all your documents will look the same when queried, so queries won’t surface the right content. Then again, if you make your chunks too small, they won’t have any context, and will rarely align to the actual intent of the document. The art of chunking is all about finding a happy middle ground.
While it’s easy to write unit tests for chunking, you can’t test that it works correctly without running it through the full pipeline. Some of the most effective fixes for a chunking bug are to use things like re-ranking models and late action approaches. But those are used in the query step, so therefore a lot of the testing needs to happen at the system level.
Testing AI Components
Unit testing is a 3 part process:
- Setup a specific scenario
- Execute the code
- Verify the result
When you remove step 1, it looks more like monitoring. It’s tempting to view monitoring as something for an operations or DevOps team, but with complex systems like AI, monitoring is a critical tool in the quality assurance toolbox.
For our example RAG pipeline, we may want to assert that:
- Every doc returned in query results is at least tangentially relevant to the question
- The LLM’s response correctly references the query results
- The LLM’s full response is consistent with the query results
All of these actually do have a ground truth without setting up a specific test scenario. The problem is those assertions seem tricky to implement. However, it’s not as tricky as it might seem, with LLMs.
The Python library DSPy is very useful for these types of monitoring-style tests. It’s a LLM framework that lets you state a problem plainly, and it’s capable of automatically calculating an LLM prompt optimized for answering that question.
import dspy
class IsChunkRelevantSignature(dspy.Signature):
document_chunk = dspy.InputField()
question = dspy.InputField()
is_relevant = dspy.OutputField(desc=”yes or no”)
_is_chunk_relevant = dspy.Predict(IsChunkRelevantSignature)
def is_chunk_relevant(chunk: str, question: str) -> bool:
result = _is_chunk_relevant(document_chunk=chunk, question=question)
return result.is_relevant == "yes"
Testing The Tests
This is amazing, right? Never before have we been able to test such opaque functionality. On the other hand, these tests are implemented with an LLM, and LLMs are known to be wrong from time to time. How do we make sure these new tests actually work?
First off, write regular old unit tests for these. Check that they work in a variety of specific test cases. Do this in the dev workflow, both isolated and end-to-end. If it doesn’t work, DSPy has a variety of ways to improve, from using more complex modules like ChainOfThought or look into an optimizer.
Next, run it in production. Test every request. Use a fast model like gpt-4o-mini, or gemini-1.5-flash (the latter is one of the fastest and cheapest models available) for the tests. They should be easy questions for an LLM to answer and so don’t need to rely on a large bank of knowledge baked into a much larger model.
Finally, no matter how good these tests get, they’ll still be flaky to some extent. Which is why it’s critical to review their output in aggregate over all traffic going through the system. You might view it as
- time series — e.g. “percentage of failures per minute”
- post-hoc cohort analysis — e.g. “percentage of failures for questions about the Gamma particle”
It’s wise to monitor and analyze from both perspectives.
Conclusion
Regardless of how you view it, testing AI apps is even more important than ever. It feels difficult to test, but with a solid toolbox, it is absolutely attainable. Good luck building!