Skip to content

How To Know If You Need To Test AI Apps?

Author: Tim Kellogg

Last updated: October 1, 2024

Do You Need To Test AI Apps
Table of Contents
Schedule

AI is clearly smart and developing at a mind-boggling pace. Sometimes you might wonder, is AI like traditional software? Do you still need to test it?

The answers are no and then yes. It’s not like traditional software, and AI apps are indeed harder to test, but they still need to be tested. 

LLMs are probabilistic and their output is complex, so traditional assert-based tests seem very difficult. As a result, more and more software shops are giving up and relying on user reports. This is a precarious position, because users will often lose trust in the app rather than reporting errors.

Let’s start simple. What can you test?

First of all, isolate the deterministic parts and test those the traditional way. Second, AI components need to be approached at the level of the entire system. 

 

 

An Example

Here’s a RAG app architecture that’s fairly common. In fact, most RAG apps are some variant of this:

Download/Ingest → Chunk → Calculate Embedding → Store in DB → Query → LLM

This gives you an LLM app (e.g. a chatbot) that is able to answer questions based on a corpus of data. Maybe you have a pile of PDFs and you want to offer your customers a chatbot that answers questions from that library.

How do we tackle testing this thing?

First, isolate the units. Most of these steps are easily unit tested — downloading, chunking, storing & querying. Write tests for these just like any other component.

Second, the unit tests are testing too small of an area and don’t verify that the system works as a whole. We need to address that differently.

Specific to this example, chunking is the most error-prone part. If you don’t chunk, then all your documents will look the same when queried, so queries won’t surface the right content. Then again, if you make your chunks too small, they won’t have any context, and will rarely align to the actual intent of the document. The art of chunking is all about finding a happy middle ground.

While it’s easy to write unit tests for chunking, you can’t test that it works correctly without running it through the full pipeline. Some of the most effective fixes for a chunking bug are to use things like re-ranking models and late action approaches. But those are used in the query step, so therefore a lot of the testing needs to happen at the system level.

 

 

Testing AI Components

Unit testing is a 3 part process:

  1. Setup a specific scenario
  2. Execute the code
  3. Verify the result

When you remove step 1, it looks more like monitoring. It’s tempting to view monitoring as something for an operations or DevOps team, but with complex systems like AI, monitoring is a critical tool in the quality assurance toolbox.

For our example RAG pipeline, we may want to assert that:

  • Every doc returned in query results is at least tangentially relevant to the question
  • The LLM’s response correctly references the query results
  • The LLM’s full response is consistent with the query results

All of these actually do have a ground truth without setting up a specific test scenario. The problem is those assertions seem tricky to implement. However, it’s not as tricky as it might seem, with LLMs.

The Python library DSPy is very useful for these types of monitoring-style tests. It’s a LLM framework that lets you state a problem plainly, and it’s capable of automatically calculating an LLM prompt optimized for answering that question.

 

import dspy

 

class IsChunkRelevantSignature(dspy.Signature):

    document_chunk = dspy.InputField()

    question = dspy.InputField()

    is_relevant = dspy.OutputField(desc=”yes or no”)

 

_is_chunk_relevant = dspy.Predict(IsChunkRelevantSignature)

 

def is_chunk_relevant(chunk: str, question: str) -> bool:

    result = _is_chunk_relevant(document_chunk=chunk, question=question)

    return result.is_relevant == "yes"



Testing The Tests

This is amazing, right? Never before have we been able to test such opaque functionality. On the other hand, these tests are implemented with an LLM, and LLMs are known to be wrong from time to time. How do we make sure these new tests actually work?

First off, write regular old unit tests for these. Check that they work in a variety of specific test cases. Do this in the dev workflow, both isolated and end-to-end. If it doesn’t work, DSPy has a variety of ways to improve, from using more complex modules like ChainOfThought or look into an optimizer.

Next, run it in production. Test every request. Use a fast model like gpt-4o-mini, or gemini-1.5-flash (the latter is one of the fastest and cheapest models available) for the tests. They should be easy questions for an LLM to answer and so don’t need to rely on a large bank of knowledge baked into a much larger model.

Finally, no matter how good these tests get, they’ll still be flaky to some extent. Which is why it’s critical to review their output in aggregate over all traffic going through the system. You might view it as

  •  time series — e.g. “percentage of failures per minute”
  • post-hoc cohort analysis — e.g. “percentage of failures for questions about the Gamma particle”

It’s wise to monitor and analyze from both perspectives.

Complement BrowserStack with MuukTest for effective integration testing.

Conclusion

Regardless of how you view it, testing AI apps is even more important than ever. It feels difficult to test, but with a solid toolbox, it is absolutely attainable. Good luck building!

Tim Kellogg

Tim Kellogg is an AI architect, software engineer, and overall tech enthusiast. He is the founder of dentropy and Fossil, as was ML Director at Advata and cofounder and CTO of Fancy Robot. He shared his passion for creating innovative solutions and exploring the frontiers of technology in this blog and LinkedIn.