AI doesn’t belong in test runtime

Not all AI in e2e testing is created equal

Adopting generative AI in end-to-end testing improves test coverage and reduces time spent on testing. At last, automating all those manual test cases seems within reach, right? 

We have to talk about the stability and reliability of AI in this context, too. The concerns are real and I’d like to address a few here. 

Testing tools don’t use AI in the same way 

AI testing tools use AI differently to write, execute and maintain automated tests. The LLMs under the hood are not deployed in the same place in the same way by every testing technology. The AI can be deployed as:

  • codegen: You give the AI access to your code to generate  test code in the desired automation framework. These are the Copilots and Cursors, or ChatGPT, when you prompt the LLM to create  test code for you.
  • agents: AI is used to interact with an application like a human would. This usually happens in 2 ways: 
    1. AI used in runtime: LLM goes into the app and interacts with it every time a test or a test step is executed. 
    2. deterministic code used in runtime: LLMs are used to create interaction representations that translate into deterministic code used during test execution.

What can go wrong?

  1. One significant problem is the brittleness of AI models. Small changes to input data - be it a prompt or an update to the web application - can have a disproportionate impact, leading to false positives or negatives in test results. Without a thorough review of an AI-generated test case, the brittleness could result in undesirable outcomes. 

    There are strategies to reinforce the good outcomes, deploy check loops, but eventually, you'll need a human in the loop. But requiring too much “human” in the equation eats up the benefits of AI - to save manual work. You know, the reason why you used AI in the first place.
  1. Another issue is the interpretability of the output of AI models. Understanding why a particular test failed and how to resolve the issue can be challenging, especially when the AI-generated code is complex or unfamiliar. This requires testers to deeply understand the AI’s output, which can be a tedious and frustrating task.

    Adding the insult to the injury, not all AI testing tools disclose the generated code. Difficult to interpret anything. 
  1. Embedding AI in test case execution directly. For example, using it for smart assertions - this can be both slow and costly, creating barriers to running test suites as frequently as desired. It introduces an extra layer of instability further complicating the process. The more parts call the LLM during a test run, the more complicated it gets.
screenshot of a reddit comment
source: Reddit

The good news is that all of the concerns can be mitigated by adopting the right architecture in the right place of testing cycle. 

Use AI for test creation

AI is most valuable during the creation and maintenance phase of test cases. Let’s take a scripting example. You could begin with a prompt describing your desired test case and allow the AI to generate an initial version. 

screenshot of a prompt and generated code in Cursor
Cursor screenshot - prompt + output

If you’re lucky, the AI may produce a valid and ready-to-use test case right away. How convenient!

If the AI struggles to interpret your application, you - the domain expert - can step in and guide it, ensuring that the resulting test case is accurate and robust. It’s a good practice to keep AI fallibility on top of your mind when you’re accessing its output. It’s an even better practice for tool developers to build the reminder into the process. 

screenshot of a AI agent notification asking for help
AI agent asking for help 

Do not use AI in test runtime 

Ideally, AI should not be used during runtime. It’s slow. It’s brittle. It’s costly. A test case represents an expectation of how a system should work in a particular area. The agentic AI must try to fulfill this expectation. No workarounds. Only if the expectation is formulated precisely enough (code / steps) it can be validated against.

I suggest relying on established automation frameworks such as Playwright, Cypress, or Selenium for test execution. By using standard automation framework code, your test cases can remain deterministic, allowing you to execute them fast and reliably. Some providers even offer execution platforms to scale your standard framework test suites efficiently.

conceptual diagram of testing phases  and the use AI

Use AI for test maintenance

The case for using AI in test auto-healing is quite strong. When given boundaries and the ‘good example’ of the original test case, the AI worst instincts can be mitigated. The idea is that AI generation works best when the problem space is limited.

A robust solution to auto-maintenance would address a huge pain point in end-to-end testing.  Maintenance is more time consuming (and frustrating) than scripting and running tests combined. There are many tools building auto-maintenance features using AI right now.  If good enough, they could considerably simplify the process of keeping your tests up-to-date and relevant.

maximilian link headshot
Daniel Roedler
Co-founder & CPO
see more blogs
interested in new features?
Subscribe to our monthly product update!
Thank you! We got your submission!
Oops! Something went wrong.