Here I am to spill some beans on what happened at Octomind in the last 2 weeks. Let's start with what went well.
Our AI agent often fails and the AI part is not always the culprit. The agent doesn’t have all the information it needs - it doesn’t see all web elements. Stuff like shadow DOMs and nested-nested iFrames are really tricky.
It dawned on us that many edge-cases are so complex that they can only be handled through precise multimodality - combining screenshot + public DOM analysis. And that requires much faster iterations on the edge-cases we encounter. We hope to get better at that after dropping LangChain.
Not all LLM progress is good news for us. We had high hopes for the new GPT-4o, but our experiments showed mixed results over the last 2 weeks. It is fast, but tends to loop and make more mistakes when reasoning over larger context (e.g. DOM code). It performs worse in complex tasks like our test discovery.
We go further with the slower GPT-turbo for all the complex stuff. We'll use GPT-4o where it shines - in precise settings with well-defined guardrails to speed up the process. Like proposing the next test steps.
We've amped up the agent's output:
We gave the AI assertions some love. We made them faster and more consistent.
AI discovery serves more meaningful tests cases. The discovered tests are now:
All these improvements are based on our own benchmarking data. Try yourself and throw a new edge-case in the AI agent's way 😊
Thanks for all the feedback to my last update. Happy you liked it. As for the improvement suggestions, please, keep them coming.