Why
E2E browser test suites are normally a pain to maintain and flaky for so many reasons, I’m not going to list all of them here. The situation hasn’t changed much since I first encountered them in late 2015.
They are usually created nowadays by QA automation developers and frontend engineers using mostly either cypress js, playwright or selenium.
I believe we could potentially solve two problems with those tests by the recent advancements in LLMs.
We can make the writing of those tests easier by converting them to a natural language base format.
Easier maintenance of those tests by using the natural language intstructions and the website itself as a source of truth. This should eliminate issues related to the html tree changes or lite change to the UI needing a human changing the test script too.
The second point is about eliminating toil faced by the engineers as the test suite grows.
Proposed Solution
For the purposes of this blog post what I used to test it is Python Playwright, and Anthropic’s Claude2 LLM API (because it has a decent level of reasoning and the biggest context size of 100k tokens)
We can take raw instructions like: click on this button, type my username in the username box….etc. And also take the website html tree in the current page and feed that into an llm to convert into a playwright runnable instruction and execute it and save the instruction in case that works. Repeating that would generate a script that we can add to our test suit.
The problem we run into with this approach is that we can’t fit an entire page into any of the currently available LLMs unless it’s really small. What I came up with to reduce the amount of tokens by creating a reduced version of the html page as a csv and feed that to the llm.
Let’s assume we have a list of instructions like this, with a start url of google.com
[
“click reject all“
“type penguins“,
“click google search“,
“click on the first result“
]
The pseudocode for our program goes like this
go_to_the_page
for instruction in list_of_instructions:
html = capture_page_html()
page_content = process_html_to_csv(html) // This is a csv
relevant_locators = call_anthropic(f"
{page_content}
I want to {instruction}
Which rows are related to that?
Please return only the relevant rows without anyting else"
generated_code = call_anthropic(f"
I have this script:
from playwright.sync_api import sync_playwright, expect
from time import sleep
from os import environ
def test_website():
with sync_playwright() as p:
browser = p.chromium.launch(headless=True, timeout=120000)
page = browser.new_page()
page.goto("{start_url}", timeout=120000)
I have this HTML:
{relevant_locators}
I want to add a one liner to {instruction}.
")
clean_code = cleanup_code(generated_code)
exec(clean_code) // This execute the code in the browser
if success:
save_the_code
else:
repeat_or_exit
The main relevant parts here are reducing the html passed into the model, executing the code as we go by having our generation program hold a browser window open, and guarding against temporary hellucination, the last part just require repeats.
For reducing the html part
def process_html_to_csv(html)
csv = initial_csv
for element in html
csv.append_row(element.id, element.text, element.any_special_test_id....)
return csv
Also one more thing to keep in mind is replace anything that requires a password or a sensitive value with a call to read an env var from the environment and never send sensitive data into the LLM API call.
An example output is
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://www.google.com")
page.fill("//*[@id='APjFqb']", "penguins")
page.click("//input[@name='btnK']")
page.click("//h3[contains(text(), 'First Result')]")
browser.close()
Because LLMs are very good at translating from a program language to another for one file scripts you can feed this in and turn it into javascript for example
const { chromium } = require('playwright');
(async () => {
const browser = await chromium.launch({headless: true});
const context = await browser.newContext();
const page = await context.newPage();
await page.goto('https://www.google.com');
await page.fill('//*[@id="APjFqb"]', 'penguins');
await page.click('//html/body/div[1]/div[3]/form/div[1]/div[1]/div[2]/div[2]/div[7]/center/input[1]');
await page.click("//html/body/div[5]/div[1]/div[12]/div[1]/div[2]/div[2]/div/div/div[6]/div/div/div[1]/div/div/span/a");
await browser.close();
})();
Another thing clearly needed in the above script that could be done in a project you have control over is to hook into test-id html elements instead of xpaths, you can do this by mapping and replacing that in the html csv as we pass it into the LLM API.
Cons
Pricing, it’s 1.63$/million_tokens https://www-files.anthropic.com/production/images/model_pricing_july2023.pdf this adds up super quickly.
Does not integrate well with any already written modules and the normal dev processes.
Hellucination and other flakiness coming from the LLM model, we traded one set of problem for another here, the only thing that could change that is future enhancements to LLM APIs.
Final Thoughts
This is still work in progress, there are more advancements coming to LLMs in the future that I would like to incorporate here in a future post (multimodal inputs).
Feel free to reachout to me if you’re interested in replicating this or have any questions about it.