Alibaba's page-agent skips the screenshots
Most web agents work like a person squinting at a screen. They take a screenshot, ask a model where to click, move the mouse, screenshot again. Slow, expensive, fragile. Alibaba's page-agent throws that out. It lives inside the webpage as JavaScript, reads the DOM as text, and acts on it directly. No browser extension, no headless Chrome, no vision model burning tokens on pixels.
Say what you want in natural language and it manipulates the page through the DOM. You bring your own model. It's been quietly maturing, 33 releases, latest 1.10.0 in mid-June, and it's trending on GitHub again with a fresh Hacker News thread and nearly 20k stars.
Why it matters: the screenshot-and-click approach, the one most computer-use demos show off, is the brute-force path. Reading structured DOM is cheaper, faster, and more reliable for the huge slice of agent work that happens in a browser. It's less general than pixel-level control, but for in-app automation it wins on every practical axis. The bet here is that for the web specifically, you don't need eyes. You need the source.
Link: https://github.com/alibaba/page-agent
← Back to all articles
Say what you want in natural language and it manipulates the page through the DOM. You bring your own model. It's been quietly maturing, 33 releases, latest 1.10.0 in mid-June, and it's trending on GitHub again with a fresh Hacker News thread and nearly 20k stars.
Why it matters: the screenshot-and-click approach, the one most computer-use demos show off, is the brute-force path. Reading structured DOM is cheaper, faster, and more reliable for the huge slice of agent work that happens in a browser. It's less general than pixel-level control, but for in-app automation it wins on every practical axis. The bet here is that for the web specifically, you don't need eyes. You need the source.
Link: https://github.com/alibaba/page-agent
Comments