June 25, 2026AgentsOpen SourceTool

Alibaba's page-agent skips the screenshots

Most web agents work like a person squinting at a screen. They take a screenshot, ask a model where to click, move the mouse, screenshot again. Slow, expensive, fragile. Alibaba's page-agent throws that out. It lives inside the webpage as JavaScript, reads the DOM as text, and acts on it directly. No browser extension, no headless Chrome, no vision model burning tokens on pixels.

Say what you want in natural language and it manipulates the page through the DOM. You bring your own model. It's been quietly maturing, 33 releases, latest 1.10.0 in mid-June, and it's trending on GitHub again with a fresh Hacker News thread and nearly 20k stars.

Why it matters: the screenshot-and-click approach, the one most computer-use demos show off, is the brute-force path. Reading structured DOM is cheaper, faster, and more reliable for the huge slice of agent work that happens in a browser. It's less general than pixel-level control, but for in-app automation it wins on every practical axis. The bet here is that for the web specifically, you don't need eyes. You need the source.

Link: https://github.com/alibaba/page-agent
← Previous
Tencent puts agents where it already puts websites
Next β†’
ai-berkshire turns Claude Code into a value-investing committee
← Back to all articles

Comments

Loading...
>_