Browser
Control a built-in Chromium browser to automate web tasks, extract data, and interact with any website
xiantong includes a built-in Chromium browser that your agent can control directly. Navigate pages, fill forms, click buttons, extract data, run JavaScript, and inspect network traffic — all without leaving the conversation.
When to Use the Browser#
Good fit for the browser#
- One-off tasks that don’t need a reusable integration
- UI-only workflows where no API exists
- When source setup is blocked and you need results now
- Scraping or extracting data from web pages
- Filling forms or completing multi-step web workflows
Better with a source#
- Repeatable tasks you’ll run regularly
- Team-wide automation and reporting
- Workflows that need stable, programmatic access
- When an API or MCP server already exists for the service
Core Workflow#
Every browser interaction follows the same pattern:
Open the browser
The agent opens a browser window in the background (or reuses an existing one).
Navigate to a page
Load a URL — the agent can navigate to any website, including ones where you’re already logged in.
Inspect the page
The agent takes a snapshot of the page — a structured accessibility tree that identifies every interactive element (buttons, links, inputs) with a reference ID like @e1, @e2, etc.
Interact
Using those references, the agent can click buttons, fill text inputs, select dropdown options, scroll, and send keyboard shortcuts.
Extract or verify
Read the results — extract data with JavaScript, take screenshots for visual verification, or inspect network traffic to understand what happened.
What You Can Do#
Navigate & Click#
Open URLs, click buttons and links, go back/forward in history
Fill Forms#
Type into text fields, select dropdowns, submit forms
Extract Data#
Run JavaScript to query the DOM and pull structured data from any page
Screenshots#
Capture full-page or targeted screenshots of specific elements or regions
Inspect Network#
See what API calls a page makes — debug failures or discover internal endpoints
Keyboard Input#
Send key presses and shortcuts (Enter, Escape, Cmd+K, etc.)
Permissions#
Browser tools work in all permission modes, including Explore. The agent can browse, read, and extract data without switching to a higher permission level.
The agent reads a browser tools guide before its first browser interaction in each session. This ensures it uses the tools correctly and follows best practices. If you see a brief pause on the first browser action, that’s why.
Window Lifecycle#
The browser window persists across interactions within a session. When the agent is done:
| Action | What happens | When to use |
|---|---|---|
| Close | Window is destroyed, all state lost | Task fully complete, browser not needed |
| Release | Agent overlay dismissed, window stays visible | Agent done, you may want to keep browsing |
| Hide | Window hidden but preserved in memory | Temporarily done, may need browser again later |
Closing the browser window via the OS close button hides it rather than destroying it — the agent can re-open it instantly.