Screen control

6 tools for desktop automation — capture the screen, read it with OCR, click, type, and control windows directly. Requires a live display (not headless).

Capture & read

ToolParametersDescription
take_screenshot output_name? Capture the full screen and save it as a PNG artifact. Optionally specify a filename; defaults to a timestamped name. Returns the file path and base64 preview.
screen_read region? Run OCR on the current screen and return the extracted text. Optionally limit to a screen region (x, y, width, height). Useful for reading UI that can't be accessed via DOM.

Mouse & keyboard

ToolParametersDescription
screen_click x, y, button?, double? Move the mouse to screen coordinates and click. Supports left/right/middle button and double-click.
screen_type text Type text at the current cursor position using simulated keystrokes. Works in any focused input field.
screen_key key Press a single key or modifier combination (e.g. Return, Escape, ctrl+c, cmd+shift+4).
key_sequence keys Execute a sequence of key presses in order. Useful for keyboard shortcuts that require multiple steps.

Window management

ToolParametersDescription
window_list List all open windows with their title, app name, and window ID.
window_focus window_id Bring a window to the foreground by its window ID. Combined with window_list to focus any open app.
screen_find_window title Find a window by partial title match and return its ID, position, and size.

Screen tools require macOS 10.15+ (with Screen Recording permission), Linux with X11 or Wayland, or Windows 10+. They do not work in headless or SSH-only environments.