# Scrape Any Website

Your employees can read the full content of any public webpage. Just give them a URL and they'll extract clean, readable text.

## TL;DR

Every employee can read the full content of any public webpage. Paste a URL and they extract clean, readable text with no ads or clutter. No setup required.

## How It Works

The employee fetches the page, strips boilerplate (headers, footers, ads, navigation), and returns clean text you can work with.

## What It Can Do

| Capability | Example |
|-----------|---------|
| **Read articles** | "Read this blog post and summarize the key points" |
| **Extract data** | "Pull the pricing from this page" |
| **Research competitors** | "Read their features page and compare to ours" |
| **Process documentation** | "Read this API doc and write me a quick-start guide" |
| **Digest long content** | "Read this 5,000-word report and give me the top 3 takeaways" |

## How to Set It Up

**Nothing to do.** Web Scraper is enabled for every employee at hire.

You can ask the employee to manage their own tools, or do it manually:

1. Select the employee
2. Click the **Tools** tab
3. Find "Web Scraper" in the Actions section

| Action | What it does | How |
|--------|-------------|-----|
| **Enable / Disable** | Controls whether the employee can use this tool | Toggle the switch |
| **Tool Rules** | Custom instructions that guide how the employee uses this specific tool, e.g. "always extract tables as markdown" or "skip navigation content" | Expand the tool, then write your rules in the text field |
| **Delete** | Permanently removes the tool from the employee | Click the delete button |

## Tips & Tricks

- **Give the direct URL.** Not a search results page or redirect link
- **Ask for specific extraction.** "Read this page and pull out only the pricing table" works better than "read this"
- **Combine with search.** "Search for X, then read the top 3 results" lets the employee use both tools together
- **Long pages get truncated.** If you need content past the 30k character limit, ask the employee to focus on a specific section

## Behind the Scene

| | |
|---|---|
| **Powered by** | Trafilatura |
| **How it works** | A 3-step extraction pipeline. Each step only runs if the previous one fails |

| Step | Method | Handles |
|------|--------|---------|
| **1. Trafilatura** | Lightweight Python fetch | ~90% of pages: articles, docs, blogs, news |
| **2. Browser User-Agent** | Fetch with browser headers | Sites that block bots but don't require JavaScript |
| **3. Browser Controller** | Real browser rendering | JavaScript-heavy SPAs (requires Desktop companion app) |

Most pages resolve on step 1 in under a second.

### Web Scraper vs Web Search vs Browser Controller

| | **Web Scraper** | **Web Search** | **Browser Controller** |
|---|---|---|---|
| **Purpose** | Read full content from a specific URL | Find information via search engine | Control a real browser. Navigate, click, fill forms |
| **Input** | A URL you already have | A question or topic | Instructions like "go to this site and..." |
| **Output** | Full page text, clean, no ads or nav | Titles, snippets, and source links | Screenshots, extracted data, completed actions |
| **Best for** | "Read this article for me" | "What's happening with X?" | "Log into this site and download the report" |
| **When to use** | You know the exact page | You don't know where to look | You need to interact with a page |
| **Requires** | Nothing, built in | Nothing, built in | Desktop companion app |

The employee often combines these automatically, searching first to find URLs, then reading the best results in full.

## What It Costs

| | |
|---|---|
| **Cost** | Runtime credits based on processing time, typically very fast |
| **Rate limits** | None from our side, but the target website may block rapid consecutive requests |
| **Truncation** | Pages longer than ~30,000 characters (~8,000 tokens) are truncated |

## Is It Safe

- **Public pages only.** The scraper can only access publicly available content. No login-protected or paywalled content is accessible
- **Results in your chat.** Scraped content is summarized in the employee's response, which is saved in your conversation history like any other message
- **Respects robots.txt.** The scraper follows standard web crawling conventions

## Good to Know

- **Clean text only.** The scraper strips all HTML, scripts, styles, and navigation. You get readable text, not raw markup
- **No login-protected content.** The scraper can only access publicly available pages. For authenticated content, use Browser Controller with the Desktop companion app
- **JavaScript-heavy sites.** Single-page apps that require JavaScript to render will fall back to Browser Controller if the Desktop companion app is connected. Without it, the employee will let you know the page couldn't be read

## Frequently Asked Questions

**Q: Can the employee read PDFs from a URL?**
A: The scraper is optimized for HTML web pages. For PDFs, upload the file directly to the employee's chat. They can read uploaded documents natively.

**Q: What happens with pages behind a login?**
A: The scraper can only access public pages. For authenticated content, use Browser Controller. It uses your actual browser session, so any site you're logged into is accessible.

**Q: Does the employee cache scraped pages?**
A: No. Every scrape fetches the live page, so you always get current content.

**Q: Can I scrape multiple pages at once?**
A: Yes. Give the employee a list of URLs and they'll read each one. They may do this automatically when combining Web Search and Web Scraper.

**Q: Why is the content truncated?**
A: Pages are capped at ~30,000 characters (~8,000 tokens) to keep responses fast and costs predictable. If you need the full page, ask the employee to focus on a specific section or split the request.