Home/APIs/Crawl
Crawl API

Crawl Entire
Websites

Multi-page extraction with real-time streaming. Results arrive as pages complete.

$0.001per page
SSE Stream
LIVE
data: {"type":"page","title":"Getting Started"}
data: {"type":"page","title":"Installation"}
data: {"type":"page","title":"API Reference"}
data: {"type":"usage","cost":0.003}
data: {"type":"done","time":"4.2s"}
100
Max Pages
$0.001
Per Page
$0
Failed
How It Works

Depth-Based Crawling

Follow links from your seed URL to discover and extract content.

SEED URL
https://docs.example.com
Depth = 1
/getting-started
/installation
/api-reference
Depth = 2
/api/auth
/api/search
...
depth = 1

Only pages directly linked from seed. Fast and focused.

depth = 2

Seed + pages those link to. Good for documentation.

depth = 3+

Deeper crawling. Use with caution on large sites.

Real-Time SSE
Results stream as pages complete
Depth Control
Configure crawl depth 1-3+
Advanced Proxy
Bypass bot detection
Real-Time

SSE Stream Frames

Results arrive via Server-Sent Events as pages complete. No waiting for the entire crawl.

pageContent for each crawled page
usageCost summary after crawling
doneCompletion with response time
Process As You Go

Don't wait for the crawl to complete. Save pages to your database or process them as each frame arrives.

Quick Start

Start Crawling in Minutes

crawl.py
PYTHON
from llmlayer import LLMLayerClient

client = LLMLayerClient(api_key="...")

async for frame in client.crawl_stream(
    url="https://docs.example.com",
    max_pages=20,
    max_depth=2,
    main_content_only=True
):
    if frame["type"] == "page":
        page = frame["page"]
        if page["success"]:
            print(f"✅ {page['title']}")
            # Save page["markdown"]
    elif frame["type"] == "usage":
        print(f"Cost: ${frame['cost']}")
crawl.ts
TYPESCRIPT
import { LLMLayerClient } from 'llmlayer';

const client = new LLMLayerClient({
  apiKey: process.env.LLMLAYER_API_KEY
});

for await (const frame of client.crawlStream({
  url: 'https://docs.example.com',
  maxPages: 20,
  maxDepth: 2,
  mainContentOnly: true
})) {
  if (frame.type === 'page' && frame.page.success) {
    console.log(`✅ ${frame.page.title}`);
    // Save frame.page.markdown
  }
}
Terminal (note -N flag for streaming)cURL
curl -N -X POST https://api.llmlayer.dev/api/v2/crawl_stream \
  -H "Authorization: Bearer $LLMLAYER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com", "max_pages": 20, "max_depth": 2}'
Reference

API Parameters

POST/api/v2/crawl_stream
PARAMETERTYPEDEFAULTDESCRIPTION
url*stringStarting URL to crawl from
max_pagesinteger25Maximum pages to crawl (1-100)
max_depthinteger2Link depth from seed URL
main_content_onlybooleanfalseRemove nav, headers, footers, sidebars
advanced_proxybooleanfalseBypass bot detection (+$0.004/page)
include_subdomainsbooleanfalseCrawl across subdomains (blog.*, docs.*)
timeoutnumber60Total crawl timeout in seconds

* Required parameter

Pricing

Pay Only for Success

Standard Crawling

$0.001
per successfully crawled page
10 pages$0.01
50 pages$0.05
100 pages$0.10

With Advanced Proxy

$0.005
per page ($0.001 + $0.004 proxy)
10 pages$0.05
50 pages$0.25
100 pages$0.50
Failed Pages Are FREE

If a page fails to load (404, blocked, timeout), you're not charged. Only pay for pages that successfully return content.

Use Cases

Built For

Documentation Backup

Download entire documentation sites for offline access or archival.

AI Training Data

Build high-quality datasets from curated websites with main_content_only.

RAG Pipelines

Feed crawled markdown directly into vector databases for retrieval.

Site Migration

Extract all content before migrating platforms. Preserve everything.

$0.001per page

Crawl Websites
In Real-Time

Multi-page extraction with streaming results. Only pay for successful pages. Free credits to start.