Multi-page extraction with real-time streaming. Results arrive as pages complete.
Follow links from your seed URL to discover and extract content.
Only pages directly linked from seed. Fast and focused.
Seed + pages those link to. Good for documentation.
Deeper crawling. Use with caution on large sites.
Results arrive via Server-Sent Events as pages complete. No waiting for the entire crawl.
Don't wait for the crawl to complete. Save pages to your database or process them as each frame arrives.
from llmlayer import LLMLayerClient
client = LLMLayerClient(api_key="...")
async for frame in client.crawl_stream(
url="https://docs.example.com",
max_pages=20,
max_depth=2,
main_content_only=True
):
if frame["type"] == "page":
page = frame["page"]
if page["success"]:
print(f"✅ {page['title']}")
# Save page["markdown"]
elif frame["type"] == "usage":
print(f"Cost: ${frame['cost']}")import { LLMLayerClient } from 'llmlayer';
const client = new LLMLayerClient({
apiKey: process.env.LLMLAYER_API_KEY
});
for await (const frame of client.crawlStream({
url: 'https://docs.example.com',
maxPages: 20,
maxDepth: 2,
mainContentOnly: true
})) {
if (frame.type === 'page' && frame.page.success) {
console.log(`✅ ${frame.page.title}`);
// Save frame.page.markdown
}
}curl -N -X POST https://api.llmlayer.dev/api/v2/crawl_stream \
-H "Authorization: Bearer $LLMLAYER_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://docs.example.com", "max_pages": 20, "max_depth": 2}'/api/v2/crawl_stream| PARAMETER | TYPE | DEFAULT | DESCRIPTION |
|---|---|---|---|
url* | string | — | Starting URL to crawl from |
max_pages | integer | 25 | Maximum pages to crawl (1-100) |
max_depth | integer | 2 | Link depth from seed URL |
main_content_only | boolean | false | Remove nav, headers, footers, sidebars |
advanced_proxy | boolean | false | Bypass bot detection (+$0.004/page) |
include_subdomains | boolean | false | Crawl across subdomains (blog.*, docs.*) |
timeout | number | 60 | Total crawl timeout in seconds |
* Required parameter
If a page fails to load (404, blocked, timeout), you're not charged. Only pay for pages that successfully return content.
Download entire documentation sites for offline access or archival.
Build high-quality datasets from curated websites with main_content_only.
Feed crawled markdown directly into vector databases for retrieval.
Extract all content before migrating platforms. Preserve everything.
Multi-page extraction with streaming results. Only pay for successful pages. Free credits to start.