API Reference
Crawl API
Recursively crawl websites and collect content from multiple pages.
The Crawl API lets you recursively crawl websites, following links and collecting content from multiple pages. Control the scope with path filters, depth limits, and per-page scrape options.
Endpoint
POST https://api.octivas.com/api/v1/crawlRequest Parameters
Core Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string (URL) | Yes | — | The starting URL to crawl from |
limit | integer | No | 10 | Maximum number of pages to crawl (1–100) |
formats | string[] | No | ["markdown"] | Output formats for each page (see Output Formats) |
Crawl Control
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
include_paths | string[] | No | — | URL pathname regex patterns to include. Only URLs matching at least one pattern will be crawled. |
exclude_paths | string[] | No | — | URL pathname regex patterns to exclude. URLs matching any pattern will be skipped. |
max_depth | integer | No | — | Maximum crawl depth from the starting URL. 0 = only the starting page, 1 = starting page + directly linked pages, etc. |
allow_external_links | boolean | No | false | Follow links to external domains |
allow_subdomains | boolean | No | false | Follow links to subdomains of the starting URL |
ignore_sitemap | boolean | No | false | Skip sitemap.xml when discovering URLs to crawl |
ignore_query_parameters | boolean | No | false | Treat URLs with different query parameters as the same page (deduplication) |
Per-Page Scrape Options
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
only_main_content | boolean | No | — | Extract only the primary content of each page, excluding navigation, sidebars, footers, etc. |
timeout | integer | No | 30000 | Per-page request timeout in milliseconds (1,000–120,000) |
wait_for | integer | No | 0 | Wait time in milliseconds for JavaScript rendering before scraping each page (0–60,000) |
Output Formats
The formats array accepts any combination of:
| Format | Description |
|---|---|
markdown | Page content converted to Markdown |
html | Cleaned HTML content |
rawHtml | Original unprocessed HTML |
screenshot | Screenshot of the page (returned as a URL) |
links | List of URLs found on the page |
json | Structured data extraction |
images | List of image URLs found on the page |
summary | AI-generated summary of the page content |
Example Request
curl -X POST https://api.octivas.com/api/v1/crawl \
-H "Authorization: Bearer your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"limit": 10,
"formats": ["markdown"]
}'import Octivas from 'octivas';
const client = new Octivas('your_api_key');
const result = await client.crawl({
startUrl: 'https://docs.example.com',
maxPages: 10
});
console.log(`Crawled ${result.pages_crawled} pages`);
result.pages.forEach(page => {
console.log(page.url, page.markdown);
});import octivas
client = octivas.Client("your_api_key")
result = client.crawl(
start_url="https://docs.example.com",
max_pages=10
)
print(f"Crawled {result.pages_crawled} pages")
for page in result.pages:
print(page.url, page.markdown)Advanced Example
Use path filters and depth control to crawl only specific sections of a website:
curl -X POST https://api.octivas.com/api/v1/crawl \
-H "Authorization: Bearer your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"limit": 50,
"formats": ["markdown", "links"],
"include_paths": ["/docs/*", "/api/*"],
"exclude_paths": ["/blog/*", "/changelog/*"],
"max_depth": 3,
"only_main_content": true,
"ignore_sitemap": false
}'const response = await fetch("https://api.octivas.com/api/v1/crawl", {
method: "POST",
headers: {
"Authorization": "Bearer your_api_key",
"Content-Type": "application/json",
},
body: JSON.stringify({
url: "https://docs.example.com",
limit: 50,
formats: ["markdown", "links"],
include_paths: ["/docs/*", "/api/*"],
exclude_paths: ["/blog/*", "/changelog/*"],
max_depth: 3,
only_main_content: true,
}),
});
const result = await response.json();
console.log(`Crawled ${result.pages_crawled} pages`);import requests
response = requests.post(
"https://api.octivas.com/api/v1/crawl",
headers={
"Authorization": "Bearer your_api_key",
"Content-Type": "application/json",
},
json={
"url": "https://docs.example.com",
"limit": 50,
"formats": ["markdown", "links"],
"include_paths": ["/docs/*", "/api/*"],
"exclude_paths": ["/blog/*", "/changelog/*"],
"max_depth": 3,
"only_main_content": True,
},
)
result = response.json()
print(f"Crawled {result['pages_crawled']} pages")Response
{
"success": true,
"url": "https://docs.example.com",
"pages_crawled": 3,
"credits_used": 3,
"pages": [
{
"url": "https://docs.example.com/",
"markdown": "# Documentation\n\nWelcome to our docs...",
"html": "<h1>Documentation</h1><p>Welcome to our docs...</p>",
"links": [
"https://docs.example.com/getting-started",
"https://docs.example.com/api-reference"
],
"metadata": {
"title": "Docs Home",
"description": "Official documentation",
"url": "https://docs.example.com/",
"status_code": 200,
"credits_used": 1
}
},
{
"url": "https://docs.example.com/getting-started",
"markdown": "# Getting Started\n\nFollow these steps...",
"metadata": {
"title": "Getting Started",
"url": "https://docs.example.com/getting-started"
}
}
]
}Response Fields
| Field | Type | Description |
|---|---|---|
success | boolean | Whether the request succeeded |
url | string | The starting URL |
pages_crawled | integer | Number of pages successfully crawled |
credits_used | integer | Total credits consumed |
pages | array | Array of page content objects |
pages[].url | string | URL of the crawled page |
pages[].markdown | string | null | Page content as Markdown (if requested) |
pages[].html | string | null | Page content as HTML (if requested) |
pages[].raw_html | string | null | Original unprocessed HTML (if requested) |
pages[].screenshot | string | null | Screenshot URL (if requested) |
pages[].links | string[] | null | URLs found on the page (if requested) |
pages[].images | string[] | null | Image URLs found on the page (if requested) |
pages[].summary | string | null | AI-generated page summary (if requested) |
pages[].metadata | object | Page metadata |
pages[].metadata.title | string | Page title |
pages[].metadata.description | string | Page meta description |
pages[].metadata.url | string | Final URL after redirects |
pages[].metadata.language | string | Page language |
pages[].metadata.status_code | integer | HTTP status code |
pages[].metadata.credits_used | integer | Credits used for this page |