Crawl API

The Crawl API lets you recursively crawl websites, following links and collecting content from multiple pages. Control the scope with path filters, depth limits, and per-page scrape options.

Endpoint

POST https://api.octivas.com/api/v1/crawl

Request Parameters

Core Parameters

Parameter	Type	Required	Default	Description
`url`	`string` (URL)	Yes	—	The starting URL to crawl from
`limit`	`integer`	No	`10`	Maximum number of pages to crawl (1–100)
`formats`	`string[]`	No	`["markdown"]`	Output formats for each page (see Output Formats)

Crawl Control

Parameter	Type	Required	Default	Description
`include_paths`	`string[]`	No	—	URL pathname regex patterns to include. Only URLs matching at least one pattern will be crawled.
`exclude_paths`	`string[]`	No	—	URL pathname regex patterns to exclude. URLs matching any pattern will be skipped.
`max_depth`	`integer`	No	—	Maximum crawl depth from the starting URL. `0` = only the starting page, `1` = starting page + directly linked pages, etc.
`allow_external_links`	`boolean`	No	`false`	Follow links to external domains
`allow_subdomains`	`boolean`	No	`false`	Follow links to subdomains of the starting URL
`ignore_sitemap`	`boolean`	No	`false`	Skip `sitemap.xml` when discovering URLs to crawl
`ignore_query_parameters`	`boolean`	No	`false`	Treat URLs with different query parameters as the same page (deduplication)

Per-Page Scrape Options

Parameter	Type	Required	Default	Description
`only_main_content`	`boolean`	No	—	Extract only the primary content of each page, excluding navigation, sidebars, footers, etc.
`timeout`	`integer`	No	`30000`	Per-page request timeout in milliseconds (1,000–120,000)
`wait_for`	`integer`	No	`0`	Wait time in milliseconds for JavaScript rendering before scraping each page (0–60,000)

Output Formats

The formats array accepts any combination of:

Format	Description
`markdown`	Page content converted to Markdown
`html`	Cleaned HTML content
`rawHtml`	Original unprocessed HTML
`screenshot`	Screenshot of the page (returned as a URL)
`links`	List of URLs found on the page
`json`	Structured data extraction
`images`	List of image URLs found on the page
`summary`	AI-generated summary of the page content

Example Request

curl -X POST https://api.octivas.com/api/v1/crawl \
  -H "Authorization: Bearer your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 10,
    "formats": ["markdown"]
  }'

import Octivas from 'octivas';

const client = new Octivas('your_api_key');

const result = await client.crawl({
  startUrl: 'https://docs.example.com',
  maxPages: 10
});

console.log(`Crawled ${result.pages_crawled} pages`);
result.pages.forEach(page => {
  console.log(page.url, page.markdown);
});

import octivas

client = octivas.Client("your_api_key")

result = client.crawl(
    start_url="https://docs.example.com",
    max_pages=10
)

print(f"Crawled {result.pages_crawled} pages")
for page in result.pages:
    print(page.url, page.markdown)

Advanced Example

Use path filters and depth control to crawl only specific sections of a website:

curl -X POST https://api.octivas.com/api/v1/crawl \
  -H "Authorization: Bearer your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 50,
    "formats": ["markdown", "links"],
    "include_paths": ["/docs/*", "/api/*"],
    "exclude_paths": ["/blog/*", "/changelog/*"],
    "max_depth": 3,
    "only_main_content": true,
    "ignore_sitemap": false
  }'

const response = await fetch("https://api.octivas.com/api/v1/crawl", {
  method: "POST",
  headers: {
    "Authorization": "Bearer your_api_key",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://docs.example.com",
    limit: 50,
    formats: ["markdown", "links"],
    include_paths: ["/docs/*", "/api/*"],
    exclude_paths: ["/blog/*", "/changelog/*"],
    max_depth: 3,
    only_main_content: true,
  }),
});

const result = await response.json();
console.log(`Crawled ${result.pages_crawled} pages`);

import requests

response = requests.post(
    "https://api.octivas.com/api/v1/crawl",
    headers={
        "Authorization": "Bearer your_api_key",
        "Content-Type": "application/json",
    },
    json={
        "url": "https://docs.example.com",
        "limit": 50,
        "formats": ["markdown", "links"],
        "include_paths": ["/docs/*", "/api/*"],
        "exclude_paths": ["/blog/*", "/changelog/*"],
        "max_depth": 3,
        "only_main_content": True,
    },
)

result = response.json()
print(f"Crawled {result['pages_crawled']} pages")

Response

{
  "success": true,
  "url": "https://docs.example.com",
  "pages_crawled": 3,
  "credits_used": 3,
  "pages": [
    {
      "url": "https://docs.example.com/",
      "markdown": "# Documentation\n\nWelcome to our docs...",
      "html": "<h1>Documentation</h1><p>Welcome to our docs...</p>",
      "links": [
        "https://docs.example.com/getting-started",
        "https://docs.example.com/api-reference"
      ],
      "metadata": {
        "title": "Docs Home",
        "description": "Official documentation",
        "url": "https://docs.example.com/",
        "status_code": 200,
        "credits_used": 1
      }
    },
    {
      "url": "https://docs.example.com/getting-started",
      "markdown": "# Getting Started\n\nFollow these steps...",
      "metadata": {
        "title": "Getting Started",
        "url": "https://docs.example.com/getting-started"
      }
    }
  ]
}

Response Fields

Field	Type	Description
`success`	`boolean`	Whether the request succeeded
`url`	`string`	The starting URL
`pages_crawled`	`integer`	Number of pages successfully crawled
`credits_used`	`integer`	Total credits consumed
`pages`	`array`	Array of page content objects
`pages[].url`	`string`	URL of the crawled page
`pages[].markdown`	`string \| null`	Page content as Markdown (if requested)
`pages[].html`	`string \| null`	Page content as HTML (if requested)
`pages[].raw_html`	`string \| null`	Original unprocessed HTML (if requested)
`pages[].screenshot`	`string \| null`	Screenshot URL (if requested)
`pages[].links`	`string[] \| null`	URLs found on the page (if requested)
`pages[].images`	`string[] \| null`	Image URLs found on the page (if requested)
`pages[].summary`	`string \| null`	AI-generated page summary (if requested)
`pages[].metadata`	`object`	Page metadata
`pages[].metadata.title`	`string`	Page title
`pages[].metadata.description`	`string`	Page meta description
`pages[].metadata.url`	`string`	Final URL after redirects
`pages[].metadata.language`	`string`	Page language
`pages[].metadata.status_code`	`integer`	HTTP status code
`pages[].metadata.credits_used`	`integer`	Credits used for this page

On this page