Octivas Docs
API Reference

Crawl API

Recursively crawl websites and collect content from multiple pages.

The Crawl API lets you recursively crawl websites, following links and collecting content from multiple pages. Control the scope with path filters, depth limits, and per-page scrape options.

Endpoint

POST https://api.octivas.com/api/v1/crawl

Request Parameters

Core Parameters

ParameterTypeRequiredDefaultDescription
urlstring (URL)YesThe starting URL to crawl from
limitintegerNo10Maximum number of pages to crawl (1–100)
formatsstring[]No["markdown"]Output formats for each page (see Output Formats)

Crawl Control

ParameterTypeRequiredDefaultDescription
include_pathsstring[]NoURL pathname regex patterns to include. Only URLs matching at least one pattern will be crawled.
exclude_pathsstring[]NoURL pathname regex patterns to exclude. URLs matching any pattern will be skipped.
max_depthintegerNoMaximum crawl depth from the starting URL. 0 = only the starting page, 1 = starting page + directly linked pages, etc.
allow_external_linksbooleanNofalseFollow links to external domains
allow_subdomainsbooleanNofalseFollow links to subdomains of the starting URL
ignore_sitemapbooleanNofalseSkip sitemap.xml when discovering URLs to crawl
ignore_query_parametersbooleanNofalseTreat URLs with different query parameters as the same page (deduplication)

Per-Page Scrape Options

ParameterTypeRequiredDefaultDescription
only_main_contentbooleanNoExtract only the primary content of each page, excluding navigation, sidebars, footers, etc.
timeoutintegerNo30000Per-page request timeout in milliseconds (1,000–120,000)
wait_forintegerNo0Wait time in milliseconds for JavaScript rendering before scraping each page (0–60,000)

Output Formats

The formats array accepts any combination of:

FormatDescription
markdownPage content converted to Markdown
htmlCleaned HTML content
rawHtmlOriginal unprocessed HTML
screenshotScreenshot of the page (returned as a URL)
linksList of URLs found on the page
jsonStructured data extraction
imagesList of image URLs found on the page
summaryAI-generated summary of the page content

Example Request

curl -X POST https://api.octivas.com/api/v1/crawl \
  -H "Authorization: Bearer your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 10,
    "formats": ["markdown"]
  }'
import Octivas from 'octivas';

const client = new Octivas('your_api_key');

const result = await client.crawl({
  startUrl: 'https://docs.example.com',
  maxPages: 10
});

console.log(`Crawled ${result.pages_crawled} pages`);
result.pages.forEach(page => {
  console.log(page.url, page.markdown);
});
import octivas

client = octivas.Client("your_api_key")

result = client.crawl(
    start_url="https://docs.example.com",
    max_pages=10
)

print(f"Crawled {result.pages_crawled} pages")
for page in result.pages:
    print(page.url, page.markdown)

Advanced Example

Use path filters and depth control to crawl only specific sections of a website:

curl -X POST https://api.octivas.com/api/v1/crawl \
  -H "Authorization: Bearer your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "limit": 50,
    "formats": ["markdown", "links"],
    "include_paths": ["/docs/*", "/api/*"],
    "exclude_paths": ["/blog/*", "/changelog/*"],
    "max_depth": 3,
    "only_main_content": true,
    "ignore_sitemap": false
  }'
const response = await fetch("https://api.octivas.com/api/v1/crawl", {
  method: "POST",
  headers: {
    "Authorization": "Bearer your_api_key",
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://docs.example.com",
    limit: 50,
    formats: ["markdown", "links"],
    include_paths: ["/docs/*", "/api/*"],
    exclude_paths: ["/blog/*", "/changelog/*"],
    max_depth: 3,
    only_main_content: true,
  }),
});

const result = await response.json();
console.log(`Crawled ${result.pages_crawled} pages`);
import requests

response = requests.post(
    "https://api.octivas.com/api/v1/crawl",
    headers={
        "Authorization": "Bearer your_api_key",
        "Content-Type": "application/json",
    },
    json={
        "url": "https://docs.example.com",
        "limit": 50,
        "formats": ["markdown", "links"],
        "include_paths": ["/docs/*", "/api/*"],
        "exclude_paths": ["/blog/*", "/changelog/*"],
        "max_depth": 3,
        "only_main_content": True,
    },
)

result = response.json()
print(f"Crawled {result['pages_crawled']} pages")

Response

{
  "success": true,
  "url": "https://docs.example.com",
  "pages_crawled": 3,
  "credits_used": 3,
  "pages": [
    {
      "url": "https://docs.example.com/",
      "markdown": "# Documentation\n\nWelcome to our docs...",
      "html": "<h1>Documentation</h1><p>Welcome to our docs...</p>",
      "links": [
        "https://docs.example.com/getting-started",
        "https://docs.example.com/api-reference"
      ],
      "metadata": {
        "title": "Docs Home",
        "description": "Official documentation",
        "url": "https://docs.example.com/",
        "status_code": 200,
        "credits_used": 1
      }
    },
    {
      "url": "https://docs.example.com/getting-started",
      "markdown": "# Getting Started\n\nFollow these steps...",
      "metadata": {
        "title": "Getting Started",
        "url": "https://docs.example.com/getting-started"
      }
    }
  ]
}

Response Fields

FieldTypeDescription
successbooleanWhether the request succeeded
urlstringThe starting URL
pages_crawledintegerNumber of pages successfully crawled
credits_usedintegerTotal credits consumed
pagesarrayArray of page content objects
pages[].urlstringURL of the crawled page
pages[].markdownstring | nullPage content as Markdown (if requested)
pages[].htmlstring | nullPage content as HTML (if requested)
pages[].raw_htmlstring | nullOriginal unprocessed HTML (if requested)
pages[].screenshotstring | nullScreenshot URL (if requested)
pages[].linksstring[] | nullURLs found on the page (if requested)
pages[].imagesstring[] | nullImage URLs found on the page (if requested)
pages[].summarystring | nullAI-generated page summary (if requested)
pages[].metadataobjectPage metadata
pages[].metadata.titlestringPage title
pages[].metadata.descriptionstringPage meta description
pages[].metadata.urlstringFinal URL after redirects
pages[].metadata.languagestringPage language
pages[].metadata.status_codeintegerHTTP status code
pages[].metadata.credits_usedintegerCredits used for this page

On this page