Scrape API
Extract content from a single webpage in multiple formats including Markdown, HTML, screenshots, and structured JSON.
The Scrape API allows you to extract content from any single webpage. Provide a URL and receive clean, structured content in your preferred format — including Markdown, HTML, raw HTML, screenshots, extracted links, images, summaries, and structured JSON via schema-based extraction.
Endpoint
POST https://api.octivas.com/api/v1/scrapeRequest Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
url | string (URL) | Yes | — | The URL to scrape content from |
formats | string[] | No | ["markdown"] | Output formats: "markdown", "html", "rawHtml", "screenshot", "links", "json", "images", "summary" |
schema | object | No | — | JSON Schema defining the structure for extraction (requires "json" format) |
prompt | string | No | — | Guidance prompt for structured extraction (requires "json" format) |
max_age | integer | No | 172800000 | Cache freshness window in milliseconds. Set to 0 to bypass cache and get fresh content |
store_in_cache | boolean | No | true | Whether to cache the scrape result for future requests |
location | object | No | — | Geographic settings for the request. See Location Object |
only_main_content | boolean | No | true | When true, extracts only the primary content area. Set to false to include navbars, footers, sidebars, etc. |
timeout | integer | No | 30000 | Request timeout in milliseconds |
Output Formats
| Format | Returns | Description |
|---|---|---|
markdown | string | Page content converted to clean Markdown |
html | string | Cleaned HTML content |
rawHtml | string | Original unprocessed HTML from the page |
screenshot | string (URL) | URL to a full-page screenshot image stored on content.octivas.com |
links | string[] | All hyperlinks found on the page |
json | object | Structured data extracted using schema and/or prompt |
images | string[] | All image URLs found on the page |
summary | string | AI-generated summary of the page content |
Location Object
| Field | Type | Required | Description |
|---|---|---|---|
country | string | Yes | ISO 3166-1 alpha-2 country code (e.g. "US", "DE", "JP") |
languages | string[] | No | Preferred languages (e.g. ["en", "de"]). Defaults to ["en"] |
Example Request
curl -X POST https://api.octivas.com/api/v1/scrape \
-H "Authorization: Bearer your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown", "html", "links", "screenshot"]
}'import Octivas from 'octivas';
const client = new Octivas('your_api_key');
const result = await client.scrape({
url: 'https://example.com',
formats: ['markdown', 'html', 'links', 'screenshot']
});
console.log(result.markdown);
console.log(result.links);
console.log(result.screenshot); // URL to screenshot imageimport octivas
client = octivas.Client("your_api_key")
result = client.scrape(
url="https://example.com",
formats=["markdown", "html", "links", "screenshot"]
)
print(result.markdown)
print(result.links)
print(result.screenshot) # URL to screenshot imageResponse
{
"success": true,
"url": "https://example.com/",
"markdown": "# Example\n\nThis is example content.",
"html": "<h1>Example</h1><p>This is example content.</p>",
"raw_html": null,
"screenshot": "https://content.octivas.com/screenshots/abc123.png",
"links": [
"https://example.com/about",
"https://example.com/contact"
],
"json": null,
"images": null,
"summary": null,
"metadata": {
"title": "Example Domain",
"description": "Example website",
"url": "https://example.com/",
"status_code": 200,
"credits_used": 1
}
}Response Fields
| Field | Type | Description |
|---|---|---|
success | boolean | Whether the request succeeded |
url | string | The resolved URL that was scraped |
markdown | string | null | Page content as Markdown (if requested) |
html | string | null | Page content as cleaned HTML (if requested) |
raw_html | string | null | Original unprocessed HTML (if requested) |
screenshot | string | null | URL to the screenshot image (if requested) |
links | string[] | null | Hyperlinks found on the page (if requested) |
json | object | null | Structured extraction result (if requested with schema/prompt) |
images | string[] | null | Image URLs found on the page (if requested) |
summary | string | null | AI-generated summary (if requested) |
metadata.title | string | Page title |
metadata.description | string | Page meta description |
metadata.url | string | Final URL after redirects |
metadata.status_code | number | HTTP status code |
metadata.credits_used | number | Credits consumed by this request |
Structured Extraction
Use the json format with a schema and optional prompt to extract structured data from any page.
curl -X POST https://api.octivas.com/api/v1/scrape \
-H "Authorization: Bearer your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/product/123",
"formats": ["json"],
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"currency": { "type": "string" },
"in_stock": { "type": "boolean" }
},
"required": ["name", "price"]
},
"prompt": "Extract the product details from this page."
}'const result = await client.scrape({
url: 'https://example.com/product/123',
formats: ['json'],
schema: {
type: 'object',
properties: {
name: { type: 'string' },
price: { type: 'number' },
currency: { type: 'string' },
in_stock: { type: 'boolean' }
},
required: ['name', 'price']
},
prompt: 'Extract the product details from this page.'
});
console.log(result.json);
// { name: "Widget Pro", price: 29.99, currency: "USD", in_stock: true }result = client.scrape(
url="https://example.com/product/123",
formats=["json"],
schema={
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string"},
"in_stock": {"type": "boolean"},
},
"required": ["name", "price"],
},
prompt="Extract the product details from this page.",
)
print(result.json)
# {"name": "Widget Pro", "price": 29.99, "currency": "USD", "in_stock": True}Caching
By default, scrape results are cached for 2 days (172,800,000 ms). You can control caching behavior:
max_age: 0— Always fetch fresh content, bypassing the cachestore_in_cache: false— Fetch normally but don't store the result in the cache
{
"url": "https://example.com",
"formats": ["markdown"],
"max_age": 0,
"store_in_cache": false
}Geo / Locale
Use the location parameter to scrape pages as if from a specific country and language:
{
"url": "https://example.com",
"formats": ["markdown"],
"location": {
"country": "DE",
"languages": ["de", "en"]
}
}Batch Scrape
Scrape multiple URLs (up to 10) in a single request. Batch jobs run asynchronously — submit the job, then poll for results.
Submit a Batch Job
POST https://api.octivas.com/api/v1/batch/scrapeRequest Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
urls | string[] | Yes | — | URLs to scrape (1–10 max) |
formats | string[] | No | ["markdown"] | Output formats (same options as single scrape) |
schema | object | No | — | JSON Schema for structured extraction |
prompt | string | No | — | Guidance prompt for extraction |
max_age | integer | No | 172800000 | Cache freshness in ms |
store_in_cache | boolean | No | true | Whether to cache results |
location | object | No | — | Geographic settings |
only_main_content | boolean | No | true | Extract primary content only |
timeout | integer | No | 30000 | Per-URL timeout in ms |
Example
curl -X POST https://api.octivas.com/api/v1/batch/scrape \
-H "Authorization: Bearer your_api_key" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
],
"formats": ["markdown"]
}'const job = await client.batchScrape({
urls: [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
],
formats: ['markdown']
});
console.log(job.job_id); // "507f1f77bcf86cd799439011"job = client.batch_scrape(
urls=[
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
],
formats=["markdown"],
)
print(job.job_id) # "507f1f77bcf86cd799439011"Response
{
"success": true,
"job_id": "507f1f77bcf86cd799439011",
"status": "processing",
"total_urls": 3
}Poll for Results
GET https://api.octivas.com/api/v1/batch/scrape/{job_id}curl https://api.octivas.com/api/v1/batch/scrape/507f1f77bcf86cd799439011 \
-H "Authorization: Bearer your_api_key"const status = await client.getBatchScrapeStatus('507f1f77bcf86cd799439011');
console.log(status.status); // "completed"
console.log(status.completed); // 3
status.results.forEach(r => console.log(r.url, r.markdown));status = client.get_batch_scrape_status("507f1f77bcf86cd799439011")
print(status.status) # "completed"
print(status.completed) # 3
for r in status.results:
print(r.url, r.markdown)Response
{
"success": true,
"job_id": "507f1f77bcf86cd799439011",
"status": "completed",
"completed": 3,
"total": 3,
"credits_used": 3,
"results": [
{
"success": true,
"url": "https://example.com/page1",
"markdown": "# Page 1\n\nContent of page 1.",
"metadata": {
"title": "Page 1",
"url": "https://example.com/page1",
"status_code": 200,
"credits_used": 1
}
},
{
"success": true,
"url": "https://example.com/page2",
"markdown": "# Page 2\n\nContent of page 2.",
"metadata": {
"title": "Page 2",
"url": "https://example.com/page2",
"status_code": 200,
"credits_used": 1
}
}
]
}Status Values
| Status | Description |
|---|---|
processing | Job is still running. Poll again to check progress. |
completed | All URLs have been scraped. Results are ready. |
failed | The job encountered a fatal error. |