01Create a job
POST /v1/jobs
{
"name": "nightly-catalog-2026-05-12",
"input": { "upload_id": "upl_8K2nB7" }, // or "urls": [...] inline
"defaults": {
"render": true,
"proxy_type": "residential",
"extract": { "name": "string", "price": "number" }
},
"concurrency": 32,
"output": { "format": "ndjson", "gzip": true },
"webhook": "https://your-app.com/hooks/jobs"
}{
"id": "job_3F2D1A77B0E1",
"status": "queued",
"urls_total": 128400,
"eta_s": 2700,
"created_at": "2026-05-12T22:00:14Z"
}02Input formats
Three ways to deliver URLs:
- Inline.
"urls": ["https://…", …]up to 1,000 URLs. Great for ad-hoc. - Upload.
POST /v1/uploadswith an NDJSON or CSV file (1 GB max), then pass the returnedupload_id. - S3 / GCS.
"input": { "s3_uri": "s3://bucket/path.ndjson", "role_arn": "…" }. We assume your role and stream the file.
Per-URL overrides are supported — supply each line as a JSON object with at minimum "url", and any subset of Scrape parameters to override defaults.
03Poll status
{
"id": "job_3F2D1A77B0E1",
"status": "running", // queued | running | completed | cancelled | failed
"urls_total": 128400,
"urls_done": 48217,
"urls_errored": 312,
"credits_used": 293482,
"eta_s": 1820,
"download_url": null, // present once status=completed
"download_url_expires_at": null
}Prefer the job.completed webhook over polling — it's more accurate, lower-latency, and free.
04Output formats
ndjson(default) — one JSON line per URL. Streaming-friendly.csv— flat CSV with the extracted fields as columns. Requires anextractschema in defaults.parquet— Apache Parquet, compressed (zstd by default). Same column rules as CSV.
The download URL is a signed S3 link valid for 24 hours by default (configurable up to 30 days). Failed URLs are included in the output with "status": "error".
05List · cancel · retry
Cancellation is graceful — in-flight URLs finish, queued ones are skipped, the partial output becomes available. Retry produces a new job containing only the URLs that errored in the original.
06Limits
- 1,000,000 URLs per job. Need more? Chain jobs from a webhook.
- Maximum job lifetime: 24 hours. Long jobs are auto-cancelled with whatever has been completed retained.
- Concurrency per job: 256 (subject to plan concurrency cap).
- Maximum result file size: 50 GB (gzipped). Larger jobs are split into multi-part files.