01Why an AI parser
Hand-written selectors are brittle. They break when the target adds a CSS class, swaps a div for a section, or A/B-tests a new layout. They require a custom parser per site and constant maintenance.
The Hypedata AI Parser is a structured-output model trained on hundreds of millions of web pages and their canonical extractions. You describe what you want, it figures out where to find it. When the target redesigns, the parser keeps working — no code change on your end.
If you're scraping a single highly-structured endpoint (e.g. a public JSON-LD product page) where the schema is already explicit, parse it yourself — it's cheaper and instantaneous. The AI Parser shines on messy, inconsistent, or evolving pages.
02Endpoint
You can use the parser two ways: inline as part of a /v1/scrape call (extract parameter), or standalone by POSTing HTML you already have.
POST /v1/scrape
{
"url": "https://shop.example.com/p/alpha",
"render": true,
"extract": { "name": "string", "price": "number" }
}
POST /v1/parse
{
"html": "<html>…</html>",
"url": "https://shop.example.com/p/alpha", // optional, helps the model
"schema": { "name": "string", "price": "number" }
}
03Sketch schemas
Sketch schemas are a compact JSON dialect where each value is a one-line type-and-hint. The parser infers the rest. They're ideal for prototyping and 90% of production cases.
{
"title": "string · article headline",
"author": "string · byline name only — no titles",
"published_at": "string · ISO 8601 date, in UTC",
"reading_time": "integer · minutes",
"tags": "array of strings · all tag/category labels",
"paywall": "boolean · true if any part of the body is gated"
}Recognized type prefixes:
string,integer,number,booleanarray of <type>object— followed by nested keys (see Nested & arrays)enum— followed by a list of allowed values:"enum · in_stock | out_of_stock | preorder"nullable <type>— when missing fields should benullrather than failing
04JSON Schema
For strict pipelines, supply a full JSON Schema 2020-12 document. The parser validates the result against it and refuses to return non-conforming JSON.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"type": "object",
"required": ["name", "price", "currency"],
"properties": {
"name": { "type": "string", "minLength": 1 },
"price": { "type": "number", "minimum": 0 },
"currency": { "type": "string", "pattern": "^[A-Z]{3}$" },
"in_stock": { "type": "boolean" },
"images": { "type": "array", "items": { "type": "string", "format": "uri" } }
}
}
We support type, required, properties, items, enum, format (uri/email/date-time/uuid), pattern, minimum/maximum, minLength/maxLength, anyOf, and oneOf. Unsupported keywords are silently ignored.
05Nested & arrays
Nested objects and arrays of objects work the same way in both formats. Example: scraping a list of reviews from a product page.
{
"product_name": "string",
"average_rating": "number · 0–5, one decimal",
"reviews": {
"_type": "array",
"_items": {
"author": "string",
"rating": "integer · 1–5",
"date": "string · ISO 8601",
"verified": "boolean",
"body": "string"
}
}
}For pagination-aware extractions (collect all reviews across paginated pages), combine the parser with the crawl loop pattern — the parser itself works on a single page at a time.
06Confidence & citations
Pass "return_confidence": true on the request and each leaf value in the response gains a sibling {field}_confidence in the 0..1 range, plus a {field}_citation pointing to the source span in the HTML.
{
"data": {
"name": "Alpha Coat — Tobacco",
"name_confidence": 0.98,
"name_citation": { "selector": "h1.pdp-title", "offset": 0, "length": 20 },
"price": 219,
"price_confidence": 0.94
}
}Low-confidence values (< 0.6) are usually a sign the page doesn't actually contain the field, or you need a more specific hint in the sketch. Confidence reporting adds 1 credit per request.
07Plan caching
For a given (hostname, schema) pair, the parser compiles an extraction plan on the first request. Subsequent requests within 24 hours reuse the plan — same accuracy, ~10× lower latency, half the cost (charged 3 credits instead of 5).
Plan caching is automatic and per-workspace. It's the main reason high-volume catalog scrapes get cheap fast: the first 100 product pages train the plan, the next 100,000 ride on it.
08Failure modes
| Code | Cause | Recovery |
|---|---|---|
schema_invalid | Sketch can't be parsed or JSON Schema is malformed. | Fix the schema and re-request. |
validation_failed | The model produced output but it doesn't pass your strict JSON Schema. | Loosen required, or read partial_data in the error body. |
page_unparseable | HTML was empty, binary, or so heavily obfuscated the parser refused. | Verify render: true is set; check the screenshot_url to see what loaded. |
model_timeout | The page was too large to process within budget. | Trim with extract_root selector, or split into multiple passes. |