AI Parser — Hypedata Docs

01Why an AI parser

Hand-written selectors are brittle. They break when the target adds a CSS class, swaps a div for a section, or A/B-tests a new layout. They require a custom parser per site and constant maintenance.

The Hypedata AI Parser is a structured-output model trained on hundreds of millions of web pages and their canonical extractions. You describe what you want, it figures out where to find it. When the target redesigns, the parser keeps working — no code change on your end.

When NOT to use it

If you're scraping a single highly-structured endpoint (e.g. a public JSON-LD product page) where the schema is already explicit, parse it yourself — it's cheaper and instantaneous. The AI Parser shines on messy, inconsistent, or evolving pages.

02Endpoint

POST https://api.hypedata.io/v1/parse

You can use the parser two ways: inline as part of a /v1/scrape call (extract parameter), or standalone by POSTing HTML you already have.

POST /v1/scrape
{
  "url": "https://shop.example.com/p/alpha",
  "render": true,
  "extract": { "name": "string", "price": "number" }
}

POST /v1/parse
{
  "html": "<html>…</html>",
  "url": "https://shop.example.com/p/alpha",   // optional, helps the model
  "schema": { "name": "string", "price": "number" }
}

03Sketch schemas

Sketch schemas are a compact JSON dialect where each value is a one-line type-and-hint. The parser infers the rest. They're ideal for prototyping and 90% of production cases.

{
  "title":        "string · article headline",
  "author":       "string · byline name only — no titles",
  "published_at": "string · ISO 8601 date, in UTC",
  "reading_time": "integer · minutes",
  "tags":         "array of strings · all tag/category labels",
  "paywall":      "boolean · true if any part of the body is gated"
}

Recognized type prefixes:

string, integer, number, boolean
array of <type>
object — followed by nested keys (see Nested & arrays)
enum — followed by a list of allowed values: "enum · in_stock | out_of_stock | preorder"
nullable <type> — when missing fields should be null rather than failing

04JSON Schema

For strict pipelines, supply a full JSON Schema 2020-12 document. The parser validates the result against it and refuses to return non-conforming JSON.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "type": "object",
  "required": ["name", "price", "currency"],
  "properties": {
    "name":     { "type": "string", "minLength": 1 },
    "price":    { "type": "number", "minimum": 0 },
    "currency": { "type": "string", "pattern": "^[A-Z]{3}$" },
    "in_stock": { "type": "boolean" },
    "images":   { "type": "array", "items": { "type": "string", "format": "uri" } }
  }
}

We support type, required, properties, items, enum, format (uri/email/date-time/uuid), pattern, minimum/maximum, minLength/maxLength, anyOf, and oneOf. Unsupported keywords are silently ignored.

05Nested & arrays

Nested objects and arrays of objects work the same way in both formats. Example: scraping a list of reviews from a product page.

{
  "product_name": "string",
  "average_rating": "number · 0–5, one decimal",
  "reviews": {
    "_type": "array",
    "_items": {
      "author":    "string",
      "rating":    "integer · 1–5",
      "date":      "string · ISO 8601",
      "verified":  "boolean",
      "body":      "string"
    }
  }
}

For pagination-aware extractions (collect all reviews across paginated pages), combine the parser with the crawl loop pattern — the parser itself works on a single page at a time.

06Confidence & citations

Pass "return_confidence": true on the request and each leaf value in the response gains a sibling {field}_confidence in the 0..1 range, plus a {field}_citation pointing to the source span in the HTML.

{
  "data": {
    "name": "Alpha Coat — Tobacco",
    "name_confidence": 0.98,
    "name_citation": { "selector": "h1.pdp-title", "offset": 0, "length": 20 },
    "price": 219,
    "price_confidence": 0.94
  }
}

Low-confidence values (< 0.6) are usually a sign the page doesn't actually contain the field, or you need a more specific hint in the sketch. Confidence reporting adds 1 credit per request.

07Plan caching

For a given (hostname, schema) pair, the parser compiles an extraction plan on the first request. Subsequent requests within 24 hours reuse the plan — same accuracy, ~10× lower latency, half the cost (charged 3 credits instead of 5).

Plan caching is automatic and per-workspace. It's the main reason high-volume catalog scrapes get cheap fast: the first 100 product pages train the plan, the next 100,000 ride on it.

08Failure modes

Code	Cause	Recovery
`schema_invalid`	Sketch can't be parsed or JSON Schema is malformed.	Fix the schema and re-request.
`validation_failed`	The model produced output but it doesn't pass your strict JSON Schema.	Loosen `required`, or read `partial_data` in the error body.
`page_unparseable`	HTML was empty, binary, or so heavily obfuscated the parser refused.	Verify `render: true` is set; check the `screenshot_url` to see what loaded.
`model_timeout`	The page was too large to process within budget.	Trim with `extract_root` selector, or split into multiple passes.