Extract
Extract data from pages using AI
Last updated
Extract data from pages using AI
Last updated
The Extract API allows you to get data in a structured format for any provided URLs with a single call.
Hyperbrowser exposes endpoints for starting an extract request and for getting it's status and results. By default, extracting is handled in an asynchronous manner of first starting the job and then checking it's status until it is completed. However, with our SDKs, we provide a simple function that handles the whole flow and returns the data once the job is completed.
or
You can configure the extract request with the following parameters:
urls
- A required list of urls you want to use to extract data from. To allow crawling for any of the urls provided in the list, simply add /*
to the end of the url (https://hyperbrowser.ai/*
). This will crawl other pages on the site with the same origin and find relevant pages to use for the extraction context.
schema
- A strict json schema you want the returned data to be structured as. Gives the best results if provided. If not provided, we will try to automatically generate one based on the prompt.
prompt
- A prompt describing how you want the data structured and any other guiding instructions for the extraction.
maxLinks
- The maximum number of links to look for if performing a crawl (urls with /*
at the end) for any given url. We will automatically try to pick relevant links for the extraction from the links that we look at.
waitFor
- A delay in milliseconds to wait after the page loads before initiating the scrape to get data for extraction from page. This can be useful for allowing dynamic content to fully render. This is also useful for waiting to detect CAPTCHAs on the page if you have solveCaptchas
set to true in the sessionOptions
.
You can provide a schema
, or a prompt
, or both. For best results, provide both a schema
and a prompt
. The schema
should define exactly how you want the extract data formatted and the prompt
should have any information that can help guide the extraction. If no schema
is provided, then we will try to automatically generate a schema based on the prompt.
For the Node SDK, you can simply pass in a zod schema for ease of use or an actual json schema. For the Python SDK, you can pass in a pydantic model or an actual json schema.
Ensure that the root level of the schema is type: "object"
.
The Start Extract Job POST /extract
endpoint will return a jobId
in the response which can be used to get information about the job in subsequent requests.
The Get Extract Job GET /extract/{jobId}
will return the following data:
The status of an extract job can be one of pending
, running
, completed
, failed
. There can also be an optional error
field with an error message if an error was encountered.
sessionOptions
- .
To see the full schema, checkout the .
You can also provide configurations for the session that will be used to execute the extract job just as you would when creating a new session itself. These could include using a proxy or solving CAPTCHAs. To see all the different available session parameters, checkout the or .
For a full reference on the extract endpoint, checkout the .