Scraping

Advanced Options for Hyperbrowser Scraping

Scraping a web page

With supplying just a url, you can easily extract the contents of a page in markdown format with the /scrape endpoint.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  // Handles both starting and waiting for scrape job response
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
  });
  console.log("Scrape result:", scrapeResult);
};

main();

Now, let's take an in depth look at all the provided options for scraping.

Session Options

All Scraping APIs (scrape, crawl, extract) support session parameters. You can see the session parameters listed here.

Scrape Options

formats

  • Type: array

  • Items: string

  • Enum: ["html", "links", "markdown", "screenshot"]

  • Description: Choose the formats to include in the API response:

    • html - Returns the scraped content as HTML.

    • links - Includes a list of links found on the page.

    • markdown - Provides the content in Markdown format.

    • screenshot - Provides a screenshot of the page.

  • Default: ["markdown"]

includeTags

  • Type: array

  • Items: string

  • Description: Provide an array of HTML tags, classes, or IDs to include in the scraped content. Only elements matching these selectors will be returned.

  • Default: undefined

excludeTags

  • Type: array

  • Items: string

  • Description: Provide an array of HTML tags, classes, or IDs to exclude from the scraped content. Elements matching these selectors will be omitted from the response.

  • Default: undefined

onlyMainContent

  • Type: boolean

  • Description: When set to true (default), the API will attempt to return only the main content of the page, excluding common elements like headers, navigation menus, and footers. Set to false to return the full page content.

  • Default: true

waitFor

  • Type: number

  • Description: Specify a delay in milliseconds to wait after the page loads before initiating the scrape. This can be useful for allowing dynamic content to fully render. This is also useful for waiting to detect CAPTCHAs on the page if you have solveCaptchas set to true in the sessionOptions.

  • Default: 0

timeout

  • Type: number

  • Description: Specify the maximum time in milliseconds to wait for the page to load before timing out. This would be like doing:

await page.goto("https://example.com", { waitUntil: "load", timeout: 30000 })
  • Default: 30000 (30 seconds)

waitUntil

  • Type: string

  • Enum: ["load", "domcontentloaded", "networkidle"]

  • Description: Specify the condition to wait for the page to load:

    • domcontentloaded: Wait until the HTML is fully parsed and DOM is ready

    • load - Wait until DOM and all resources are completely loaded

    • networkidle - Wait until no more network requests occur for a certain period of time

  • Default: load

screenshotOptions

  • Type: object

  • Properties:

    • fullPage - Take screenshot of the full page beyond the viewport

      • Type: boolean

      • Default: false

    • format - The image type of the screenshot

      • Type: string

      • Enum: ["webp", "jpeg", "png"]

      • Default: webp

  • Description: Configurations for the returned screenshot. Only applicable if screenshot is provided in the formats array.

Example

By configuring these options when making a scrape request, you can control the format and content of the scraped data, as well as the behavior of the scraper itself.

For example, to scrape a page with the following:

  • In stealth mode

  • With CAPTCHA solving

  • Return only the main content as HTML

  • Exclude any <span> elements

  • Wait 2 seconds after the page loads and before scraping

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
    sessionOptions: {
      useStealth: true,
      solveCaptchas: true,
    },
    scrapeOptions: {
      formats: ["html"],
      onlyMainContent: true,
      excludeTags: ["span"],
      waitFor: 2000,
    },
  });
  console.log("Scrape result:", scrapeResult);
};

main();

Crawl a Site

Instead of just scraping a single page, you might want to get all the content across multiple pages on a site. The /crawl endpoint is perfect for such a task. You can use the same sessionOptions and scrapeOptions as before for this endpoint as well. The crawl endpoint does have some extra parameters that are used to tailor the crawl to your scraping needs.

Crawl Options

Limiting the Number of Pages to Crawl with maxPages

  • Type: integer

  • Minimum: 1

  • Description: The maximum number of pages to crawl before stopping.

Following Links with followLinks

  • Type: boolean

  • Default: true

  • Description: When set to true, the crawler will follow links found on the pages it visits, allowing it to discover new pages and expand the scope of the crawl. When set to false, the crawler will only visit the starting URL and any explicitly specified pages, without following any additional links.

Ignoring the Sitemap with ignoreSitemap

  • Type: boolean

  • Default: false

  • Description: When set to true, the crawler will not pre-generate a list of urls from potential sitemaps it finds. The crawler will try to locate sitemaps beginning at the base URL of the URL provided in the url param.

Excluding Pages with excludePatterns

  • Type: array

  • Items: string

  • Description: An array of regular expressions or wildcard patterns specifying which URLs should be excluded from the crawl. Any pages whose URLs' path match one of these patterns will be skipped.

Including Pages with includePatterns

  • Type: array

  • Items: string

  • Description: An array of regular expressions or wildcard patterns specifying which URLs should be included in the crawl. Only pages whose URLs' path match one of these path patterns will be visited.

Example

By configuring these options when initiating a crawl, you can control the scope and behavior of the crawler to suit your specific needs.

For example, to crawl a site with the following:

  • Maximum of 5 pages

  • Only include /blog pages

  • Return only the main content as markdown

  • Exclude any <span> elements

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const crawlResult = await client.crawl.startAndWait({
    url: "https://hyperbrowser.ai",
    maxPages: 5,
    includePatterns: ["/blog/*"],
    scrapeOptions: {
      formats: ["markdown"],
      onlyMainContent: true,
      excludeTags: ["span"],
    },
  });
  console.log("Crawl result:", crawlResult);
};

main();

Structured Extraction

The Extract API allows you to fetch data in a well-defined structure from any webpage or website with just a few lines of code. You can provide a list of web pages, and hyperbrowser will collate all the information together and extract the information that best fits the provided schema (or prompt). You have access to the same SessionOptionsavailable here as well.

Extract Options:

Specifying all page to collect data from with urls

  • Type: array

  • Items: string

  • Required: Yes

  • Description: List of URLs to extract data from. To crawl a site, add /* to a URL (e.g., https://example.com/*). This will crawl other pages on the site with the same origin and find relevant pages to use for the extraction context.

Specify the extraction schema

  • Type: object

  • Required: No

  • Description: JSON schema defining the structure of the data you want to extract. Gives the best results with clear data structure requirements.

  • Note: You must provide either a schema or a prompt. If both are provided, the schema takes precedence.

  • Default: undefined

Specify the data to be extracted from a prompt

  • Type: string

  • Required: No

  • Description: A prompt describing how you want the data structured. Useful if you don't have a specific schema in mind.

  • Note: You must provide either a schema or a prompt. If both are provided, the schema takes precedence.

  • Default: undefined

Further specify the extraction process with a systemPrompt

  • Type: string

  • Required: No

  • Description: Additional instructions for the extraction process to guide the AI's behavior.

  • Default: undefined

Specify the number of pages to collect information from with maxLinks

  • Type: number

  • Description: Maximum number of links to follow when crawling a site for any given URL with /* suffix.

  • Default: undefined

Max time to wait on a page before extraction using waitFor

  • Type: number

  • Description: Time in milliseconds to wait after page load before extraction. This can be useful for allowing dynamic content to fully render or for waiting to detect CAPTCHAs if you have solveCaptchas set to true.

  • Default: 0

Set options for the session with sessionOptions

  • Type: object

  • Default: undefined

One of schema or prompt must be defined.

Example

By configuring these options when initiating a structured extraction, you can control the scope and behavior to suit your specific needs.

For example, to crawl a site with the following:

  • Maximum of 5 pages per URL

  • Include /products on example.com, and as many subsequent pages as possible on test.com up to 5 pages

  • Return the extracted data in the specified schema

  • Wait 2 seconds after the page loads and before extracting

For more detail, check out the Extract page.

curl -X POST https://api.hyperbrowser.ai/api/extract \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
        "urls": ["https://example.com/products","https://www.test.com/*"],
        "prompt": "Extract the product information from this page",
        "schema": {
            "type": "object",
            "properties": {
                "productName": {
                    "type": "string"
                },
                "price": {
                    "type": "string"
                },
                "features": {
                    "type": "array",
                    "items": {
                        "type": "string"
                    }
                }
            },
            "required": [
                "productName",
                "price",
                "features"
            ]
        },
        "maxLinks": 5,
        "waitFor": 2000,
        "sessionOptions": {
            "useStealth": true,
            "solveCaptchas": true,
            "adblock": true
        }
    }'

Last updated