LogoLogo
SupportDashboard
  • Community
  • Welcome to Hyperbrowser
  • Get Started
    • Quickstart
      • AI Agents
        • Browser Use
        • Claude Computer Use
        • OpenAI CUA
      • Web Scraping
        • Scrape
        • Crawl
        • Extract
      • Browser Automation
        • Puppeteer
        • Playwright
        • Selenium
  • Agents
    • Browser Use
    • Claude Computer Use
    • OpenAI CUA
  • HyperAgent
    • About HyperAgent
      • HyperAgent SDK
      • HyperAgent Types
  • Quickstart
  • Multi-Page actions
  • Custom Actions
  • MCP Support
    • Tutorial
  • Examples
    • Custom Actions
    • LLM support
    • Cloud Support
      • Setting Up
      • Proxies
      • Profiles
    • MCP Examples
      • Google Sheets
      • Weather
        • Weather Server
    • Output to Schema
  • Web Scraping
    • Scrape
    • Crawl
    • Extract
  • Sessions
    • Overview
      • Session Parameters
    • Advanced Privacy & Anti-Detection
      • Stealth Mode
      • Proxies
      • Static IPs
      • CAPTCHA Solving
      • Ad Blocking
    • Profiles
    • Recordings
    • Live View
    • Extensions
    • Downloads
  • Guides
    • Model Context Protocol
    • Scraping
    • AI Function Calling
    • Extract Information with an LLM
    • Using Hyperbrowser Session
    • CAPTCHA Solving
  • Integrations
    • ⛓️LangChain
    • 🦙LlamaIndex
  • reference
    • Pricing
    • SDKs
      • Node
        • Sessions
        • Profiles
        • Scrape
        • Crawl
        • Extensions
      • Python
        • Sessions
        • Profiles
        • Scrape
        • Crawl
        • Extensions
    • API Reference
      • Sessions
      • Scrape
      • Crawl
      • Extract
      • Agents
        • Browser Use
        • Claude Computer Use
        • OpenAI CUA
      • Profiles
      • Extensions
Powered by GitBook
On this page
  • Scraping a web page
  • Session Options
  • Scrape Options
  • Example
  • Crawl a Site
  • Crawl Options
  • Example
  • Structured Extraction
  • Extract Options:
  • Example
Export as PDF
  1. Guides

Scraping

Advanced Options for Hyperbrowser Scraping

Scraping a web page

With supplying just a url, you can easily extract the contents of a page in markdown format with the /scrape endpoint.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  // Handles both starting and waiting for scrape job response
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
  });
  console.log("Scrape result:", scrapeResult);
};

main();
import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartScrapeJobParams
​
# Load environment variables from .env file
load_dotenv()
​
# Initialize Hyperbrowser client
client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))
​
​
# Start scraping and wait for completion
scrape_result = client.scrape.start_and_wait(
    StartScrapeJobParams(url="https://example.com")
)
print("Scrape result:", scrape_result)

Start Scrape Job

curl -X POST https://app.hyperbrowser.ai/api/scrape \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
        "url": "https://example.com"
    }'

Get Scrape Job Status

curl https://app.hyperbrowser.ai/api/scrape/{jobId}/status \
    -H 'x-api-key: <YOUR_API_KEY>'

Get Scrape Job Status and Data

curl https://app.hyperbrowser.ai/api/scrape/{jobId} \
    -H 'x-api-key: <YOUR_API_KEY>'

Now, let's take an in depth look at all the provided options for scraping.

Session Options

Scrape Options

formats

  • Type: array

  • Items: string

  • Enum: ["html", "links", "markdown", "screenshot"]

  • Description: Choose the formats to include in the API response:

    • html - Returns the scraped content as HTML.

    • links - Includes a list of links found on the page.

    • markdown - Provides the content in Markdown format.

    • screenshot - Provides a screenshot of the page.

  • Default: ["markdown"]

includeTags

  • Type: array

  • Items: string

  • Description: Provide an array of HTML tags, classes, or IDs to include in the scraped content. Only elements matching these selectors will be returned.

  • Default: undefined

excludeTags

  • Type: array

  • Items: string

  • Description: Provide an array of HTML tags, classes, or IDs to exclude from the scraped content. Elements matching these selectors will be omitted from the response.

  • Default: undefined

onlyMainContent

  • Type: boolean

  • Description: When set to true (default), the API will attempt to return only the main content of the page, excluding common elements like headers, navigation menus, and footers. Set to false to return the full page content.

  • Default: true

waitFor

  • Type: number

  • Description: Specify a delay in milliseconds to wait after the page loads before initiating the scrape. This can be useful for allowing dynamic content to fully render. This is also useful for waiting to detect CAPTCHAs on the page if you have solveCaptchas set to true in the sessionOptions.

  • Default: 0

timeout

  • Type: number

  • Description: Specify the maximum time in milliseconds to wait for the page to load before timing out. This would be like doing:

await page.goto("https://example.com", { waitUntil: "load", timeout: 30000 })
  • Default: 30000 (30 seconds)

waitUntil

  • Type: string

  • Enum: ["load", "domcontentloaded", "networkidle"]

  • Description: Specify the condition to wait for the page to load:

    • domcontentloaded: Wait until the HTML is fully parsed and DOM is ready

    • load - Wait until DOM and all resources are completely loaded

    • networkidle - Wait until no more network requests occur for a certain period of time

  • Default: load

screenshotOptions

  • Type: object

  • Properties:

    • fullPage - Take screenshot of the full page beyond the viewport

      • Type: boolean

      • Default: false

    • format - The image type of the screenshot

      • Type: string

      • Enum: ["webp", "jpeg", "png"]

      • Default: webp

  • Description: Configurations for the returned screenshot. Only applicable if screenshot is provided in the formats array.

Example

By configuring these options when making a scrape request, you can control the format and content of the scraped data, as well as the behavior of the scraper itself.

For example, to scrape a page with the following:

  • In stealth mode

  • With CAPTCHA solving

  • Return only the main content as HTML

  • Exclude any <span> elements

  • Wait 2 seconds after the page loads and before scraping

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
    sessionOptions: {
      useStealth: true,
      solveCaptchas: true,
    },
    scrapeOptions: {
      formats: ["html"],
      onlyMainContent: true,
      excludeTags: ["span"],
      waitFor: 2000,
    },
  });
  console.log("Scrape result:", scrapeResult);
};

main();
import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartScrapeJobParams, CreateSessionParams, ScrapeOptions


load_dotenv()


client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


scrape_result = client.scrape.start_and_wait(
    StartScrapeJobParams(
        url="https://example.com",
        session_options=CreateSessionParams(use_stealth=True, solve_captchas=True),
        scrape_options=ScrapeOptions(
            formats=["html"],
            only_main_content=True,
            exclude_tags=["span"],
            wait_for=2000,
        ),
    )
)

print("Scrape result:", scrape_result.model_dump_json(indent=2))
curl -X POST https://app.hyperbrowser.ai/api/scrape \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
            "url": "https://example.com",
            "sessionOptions": {
                    "useStealth": true,
                    "solveCaptchas": true
            },
            "scrapeOptions": {
                    "formats": ["html"],
                    "onlyMainContent": true, 
                    "excludeTags": ["span"],
                    "waitFor": 2000
            }
    }'

Crawl a Site

Instead of just scraping a single page, you might want to get all the content across multiple pages on a site. The /crawl endpoint is perfect for such a task. You can use the same sessionOptions and scrapeOptions as before for this endpoint as well. The crawl endpoint does have some extra parameters that are used to tailor the crawl to your scraping needs.

Crawl Options

Limiting the Number of Pages to Crawl with maxPages

  • Type: integer

  • Minimum: 1

  • Description: The maximum number of pages to crawl before stopping.

Following Links with followLinks

  • Type: boolean

  • Default: true

  • Description: When set to true, the crawler will follow links found on the pages it visits, allowing it to discover new pages and expand the scope of the crawl. When set to false, the crawler will only visit the starting URL and any explicitly specified pages, without following any additional links.

Ignoring the Sitemap with ignoreSitemap

  • Type: boolean

  • Default: false

  • Description: When set to true, the crawler will not pre-generate a list of urls from potential sitemaps it finds. The crawler will try to locate sitemaps beginning at the base URL of the URL provided in the url param.

Excluding Pages with excludePatterns

  • Type: array

  • Items: string

  • Description: An array of regular expressions or wildcard patterns specifying which URLs should be excluded from the crawl. Any pages whose URLs' path match one of these patterns will be skipped.

Including Pages with includePatterns

  • Type: array

  • Items: string

  • Description: An array of regular expressions or wildcard patterns specifying which URLs should be included in the crawl. Only pages whose URLs' path match one of these path patterns will be visited.

Example

By configuring these options when initiating a crawl, you can control the scope and behavior of the crawler to suit your specific needs.

For example, to crawl a site with the following:

  • Maximum of 5 pages

  • Only include /blog pages

  • Return only the main content as markdown

  • Exclude any <span> elements

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const crawlResult = await client.crawl.startAndWait({
    url: "https://hyperbrowser.ai",
    maxPages: 5,
    includePatterns: ["/blog/*"],
    scrapeOptions: {
      formats: ["markdown"],
      onlyMainContent: true,
      excludeTags: ["span"],
    },
  });
  console.log("Crawl result:", crawlResult);
};

main();
import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartCrawlJobParams, ScrapeOptions


load_dotenv()


client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


crawl_result = client.crawl.start_and_wait(
    StartCrawlJobParams(
        url="https://hyperbrowser.ai",
        max_pages=5,
        include_patterns=["/blog/*"],
        scrape_options=ScrapeOptions(
            formats=["markdown"],
            only_main_content=True,
            exclude_tags=["span"],
        ),
    )
)

print("Crawl result:", crawl_result.model_dump_json(indent=2))
curl -X POST https://app.hyperbrowser.ai/api/crawl \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
            "url": "https://hyperbrowser.ai",
            "maxPages": 5,
            "includePatterns": ["/blog/*"],
            "scrapeOptions": {
                    "formats": ["markdown"],
                    "onlyMainContent": true, 
                    "excludeTags": ["span"]
            }
    }'

Structured Extraction

The Extract API allows you to fetch data in a well-defined structure from any webpage or website with just a few lines of code. You can provide a list of web pages, and hyperbrowser will collate all the information together and extract the information that best fits the provided schema (or prompt). You have access to the same SessionOptionsavailable here as well.

Extract Options:

Specifying all page to collect data from with urls

  • Type: array

  • Items: string

  • Required: Yes

  • Description: List of URLs to extract data from. To crawl a site, add /* to a URL (e.g., https://example.com/*). This will crawl other pages on the site with the same origin and find relevant pages to use for the extraction context.

Specify the extraction schema

  • Type: object

  • Required: No

  • Description: JSON schema defining the structure of the data you want to extract. Gives the best results with clear data structure requirements.

  • Note: You must provide either a schema or a prompt. If both are provided, the schema takes precedence.

  • Default: undefined

Specify the data to be extracted from a prompt

  • Type: string

  • Required: No

  • Description: A prompt describing how you want the data structured. Useful if you don't have a specific schema in mind.

  • Note: You must provide either a schema or a prompt. If both are provided, the schema takes precedence.

  • Default: undefined

Further specify the extraction process with a systemPrompt

  • Type: string

  • Required: No

  • Description: Additional instructions for the extraction process to guide the AI's behavior.

  • Default: undefined

Specify the number of pages to collect information from with maxLinks

  • Type: number

  • Description: Maximum number of links to follow when crawling a site for any given URL with /* suffix.

  • Default: undefined

Max time to wait on a page before extraction using waitFor

  • Type: number

  • Description: Time in milliseconds to wait after page load before extraction. This can be useful for allowing dynamic content to fully render or for waiting to detect CAPTCHAs if you have solveCaptchas set to true.

  • Default: 0

  • Type: object

  • Default: undefined

One of schema or prompt must be defined.

Example

By configuring these options when initiating a structured extraction, you can control the scope and behavior to suit your specific needs.

For example, to crawl a site with the following:

  • Maximum of 5 pages per URL

  • Include /products on example.com, and as many subsequent pages as possible on test.com up to 5 pages

  • Return the extracted data in the specified schema

  • Wait 2 seconds after the page loads and before extracting

curl -X POST https://app.hyperbrowser.ai/api/extract \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
        "urls": ["https://example.com/products","https://www.test.com/*"],
        "prompt": "Extract the product information from this page",
        "schema": {
            "type": "object",
            "properties": {
                "productName": {
                    "type": "string"
                },
                "price": {
                    "type": "string"
                },
                "features": {
                    "type": "array",
                    "items": {
                        "type": "string"
                    }
                }
            },
            "required": [
                "productName",
                "price",
                "features"
            ]
        },
        "maxLinks": 5,
        "waitFor": 2000,
        "sessionOptions": {
            "useStealth": true,
            "solveCaptchas": true,
            "adblock": true
        }
    }'
PreviousModel Context ProtocolNextAI Function Calling

Last updated 1 month ago

All Scraping APIs (scrape, crawl, extract) support session parameters. You can see the .

Set options for the session with

For more detail, check out the page.

session parameters listed here
Extract
sessionOptions