LogoLogo
SupportDashboard
  • Community
  • Welcome to Hyperbrowser
  • Get Started
    • Quickstart
      • AI Agents
        • Browser Use
        • Claude Computer Use
        • OpenAI CUA
      • Web Scraping
        • Scrape
        • Crawl
        • Extract
      • Browser Automation
        • Puppeteer
        • Playwright
        • Selenium
  • Agents
    • Browser Use
    • Claude Computer Use
    • OpenAI CUA
  • HyperAgent
    • About HyperAgent
      • HyperAgent SDK
      • HyperAgent Types
  • Quickstart
  • Multi-Page actions
  • Custom Actions
  • MCP Support
    • Tutorial
  • Examples
    • Custom Actions
    • LLM support
    • Cloud Support
      • Setting Up
      • Proxies
      • Profiles
    • MCP Examples
      • Google Sheets
      • Weather
        • Weather Server
    • Output to Schema
  • Web Scraping
    • Scrape
    • Crawl
    • Extract
  • Sessions
    • Overview
      • Session Parameters
    • Advanced Privacy & Anti-Detection
      • Stealth Mode
      • Proxies
      • Static IPs
      • CAPTCHA Solving
      • Ad Blocking
    • Profiles
    • Recordings
    • Live View
    • Extensions
    • Downloads
  • Guides
    • Model Context Protocol
    • Scraping
    • AI Function Calling
    • Extract Information with an LLM
    • Using Hyperbrowser Session
    • CAPTCHA Solving
  • Integrations
    • ⛓️LangChain
    • 🦙LlamaIndex
  • reference
    • Pricing
    • SDKs
      • Node
        • Sessions
        • Profiles
        • Scrape
        • Crawl
        • Extensions
      • Python
        • Sessions
        • Profiles
        • Scrape
        • Crawl
        • Extensions
    • API Reference
      • Sessions
      • Scrape
      • Crawl
      • Extract
      • Agents
        • Browser Use
        • Claude Computer Use
        • OpenAI CUA
      • Profiles
      • Extensions
Powered by GitBook
On this page
  • Installation
  • Usage
  • Response
  • Session Configurations
  • Scrape Configurations
  • Batch Scrape
  • Response
Export as PDF
  1. Web Scraping

Scrape

Scrape any page and get formatted data

PreviousOutput to SchemaNextCrawl

Last updated 1 month ago

The Scrape API allows you to get the data you want from web pages using with a single call. You can scrape page content and capture it's data in various formats.

For detailed usage, checkout the

Hyperbrowser exposes endpoints for starting a scrape request and for getting it's status and results. By default, scraping is handled in an asynchronous manner of first starting the job and then checking it's status until it is completed. However, with our SDKs, we provide a simple function that handles the whole flow and returns the data once the job is completed.

Installation

npm install @hyperbrowser/sdk

or

yarn add @hyperbrowser/sdk
pip install hyperbrowser

Usage

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  // Handles both starting and waiting for scrape job response
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
  });
  console.log("Scrape result:", scrapeResult);
};

main();
import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartScrapeJobParams

# Load environment variables from .env file
load_dotenv()

# Initialize Hyperbrowser client
client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


def main():
    # Start scraping and wait for completion
    scrape_result = client.scrape.start_and_wait(
        StartScrapeJobParams(url="https://example.com")
    )
    print("Scrape result:\n", scrape_result.model_dump_json(indent=2))


main()

Start Scrape Job

curl -X POST https://app.hyperbrowser.ai/api/scrape \
    -H 'Content-Type: application/json' \
    -H 'x-api-key: <YOUR_API_KEY>' \
    -d '{
        "url": "https://example.com"
    }'

Get Scrape Job Status and Data

curl https://app.hyperbrowser.ai/api/scrape/{jobId} \
    -H 'x-api-key: <YOUR_API_KEY>'

Response

The Start Scrape Job POST /scrape endpoint will return a jobId in the response which can be used to get information about the job in subsequent requests.

{
    "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}

The Get Scrape Job GET /scrape/{jobId} will return the following data:

{
  "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
  "status": "completed",
  "data": {
    "metadata": {
      "title": "Example Page",
      "description": "A sample webpage"
    },
    "markdown": "# Example Page\nThis is content...",
  }
}

The status of a scrape job can be one of pending, running, completed, failed . There can also be other optional fields like error with an error message if an error was encountered, and html and links in the data object depending on which formats are requested for the request.

Session Configurations

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
    sessionOptions: {
      useProxy: true,
      solveCaptchas: true,
      proxyCountry: "US",
      locales: ["en"],
    },
  });
  console.log("Scrape result:", scrapeResult);
};

main();
import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import StartScrapeJobParams, CreateSessionParams

load_dotenv()

client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


def main():
    scrape_result = client.scrape.start_and_wait(
        StartScrapeJobParams(
            url="https://example.com",
            session_options=CreateSessionParams(use_proxy=True, solve_captchas=True),
        )
    )
    print("Scrape result:\n", scrape_result.model_dump_json(indent=2))


main()

Proxy Usage and CAPTCHA solving are only available on PAID plans.

Using proxy and solving CAPTCHAs will slow down the scrape so use it if necessary.

Scrape Configurations

You can also provide optional parameters for the scrape job itself such as the formats to return, only returning the main content of the page, setting the maximum timeout for navigating to a page, etc.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const scrapeResult = await client.scrape.startAndWait({
    url: "https://example.com",
    scrapeOptions: {
      formats: ["markdown", "html", "links"],
      onlyMainContent: false,
      timeout: 15000,
    },
  });
  console.log("Scrape result:", scrapeResult);
};

main();
import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models import ScrapeOptions, StartScrapeJobParams

# Load environment variables from .env file
load_dotenv()

# Initialize Hyperbrowser client
client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


# Start scraping and wait for completion
scrape_result = client.scrape.start_and_wait(
    StartScrapeJobParams(
        url="https://example.com",
        scrape_options=ScrapeOptions(
            formats=["html", "links", "markdown"], only_main_content=False, timeout=5000
        ),
    )
)
print("Scrape result:", scrape_result)

Batch Scrape

Batch Scrape works the same as regular scrape, except instead of a single URL, you can provide a list of up to 1,000 URLs to scrape at once.

Batch Scrape is currently only available on the Ultra plan.

import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";

config();

const client = new Hyperbrowser({
  apiKey: process.env.HYPERBROWSER_API_KEY,
});

const main = async () => {
  const scrapeResult = await client.scrape.batch.startAndWait({
    urls: ["https://example.com", "https://hyperbrowser.ai"],
    scrapeOptions: {
      formats: ["markdown", "html", "links"],
    },
  });
  console.log("Scrape result:", scrapeResult);
};

main();
import os
from dotenv import load_dotenv
from hyperbrowser import Hyperbrowser
from hyperbrowser.models.scrape import ScrapeOptions, StartBatchScrapeJobParams

load_dotenv()

client = Hyperbrowser(api_key=os.getenv("HYPERBROWSER_API_KEY"))


scrape_result = client.scrape.batch.start_and_wait(
    StartBatchScrapeJobParams(
        urls=["https://example.com", "https://hyperbrowser.ai"],
        scrape_options=ScrapeOptions(
            formats=["html", "links", "markdown"]
        ),
    )
)
print("Scrape result:", scrape_result)

Response

The Start Batch Scrape Job POST /scrape/batch endpoint will return a jobId in the response which can be used to get information about the job in subsequent requests.

{
    "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}

The Get Batch Scrape Job GET /scrape/batch/{jobId} will return the following data:

{
    "jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
    "status": "completed",
    "totalScrapedPages": 2,
    "totalPageBatches": 1,
    "currentPageBatch": 1,
    "batchSize": 20,
    "data": [
        {
            "markdown": "Hyperbrowser\n\n[Home](https://hyperbrowser.ai/)...",
            "metadata": {
                "url": "https://www.hyperbrowser.ai/",
                "title": "Hyperbrowser",
                "viewport": "width=device-width, initial-scale=1",
                "link:icon": "https://www.hyperbrowser.ai/favicon.ico",
                "sourceURL": "https://hyperbrowser.ai",
                "description": "Infinite Browsers"
            },
            "url": "hyperbrowser.ai",
            "status": "completed",
            "error": null
        },
        {
            "markdown": "Example Domain\n\n# Example Domain...",
            "metadata": {
                "url": "https://www.example.com/",
                "title": "Example Domain",
                "viewport": "width=device-width, initial-scale=1",
                "sourceURL": "https://example.com"
            },
            "url": "example.com",
            "status": "completed",
            "error": null
        }
    ]
}

The status of a batch scrape job can be one of pending, running, completed, failed . The results of all the scrapes will be an array in the data field of the response. Each scraped page will be returned in the order of the initial provided urls, and each one will have its own status and information.

As with the single scrape, by default, batch scraping is handled in an asynchronous manner of first starting the job and then checking it's status until it is completed. However, with our SDKs, we provide a simple function (client.scrape.batch.startAndWait) that handles the whole flow and returns the data once the job is completed.

To see the full schema, checkout the .

You can also provide configurations for the session that will be used to execute the scrape job just as you would when creating a new session itself. These could include using a proxy or solving CAPTCHAs. To see all the different available session parameters, checkout the or .

For a full reference on the scrape endpoint, checkout the , or read the to see more advanced options for scraping.

To see the full schema, checkout the .

API Reference
Advanced Scraping Guide
Session Parameters
Scrape API Reference
API Reference
API Reference
API Reference