Scrape
Scrape any page and get formatted data
The Scrape API allows you to get the data you want from web pages using with a single call. You can scrape page content and capture it's data in various formats.
Hyperbrowser exposes endpoints for starting a scrape request and for getting it's status and results. By default, scraping is handled in an asynchronous manner of first starting the job and then checking it's status until it is completed. However, with our SDKs, we provide a simple function that handles the whole flow and returns the data once the job is completed.
Installation
npm install @hyperbrowser/sdk
or
yarn add @hyperbrowser/sdk
Usage
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
config();
const client = new Hyperbrowser({
apiKey: process.env.HYPERBROWSER_API_KEY,
});
const main = async () => {
// Handles both starting and waiting for scrape job response
const scrapeResult = await client.scrape.startAndWait({
url: "https://example.com",
});
console.log("Scrape result:", scrapeResult);
};
main();
Response
The Start Scrape Job POST /scrape
endpoint will return a jobId
in the response which can be used to get information about the job in subsequent requests.
{
"jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}
The Get Scrape Job GET /scrape/{jobId}
will return the following data:
{
"jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
"status": "completed",
"data": {
"metadata": {
"title": "Example Page",
"description": "A sample webpage"
},
"markdown": "# Example Page\nThis is content...",
}
}
The status of a scrape job can be one of pending
, running
, completed
, failed
. There can also be other optional fields like error
with an error message if an error was encountered, and html
and links
in the data object depending on which formats are requested for the request.
To see the full schema, checkout the API Reference.
Session Configurations
You can also provide configurations for the session that will be used to execute the scrape job just as you would when creating a new session itself. These could include using a proxy or solving CAPTCHAs. To see all the different available session parameters, checkout the API Reference or Session Parameters.
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
config();
const client = new Hyperbrowser({
apiKey: process.env.HYPERBROWSER_API_KEY,
});
const main = async () => {
const scrapeResult = await client.scrape.startAndWait({
url: "https://example.com",
sessionOptions: {
useProxy: true,
solveCaptchas: true,
proxyCountry: "US",
locales: ["en"],
},
});
console.log("Scrape result:", scrapeResult);
};
main();
Scrape Configurations
You can also provide optional parameters for the scrape job itself such as the formats to return, only returning the main content of the page, setting the maximum timeout for navigating to a page, etc.
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
config();
const client = new Hyperbrowser({
apiKey: process.env.HYPERBROWSER_API_KEY,
});
const main = async () => {
const scrapeResult = await client.scrape.startAndWait({
url: "https://example.com",
scrapeOptions: {
formats: ["markdown", "html", "links"],
onlyMainContent: false,
timeout: 15000,
},
});
console.log("Scrape result:", scrapeResult);
};
main();
For a full reference on the scrape endpoint, checkout the API Reference, or read the Advanced Scraping Guide to see more advanced options for scraping.
Batch Scrape
Batch Scrape works the same as regular scrape, except instead of a single URL, you can provide a list of up to 1,000 URLs to scrape at once.
import { Hyperbrowser } from "@hyperbrowser/sdk";
import { config } from "dotenv";
config();
const client = new Hyperbrowser({
apiKey: process.env.HYPERBROWSER_API_KEY,
});
const main = async () => {
const scrapeResult = await client.scrape.batch.startAndWait({
urls: ["https://example.com", "https://hyperbrowser.ai"],
scrapeOptions: {
formats: ["markdown", "html", "links"],
},
});
console.log("Scrape result:", scrapeResult);
};
main();
Response
The Start Batch Scrape Job POST /scrape/batch
endpoint will return a jobId
in the response which can be used to get information about the job in subsequent requests.
{
"jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c"
}
The Get Batch Scrape Job GET /scrape/batch/{jobId}
will return the following data:
{
"jobId": "962372c4-a140-400b-8c26-4ffe21d9fb9c",
"status": "completed",
"totalScrapedPages": 2,
"totalPageBatches": 1,
"currentPageBatch": 1,
"batchSize": 20,
"data": [
{
"markdown": "Hyperbrowser\n\n[Home](https://hyperbrowser.ai/)...",
"metadata": {
"url": "https://www.hyperbrowser.ai/",
"title": "Hyperbrowser",
"viewport": "width=device-width, initial-scale=1",
"link:icon": "https://www.hyperbrowser.ai/favicon.ico",
"sourceURL": "https://hyperbrowser.ai",
"description": "Infinite Browsers"
},
"url": "hyperbrowser.ai",
"status": "completed",
"error": null
},
{
"markdown": "Example Domain\n\n# Example Domain...",
"metadata": {
"url": "https://www.example.com/",
"title": "Example Domain",
"viewport": "width=device-width, initial-scale=1",
"sourceURL": "https://example.com"
},
"url": "example.com",
"status": "completed",
"error": null
}
]
}
The status of a batch scrape job can be one of pending
, running
, completed
, failed
. The results of all the scrapes will be an array in the data
field of the response. Each scraped page will be returned in the order of the initial provided urls, and each one will have its own status and information.
To see the full schema, checkout the API Reference.
As with the single scrape, by default, batch scraping is handled in an asynchronous manner of first starting the job and then checking it's status until it is completed. However, with our SDKs, we provide a simple function (client.scrape.batch.startAndWait
) that handles the whole flow and returns the data once the job is completed.
Last updated