'Bypass Cloudflare with puppeteer
I am trying to scrape some startups data of a site with puppeteer and when I try to navigate to the next page the cloudflare waiting screen comes in and disrupts the scraper. I tried changing the IP but its still the same. Is there a way to bypass it with puppeteer.
(async () => {
const browser = await puppeteer.launch({
headless: false,
defaultViewport: null,
});
const page = await browser.newPage();
page.setDefaultNavigationTimeout(0);
let links = [];
// initial page
await page.goto(`https://www.startupranking.com/top/india`, {
waitUntil: "networkidle0",
});
// looping through the url to different pages
for (let i = 2; i <= 7; i++) {
if (i === 3) {
console.log("waiting");
await page.waitFor(20000);
console.log("waited");
}
const onPageLinks = await page.$$eval("tr .name a", (arr) =>
arr.map((cur) => cur.href)
);
links = links.concat(onPageLinks);
console.log(onPageLinks, "inside loop");
await page.goto(`https://www.startupranking.com/top/india/${i}`, {
waitUntil: "networkidle0",
});
}
console.log(links, links.length, "outside loop");
})();
As it is only checking for the first loop i put in a waitFor to bypass the time it takes to check, it works fine on some IP's but on others it gives challenges to solve, I have to run this on a server so I am thinking of bypassing it completely.
Solution 1:[1]
You can use the cloudflare-scraper
package, which is also based on puppeteer.
taken from the documentation
installation
npm install cloudflare-scraper puppeteer
usage
const cloudflareScraper = require('cloudflare-scraper');
(async () => {
try {
const response = await cloudflareScraper.get('https://cloudflare-url.com');
console.log(response);
} catch (error) {
console.log(error);
}
})();
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | dcts |