'Web Scrape pagination in a single URL (cheerio and axios)
newbie here. I was on web scraping project. And I wanted some guide on web scraping pagination technique. I'm scraping this site https://www.imoney.my/unit-trust-investments. As you can see ,I wanted to retrieve different "Total return" percentage based on Xyears. Right now I'm using cheerio and axios.
const http = require("http");
const axios = require("axios");
const cheerio = require("cheerio");
http
.createServer(async function (_, res) {
try {
const response = await axios.get(
"https://www.imoney.my/unit-trust-investments"
);
const $ = cheerio.load(response.data);
const funds = [];
$("[class='list-item']").each((_i, row) => {
const $row = $(row);
const fund = $row.find("[class*='product-title']").find("a").text();
const price = $row.find("[class*='is-narrow product-profit']").find("b").text();
const risk = $row.find("[class*='product-title']").find("[class*='font-xsm extra-info']").text().replace('/10','');;
const totalreturn = $row.find("[class*='product-return']").find("[class='font-lg']").find("b").text().replace('%','');
funds.push({ fund, price, risk, totalreturn});
});
res.statusCode = 200;
res.write(JSON.stringify(funds, null, 4));
} catch (err) {
res.statusCode = 400;
res.write("Unable to process request.");
}
res.end();
})
.listen(8080);
do note, the URL does not change when different year is selected, only the value for total return is changed
Solution 1:[1]
This happens because the page uses javascript to generate the content. In this case, you need something like Puppeteer. That's what you need:
const puppeteer = require("puppeteer");
const availableFunds = "10000";
const years = 2; // 3 for 0.5 years; 2 for 1 year; 1 for 2 years, 0 for 3 years.
async function start() {
const browser = await puppeteer.launch({
headless: false,
});
const page = await browser.newPage();
await page.goto("https://www.imoney.my/unit-trust-investments");
await page.waitForSelector(".product-item");
await page.focus("#amount");
for (let i = 0; i < 5; i++) {
await page.keyboard.press("Backspace");
}
await page.type("#amount", availableFunds);
await page.click("#tenure");
for (let i = 0; i < years; i++) {
await page.keyboard.press("ArrowUp");
}
await page.keyboard.press("Enter");
const funds = await page.evaluate(() => {
const funds = [];
Array.from(document.querySelectorAll(".product-item")).forEach((el) => {
const fund = el.querySelector(".title")?.textContent.trim();
const price = el.querySelector(".investmentReturnValue")?.textContent.trim();
const risk = el.querySelector(".col-title .info-desc dd")?.textContent.trim();
const totalreturn = el.querySelector(".col-rate.text-left .info-desc .ir-value")?.textContent.trim();
if (fund && price && risk && totalreturn) funds.push({ fund, price, risk, totalreturn });
});
return funds;
});
console.log(funds);
browser.close();
}
start();
Output:
[
{
fund: 'Aberdeen Standard Islamic World Equity Fund - Class A',
price: 'RM 12,651.20',
risk: 'Medium\n 7/10',
totalreturn: '26.51'
},
{
fund: 'Affin Hwang Select Balanced Fund',
price: 'RM 10,355.52',
risk: 'Medium\n 5/10',
totalreturn: '3.56'
},
... and others
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Mikhail Zub |