'Puppeteer not giving accurate HTML code for page with shadow roots

I am trying to download the HTML code for the website intersight.com/help/. But puppeteer is not returning the HTML code with hrefs as we can see in the page (example https://intersight.com/help/getting_started is not present in the downloaded HTML). On inspecting the HTML in browser I came to know that all the missing HTML is present inside the <an-hulk></an-hulk> tags. I don't know what these tags mean.

const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  const data = await page.goto('https://intersight.com/help/', { waitUntil: 'domcontentloaded' });
  // Tried all the below lines, neither worked
  // await page.waitForSelector('.helplet-links')
  // document.querySelector("#app > an-hulk").shadowRoot.querySelector("#content").shadowRoot.querySelector("#main > div > div > div > an-hulk-home").shadowRoot.querySelector("div > div > div:nth-child(1) > div:nth-child(1) > div.helplet-links > ul > li:nth-child(1) > a > span")
  // await page.evaluateHandle(`document.querySelector("#app > an-hulk").shadowRoot.querySelector("#content").shadowRoot.querySelector("#main > div > div > div > an-hulk-home")`);
  await page.evaluateHandle(`document.querySelector("an-hulk").shadowRoot.querySelector("#aside").shadowRoot.querySelectorAll(".item")`)
  const result = await page.content()
  fs.writeFile('./intersight.html', result, (err) => {
    if (err) console.log(err)
    else console.log('done!!')
  })
  // console.log(result)
  await browser.close();
})();


Solution 1:[1]

As mentioned in the comments, you're dealing with a page that uses shadow roots. Traditional selectors that attempt to pierce shadow roots won't work through the console or Puppeteer without help. Short of using a library, the idea is to identify any shadow root elements by their .shadowRoot property, then dive into them recursively and repeat the process until you get the data you're after.

This code should grab all of the hrefs on the page (I didn't do a manual count) following this strategy:

const puppeteer = require("puppeteer");

let browser;
(async () => {
  browser = await puppeteer.launch({headless: true});
  const [page] = await browser.pages();
  const url = "https://intersight.com/help/";
  const data = await page.goto(url, {
    waitUntil: "networkidle0"
  });
  await page.waitForSelector("an-hulk", {visible: true});
  const hrefs = await page.evaluate(() => {
    const walk = root => [
      ...[...root.querySelectorAll("a[href]")]
        .map(e => e.getAttribute("href")),
      ...[...root.querySelectorAll("*")]
        .filter(e => e.shadowRoot)
        .flatMap(e => walk(e.shadowRoot))
    ];
    return walk(document);
  });
  console.log(hrefs);
  console.log(hrefs.length); // => 44 at the time I ran this

  // Bonus example of diving manually into shadow roots...
  //const html = await page.evaluate(() =>
  //  document
  //    .querySelector("#app > an-hulk")
  //    .shadowRoot
  //    .querySelector("#content")
  //    .shadowRoot
  //    .querySelector("#main an-hulk-home")
  //    .shadowRoot
  //    .querySelector(".content")
  //    .innerHTML
  //);
  //console.log(html);
})()
  .catch(err => console.error(err))
  .finally(() => browser?.close());
;

Note that the sidebar and other parts of the page use event listeners on spans and divs to implement links, so these don't count as hrefs as far as the above code is concerned. If you want to access these URLs, there are a variety of strategies you can try, including clicking them and extracting the URL after navigation. This is speculative since it's not clear that you want to do this.


A few remarks about your code:

  • Puppeteer wait until page is completely loaded is an important resource. { waitUntil: 'domcontentloaded' } is a weaker condition than { waitUntil: 'networkidle0' }. Using page.waitForSelector(selector, {visible: true}) and page.waitForFunction(predicate) are important to use to ensure the elements have been rendered before you begin manipulating them. Even without the shadow root, it's not clear to me that the top-level "an-hulk" is going to be available by the time you run evaluate.
  • Add console listeners to your page to help debug. Try your queries one step at a time and break them into multiple stages to see where they go wrong.
  • fs.writeFile should be await fs.promises.writeFile since you're in an async function.

Additional resources and similar threads:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1