Introduction to Web Crawling

Web crawling has existed since the beginning of the internet. Almost twenty years of gathering publicly available data have made the internet the biggest existing data resource. In this blog, we will focus on gathering information from the internet by crawling dynamic pages along with scraping data from crawled websites. 

Data extraction is used for collecting different types of data from various sources, and it is usually a combination of web scraping and web crawling. Web crawling discovers links on the web that contain relevant data and then extract it while scraping extracts datasets which are later used for verification, analysis, or comparison based on a business’s needs and goals.

Using HTTP client requests is more than enough to crawl data from static websites. However, if the elements on the page are created and modified dynamically, static crawlers become unsuitable. Dynamic pages content changes and follows user actions, while static pages continually render the same HTML elements with the same content. Static crawlers cannot collect dynamically rendered content by JavaScript.

In the past, few approaches supported crawling dynamic sites like the combination of PhantomJS and Selenium. PhantomJS is a headless web browser abandoned due to a lack of active contributions. Selenium is an automation tool that supports many languages and different browsers, but crawlers do not need cross-browser support. The PhantomJS owner himself recommended switching to its superior alternative headless Chrome. Apart from Chrome, there are many other headless browsers like Firefox, Splash, HtmlUnit… All mentioned browsers can be started in a headless mode, and depending on the software, which controls it, the graphical user interface can be hidden or shown. 

Not all headless browsers are suited for every task. Many developers use Selenium as an API for headless Firefox testing and automation. Splash can be used with Scrapy framework for web scraping or testing speed performances in Python. HtmlUnit uses Java for testing forms, links, or HTTP authentication. Chrome can be used for crawling or scraping the data, taking screenshots or PDF-s as well as testing multiple levels of navigation. Currently the most popular headless browser, Chrome is often instrumented using the Puppeteer library that provides an API to manipulate the browser. In this blog, we will crawl and scrape dynamic websites using the Puppeteer library with headless Chrome. The Puppeteer library offers a variety of actions that can be done in the headless mode of Chrome or Firefox using JavaScript.

Ethical crawling

To do ethical crawling of data, two concepts need to be respected. The first is to read terms and conditions on a page that ought to be crawled. This file is less available on websites but usually contains helpful information for web crawlers. Implementing a clickwrap or browsewrap agreement is the best way to avoid crawling unwanted data and ensure web scraping protection with terms and conditions. The clickwrap agreement requires the visitor to indicate agreement before actively accessing any website information. The browsewrap agreement automatically assumes users have accepted an agreement if they use the website.

The second concept is to obey the robots.txt file. Each website should contain a robots.txt file to prevent unwanted data crawling or making specific subpages or folders private. It is not illegal or impossible to disobey robots.txt rules, but it would be unethical. Rules for making one example of this file can be found here. The robots.txt file can usually be found on a website domain, for instance, https://www.atlantbh.com/robots.txt, where this content can be found:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

 

Using Puppeteer library with Headless Chrome for crawling

In this part of the blog, we’re going to cover creating a crawler using Puppeteer with the help of Headless Chrome. We will start by creating a simple Node app. Make sure you have Node.js 8+ installed before initializing your project:

mkdir crawler-project
cd crawler-project
npm init

The first step to getting started with the Puppeteer library is running the installation command below:

npm install puppeteer

The first step in creating our web crawler is creating a new file named crawler.js and opening it in a favorite code editor. To work with the Puppeteer library, we need to import it using require method:

const puppeteer = require('puppeteer');

For crawling purposes, we will use a page that shows all blogs from the Atlantbh page: https://www.atlantbh.com/blog/.

As mentioned, we need to check if this page is crawlable using the robots.txt file. Puppeteer itself does not offer a robots.txt test, therefore, we need to use some other approaches. The first is to use robots.txt tester standalone tools, a package, or a plugin that checks URL availability on the fly. Using packages or plugins when checking URLs is particularly good when there is more than one website that needs to be crawled.

We decided to use the npm package called robots-txt-parser, which can be installed using the command below:

npm install robots-txt-parser

After installation is done, it is necessary to import it into a file, like it was with Puppeteer:

const robotsParser = require('robots-txt-parser');

The next step would be creating a method to check if the URL allows data extraction. In the code section below, along with https://www.atlantbh.com/blog/ URL, there are two more URLs used to check how robots-txt-parser library works. The DEFAULT_USER_AGENT variable is platform specific and can be different for a reader. More details about the userAgent variable can be found later in the blog.

const DEFAULT_USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7)" + "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36";
const DEFAULT_HOST = "https://www.atlantbh.com";

async function checkIfAllowed(url) {
   const robots = robotsParser({ userAgent: DEFAULT_USER_AGENT});
   await robots.useRobotsFor(DEFAULT_HOST);
   return robots.canCrawl(url);
}
console.assert(await 
checkIfAllowed('https://www.atlantbh.com/blog/') === true);
console.assert(await 
checkIfAllowed('https://www.atlantbh.com/wp-json/') === true);
console.assert(await 
checkIfAllowed('https://www.atlantbh.com/wp-admin/css') === true);

From the robots.txt file, we can see that only /wp-admin/ folder is partially disallowed, therefore, after executing the code section above, the first two asserts are true, but the third one is not because crawling the CSS folder is not allowed.

After we figure that our initial website URL can be crawled and the Puppeteer library is imported, we can continue building our crawler.

With the function crawl(), displayed in the code section below, we are crawling all available blog links from the  https://www.atlantbh.com/blog/ page. The function fetchUrls() is looping through all HTML elements that represent articles. Specific selectors are used to get all URL values required. After crawling all links, we will check if those links are allowed to crawl and then scrape data from them to get blog details. 

async function crawl() {
  const browser = await puppeteer.launch({waitUntil: 'domcontentloaded'});
  const page = await browser.newPage();
  await page.setUserAgent(DEFAULT_USER_AGENT);
  await page.goto('https://www.atlantbh.com/blog/');
  const urls = await fetchUrls(page);
  await browser.close();
}
async function fetchUrls(page) {
   return await page.evaluate(() => {
       return Array.from(
           document.querySelectorAll(
               "div.post-area >" + "div.posts-container > article"
           )
       ).map(
           (element) =>
               element.querySelector(
                   "div.content-inner >" +
                   "div.article-content-wrap> div.post-header > h3 > a"
               ).href
       );
   });
}

After launching a browser with Puppeteer, the next step is creating a new page inside this browser. When a page is made, it is necessary to set a user agent that will be used for crawling. The user-agent flag defines which browser, browser version, and type of OS will be used while crawling. The output of variable URLs looks something like this :

[
  'https://www.atlantbh.com/selenium-4-relative-locators/',
  'https://www.atlantbh.com/bridge-pattern-in-java/',
  'https://www.atlantbh.com/xpath-in-selenium/',
  'https://www.atlantbh.com/using-data-to-control-test-flow/',
  'https://www.atlantbh.com/design-review-in-product-development-2/',
  'https://www.atlantbh.com/sealed-classes-and-interfaces-in-java/',
...]

After we crawled all blog URLs, we will check if crawling is allowed and then scrape data from them. Following code section shows what should be added to the code above to create an array of objects with blog details:

const blogs = [];
for (let url of urls) {
 if (await checkIfAllowed(url)) {
    const blogPage = await browser.newPage();
    await blogPage.setUserAgent(DEFAULT_USER_AGENT);
    await blogPage.goto(url);
    await populateBlogDetails(blogPage, blogs, url);
  }
}

The function populateBlogDetails() fetches all wanted properties for each blog using HTML selectors. Implementation of this function can be found in the code section below: 

async function populateBlogDetails(page, blogs, url) {
   blogs.push({
       url,
       ...(await page.evaluate(() => {
           return {
               title: document.querySelector( "h1.entry-title").innerText,
               category: Array.from(document.querySelectorAll(
                     "div.inner-wrap > a"))
                   .map((x) => x.innerText)
                   .join(", "),
               author: document.querySelector(
                   "div#single-below-header > span.meta-author > span > a"
               ).innerText,
               publishedDate: document.querySelectorAll(
                   "div#single-below-header > span.meta-date"
               )[0].innerText,
           };
       })),
   });
}

The output of variable blogs looks something like this :

[{
url: 'https://www.atlantbh.com/bridge-pattern-in-java/',
title: 'Bridge pattern in Java',
category: 'SOFTWARE DEVELOPMENT, TECH BITES',
author: 'Dragan Jankovic',
publishedDate: 'June 21, 2022'
},
{
url: 'https://www.atlantbh.com/sealed-classes-and-interfaces-in-java/',
title: 'Sealed Classes and Interfaces in Java',
category: 'SOFTWARE DEVELOPMENT, TECH BITES',
author: 'Lamija Vrnjak',
publishedDate: 'June 15, 2022'
}...]

 

Parallelization

The code section above fetches all Atlantbh blogs with details for around 14 minutes. That time is really slow for scraping only 309 blogs at the time of scraping. The code was run on a MacBook Pro with a 6-core i7 (2.6 GHz). The wireless connection speed was around 70 megabits per second.

Optimization of execution time is doable by parallelizing crawlers. Parallelization can be done by opening more Chrome headless instances using promises with a page(s). Adapting the number of browser instances should be carefully chosen based on machine characteristics, otherwise, it will make execution time even longer.

Using a larger number of browser instances with fewer pages open is better than using fewer browser instances but with more pages open. The main reason is that crashing one page disables the whole browser instance.

The following code shows what should be done instead of only looping through the list and creating pages. A totalBrowserInstancesCount variable is the maximum number of browser instances that can be created. A totalPagesPerBrowserCount variable is the maximum number of pages that can be created per browser. When creating both browser instances and pages, premises are used to parallelize the scraping.

const blogs = [];
const browserPromises = [];
let totalPagesPerBrowserCount = 1;
let totalBrowserInstancesCount = 4;
while (--totalBrowserInstancesCount >= 0) {
  browserPromises.push(
     new Promise(async (browserResponse) => {
        const browser = await puppeteer.launch();
        const pagePromises = [];
        totalPagesPerBrowserCount = 1;
        while (--totalPagesPerBrowserCount >= 0) {
           pagePromises.push(
              new Promise(async (pageResponse) => {
                 do {
                    const url = urls.pop();
                    if (await checkIfAllowed(url)) {
                       let page = await browser.newPage();
                       await page.goto(url);
                       await populateBlogDetails(page, blogs, url);
                       await page.close();
                    }
                 } while (urls.length > 0);
                 pageResponse();
              })
           );
        }
        await Promise.all(pagePromises);
        await browser.close();
        browserResponse();
     })
  );
}
await Promise.all(browserPromises);

Variable blogs have the same value as the coding section above, with all details included. To have more test results, we executed this code section above again. Still, with one exception, we changed totalPagesPerBrowserCount value to number 3, which means we allowed the creation of 3 pages per browser. The time of execution for three cases of scraping data is shown below. Parallelization shows much better results.

craweling time of execution

 

Common errors when scraping data from pages

When crawling different dynamic pages, many errors can occur. Most of them can be solved by using different puppeteer flags, which are different for Firefox and Chrome. The error that often occurs is a timeout error when navigating to a specific page or subpage. A timeout can be caused by crawling elements while the page is not fully loaded. A flag that waits for a page to load or requests to finish is waitUntil. Flag waitUntil can have a few different values where the crawler should wait until:

  1. load – the load event is finished.
  2. domcontentloaded – the DOMContentLoaded event is fired.
  3. networkidle0 – no more than 0 network connections for at least 500 ms.
  4. networkidle2 – no more than 2 network connections for at least 500 ms.

Sometimes these errors cannot be fixed with the waitUntil flag. This flag is not helping when pages have lazy loaded images or content, which shows only when scrolled over or animated content as well. In those situations, it is necessary to scroll to the bottom of the page in order to preview all HTML elements needed for extracting data. 

Scrolling to the bottom will not be helpful if the page unloads the content once it leaves the screen. In that case, we should try to extract data from specific HTML elements, and Puppeteer offers an option of waiting for a specific element using their selector. If this option returns a timeout, we should keep scrolling and wait for the required selector to become available or reach the page’s end.

await page.waitForSelector("#elementId");

If the page loads too long, we can abort the loading of files, which makes loading the page longer than expected. This can be done using the request interception that Puppeteer offers. Intercepting requests enables us to study which requests and responses are exchanged when a page is loading, and our code is executing. When looping through requests, it is necessary to abort, continue or respond to a request so that loop does not iterate infinitely. The code section below is an example of blocking image resources on the page:

await page.setRequestInterception(true);
page.on('request', (request) => {
 if (request.resourceType() === "image" ) {
     request.abort();
 }
 request.continue();
});

 

Summary

Developers commonly use Puppeteer library, but unfortunately, not always with an ethical background. In this blog, we showed our experience with the headless browser Chrome and Puppeteer library, along with performance results with the parallelization of browsers. We hope that you find Headless Chrome with Puppeteer useful as well.

All code used in a blog can be found on this link.

Leave a Reply