Using Node.js for Web Scraping

Web scraping is the process of extracting data from websites. It is a powerful tool for data gathering, research, and automation. With Node.js, we can build powerful web scrapers that can extract data from various websites and automate data collection tasks. In this article, we will explore how to use Node.js for web scraping.

Table of Contents

Getting Started

Before we start scraping, we need to set up our Node.js environment. We need to install the request and cheerio modules. The request module is used for making HTTP requests, and the cheerio module is used for parsing HTML and XML documents.

To install these modules, run the following command in your terminal:

npm install request cheerio

Scraping Web Pages

Now that we have installed the necessary modules, we can start scraping web pages. Let’s start by scraping the title and description of a web page.

javascript

const request = require('request');
const cheerio = require('cheerio');

const url = 'https://www.example.com';

request(url, (error, response, html) => {
  if (!error && response.statusCode === 200) {
    const $ = cheerio.load(html);
    const title = $('title').text();
    const description = $('meta[name="description"]').attr('content');
    console.log(title);
    console.log(description);
  }
});

In this code, we make an HTTP request to the specified URL using the request module. We then load the HTML document into cheerio and use it to extract the title and description of the web page.

Scraping Tables

We can also use Node.js to extract data from HTML tables. Let’s say we want to extract a table of cryptocurrencies and their prices from a web page.

perl

const request = require('request');
const cheerio = require('cheerio');

const url = 'https://coinmarketcap.com/';

request(url, (error, response, html) => {
  if (!error && response.statusCode === 200) {
    const $ = cheerio.load(html);
    const table = $('table').eq(0);
    const rows = table.find('tbody > tr');
    rows.each((i, el) => {
      const columns = $(el).find('td');
      const name = columns.eq(1).text();
      const price = columns.eq(3).text();
      console.log(name, price);
    });
  }
});

In this code, we make an HTTP request to the specified URL using the request module. We then load the HTML document into cheerio and find the first table on the web page. We iterate through the table rows and extract the name and price of each cryptocurrency.

Handling Pagination

Sometimes, the data we want to scrape is spread across multiple pages. We can use Node.js to automate the process of scraping multiple pages. Let’s say we want to scrape a list of all available programming languages from the TIOBE index.

javascript

const request = require('request');
const cheerio = require('cheerio');

const baseUrl = 'https://www.tiobe.com/tiobe-index/';
const languages = [];

function scrapePage(url) {
  return new Promise((resolve, reject) => {
    request(url, (error, response, html) => {
      if (error) {
        reject(error);
        return;
      }

      const $ = cheerio.load(html);
      const rows = $('.table > tbody > tr');

      rows.each((i, el) => {
        const columns = $(el).find('td');
        const name = columns.eq(3).text().trim();
        if (name) {
      languages.push(name);
    }
  });

  const nextLink = $('.next').attr('href');
  if (nextLink) {
    const nextUrl = baseUrl + nextLink;
    scrapePage(nextUrl).then(resolve).catch(reject);
  } else {
    resolve(languages);
  }
});

});
}

scrapePage(baseUrl).then((languages) => {
console.log(languages);
}).catch((error) => {
console.error(error);
});

vbnet


In this code, we define a function `scrapePage` that scrapes a single page and returns a promise that resolves with an array of programming languages. We then call `scrapePage` recursively to scrape all available pages until there are no more pages left. Finally, we log the array of programming languages to the console.

Conclusion

Node.js provides a powerful platform for web scraping. With the `request` and `cheerio` modules, we can easily make HTTP requests, parse HTML and XML documents, and extract data from web pages. We can also automate the process of scraping multiple pages by using recursion and promises. However, it is important to note that web scraping can be a delicate process and it is important to respect the terms of service of the websites we are scraping.

NodeJs

Using Node.js for Web Scraping

Getting Started

Scraping Web Pages

Scraping Tables

Handling Pagination

CÔNG TY CỔ PHẦN ĐẦU TƯ THƯƠNG MẠI VÀ CÔNG NGHỆ THÁI DƯƠNG

Giỏ hàng

Getting Started

Scraping Web Pages

Scraping Tables

Handling Pagination

CÔNG TY CỔ PHẦN ĐẦU TƯ THƯƠNG MẠI VÀ CÔNG NGHỆ THÁI DƯƠNG

Đăng nhập