Learn how to do basic web scraping using Node.js in this tutorial. The request-promise and cheerio libraries are used.đ» Github: https://github.com/beaucarne. Unlike ReactPHP HTTPClient, clue/buzz-react buffers the response and fulfills the promise once the whole response is received. Actually, it is a default behavior and you can change it if you need streaming responses. So, as you can see, the whole process of scraping is very simple: Make a. Web scraping is the process of gathering information from the Internet. Even copy-pasting the lyrics of your favorite song is a form of web scraping! However, the words âweb scrapingâ usually refer to a process that involves automation. Some websites donât like it when automatic scrapers gather their data, while others donât mind. Chrome is the most popular web browser worldwide as of mid-2017, made by the tech company Google. It's available for most operating systems including Windows, macOS, and Linux and on multiple platforms such as the desktop, phones, and tablets.
Web scraping is a way to grab data from websites without needing access to APIs or the websiteâs database. You only need access to the siteâs data â as long as your browser can access the data, you will be able to scrape it.
Realistically, most of the time you could just go through a website manually and grab the data âby handâ using copy and paste, but in a lot of cases that would take you many hours of manual work, which could end up costing you a lot more than the data is worth, especially if youâve hired someone to do the task for you. Why hire someone to work at 1â2 minutes per query when you can get a program to perform a query automatically every few seconds?
For example, letâs say that you wish to compile a list of the Oscar winners for best picture, along with their director, starring actors, release date, and run time. Using Google, you can see there are several sites that will list these movies by name, and maybe some additional information, but generally youâll have to follow through with links to capture all the information you want.
Obviously, it would be impractical and time-consuming to go through every link from 1927 through to today and manually try to find the information through each page. With web scraping, we just need to find a website with pages that have all this information, and then point our program in the right direction with the right instructions.
In this tutorial, we will use Wikipedia as our website as it contains all the information we need and then use Scrapy on Python as a tool to scrape our information.
A few caveats before we begin:
Data scraping involves increasing the server load for the site that youâre scraping, which means a higher cost for the companies hosting the site and a lower quality experience for other users of that site. The quality of the server that is running the website, the amount of data youâre trying to obtain, and the rate at which youâre sending requests to the server will moderate the effect you have on the server. Keeping this in mind, we need to make sure that we stick to a few rules.
Most sites also have a file called robots.txt in their main directory. This file sets out rules for what directories sites do not want scrapers to access. A websiteâs Terms & Conditions page will usually let you know what their policy on data scraping is. For example, IMDBâs conditions page has the following clause:
Robots and Screen Scraping: You may not use data mining, robots, screen scraping, or similar data gathering and extraction tools on this site, except with our express-written consent as noted below.
Before we try to obtain a websiteâs data we should always check out the websiteâs terms and robots.txt
to make sure we are obtaining legal data. When building our scrapers, we also need to make sure that we do not overwhelm a server with requests that it canât handle.
Luckily, many websites recognize the need for users to obtain data, and they make the data available through APIs. If these are available, itâs usually a much easier experience to obtain data through the API than through scraping.
Wikipedia allows data scraping, as long as the bots arenât going âway too fastâ, as specified in their robots.txt
. They also provide downloadable datasets so people can process the data on their own machines. If we go too fast, the servers will automatically block our IP, so weâll implement timers in order to keep within their rules.
First of all, to start off, letâs install Scrapy.
Install the latest version of Python from https://www.python.org/downloads/windows/
Note:Windows users will also need Microsoft Visual C++ 14.0, which you can grab from âMicrosoft Visual C++ Build Toolsâ over here.
Youâll also want to make sure you have the latest version of pip.
In cmd.exe, type in:
This will install Scrapy and all the dependencies automatically.
First youâll want to install all the dependencies:
In Terminal, enter:
Once thatâs all installed, just type in:
To make sure pip is updated, and then:
And itâs all done.
First youâll need to make sure you have a c-compiler on your system. In Terminal, enter:
After that, install homebrew from https://brew.sh/.
Update your PATH variable so that homebrew packages are used before system packages:
Install Python:
And then make sure everything is updated:
After thatâs done, just install Scrapy using pip:
>You will be writing a script called a âSpiderâ for Scrapy to run, but donât worry, Scrapy spiders arenât scary at all despite their name. The only similarity Scrapy spiders and real spiders have are that they like to crawl on the web.
Inside the spider is a class
that you define that tells Scrapy what to do. For example, where to start crawling, the types of requests it makes, how to follow links on pages, and how it parses data. You can even add custom functions to process data as well, before outputting back into a file.
To start our first spider, we need to first create a Scrapy project. To do this, enter this into your command line:
This will create a folder with your project.
Weâll start with a basic spider. The following code is to be entered into a python script. Open a new python script in /oscars/spiders
and name it oscars_spider.py
Weâll import Scrapy.
We then start defining our Spider class. First, we set the name and then the domains that the spider is allowed to scrape. Finally, we tell the spider where to start scraping from.
Next, we need a function which will capture the information that we want. For now, weâll just grab the page title. We use CSS to find the tag which carries the title text, and then we extract it. Finally, we return the information back to Scrapy to be logged or written to a file.
Now save the code in /oscars/spiders/oscars_spider.py
To run this spider, simply go to your command line and type:
You should see an output like this:
Congratulations, youâve built your first basic Scrapy scraper!
Full code:
Obviously, we want it to do a little bit more, so letâs look into how to use Scrapy to parse data.
First, letâs get familiar with the Scrapy shell. The Scrapy shell can help you test your code to make sure that Scrapy is grabbing the data you want.
To access the shell, enter this into your command line:
This will basically open the page that youâve directed it to and it will let you run single lines of code. For example, you can view the raw HTML of the page by typing in:
Or open the page in your default browser by typing in:
Our goal here is to find the code that contains the information that we want. For now, letâs try to grab the movie title names only.
The easiest way to find the code we need is by opening the page in our browser and inspecting the code. In this example, I am using Chrome DevTools. Just right-click on any movie title and select âinspectâ:
As you can see, the Oscar winners have a yellow background while the nominees have a plain background. Thereâs also a link to the article about the movie title, and the links for movies end in film)
. Now that we know this, we can use a CSS selector to grab the data. In the Scrapy shell, type in:
As you can see, you now have a list of all the Oscar Best Picture Winners!
Going back to our main goal, we want a list of the Oscar winners for best picture, along with their director, starring actors, release date, and run time. To do this, we need Scrapy to grab data from each of those movie pages.
Weâll have to rewrite a few things and add a new function, but donât worry, itâs pretty straightforward.
Weâll start by initiating the scraper the same way as before.
But this time, two things will change. First, weâll import time
along with scrapy
because we want to create a timer to restrict how fast the bot scrapes. Also, when we parse the pages the first time, we want to only get a list of the links to each title, so we can grab information off those pages instead.
Here we make a loop to look for every link on the page that ends in film)
with the yellow background in it and then we join those links together into a list of URLs, which we will send to the function parse_titles
to pass further. We also slip in a timer for it to only request pages every 5 seconds. Remember, we can use the Scrapy shell to test our response.css fields to make sure weâre getting the correct data!
The real work gets done in our parse_data
function, where we create a dictionary called data
and then fill each key with the information we want. Again, all these selectors were found using Chrome DevTools as demonstrated before and then tested with the Scrapy shell.
The final line returns the data dictionary back to Scrapy to store.
Complete code:
Sometimes we will want to use proxies as websites will try to block our attempts at scraping.
To do this, we only need to change a few things. Using our example, in our def parse()
, we need to change it to the following:
This will route the requests through your proxy server.
Now it is time to run our spider. To make Scrapy start scraping and then output to a CSV file, enter the following into your command prompt:
You will see a large output, and after a couple of minutes, it will complete and you will have a CSV file sitting in your project folder.
When you open the CSV file, you will see all the information we wanted (sorted out by columns with headings). Itâs really that simple.
With data scraping, we can obtain almost any custom dataset that we want, as long as the information is publicly available. What you want to do with this data is up to you. This skill is extremely useful for doing market research, keeping information on a website updated, and many other things.
Itâs fairly easy to set up your own web scraper to obtain custom datasets on your own, however, always remember that there might be other ways to obtain the data that you need. Businesses invest a lot into providing the data that you want, so itâs only fair that we respect their terms and conditions.