Headless Scraping

Headless Scraping

Web Scraping with a Headless Browser: A Puppeteer Tutorial

HTTP Rotating & Static

  • 40 million IPs for all purposes
  • 195+ locations
  • 3 day moneyback guarantee

Visit smartproxy.com

In this article, we’ll see how easy it is to perform web scraping (web automation) with the somewhat non-traditional method of using a headless browser.
What Is a Headless Browser and Why Is It Needed?
The last few years have seen the web evolve from simplistic websites built with bare HTML and CSS. Now there are much more interactive web apps with beautiful UIs, which are often built with frameworks such as Angular or React. In other words, nowadays JavaScript rules the web, including almost everything you interact with on websites.
For our purposes, JavaScript is a client-side language. The server returns JavaScript files or scripts injected into an HTML response, and the browser processes it. Now, this is a problem if we are doing some kind of web scraping or web automation because more times than not, the content that we’d like to see or scrape is actually rendered by JavaScript code and is not accessible from the raw HTML response that the server delivers.
As we mentioned above, browsers do know how to process the JavaScript and render beautiful web pages. Now, what if we could leverage this functionality for our scraping needs and had a way to control browsers programmatically? That’s exactly where headless browser automation steps in!
Headless? Excuse me? Yes, this just means there’s no graphical user interface (GUI). Instead of interacting with visual elements the way you normally would—for example with a mouse or touch device—you automate use cases with a command-line interface (CLI).
Headless Chrome and Puppeteer
There are many web scraping tools that can be used for headless browsing, like or headless Firefox using Selenium. But today we’ll be exploring headless Chrome via Puppeteer, as it’s a relatively newer player, released at the start of 2018. Editor’s note: It’s worth mentioning Intoli’s Remote Browser, another new player, but that will have to be a subject for another article.
What exactly is Puppeteer? It’s a library which provides a high-level API to control headless Chrome or Chromium or to interact with the DevTools protocol. It’s maintained by the Chrome DevTools team and an awesome open-source community.
Enough talking—let’s jump into the code and explore the world of how to automate web scraping using Puppeteer’s headless browsing!
Preparing the Environment
First of all, you’ll need to have 8+ installed on your machine. You can install it here, or if you are CLI lover like me and like to work on Ubuntu, follow those commands:
curl -sL | sudo -E bash –
sudo apt-get install -y nodejs
You’ll also need some packages that may or may not be available on your system. Just to be safe, try to install those:
sudo apt-get install -yq –no-install-recommends libasound2 libatk1. 0-0 libc6 libcairo2 libcups2 libdbus-1-3 libexpat1 libfontconfig1 libgcc1 libgconf-2-4 libgdk-pixbuf2. 0-0 libglib2. 0-0 libgtk-3-0 libnspr4 libpango-1. 0-0 libpangocairo-1. 0-0 libstdc++6 libx11-6 libx11-xcb1 libxcb1 libxcursor1 libxdamage1 libxext6 libxfixes3 libxi6 libxrandr2 libxrender1 libxss1 libxtst6 libnss3
Setup Headless Chrome and Puppeteer
I’d recommend installing Puppeteer with npm, as it’ll also include the stable up-to-date Chromium version that is guaranteed to work with the library.
Run this command in your project root directory:
npm i puppeteer –save
Note: This might take a while as Puppeteer will need to download and install Chromium in the background.
Okay, now that we are all set and configured, let the fun begin!
Using Puppeteer API for Automated Web Scraping
Let’s start our Puppeteer tutorial with a basic example. We’ll write a script that will cause our headless browser to take a screenshot of a website of our choice.
Create a new file in your project directory named and open it in your favorite code editor.
First, let’s import the Puppeteer library in your script:
const puppeteer = require(‘puppeteer’);
Next up, let’s take the URL from command-line arguments:
const url = [2];
if (! url) {
throw “Please provide a URL as the first argument”;}
Now, we need to keep in mind that Puppeteer is a promise-based library: It performs asynchronous calls to the headless Chrome instance under the hood. Let’s keep the code clean by using async/await. For that, we need to define an async function first and put all the Puppeteer code in there:
async function run () {
const browser = await ();
const page = await wPage();
await (url);
await reenshot({path: ”});
();}
run();
Altogether, the final code looks like this:
throw “Please provide URL as a first argument”;}
You can run it by executing the following command in the root directory of your project:
node
Wait a second, and boom! Our headless browser just created a file named and you can see the GitHub homepage rendered in it. Great, we have a working Chrome web scraper!
Let’s stop for a minute and explore what happens in our run() function above.
First, we launch a new headless browser instance, then we open a new page (tab) and navigate to the URL provided in the command-line argument. Lastly, we use Puppeteer’s built-in method for taking a screenshot, and we only need to provide the path where it should be saved. We also need to make sure to close the headless browser after we are done with our automation.
Now that we’ve covered the basics, let’s move on to something a bit more complex.
A Second Puppeteer Scraping Example
For the next part of our Puppeteer tutorial, let’s say we want to scrape down the newest articles from Hacker News.
Create a new file named and paste in the following code snippet:
function run () {
return new Promise(async (resolve, reject) => {
try {
await (“);
let urls = await page. evaluate(() => {
let results = [];
let items = document. querySelectorAll(‘orylink’);
rEach((item) => {
({
url: tAttribute(‘href’),
text: nerText, });});
return results;})
();
return resolve(urls);} catch (e) {
return reject(e);}})}
run()()();
Okay, there’s a bit more going on here compared with the previous example.
The first thing you might notice is that the run() function now returns a promise so the async prefix has moved to the promise function’s definition.
We’ve also wrapped all of our code in a try-catch block so that we can handle any errors that cause our promise to be rejected.
And finally, we’re using Puppeteer’s built-in method called evaluate(). This method lets us run custom JavaScript code as if we were executing it in the DevTools console. Anything returned from that function gets resolved by the promise. This method is very handy when it comes to scraping information or performing custom actions.
The code passed to the evaluate() method is pretty basic JavaScript that builds an array of objects, each having url and text fields that represent the story URLs we see on
The output of the script looks something like this (but with 30 entries, originally):
[ { url: ”,
text: ‘Bias detectives: the researchers striving to make algorithms fair’},
{ url: ”,
text: ‘Mino Games Is Hiring Programmers in Montreal’},
text: ‘A Beginner\’s Guide to Firewalling with pf’},
//…
text: ‘ChaCha20 and Poly1305 for IETF Protocols’}] Pretty neat, I’d say!
Okay, let’s move forward. We only had 30 items returned, while there are many more available—they are just on other pages. We need to click on the “More” button to load the next page of results.
Let’s modify our script a bit to add a support for pagination:
function run (pagesToScrape) {
if (! pagesToScrape) {
pagesToScrape = 1;}
let currentPage = 1;
let urls = [];
while (currentPage <= pagesToScrape) { let newUrls = await page. evaluate(() => {
return results;});
urls = (newUrls);
if (currentPage < pagesToScrape) { await ([ await ('relink'), await page. waitForSelector('orylink')])} currentPage++;} run(5)()(); Let’s review what we did here: We added a single argument called pagesToScrape to our main run() function. We’ll use this to limit how many pages our script will scrape. There is one more new variable named currentPage which represents the number of the page of results are we looking at currently. It’s set to 1 initially. We also wrapped our evaluate() function in a while loop, so that it keeps running as long as currentPage is less than or equal to pagesToScrape. We added the block for moving to a new page and waiting for the page to load before restarting the while loop. You’ll notice that we used the () method to have the headless browser click on the “More” button. We also used the waitForSelector() method to make sure our logic is paused until the page contents are loaded. Both of those are high-level Puppeteer API methods ready to use out-of-the-box. One of the problems you’ll probably encounter during scraping with Puppeteer is waiting for a page to load. Hacker News has a relatively simple structure and it was fairly easy to wait for its page load completion. For more complex use cases, Puppeteer offers a wide range of built-in functionality, which you can explore in the API documentation on GitHub. This is all pretty cool, but our Puppeteer tutorial hasn’t covered optimization yet. Let’s see how can we make Puppeteer run faster. Optimizing Our Puppeteer Script The general idea is to not let the headless browser do any extra work. This might include loading images, applying CSS rules, firing XHR requests, etc. As with other tools, optimization of Puppeteer depends on the exact use case, so keep in mind that some of these ideas might not be suitable for your project. For instance, if we had avoided loading images in our first example, our screenshot might not have looked how we wanted. Anyway, these optimizations can be accomplished either by caching the assets on the first request, or canceling the HTTP requests outright as they are initiated by the website. Let’s see how caching works first. You should be aware that when you launch a new headless browser instance, Puppeteer creates a temporary directory for its profile. It is removed when the browser is closed and is not available for use when you fire up a new instance—thus all the images, CSS, cookies, and other objects stored will not be accessible anymore. We can force Puppeteer to use a custom path for storing data like cookies and cache, which will be reused every time we run it again—until they expire or are manually deleted. const browser = await ({ userDataDir: '. /data', }); This should give us a nice bump in performance, as lots of CSS and images will be cached in the data directory upon the first request, and Chrome won’t need to download them again and again. However, those assets will still be used when rendering the page. In our scraping needs of Y Combinator news articles, we don’t really need to worry about any visuals, including the images. We only care about bare HTML output, so let’s try to block every request. Luckily, Puppeteer is pretty cool to work with, in this case, because it comes with support for custom hooks. We can provide an interceptor on every request and cancel the ones we don’t really need. The interceptor can be defined in the following way: await tRequestInterception(true); ('request', (request) => {
if (sourceType() === ‘document’) {
ntinue();} else {
();}});
As you can see, we have full control over the requests that get initiated. We can write custom logic to allow or abort specific requests based on their resourceType. We also have access to lots of other data like so we can block only specific URLs if we want.
In the above example, we only allow requests with the resource type of “document” to get through our filter, meaning that we will block all images, CSS, and everything else besides the original HTML response.
Here’s our final code:
await page. waitForSelector(‘orylink’);
await page. waitForSelector(‘relink’),
Stay Safe with Rate Limits
Headless browsers are very powerful tools. They’re able to perform almost any kind of web automation task, and Puppeteer makes this even easier. Despite all the possibilities, we must comply with a website’s terms of service to make sure we don’t abuse the system.
Since this aspect is more architecture-related, I won’t cover this in depth in this Puppeteer tutorial. That said, the most basic way to slow down a Puppeteer script is to add a sleep command to it:
js
await page. waitFor(5000);
This statement will force your script to sleep for five seconds (5000 ms). You can put this anywhere before ().
Just like limiting your use of third-party services, there are lots of other more robust ways to control your usage of Puppeteer. One example would be building a queue system with a limited number of workers. Every time you want to use Puppeteer, you’d push a new task into the queue, but there would only be a limited number of workers able to work on the tasks in it. This is a fairly common practice when dealing with third-party API rate limits and can be applied to Puppeteer web data scraping as well.
Puppeteer’s Place in the Fast-moving Web
In this Puppeteer tutorial, I’ve demonstrated its basic functionality as a web-scraping tool. However, it has much wider use cases, including headless browser testing, PDF generation, and performance monitoring, among many others.
Web technologies are moving forward fast. Some websites are so dependent on JavaScript rendering that it’s become nearly impossible to execute simple HTTP requests to scrape them or perform some sort of automation. Luckily, headless browsers are becoming more and more accessible to handle all of our automation needs, thanks to projects like Puppeteer and the awesome teams behind them!
Web Scraping with a Headless Browser: A Puppeteer Tutorial

HTTP Rotating & Static

  • 40 million IPs for all purposes
  • 195+ locations
  • 3 day moneyback guarantee

Visit smartproxy.com

Web Scraping with a Headless Browser: A Puppeteer Tutorial

Start writing hereWeb development has moved at a tremendous pace in the last decade with a lot of frameworks coming in for both backend and frontend development. Websites have become smarter and so have the underlying frameworks used in developing them. All these advancements in web development have led to the development of the browsers themselves too. Most of the browsers are now available with a “headless” version where a user can interact with a website without any UI. You can scrape websites too on these headless browsers using packages like a puppeteer and nodeJS.
Web development heavily relies on testing mechanisms for the quality checks before we push them into the production environment. A complex website will require a complex structure of test suites before we deploy it anywhere. Headless browsers considerably reduce the testing time involved in web development as there is no overhead of any UI. These browsers allow us to crunch more web pages in lesser time.
In this blog, we will learn to scrape websites on these headless browsers using nodeJS and asynchronous programming. Before we start with scraping websites, let us learn more about the headless browsers in a bit more detail. Furthermore, if you are concerned about the legalities of scraping, you can clear your myths about web scraping.
What is a headless browser
A headless browser is simply a browser just without any user interface. A headless browser, like a normal browser, consists of all the capabilities of rendering a website. Since no GUI is available, one needs to use the command-line utility to interact with the browser. Headless browsers are designed for tasks like automation testing.
Headless browsers are more flexible, fast and optimised in performing tasks like web-based automation testing. Since there is no overhead of any UI, headless browsers are suitable for automated stress testing and web scraping as these tasks can be run more quickly. Although vendors like PhantomJS, HtmlUnit have been in the market offering headless browser capabilities for long, browser players like chrome and firefox are also now offering a “headless” version of their browsers. Hence one need not install an extra browser for headless capabilities.
The need for a headless browser
With the advancement in the web development frameworks, browsers have become smarter as well to load all the javascript libraries. With all the evolution in the web development technologies, testing of the websites has been evolved and has emerged out to be one of the essentials of the web development industry. Evolution of headless browsers allow us to perform the following applications
Test automation for web applications
End-to-end testing is a methodology used to test whether the flow of an application is performing as designed from start to finish. The purpose of carrying out end-to-end tests is to identify system dependencies and to ensure that the right information is passed between various system components and systems. Headless browsers were designed to cater to this use case as they enable faster web testing using CLI.
Scraping websites
Headless browsers enable faster scraping of the websites as they do not have to deal with the overhead of opening any UI. With headless browsers, one can simply automate the scrapping mechanism and extract data in a much more optimised manner.
Taking web screenshots
Headless browsers may not offer any GUI experience but they do allow users to take snapshots of the websites they are rendering. It certainly helps in cases where one is performing automation testing and want to visualise code effects on the website and store results in the form of screenshots. Taking a large number of screenshots without any actual UI is a cakewalk using headless browsers.
Mapping user journey across the websites
Companies who successfully deliver outstanding customer experiences consistently do better than their competitors. Headless browsers allow us to run programs mapping customer journey test cases to optimise the user experience throughout their decision-making process on the website.
What is Puppeteer
Puppeteer is an API library with the DevTools protocol to control Chrome or Chromium. It is usually headless but can be set to operate Chrome or Chromium in its whole (non-headless). Furthermore, Puppeteer is a library of nodes that we can use to monitor a Chrome instance without heads (UI).
We use Chrome under the hood, but it will be JavaScript programmatically. Puppeteer is the Google Chrome team’s official Chrome headless browser. It may not be most effective as it breaks up a fresh Chrome example when it is initialized. This is the most accurate way to automate Chrome testing, though because it uses the actual navigator.
Web scraping using Puppeteer
In this article, we will be using puppeteer to scrape the product listing from a website. Puppeteer will use the headless chrome browser to open the web page and query back all the results. Before we start actually implementing puppeteer for web scraping, we will look into its setup and installation.
After that, we will implement a simple use case where we will go to an e-commerce website and search for a product and scrape all the results. All the above tasks will be programmatically handled by using puppeteer library. Furthermore, we will use the nodeJS language to accomplish the above-defined task.
Installing puppeteer
Let us begin with the installation. Puppeteer is a node javascript library and hence, we will need node js installed on our machine. Node js comes with npm (node package manager) which will help us to install the puppeteer package.
The following code snippet will help you in the installation of node js
## Updating the system libraries ##
sudo apt-get update
## Installing node js in the system ##
sudo apt-get install nodejs
You can use the below command to install the puppeteer package
npm install –save puppeteer
Since we have all the dependencies installed now, we can start implementing our scraping use case using puppeteer. We will be controlling actions on the website using our node JS program powered by the puppeteer package.
Scraping products list using puppeteer
Step1: Visiting the page and searching for a product
In this section, we will initialise a puppeteer object first. This object has access to all the utility functions available in the puppeteer package. In this section, our program visits the website, then it searches for the product search bar on the website. Upon finding the search elements, it types the product name in the search bar and loads the result. We gave the product name to the program using the command line arguments
const puppeteer = require(‘puppeteer’);
const browser = await ();
const page = await wPage();
var args = [2] await (“);
await (‘bile__nav__row–btn-search’)
await (‘input#js-site-search-input’, args)
await (‘Enter’);
await reenshot({path: ”})
Step 2: Scraping the list of items
In this section, we are scraping the product listings which we got after searching for our given product. HTML selectors were used for capturing web content. All the scrapped results were collated together to make the dataset. The querySelector function allows us to extract the content from the web page using the HTML selector. The querySelectorAll functions get all the content marked with the particular selector whereas querySelector function just returns the first matching element.
let urls = await page. evaluate(() = {
let results = [];
let items = document. querySelectorAll(‘oduct__list–item’);
rEach((item) = {
let name = item. querySelector(‘oduct__list–name’). innerText
let price = item. querySelector(‘span. pdpPrice’). innerText
let discount = item. querySelector(‘stingDiscnt’). innerText
({
prod_name: name,
prod_price: price,
prod_discount: discount});});
return results;})
Full Code
Here is the full working sample of the implementation. We have wrapped up the entire login in a run function and are logging the scrapped results in the console.
function run () {
return new Promise(async (resolve, reject) ={
try {
();
return resolve(urls);} catch (e) {
return reject(e);}})}
run()()();
Running the script
You can use the below command to run the above puppeteer script with a headless browser. We will use the nodejs to run our code. You just have to mention the keyword node and the filename followed by the product name whose data you need to search on the given website and scrape the results.
In this example, we are searching for the iPhones on the Croma website and then we are scrapping the product listings.
node iphones
Output
The output of the above code can be visualised like this
Originally Published on: Blogs
Enjoy this post? Give Sandra Moraes a like if it’s areData Scientist | Tech BloggerData Scientist | Tech Blogger | TravelerDiscover and read more posts from Sandra MoraesEnjoy this post? Leave a like and comment for Sandra
Headless Browser Testing Awesomeness Pros and Cons

Headless Browser Testing Awesomeness Pros and Cons

WHAT IS HEADLESS TESTING? So what is this headless testing “browsing”? It actually is what it sounds like. Headless testing is when you run a UI-based browser test without showing the browser UI. It’s running a test or running a script against a browser but without the browser, UI starting up. Why would you want to use headless browsers? There are a lot of pros and cons to following this approach. Using a headless browser might not be very helpful for browsing the Web, but for automating tasks and tests it’s awesome. Why Should You Care About Testing with a Headless Browser? Follow the money is such a cliché, but it’s a key indicator of what I think is a real trend and something I should pay attention to. For example, Sauce Labs just came out with a new service called Sauce Headless, a cloud-based headless testing solution. I know the folks at Sauce are smart folks. They don’t develop anything unless they’ve gotten feedback from their users that it’s needed functionality. I’m sure they will not be alone with their focus on headless testing. Headless browser testing is a shift-left design thinking that is important for software QA. This means the tests are automated and run in the browser without the need for any user interaction. As we shift more and more left in our software development lifecycle, we need to give feedback to our developers faster and faster. One way to do this is using some quick checks leveraging a headless browser. Automation in Software Production If you know me at all, you also know that I’m very automation inclusive. To me, it’s not just about automation testing. It’s anything that you can automate to save someone time or effort in any part of the software delivery lifecycle–whether it’s developing, quality, testing, DevOps, or installation; I would refer to any of these as automation. And headless browsers are something you can actually utilize for a lot of these efforts. Headless browser testing is the process of testing an application or website without a human user watching. This technique has pros and cons that will depend on your particular project. PRO: Headless Browsers are Faster than Real Browsers One definite “pro” of headless browsers is that they are typically faster than real browsers; the reason being that since you aren’t starting up a browser GUI you can bypass all the time a real browser takes to load CSS, JavaScript and open and render HTML. I have to admit although, that it’s not exactly like night and day. But you will typically see a 2x to 15x faster performance when using a headless browser. So if performance is critical for you, headless browsers may be a way to go. Headless Browser Scraping Another advantage of headless browsers is that they can be used to scrape websites. To do this, you don’t necessarily want to have to manually start up a website, however. You can go to it headlessly and just scrape the HTML. You don’t need to render a full browser to do that. For example, say your job needs some data sports statistics, or to compare prices between different sites. Since it’s just data you’re looking for, it doesn’t make sense to start up a full instance of a browser; it’s just extra overhead–and sometimes the less overhead you have, the quicker you’ll get results back. It may not necessarily be a test, and that’s okay. Again, you want to leverage the right tools to do the right things. I also think that headless browser scraping is not leveraged by many testers – and that’s a shame. So if you want to do some website scraping to help you with a test, later on, you won’t necessarily need the overhead of starting a full-blown browser; you can utilize headless browsers to obtain that functionality for you. Save Your Developers Time I’m aware that a lot of developers use a headless browser for unit testing code changes for their websites and mobile apps. Being able to do all this from a command line without having to manually refresh or start a browser saves them lots and effort. For example, Rob Friesel, author of the PhantomJS CookBook in a TestTalks interview explained how his developers use the headless browser PhantomJS: “Although PhantomJs in itself is not a test framework, it’s a really good canary in a coal mine to give you some confidence; if your tests are passing, you can have a high degree of confidence that your code is ok. ” Monitor Performance With Headless Browser Scripts Another common use is to use a headless browser script to monitor network application performance. Some even use it to automate the rendering and screen capturing of their website images to perform layout checks in an automated fashion. I think these are some of the reasons why Google has also developed a new headless Chrome API called Puppeteer that was designed to handle many of these developer use cases. Headless Browser Testing Ideas Besides the one we’ve already covered, here are some other uses for headless browsers that I’ve run across: Run tests on a machine without a display Setup Data Testing SSL Simulating multiple browsers on a single machine Running tests on a headless system like Linux OS without GUI Retrieve and render PDFs Layout testing – since headless browsers render and interpret CSS & HTML like a real browser you can use it to test style elements. Examples of When You Might NOT Want to Use a Headless Browser Of course, there are a number of reasons why you may wish to use a real browser as opposed to a headless browser. For instance: You need to mimic real users You need to visually see the test run If you need to do lots of debugging, headless debugging can be difficult. Popular Headless Browsers Google Puppeteer- The Headless Browser Puppeteer is a Node library. It gives you a high-level API to control headless Chrome or Chromium over the DevTools Protocol. It also can also be tweaked to use full (non-headless) Chrome or Chromium. Google Chrome since version 59 Firefox versions 55 & 56 PhantomJS –is a headless WebKit scriptable with a JavaScript API. It has fast and native support for various web standards: DOM handling, CSS selector, JSON, Canvas, and SVG. *This is no longer being maintained. Because of this, you might want to avoid. HtmlUnit –is a “GUI-Less browser for Java programs”. It models HTML documents and provides an API that allows you to invoke pages, fill out forms, click links, etc… just like you do in your “normal” browser. Splinter – Splinter is a headless browser Python-centric option. It’s open sourced and is used for testing web applications using Python. For example, you can use it to automate browser actions, such as visiting URLs and interacting with their items. jBrowserDriver – A programmable, embeddable web browser driver compatible with the Selenium WebDriver spec — headless, WebKit-based, pure Java When to Use a Headless Browser for Your Testing? So when should you use headless browsing for your testing? As you can see, it depends on your testing goals. People on the left will often say, “Never use a headless browser. A real user would never use it so why should you? ” Meanwhile, folks on the right will say, “You should always use headless browsers because they’re always faster, and faster is always better. ” As we well know, however, it’s not always one versus the other, but rather a matter of selecting the right tool for the rights task depending on the situation. Remember–use the right tool for the job and always ask yourself how it will affect the end users and what the goal(s) of your test are when decided between the two approaches.

Frequently Asked Questions about headless scraping

What is headless scraping?

Scraping websites. Headless browsers enable faster scraping of the websites as they do not have to deal with the overhead of opening any UI. With headless browsers, one can simply automate the scrapping mechanism and extract data in a much more optimised manner. Taking web screenshots.Sep 18, 2019

Is headless scraping faster?

PRO: Headless Browsers are Faster than Real Browsers But you will typically see a 2x to 15x faster performance when using a headless browser.Sep 21, 2020

What is headless testing?

Headless testing is simply running your Selenium tests using a headless browser. It operates as your typical browser would, but without a user interface, making it excellent for automated testing.Oct 8, 2019

Share this post

Leave a Reply

Your email address will not be published.