Use Chrome Headless and Dedicated Proxies to Scrape Any Website - GhostProxies
Summer 2018 Discount - Up to 56% off Dedicated Proxies

Use Chrome Headless and Dedicated Proxies to Scrape Any Website

Headless Chrome anonymous proxy

Everybody knows Google Chrome is the market leader when it comes to web browsing.

At least, that’s what the statistics say: an overwhelming 61.20 percent of Internet users are browsing with Chrome as of 2018.

If Chrome is the leading web browser, then it makes sense that Chrome Headless will be the leading browser for automated application testing, web scraping, and more. Google’s release of Puppeteer, the Node.js API that makes automating web actions simple for Chrome users, sets the stage for easy, robust web scraping.

Pyramid of Puppeteer illustration

The only problem is there is no built-in privacy or identity protection. You need to be able to use all the great features Chrome offers while being able to automate nearly anything you want and be able to do so anonymously through proxies.

This can prove to be a challenge.

It turns out there are ways to use Chrome Headless with proxies to safely scrape website data without exposing your identity. To find out, we’ll have to start with what makes Chrome what it is, and what makes a headless browser what it is.

What is Chrome Headless?

Google describes Chrome Headless as, “…a way to run the Chrome browser in a headless environment. Essentially, running Chrome without chrome!”

Perfect. Clear as mud.

A better way to think of Chrome Headless is running Chrome without the graphic interface. That’s actually the definition of a headless browser – instead of clicking on things, you execute code through a command-line interface or network communication.

There are plenty of reasons why you may want to do that. Automating Google Chrome is otherwise very difficult to do, which is why web application developers have historically used other browsers like the now-defunct PhantomJS for the purpose. But if so much of the Internet uses Chrome, it just makes sense to have Headless Chrome in your toolkit.

PhantomJS graphic

If you need to scrape data from a variety of websites and need a browser that can handle all the HTML, CSS, and JavaScript without generating error after error, Headless Chrome is the solution for you. Once you learn how to use proxies with it, you’ll be able to scrape just about any website on the Internet with ease and style.

How Chrome Headless Works

The basic environment offered by Chrome Headless should be familiar to anyone who has spent any time using another headless browser. You have a command-line interface perfectly suited to quickly scanning and scraping website data.

What you have in addition to this is Google’s fast, modern JavaScript engine and DevTools API. You have Google engineering ensuring you get support for every website on the Internet.

But controlling Chrome Headless requires using specific libraries. There are lots of options here, and Puppeteer is one of the most popular, since it, too, is a Google product. Although there are libraries that use a variety of languages, NodeJS APIs are generally recommended because that’s the same language as the data interpreted on the webpages you plan on scraping.

If you want to use a simple, no-nonsense API that is designed for web scraping and not much else, you can use NickJS. If you want all-purpose mapping through a complete API that is somewhat similar to Google’s DevTools, go with Puppeteer. You can even use a C++ API if you want.

NickJS script

But in order to fully leverage Chrome Headless for scraping, you need to be able to use it with proxies. As mentioned in the introduction to this article, that can be tricky.

How to Use Proxies with Chrome Headless

If you use a custom proxy that requires authentication, you may have found the Puppeteer library lacking full support. You need to use the page.authenticate() function to pass your log-in credentials into the Chrome login dialog box.

Example code would look like this:

const puppeteer = require('puppeteer');

(async() => {

	const proxyUrl = 'http://proxy.example.com:8000';

	const username = 'bob';

	const password = 'password123';
	const browser = await puppeteer.launch({

		args: ['--proxy-server=${proxyUrl}'],

		headless: false,

	});
	const page = await browser.newPage();
	await page.authenticate({ username, password });

	await page.goto('https://www.example.com');

	await browser.close();

})();

This is a simple way to use a proxy on Headless Chrome for web scraping. However, it can’t do everything you may need your authenticated proxy browser to do. For instance, there is a chance it will hang up on a page that requires authentication since it’s not clear from the code how the headless browser will handle multiple authentication requests.

In that case, you can use Apify’s proxy-chain package. Essentially, this package ensures that you can anonymize an authenticated proxy through Puppeteer by pushing it through a local proxy server first. It supports HTTP proxy forwarding and tunneling through HTTP CONNECT – so you can also use it when accessing HTTPS and FTP.

Apify web page Chrome Headless

Here is what using Apify’s proxy-chain package with Puppeteer looks like in Node.JS:

const puppeteer = require('puppeteer');
const
proxyChain = require('proxy-chain');

(async() => {

	const oldProxyUrl = 'http://bob:[email protected]:8000';

	const newProxyUrl = await proxyChain.anonymizeProxy(oldProxyUrl);
		// Prints an IP address
    console.log(newProxyUrl);
		const browser = await puppeteer.launch({

		args: ['--proxy-server=${newProxyUrl}'],

	});

	const page = await browser.newPage();

	await page.goto('https://www.example.com');

	await page.screenshot({
		path: 'example.png'
	});

	await browser.close();

})();

This will cover most of the web scraping requests you’re likely to make. There are additional, more advanced options for custom responses and connecting to external APIs, but in many cases, this code will do the job.

Tips for Web Scraping with Chrome Headless

Now that you have the ability to use proxies with Puppeteer, it’s time to look at ways to make your headless scraping experience run smoother and more successfully. Use these headless scraping tips to web scraping more efficient.

Extract Data with jQuery

Why go through the difficulty of setting customized data scraping paths when you can use jQuery for the purpose? Any website that gives you its Document Object Model (DOM), can be scraped this way, since the DOM is just a structured tree of elements containing all the data on a given page. Use jQuery to scrape that and you have immediate access to the data you’re looking for.

DOM headless scraping

When dealing with websites that already use jQuery, you only need to read through a couple lines of page data to find what you’re looking for. Otherwise, you can load it yourself.

Solve CAPTCHAs on the Fly

What happens when you’re busy extracting data from various websites and suddenly you get your proxies stuck behind CAPTCHA pages? In most cases, you have to call it quits for a while and wait until the website no longer asks for human authentication. This is the most reliable and recommended solution.

But if you learn how to make an HTTP request between your scraper code and a CAPTCHA solving service, you can solve CAPTCHAs automatically, both using optical character recognition and actual human beings for the purpose (although this method is not recommended).

While the majority of CAPTCHAs can be solved through artificial intelligence, most of these services employ a small number of humans to solve the tougher captchas on a full-time salary. In any case, the toughest ones generally don’t take longer than half a minute to solve.
There are lots of APIs for CAPTCHA solving services available through any Google search, and they generally cost a few dollars for every thousand CAPTCHAs solved. Again, this is not recommended and you’ll end up wasting money in the long run on CAPTCHA solving instead of simply using a larger quantity of proxies.

If you’re using PhantomBusters’ Buster Library, there is a simple way to make automatic calls to multiple CAPTCHA solving services directly from your Chrome Headless.

That kind of automation turns a time-consuming task into a simple one. These few lines of code can potentially turn your proxy script into a real human for thirty seconds or so.

Now You’re Ready to Scrape

As you gain experience in the world of scraping with Chrome Headless, you’ll begin to know where and when you need to set human-like delays to keep your scraper in good graces.

The views, information and opinions expressed in this guest article are for educational purposes only and do not necessarily reflect the views and opinions of GhostProxies.
We do not promote illegal activities or distribute tools for such activities. All trademarks and images used in this article are property of their respective owners. Please contact us if you believe any content within this article is in any violation of law or copyright.