With over 450 million users, LinkedIn is the digital rolodex of the modern age. If you don’t have an account you should probably get one. You can rub shoulders with major players in your industry, creep on old high school acquaintances, and strategize your next business move.
That’s all for the normal user of LinkedIn, which I am, and which you should be.
However, for the scraper, LinkedIn has an entirely different meaning. Instead of connecting manually with people in an industry, scrapers see LinkedIn as a gold-filled mine of personal information. A mine with 450 million (and growing) nuggets, all of which can be harvested in a variety of ways.
Then there’s company profiles on LinkedIn, which is separate from individual users, and adds an entire other element for a scraper.
Why Scrape LinkedIn?
The answer should be clear: to get all that information. User profiles have names, email addresses, industries, skill competencies, etc. Companies have number of employees, job postings, current employees, and a host of other important data.
LinkedIn is a literal representation of people and companies in the workforce, and they keep their info up to date. This data is incredibly valuable.
Of course you can’t scrape all the data I listed above. But you can scrape some of it.
Does LinkedIn Allow Scraping?
Let’s all yell “NO!” together so the point gets across. LinkedIn is very, very against scraping of any kind. It recognizes the worth of its customers in terms of analytics and privacy and will continue to fight tooth and nail to keep scrapers off the site. You can read it’s clear statement titled “Prohibition of Scraping Software” to get the gist.
While that language is solid, this subject is best illustrated by the lawsuit LinkedIn took out against 100 anonymous data scrapers who did what you’re trying to do, but did it poorly. The verdict of the case has not been decided at the time of writing, and it brings up many issues around scraping that are beyond the purview of this article.
The point I’m trying to make is that if you do plan to scrape LinkedIn, be very cautious. They really don’t want you to do it, so if you plan to you have to do it right.
How to Scrape LinkedIn
Doing it right consists of many factors. You need to think about:
- The applications required to do the scraping
- The parameters you need to set in the applications
- The type of pages you will scrape on LinkedIn (public or private)
- The types of proxies to use, and how many proxies to use
LinkedIn Crawling Applications
Choosing an application is important, as many of them cost money. You’ll want to have a full understanding of the software itself, and then what you’re trying to get out of LinkedIn in order to make a solid return on your investment.
Due to LinkedIn’s policy against scraping I find it’s best to go with a dedicated software application for the service. However, if you already have Scrapebox or GSA (who am I kidding, you probably do) it’s worth testing these out before making another purchase.
Parameters within the Application
Once you’ve settled on an application you’ll need to adjust two key settings inside it. This is generally true for all scraping procedures, but specifically for LinkedIn as it is more sensitive than other websites.
Threads in scraping software details the number of open connections you are using to scrape. The more threads the faster the scrape; the more threads the faster you will get flagged and banned.
The very cautious use one thread per proxy. That’s what a true human does, so anything more than that will, at some point, become suspicious. However, plenty of scrapers use up to 10 threads per proxy.
Due to LinkedIn’s extreme policy against scraping, I recommend staying to the single thread per proxy. Yes, it will slow results and cost more in the long run. In my view those are costs built into scraping LinkedIn and avoiding a lawsuit.
The second major factor in adjusting your application’s scrape settings is timeouts. Timeouts are the literal amount of time it takes for a server to respond to a proxy before the proxy starts a new request.
If your timeouts are set to 10 seconds, your proxy will send another request for information from the server after 10 seconds of it not responding.
Many scrapers set the timeout very low: 1 or 2 seconds. This produces a huge number of results because it creates new requests for information often, meaning you get results more often.
Don’t do this. Set your timeouts high, between 30-60 seconds. This gives the server a solid pause before that particular proxy sends another request.
Think of it like a human: does a human reload a website’s home page every second if there is lag? Maybe, but they doesn’t do it a thousand times in a thousand seconds on repeat.
By setting your timeouts high you avoid a lot of the detection by LinkedIn and don’t overwhelm them with repeated requests.
Scraping Public Files on LinkedIn Through Search Engines
Moving away from the applications let’s get into LinkedIn itself. LinkedIn is primarily used as a private network. To see most of its information you have to create an account, log in, and start connecting with people.
However, it has plenty of public pages. These can be viewed without an account, and can therefore be scraped without logging in.
You are free to scrape public pages on LinkedIn like any normal scrape that starts with a search engine. You have to enter the correct search terms, like including “LinkedIn.com”, which will generate results in Google that point to specific LinkedIn pages.
Your scraper can then access the information available on these public pages and return it to you. You’ll be scraping both Google and LinkedIn in this context, so you’ll want to be careful not to set off the alarm bells for either of them.
You can get very specific with this, searching for an industry sector of company pages on LinkedIn through an engine, like Microsoft or Google or Apple. You would do this by scraping for “Apple LinkedIn” and then scraping the results.
This will only give you public pages though, and you may not want to be limited.
Scraping Private Accounts
The scraping of private accounts is the specific line in the sand that LinkedIn doesn’t want you to cross. It’s not happy that you scrape public pages, but they’re public, and there’s not much they can do about it from a legal standpoint.
Private pages are another matter. When a person signs up with LinkedIn they are told their information will be kept private, not sold to other companies, and used for internal use only. When a scraper comes along to grab that information LinkedIn has a major problem on its hands.
I don’t condone this activity if you’re using your scrape to sell an individual’s information. This basically means you’d be bypassing LinkedIn’s privacy clause, harvesting personal information from people, then selling it to companies for a profit. Not the coolest thing to do.
There are other reasons to scrape this information though. Maybe you’re on a job hunt and want to find programmers in a specific city, or available jobs in a new state. You can scrape for research, too. Either of these seems fine to me, but the for profit model doesn’t.
The way to scrape private pages on LinkedIn is to create an account. Once you do this and actually log into LinkedIn you’ll be able to search as much as you want. Remember, this account isn’t for connecting with people, but as an access point to LinkedIn for a scrape.
To do this I recommend Octoparse. Their software allows you to log into LinkedIn with an account and apply specific searches and scrapes with a drag and drop interface, all while showing you the LinkedIn page you’re on. It’s very nice visually, if a little clunky to use.
You could figure out a way to do it with other applications but it won’t be as easy.
Search and Harvest
After creating the account, just figured out what you want to search. If you try and find Microsoft employees a ton of people will come up. You can have the scraper harvest any information that is available to you as a non-connection. Basically name, position, sometimes the email address.
Much of the information is still private unless you connect with people, and if you do that you’re basically just running a normal LinkedIn account.
Use a Proxy Per Account
By doing the above you are using a direct automation tool within LinkedIn. The potential for getting caught here is huge, so make sure to follow the threads and timeouts rules above.
Also, make sure you’re using one proxy IP address to create the account, and then scrape on that account. This is all about appearing like a human. Most humans don’t access LinkedIn from a different IP address every few hours. They access it from one IP address: their home address.
If you create the account with a proxy IP, use the same proxy IP to scrape on the account, and set all your parameters correctly you will greatly reduce the chances of getting blocked or banned.
Types and Number of Proxies
The final element in all this is the types of proxies you use, and how many of them you use. This coincides pretty heavily with your budget because more proxies (and better ones) equals more cash. Keep that in mind for this whole process.
Type of Proxy
You want elite private proxies for scraping LinkedIn. With a lawsuit under way, LinkedIn is not kidding around about punishing scrapers. This means you’ll want elite private proxies and only elite private proxies.
These proxies offer the most anonymous and secure HEADER settings out of all the proxy types, and give you unfettered access and speeds. Shared proxies or free proxies (even lesser private proxies) are simply not secure or fast enough to do the job.
You’ll also want to test your proxies to make sure they work with LinkedIn. Due to LinkedIn’s anti-scrape stance it has a large list of blacklisted IPs. If your proxies are in this list they won’t work at all. Contact your provider to get these details, or test it out for yourself and then chat with them.
Our dedicated proxies work great on LinkedIn.
Number of Proxies
Depending on the size of your scrape you’re going to need a number of them. The general rule of thumb is the more proxies the better, especially when scraping a difficult website.
If you stick to a single proxy per account and want to harvest a lot of data quickly, consider 50 accounts and 50 proxies as a place to get started.
If you want to do more proxies per account (which I don’t recommend), grab somewhere in the 100-200 range and rotate them often so they don’t get noticed, then blocked, banned, and blacklisted.
The fewer proxies you have the more often they’ll be detected. This is always an experiment, so make sure you test everything.
Scraping LinkedIn requires proxies and moxie. You have to really want to do it because it’s not going to be easy, and could result in blacklisted IPs or a lawsuit. As such, take precautionary measures. Understand why you’re scraping LinkedIn, and then reach those specific goals carefully.