Today I would like to do some web scraping of Linkedin job postings, I have two ways to go: - Source code extraction - Using the Linkedin API

I chose the first option, mainly because the API is poorly documented and I wanted to experiment with BeautifulSoup. BeautifulSoup in few words is a library that parses HTML pages and makes it easy to extract the data.

Official page: BeautifulSoup web page

## Main packages needed are ulrlib2 to make url queries and beautifulSoup to structure the results
## the imports needed for this experiment
from bs4 import BeautifulSoup
import urllib2
import pandas as pd
# get source code of the page
def get_url(url):
    return urllib2.urlopen(url).read()

# makes the source  tree format like 
def beautify(url):
    source = get_url(url)
    return BeautifulSoup(source,"html.parser")

Now that the functions are defined and libraries are imported, I’ll get job postings of linkedin.
The inspection of the source code of the page shows indications where to access elements we are interested in.
I basically achieved that by ‘inspecting elements’ using the browser.
I will look for “Data scientist” postings. Note that I’ll keep the quotes in my search because otherwise I’ll get unrelevant postings containing the words “Data” and “Scientist”.
Below we are only interested to find div element with class ‘results-context’, which contains summary of the search, especially the number of items found.

jobs = beautify('https://www.linkedin.com/jobs/search?keywords=%22Data+Scientist%22&'
                'location=France&trk=jobs_jserp_search_button_execute&orig=JSERP&locationId=fr%3A0')

results_context = jobs.find('div', {'class' : 'results-context'}).find('strong')
n_jobs = int(results_context.text.replace(',','')) 
print "###### Number of job postings #######"
print n_jobs
print "#####################################"
    ###### Number of job postings #######
    93
    #####################################

Now let’s check the number of postings we got on one page

results = jobs.find_all('li', {'class': 'job-listing'})
n_postings = len(results)
print "#### Number of job postings per page ####"
print n_postings
print "#########################################"
    #### Number of job postings per page ####
    25
    #########################################
print "#### Number of pages ####"
n_pages = int(round(n_jobs/float(n_postings)))
print n_pages
print "#########################"
    #### Number of pages ####
    4
    #########################

To be able to extract all postings, I need to iterate over the pages, therefore I will proceed with examining the urls of the different pages to work out the logic.

  • url of the first page

  • https://www.linkedin.com/jobs/search?keywords=Data+Scientist&locationId=fr:0&s tart=0&count=25&trk=jobs_jserp_pagination_1

  • second page

  • https://www.linkedin.com/jobs/search?keywords=Data+Scientist&locationId=fr:0&s tart=25&count=25&trk=jobs_jserp_pagination_2

  • third page

  • https://www.linkedin.com/jobs/search?keywords=Data+Scientist&locationId=fr:0&s tart=50&count=25&trk=jobs_jserp_pagination_3

there are two elements changing :
- start=25 which is a product of page number and 25
- trk=jobs_jserp_pagination_3

I also noticed that the pagination number doesn’t have to be changed to go to next page, which means I can change only start value to get the next postings (may be Linkedin developers should do something about it …)

titles = []
companies = []
locations = []
links = []
#loop over all pages to get the posting details
for i in range(n_pages):    
    # define the base url for generic searching 
    url = ("http://www.linkedin.com/jobs/search?keywords=%22Data+Scientist%22&locationId="
    "fr:0&start=nPostings&count=25&trk=jobs_jserp_pagination_1")
    url = url.replace('nPostings',str(25*i))
    soup = beautify(url)
    # Build lists for each type of information
    results = soup.find_all('li', {'class': 'job-listing'})
    results.sort()
    # print "there are ", len(results) , " results"
    for res in results:
        # set only the value if get_text() 
        titles.append(res.h2.a.span.get_text() if res.h2.a.span else 'None')
        companies.append( res.find('span',{'class' : 'company-name-text'}).get_text() if
                         res.find('span',{'class' : 'company-name-text'}) else 'None')
        locations.append( res.find('span', {'class' : 'job-location'}).get_text() if
                        res.find('span', {'class' : 'job-location'}) else 'None' )
        links.append(res.find('a',{'class' : 'job-title-link'}).get('href') )

As I mentioned above, all the information about where to find the job details are made easy thanks to source code viewing via any browser

Next, it’s time to create the data frame

jobs_linkedin = pd.DataFrame({'title' : titles, 'company': companies, 'location': locations, 'link' : links})

Now the table is filled with the above columns.
Just to verify, I can check the size of the table to make sure I got all the postings

jobs_linkedin.count()
    company     93
    link        93
    location    93
    title       93
    dtype: int64

In the end, I got an actual dataset just by scraping web pages. Gathering data never have been as easy. I can even go further by parsing the description of each posting page and extract information like:
- Level
- Description
- Technologies

There are no limits to which extent we can exploit the information in HTML pages thanks to BeautifulSoup, you just have to read the documentation which is very good by the way, and get to practice on real pages.

Ciao!