So, you’re looking for a job and would like to work smarter instead of harder to find something new and interesting? Why not create a web scraper that will gather and analyze job posting data for you?
Analyzing the URL and Page structure
Firstly, we will need to look at the demo page from indeed.
There are a few things to note regarding the URL structure:
As we develop a scraper to look at and collect information from a succession of pages, the URL structure will be helpful. Keep this in mind for future reference.
Each page will display 15 job posts, from which five are “sponsored” jobs, which are especially highlighted by indeed outside of normal order of results. The other 10 results are specified on the page that are viewed.
HTML tags are used to code all the information on this page. HTML (Hyper Text Markup Language) is the script that will tell your browser about displaying the content of a particular page during visiting. This will contain the general structure and organization of the document. HTML elements also include attributes, which help in keeping track of what information may be located where on a page’s structure.
By right-clicking on a page and selecting “Inspect” from the menu that shows, Chrome users can inspect the HTML structure of the page. On the right-hand side of your website, a menu will emerge, with a long line of stacked HTML tags containing the data now visible in your web browser.
There’s a small box with an arrow icon in the upper-left corner of this menu. The box will turn blue when you click it. This will start to move your mouse over the page elements to see that both the tags associated with that item and the location of that item are in the HTML format for that page.
In the screenshot shared above, we gave pointed one of the jobs posting, that shows how the entire jobs is placed between a <div> tag, with attributes that include “class=’row result”’, “id= “id=’pj_7a21e2c11afb0428’”, etc., Fortunately, we won’t need to know every characteristic of every element to retrieve our data, but knowing how to understand a page’s HTML structure will come in handy.
Now we’ll use Python to retrieve the Xml from the page and start working on our scraper.
import requests import bs4 from bs4 import Beautiful Soup import pandas as pd import time
Let’s start by extracting a single page and figuring out the code to get each piece of data we need:
URL = “https://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"#conducting a request of the stated URL above: page = requests.get(URL)#specifying a desired format of “page” using the html parser - this allows python to read the various components of the page, rather than treating it as one long string. soup = BeautifulSoup(page.text, “html.parser”)#printing soup in a more structured tree format that makes for easier reading print(soup.prettify())
It is much easier to visit a page’s HTML coding with prettify, and you’ll get something like this:
Looking over to the task, we will search and scrape five important use cases of information from every job posting: Job Title, Company Name, Location, Salary, and Job Summary.
As said above, each job posting comes under <div> tags, with an attribute “class” = “row result”.
From there, the job titles will be listed under <a> tags, with the attribute “title = (title)”. We can search the value of a tag’s attribute with tag[“attribute”], so I can use it to search the job title for every job posting.
Script for withdrawing job title data takes three steps:
def extract_job_title_from_result(soup): jobs = [] for div in soup.find_all(name=”div”, attrs={“class”:”row”}): for a in div.find_all(name=”a”, attrs={“data-tn-element”:”jobTitle”}): jobs.append(a[“title”]) return(jobs)extract_job_title_from_result(soup)
This will display an output like:
Company names are a bit complicated, as they appear in <span> tags, including “class”:”company”. Moreover, they will be included in <span> tags with “class”:”result-link-source’.
def extract_company_from_result(soup): companies = [] for div in soup.find_all(name=”div”, attrs={“class”:”row”}): company = div.find_all(name=”span”, attrs={“class”:”company”}) if len(company) > 0: for b in company: companies.append(b.text.strip()) else: sec_try = div.find_all(name=”span”, attrs={“class”:”result-link-source”}) for span in sec_try: companies.append(span.text.strip()) return(companies) extract_company_from_result(soup)
Output of company names are displayed with a lot of white spaces around them, so inputting.strip() at the end will help to delete this while fetching the data.
Location comes under the <span> tags. Many times, span tags are connected, in such a manner that the location text will sometimes be within “class”: “location” attributes, or nested in the “itemprop”: “addresslocality”.
However, a simple script of the loop can monitor all the span tags for text wherever it might be and fetch the important data.
def extract_location_from_result(soup): locations = [] spans = soup.findAll(‘span’, attrs={‘class’: ‘location’}) for span in spans: locations.append(span.text) return(locations)extract_location_from_result(soup)
Salary is the most complicated information to scrape from job postings. Many postings do not consist of any salary information. Among those who scrape the salary information, can be in one or two different situations. Hence, we need to write a script that can take multiple places for information, and need to develop a placeholder name “Nothing found” value for those that do not contain salary information.
Some salaries come under <nobr> tags, while other comes under <div> tags, “class”: “sjcl” and needs to be separated by <div> tags with no attributes.
def extract_salary_from_result(soup): salaries = [] for div in soup.find_all(name=”div”, attrs={“class”:”row”}): try: salaries.append(div.find(‘nobr’).text) except: try: div_two = div.find(name=”div”, attrs={“class”:”sjcl”}) div_three = div_two.find(“div”) salaries.append(div_three.text.strip()) except: salaries.append(“Nothing_found”) return(salaries)extract_salary_from_result(soup)
Last but not least, the job descriptions. Unfortunately, all the job summaries are not contained in the HTML from an Indeed website; nevertheless, we can gather some information about each job from the information provided. Selenium is a set of tools that may be used by a web scraper to browse through various links on a website and extract data from the full job advertisements.
Under the <span> tags, you’ll find summaries. The location text may be nested in “itemprop”: “address Locality” tags or within “class”: “location” tags. A simple for loop, on the other hand, can go through all span tags for the text and extract the information needed.
def extract_summary_from_result(soup): summaries = [] spans = soup.findAll(‘span’, attrs={‘class’: ‘summary’}) for span in spans: summaries.append(span.text.strip()) return(summaries)extract_summary_from_result(soup)
We have got various information regarding a scraper. Now, we just need to collect them all into the final scraper that withdraws the necessary information for every job post, keep it separate from all other job posts, and merge all the job information into a single data frame at a single time.
We can extract the initial conditions by mentioning few pieces of information.
max_results_per_city = 100city_set = [‘New+York’,’Chicago’,’San+Francisco’, ‘Austin’, ‘Seattle’, ‘Los+Angeles’, ‘Philadelphia’, ‘Atlanta’, ‘Dallas’, ‘Pittsburgh’, ‘Portland’, ‘Phoenix’, ‘Denver’, ‘Houston’, ‘Miami’, ‘Washington+DC’, ‘Boulder’]columns = [“city”, “job_title”, ”company_name”, ”location”, ”summary”, ”salary”]sample_df = pd.DataFrame(columns = columns)
It goes without saying that the longer the scraping process takes, the more results you seek and the more cities you look at. This isn’t a big deal if you start your scraper before going out or going to bed, but it’s something to think about.
The actual scraper is put together using the patterns we noticed in the URL structure above. We can use this knowledge to design a loop that visits every page in a precise order to retrieve data because we know how the URLs will be patterned for each page.
#scraping code:for city in city_set: for start in range(0, max_results_per_city, 10): page = requests.get(‘http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=' + str(city) + ‘&start=’ + str(start)) time.sleep(1) #ensuring at least 1 second between page grabs soup = BeautifulSoup(page.text, “lxml”, from_encoding=”utf-8") for div in soup.find_all(name=”div”, attrs={“class”:”row”}): #specifying row num for index of job posting in dataframe num = (len(sample_df) + 1) #creating an empty list to hold the data for each posting job_post = [] #append city name job_post.append(city) #grabbing job title for a in div.find_all(name=”a”, attrs={“data-tn-element”:”jobTitle”}): job_post.append(a[“title”]) #grabbing company name company = div.find_all(name=”span”, attrs={“class”:”company”}) if len(company) > 0: for b in company: job_post.append(b.text.strip()) else: sec_try = div.find_all(name=”span”, attrs={“class”:”result-link-source”}) for span in sec_try: job_post.append(span.text) #grabbing location name c = div.findAll(‘span’, attrs={‘class’: ‘location’}) for span in c: job_post.append(span.text) #grabbing summary text d = div.findAll(‘span’, attrs={‘class’: ‘summary’}) for span in d: job_post.append(span.text.strip()) #grabbing salary try: job_post.append(div.find(‘nobr’).text) except: try: div_two = div.find(name=”div”, attrs={“class”:”sjcl”}) div_three = div_two.find(“div”) job_post.append(div_three.text.strip()) except: job_post.append(“Nothing_found”) #appending list of job post info to dataframe at index num sample_df.loc[num] = job_post #saving sample_df as a local csv file — define your own local path to save contents sample_df.to_csv(“[filepath].csv”, encoding=’utf-8')
You will get own data frame for the scraped job postings after a short time.
Look how would the output be:
Contact us to request a quote!!!