[Building a Job Scrapper] How to make Web Scrapper?
* Web Scrapper ?
- Somthing u used to extract data(or informtion) from the web site
[e.g.] paste URL in Faceebook, it will show preview : photo N title
- Apply to compare review, price.. ect.
- Bring or collect information like Post, News, .. ect.
- Applicated "Web Scrapper" > how to show lots of information efficiently
* Guide Line !
1. IMPORT MODULES(or functions) WHAT U NEED : requests and BeatifulSoup
- requests : get HTML(Whole)
- BeatifulSoup : extract HTML DATA(information) u want
: u should make SOUP by BeatifulSoup > BeatifulSoup(html.text,"html.parser")
2. NEED TO KNOW HOW MANY PAGE EXIST : because of Iteration > reading each page
3. SAVE EXTRACT DATA as CVS file
requests.get('url') | go to URL > bring HTML (option: auth, params, data ..) |
soup.title['attr'] | find title tag |
requests.status_code | get status code | soup.p['attr'] | find p tag |
requests.headers['attr'] | get headers infomation | soup.find('a') | find a tag |
requests.text | get HTML as text data | soup.find('div',{"class":"name"} | find div(class=name) tag |
requests.json() | get HTML as json data | soup.find_all('a')["href"] | return LIST |
https://realpython.com/python-requests/#the-response
Python’s Requests Library (Guide) – Real Python
In this tutorial on Python's "requests" library, you'll see some of the most useful features that requests has to offer as well as how to customize and optimize those features. You'll learn how to use requests efficiently and stop requests to external serv
realpython.com
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation
Non-pretty printing If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it: str(soup) # ' I linked to example.com ' unicode(soup.a) # u' I linked to example.com ' The str() functio
www.crummy.com
* CSV(Comma Separated Values) File
- Like a excel file that is understood by not only MS but also Max, Windows, browser, Google drive, .. ect
- ROW is separated by New Line
- COLUMN is separated by Comma
- CSV has already built in Python
- open(), write(), writerow() ..
* NOW, How to make Web Scrapper ?
1) get HTML : USING requests.get(URL)
2) make SOUP for extract specific data : USING beautifulSoup(HTMLfile,"html.parser")
3) extract information from thr soup : USING find(), find_all() .. ect
> when u simplify them : string(), strip(), get_text() .. ect
> WHEN u check status : print(something) OR print(somthing.status_code)
1. indeed.py
import requests
from bs4 import BeautifulSoup
LIMIT = 50
URL = f"https://www.indeed.com/jobs?q=python&limit={LIMIT}"
# 6. ํ๋์ ํจ์๋ก ๋ฌถ๊ธฐ > ํ์ด์ง์์ถ์ถ ํ ์ต๋ ํ์ด์ง์ ๋ฐํ
def get_last_page():
# 1. requests.get()์ผ๋ก HTML๊ฐ์ ธ์ค๊ธฐ
result=requests.get(URL)
# 2. Soup ๋ง๋ค๊ธฐ
# soup์ ๋ฐ์ดํฐ๋ฅผ ์ถ์ถํ๊ฑฐ์ผ ๊ฐ๊ณต/ํ์ํ๊ธฐ์ข๊ฒ!
# BeautifulSoup(์ถ์ถํ HTML๋ฌธ์,"ํํ")
soup=BeautifulSoup(result.text,"html.parser")
# 3. ์ํ๋ ๋ธ๋ก ๋ด์ฉ ์ถ์ถ : find("ํ๊ทธ๋ช
",{์์ฑ})
# => ์น์ฌ์ดํธ์ ์์ค ๋ณด๊ณ ์ํ๋ ์ ๋ณด์ class๋ช
ํ์ธ : ํ์ด์ง๋ฐ์ค
# => ๊ฒฐ๊ณผ๊ฐ ์ญ์ soup!
pagination = soup.find("div", {"class":"pagination"})
# 4. ์ํ๋ ํ๊ทธ๋ฅผ list๋ก ์ถ์ถ : find_all("ํ๊ทธ๋ช
")
# => ์์์ ์ถ์ถํ ๋ด์ฉ ์ค ํ์ด์ง ๋ฒํธ์ ํด๋นํ๋ aํ๊ทธ ์ถ์ถ
links = pagination.find_all('a')
# (info) list์ถ๋ ฅ์ for in ๊ตฌ๋ฌธ์ฌ์ฉ, ์ธ๋ฑ์ค -ํ๊ธฐ์ ๋ค์์ ์นด์ดํธ(-1~)
pages = []
for link in links[:-1]:
# spanํ๊ทธ๋ฅผ ์ฐพ์์ str(ํ๊ทธ์ ์ธ)๋ง ์ถ์ถํ์ฌ list์ถ๊ฐ
#pages.append(link.find("span").string)
pages.append(int(link.string))
# 5. ๊ฐ์ฅ ํฐ ํ์ด์ง ์ ์ ์ฅ
max_page=pages[-1]
return max_page
# 8. ์์ธ์ ๋ณด ์ถ์ถํ๋ fuction ๋ฐ๋ก ๋ถ๋ฆฌํ์ฌ ์ ์ > dictionary๋ฐํ
def exract_job(html):
title = html.find("h2",{"class": "title"}).find("a")["title"]
company = html.find("span",{"class": "company"})
if company :
company_anchor = company.find("a")
if company_anchor is not None :
company = str(company_anchor.string)
else :
company = str(company.string)
# strip ์ ์ธ์๊ฐ ํฌํจ๋ ๋ผ์ธ ์ญ์
company=company.strip()
else:
company = None
location = html.find("div",{"class":"recJobLoc"})["data-rc-loc"]
job_id = html["data-jk"]
return {
'title': title,
'company':company,
'location':location,
'link' : f"https://www.indeed.com/viewjob?jk={job_id}"
}
# 7. ๋ชจ๋ ํ์ด์ง์ requests๊ฐ์ ธ์ค๊ธฐ
# => ๋ชจ๋ ํ์ด์ง์ ์์๊ธ๋ฒํธ ์ถ์ถ : ๊ฐํ์ด์ง ๋ฐฉ๋ฌธ์ ์ํจ
# => jobs๋ผ๋ ๋ณ์์ ๋ด์์ ๋ฐํ
def extract_jobs(last_page):
jobs = []
for page in range(last_page):
print(f">>>>>>> INDEED Scrapping page {page}")
result = requests.get(f"{URL}&start={page*LIMIT}")
soup = BeautifulSoup(result.text,"html.parser")
results = soup.find_all('div',{"class" : "jobsearch-SerpJobCard"})
for result_div in results :
# 8. ํจ์๋ก ๋ณ๋ ๋ถ๋ฆฌ ํ ํธ์ถ > job๋ฐฐ์ด์ ๋ฃ๊ธฐ
job=exract_job(result_div)
jobs.append(job)
return jobs
# 9. ํธ์ถํจ์
def get_jobs():
last_page = get_last_page()
jobs = extract_jobs(last_page)
return jobs
2. so.py (stackoverflow)
import requests
from bs4 import BeautifulSoup
URL = f"https://stackoverflow.com/jobs?q=python"
# 1. get the pages
# 2. make each requests (HTML)
# 3. extract the jobs
def get_last_page():
result = requests.get(URL)
soup = BeautifulSoup(result.text,"html.parser")
pages = soup.find("div", {"class":"s-pagination"}).find_all('a')
last_page = pages[-2].get_text(strip=True)
return int(last_page)
def extract_job(html):
title = html.find("h2").find("a")["title"]
# unpacking value : ์์ ๊ฐ์๋ฅผ ์ ๋ ๊ฐ๊ฐ ์์์ ๋ด๊น..!!! ์ฐ์
# recursive=False >> Dont go deep
company, location = html.find("h3").find_all("span",recursive=False)
company = company.get_text(strip=True).strip("\n")
location = location.get_text(strip=True).strip("\n")
#link = html.find("h2").find("a")["href"]
link = html["data-jobid"]
return {
'title':title,
'company':company,
'location':location,
'link':f"https://stackoverflow.com/jobs/{link}"
}
def extract_jobs(last_page):
jobs = []
for page in range (last_page):
print(f">>>>>>> OVERFLOWSTACK Scrapping page {page+1}")
result = requests.get(f"{URL}&pg={page+1}")
# ํ์ธ์ print(result.status_code)
soup = BeautifulSoup(result.text,"html.parser")
job_cards = soup.find_all("div",{"class": "-job"})
for job_card in job_cards:
job = extract_job(job_card)
jobs.append(job)
return jobs
def get_jobs():
last_page = get_last_page()
jobs = extract_jobs(last_page)
return jobs
3. save.py
import csv
def save_to_file(jobs):
# open(*) : open the file that name is *
# => If u dont have that file, create file automatically first.
# => u've got to set 'mode' > W : write only N reset when u restart
file = open("jobs.csv",mode="w")
# set argument : where u write
writer = csv.writer(file)
writer.writerow(["title", "company", "location", "link"])
for job in jobs:
writer.writerow(list(job.values()))
return
4. main.py
from indeed import get_jobs as get_indeed_jobs
from so import get_jobs as get_so_jobs
from save import save_to_file
indeed_jobs = get_indeed_jobs()
so_jobs = get_so_jobs()
# Combine Two 'list' by '+'
jobs = indeed_jobs + so_jobs
save_to_file(jobs)