* Web Scrapper ?
- Somthing u used to extract data(or informtion) from the web site
[e.g.] paste URL in Faceebook, it will show preview : photo N title
- Apply to compare review, price.. ect.
- Bring or collect information like Post, News, .. ect.
- Applicated "Web Scrapper" > how to show lots of information efficiently
* Guide Line !
1. IMPORT MODULES(or functions) WHAT U NEED : requests and BeatifulSoup
- requests : get HTML(Whole)
- BeatifulSoup : extract HTML DATA(information) u want
: u should make SOUP by BeatifulSoup > BeatifulSoup(html.text,"html.parser")
2. NEED TO KNOW HOW MANY PAGE EXIST : because of Iteration > reading each page
3. SAVE EXTRACT DATA as CVS file
requests.get('url') | go to URL > bring HTML (option: auth, params, data ..) |
soup.title['attr'] | find title tag |
requests.status_code | get status code | soup.p['attr'] | find p tag |
requests.headers['attr'] | get headers infomation | soup.find('a') | find a tag |
requests.text | get HTML as text data | soup.find('div',{"class":"name"} | find div(class=name) tag |
requests.json() | get HTML as json data | soup.find_all('a')["href"] | return LIST |
https://realpython.com/python-requests/#the-response
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup
* CSV(Comma Separated Values) File
- Like a excel file that is understood by not only MS but also Max, Windows, browser, Google drive, .. ect
- ROW is separated by New Line
- COLUMN is separated by Comma
- CSV has already built in Python
- open(), write(), writerow() ..
* NOW, How to make Web Scrapper ?
1) get HTML : USING requests.get(URL)
2) make SOUP for extract specific data : USING beautifulSoup(HTMLfile,"html.parser")
3) extract information from thr soup : USING find(), find_all() .. ect
> when u simplify them : string(), strip(), get_text() .. ect
> WHEN u check status : print(something) OR print(somthing.status_code)
1. indeed.py
import requests
from bs4 import BeautifulSoup
LIMIT = 50
URL = f"https://www.indeed.com/jobs?q=python&limit={LIMIT}"
# 6. ํ๋์ ํจ์๋ก ๋ฌถ๊ธฐ > ํ์ด์ง์์ถ์ถ ํ ์ต๋ ํ์ด์ง์ ๋ฐํ
def get_last_page():
# 1. requests.get()์ผ๋ก HTML๊ฐ์ ธ์ค๊ธฐ
result=requests.get(URL)
# 2. Soup ๋ง๋ค๊ธฐ
# soup์ ๋ฐ์ดํฐ๋ฅผ ์ถ์ถํ๊ฑฐ์ผ ๊ฐ๊ณต/ํ์ํ๊ธฐ์ข๊ฒ!
# BeautifulSoup(์ถ์ถํ HTML๋ฌธ์,"ํํ")
soup=BeautifulSoup(result.text,"html.parser")
# 3. ์ํ๋ ๋ธ๋ก ๋ด์ฉ ์ถ์ถ : find("ํ๊ทธ๋ช
",{์์ฑ})
# => ์น์ฌ์ดํธ์ ์์ค ๋ณด๊ณ ์ํ๋ ์ ๋ณด์ class๋ช
ํ์ธ : ํ์ด์ง๋ฐ์ค
# => ๊ฒฐ๊ณผ๊ฐ ์ญ์ soup!
pagination = soup.find("div", {"class":"pagination"})
# 4. ์ํ๋ ํ๊ทธ๋ฅผ list๋ก ์ถ์ถ : find_all("ํ๊ทธ๋ช
")
# => ์์์ ์ถ์ถํ ๋ด์ฉ ์ค ํ์ด์ง ๋ฒํธ์ ํด๋นํ๋ aํ๊ทธ ์ถ์ถ
links = pagination.find_all('a')
# (info) list์ถ๋ ฅ์ for in ๊ตฌ๋ฌธ์ฌ์ฉ, ์ธ๋ฑ์ค -ํ๊ธฐ์ ๋ค์์ ์นด์ดํธ(-1~)
pages = []
for link in links[:-1]:
# spanํ๊ทธ๋ฅผ ์ฐพ์์ str(ํ๊ทธ์ ์ธ)๋ง ์ถ์ถํ์ฌ list์ถ๊ฐ
#pages.append(link.find("span").string)
pages.append(int(link.string))
# 5. ๊ฐ์ฅ ํฐ ํ์ด์ง ์ ์ ์ฅ
max_page=pages[-1]
return max_page
# 8. ์์ธ์ ๋ณด ์ถ์ถํ๋ fuction ๋ฐ๋ก ๋ถ๋ฆฌํ์ฌ ์ ์ > dictionary๋ฐํ
def exract_job(html):
title = html.find("h2",{"class": "title"}).find("a")["title"]
company = html.find("span",{"class": "company"})
if company :
company_anchor = company.find("a")
if company_anchor is not None :
company = str(company_anchor.string)
else :
company = str(company.string)
# strip ์ ์ธ์๊ฐ ํฌํจ๋ ๋ผ์ธ ์ญ์
company=company.strip()
else:
company = None
location = html.find("div",{"class":"recJobLoc"})["data-rc-loc"]
job_id = html["data-jk"]
return {
'title': title,
'company':company,
'location':location,
'link' : f"https://www.indeed.com/viewjob?jk={job_id}"
}
# 7. ๋ชจ๋ ํ์ด์ง์ requests๊ฐ์ ธ์ค๊ธฐ
# => ๋ชจ๋ ํ์ด์ง์ ์์๊ธ๋ฒํธ ์ถ์ถ : ๊ฐํ์ด์ง ๋ฐฉ๋ฌธ์ ์ํจ
# => jobs๋ผ๋ ๋ณ์์ ๋ด์์ ๋ฐํ
def extract_jobs(last_page):
jobs = []
for page in range(last_page):
print(f">>>>>>> INDEED Scrapping page {page}")
result = requests.get(f"{URL}&start={page*LIMIT}")
soup = BeautifulSoup(result.text,"html.parser")
results = soup.find_all('div',{"class" : "jobsearch-SerpJobCard"})
for result_div in results :
# 8. ํจ์๋ก ๋ณ๋ ๋ถ๋ฆฌ ํ ํธ์ถ > job๋ฐฐ์ด์ ๋ฃ๊ธฐ
job=exract_job(result_div)
jobs.append(job)
return jobs
# 9. ํธ์ถํจ์
def get_jobs():
last_page = get_last_page()
jobs = extract_jobs(last_page)
return jobs
2. so.py (stackoverflow)
import requests
from bs4 import BeautifulSoup
URL = f"https://stackoverflow.com/jobs?q=python"
# 1. get the pages
# 2. make each requests (HTML)
# 3. extract the jobs
def get_last_page():
result = requests.get(URL)
soup = BeautifulSoup(result.text,"html.parser")
pages = soup.find("div", {"class":"s-pagination"}).find_all('a')
last_page = pages[-2].get_text(strip=True)
return int(last_page)
def extract_job(html):
title = html.find("h2").find("a")["title"]
# unpacking value : ์์ ๊ฐ์๋ฅผ ์ ๋ ๊ฐ๊ฐ ์์์ ๋ด๊น..!!! ์ฐ์
# recursive=False >> Dont go deep
company, location = html.find("h3").find_all("span",recursive=False)
company = company.get_text(strip=True).strip("\n")
location = location.get_text(strip=True).strip("\n")
#link = html.find("h2").find("a")["href"]
link = html["data-jobid"]
return {
'title':title,
'company':company,
'location':location,
'link':f"https://stackoverflow.com/jobs/{link}"
}
def extract_jobs(last_page):
jobs = []
for page in range (last_page):
print(f">>>>>>> OVERFLOWSTACK Scrapping page {page+1}")
result = requests.get(f"{URL}&pg={page+1}")
# ํ์ธ์ print(result.status_code)
soup = BeautifulSoup(result.text,"html.parser")
job_cards = soup.find_all("div",{"class": "-job"})
for job_card in job_cards:
job = extract_job(job_card)
jobs.append(job)
return jobs
def get_jobs():
last_page = get_last_page()
jobs = extract_jobs(last_page)
return jobs
3. save.py
import csv
def save_to_file(jobs):
# open(*) : open the file that name is *
# => If u dont have that file, create file automatically first.
# => u've got to set 'mode' > W : write only N reset when u restart
file = open("jobs.csv",mode="w")
# set argument : where u write
writer = csv.writer(file)
writer.writerow(["title", "company", "location", "link"])
for job in jobs:
writer.writerow(list(job.values()))
return
4. main.py
from indeed import get_jobs as get_indeed_jobs
from so import get_jobs as get_so_jobs
from save import save_to_file
indeed_jobs = get_indeed_jobs()
so_jobs = get_so_jobs()
# Combine Two 'list' by '+'
jobs = indeed_jobs + so_jobs
save_to_file(jobs)
'Python > [Nomad] Web Scrapper' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
[EXTENDING] put Web Scrapper in Server (0) | 2020.08.06 |
---|---|
[GET READY FOR DJANGO] OOP (0) | 2020.08.05 |
[Introduction&Theory] Python_basic (0) | 2020.08.03 |
๋๊ธ