[Building a Job Scrapper] How to make Web Scrapper?

Python/[Nomad] Web Scrapper

[Building a Job Scrapper] How to make Web Scrapper?

💜autumn 2020. 8. 4. 18:40

* Web Scrapper ?

- Somthing u used to extract data(or informtion) from the web site

[e.g.] paste URL in Faceebook, it will show preview : photo N title

- Apply to compare review, price.. ect.

- Bring or collect information like Post, News, .. ect.

- Applicated "Web Scrapper" > how to show lots of information efficiently

* Guide Line !

1. IMPORT MODULES(or functions) WHAT U NEED : requests and BeatifulSoup

- requests : get HTML(Whole)

- BeatifulSoup : extract HTML DATA(information) u want

: u should make SOUP by BeatifulSoup > BeatifulSoup(html.text,"html.parser")

2. NEED TO KNOW HOW MANY PAGE EXIST : because of Iteration > reading each page

3. SAVE EXTRACT DATA as CVS file

requests.get('url')	go to URL > bring HTML (option: auth, params, data ..)	soup.title['attr']	find title tag
requests.status_code	get status code	soup.p['attr']	find p tag
requests.headers['attr']	get headers infomation	soup.find('a')	find a tag
requests.text	get HTML as text data	soup.find('div',{"class":"name"}	find div(class=name) tag
requests.json()	get HTML as json data	soup.find_all('a')["href"]	return LIST

https://realpython.com/python-requests/#the-response

Python’s Requests Library (Guide) – Real Python

In this tutorial on Python's "requests" library, you'll see some of the most useful features that requests has to offer as well as how to customize and optimize those features. You'll learn how to use requests efficiently and stop requests to external serv

realpython.com

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it: str(soup) # ' I linked to example.com ' unicode(soup.a) # u' I linked to example.com ' The str() functio

www.crummy.com

* CSV(Comma Separated Values) File

- Like a excel file that is understood by not only MS but also Max, Windows, browser, Google drive, .. ect

- ROW is separated by New Line

- COLUMN is separated by Comma

- CSV has already built in Python

- open(), write(), writerow() ..

* NOW, How to make Web Scrapper ?

1) get HTML : USING requests.get(URL)

2) make SOUP for extract specific data : USING beautifulSoup(HTMLfile,"html.parser")

3) extract information from thr soup : USING find(), find_all() .. ect

> when u simplify them : string(), strip(), get_text() .. ect

> WHEN u check status : print(something) OR print(somthing.status_code)

1. indeed.py

import requests
from bs4 import BeautifulSoup

LIMIT = 50
URL = f"https://www.indeed.com/jobs?q=python&limit={LIMIT}"


# 6. 하나의 함수로 묶기 > 페이지수추출 후 최대 페이지수 반환
def get_last_page():
  # 1. requests.get()으로 HTML가져오기
  result=requests.get(URL)

  # 2. Soup 만들기
  # soup은 데이터를 추출한거야 가공/탐색하기좋게!
  # BeautifulSoup(추출할HTML문서,"형태")
  soup=BeautifulSoup(result.text,"html.parser")

  # 3. 원하는 블록 내용 추출 : find("태그명",{속성})
  #   => 웹사이트의 소스 보고 원하는 정보의 class명 확인 : 페이지박스
  #   => 결과값 역시 soup!
  pagination = soup.find("div", {"class":"pagination"})

  # 4. 원하는 태그를 list로 추출 : find_all("태그명")
  #   => 위에서 추출한 내용 중 페이지 번호에 해당하는 a태그 추출
  links = pagination.find_all('a')

  # (info) list출력시 for in 구문사용, 인덱스 -표기시 뒤에서 카운트(-1~)
  pages = []
  for link in links[:-1]:
    # span태그를 찾아서 str(태그제외)만 추출하여 list추가
    #pages.append(link.find("span").string)
    pages.append(int(link.string))

  # 5. 가장 큰 페이지 수 저장
  max_page=pages[-1]
  return max_page

# 8. 상세정보 추출하는 fuction 따로 분리하여 정의 > dictionary반환
def exract_job(html):
  title = html.find("h2",{"class": "title"}).find("a")["title"]
  company = html.find("span",{"class": "company"})
  if company :
    company_anchor = company.find("a")
    if company_anchor is not None :
      company = str(company_anchor.string)
    else :
      company = str(company.string)
    # strip 속 인자가 포함된 라인 삭제
    company=company.strip()
  else:
    company = None
  location = html.find("div",{"class":"recJobLoc"})["data-rc-loc"]
  job_id = html["data-jk"]
  return {
    'title': title, 
    'company':company, 
    'location':location, 
    'link' : f"https://www.indeed.com/viewjob?jk={job_id}"
  }

# 7. 모든 페이지의 requests가져오기
#  => 모든 페이지의 시작글번호 추출 : 각페이지 방문을 위함
#  => jobs라는 변수에 담아서 반환
def extract_jobs(last_page):
  jobs = []
  for page in range(last_page):
    print(f">>>>>>> INDEED Scrapping page {page}")
    result = requests.get(f"{URL}&start={page*LIMIT}")
    soup = BeautifulSoup(result.text,"html.parser")
    results = soup.find_all('div',{"class" : "jobsearch-SerpJobCard"})
    for result_div in results : 
      # 8. 함수로 별도 분리 후 호출 > job배열에 넣기
      job=exract_job(result_div)
      jobs.append(job)
  return jobs

# 9. 호출함수
def get_jobs():
  last_page = get_last_page()
  jobs = extract_jobs(last_page)
  return jobs

2. so.py (stackoverflow)

import requests
from bs4 import BeautifulSoup

URL = f"https://stackoverflow.com/jobs?q=python"

# 1. get the pages
# 2. make each requests (HTML)
# 3. extract the jobs

def get_last_page():
  result = requests.get(URL)
  soup = BeautifulSoup(result.text,"html.parser")
  pages = soup.find("div", {"class":"s-pagination"}).find_all('a')
  last_page = pages[-2].get_text(strip=True)
  return int(last_page)


def extract_job(html):
  title = html.find("h2").find("a")["title"]
  # unpacking value : 요소 개수를 알 때 각각 알아서 담김..!!! 우와
  # recursive=False >> Dont go deep
  company, location = html.find("h3").find_all("span",recursive=False)
  company = company.get_text(strip=True).strip("\n")
  location = location.get_text(strip=True).strip("\n")
  #link = html.find("h2").find("a")["href"]
  link = html["data-jobid"]
  return {
    'title':title, 
    'company':company, 
    'location':location, 
    'link':f"https://stackoverflow.com/jobs/{link}"
  }


def extract_jobs(last_page):
  jobs = []
  for page in range (last_page):
    print(f">>>>>>> OVERFLOWSTACK Scrapping page {page+1}")
    result = requests.get(f"{URL}&pg={page+1}")
    # 확인시 print(result.status_code)
    soup = BeautifulSoup(result.text,"html.parser")
    job_cards = soup.find_all("div",{"class": "-job"})
    for job_card in job_cards:
      job = extract_job(job_card)
      jobs.append(job)
  return jobs


def get_jobs():
  last_page = get_last_page()
  jobs = extract_jobs(last_page)
  return jobs

3. save.py

import csv

def save_to_file(jobs):
  # open(*) : open the file that name is *
  #  => If u dont have that file, create file automatically first.
  #  => u've got to set 'mode' > W : write only N reset when u restart
  file = open("jobs.csv",mode="w")
  # set argument : where u write
  writer = csv.writer(file)
  writer.writerow(["title", "company", "location", "link"])
  for job in jobs:
    writer.writerow(list(job.values()))
  return

4. main.py

from indeed import get_jobs as get_indeed_jobs
from so import get_jobs as get_so_jobs
from save import save_to_file

indeed_jobs = get_indeed_jobs()
so_jobs = get_so_jobs()

# Combine Two 'list' by '+'
jobs = indeed_jobs + so_jobs

save_to_file(jobs)