๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Python/[Nomad] Web Scrapper

[Building a Job Scrapper] How to make Web Scrapper?

by ๐Ÿ’œautumn 2020. 8. 4.

* Web Scrapper ?

 - Somthing u used to extract data(or informtion) from the web site

    [e.g.] paste URL in Faceebook, it will show preview : photo N title

 - Apply to compare review, price.. ect.

 - Bring or collect information like Post, News, .. ect.

 - Applicated "Web Scrapper" > how to show lots of information efficiently


* Guide Line !

1. IMPORT MODULES(or functions) WHAT U NEED : requests and BeatifulSoup

    - requests : get HTML(Whole)

    - BeatifulSoup : extract HTML DATA(information) u want

                       : u should make SOUP by BeatifulSoup > BeatifulSoup(html.text,"html.parser")

2. NEED TO KNOW HOW MANY PAGE EXIST : because of Iteration > reading each page

3. SAVE EXTRACT DATA as CVS file

requests.get('url') go to URL > bring HTML
(option: auth, params, data ..)
soup.title['attr'] find title tag
requests.status_code get status code soup.p['attr'] find p tag
requests.headers['attr'] get headers infomation soup.find('a') find a tag
requests.text get HTML as text data soup.find('div',{"class":"name"} find div(class=name) tag
requests.json() get HTML as json data soup.find_all('a')["href"] return LIST

https://realpython.com/python-requests/#the-response

 

Python’s Requests Library (Guide) – Real Python

In this tutorial on Python's "requests" library, you'll see some of the most useful features that requests has to offer as well as how to customize and optimize those features. You'll learn how to use requests efficiently and stop requests to external serv

realpython.com

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

 

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it: str(soup) # ' I linked to example.com ' unicode(soup.a) # u' I linked to example.com ' The str() functio

www.crummy.com


* CSV(Comma Separated Values) File 

 - Like a excel file that is understood by not only MS but also Max, Windows, browser, Google drive, .. ect

 - ROW is separated by New Line

 - COLUMN is separated by Comma

 - CSV has already built in Python

 - open(), write(), writerow() .. 


* NOW, How to make Web Scrapper ?

  1) get HTML : USING requests.get(URL)

  2) make SOUP for extract specific data : USING beautifulSoup(HTMLfile,"html.parser")

  3) extract information from thr soup : USING find(), find_all() .. ect

         > when u simplify them : string(), strip(), get_text() .. ect

         > WHEN u check status : print(something) OR print(somthing.status_code)

     

 

  1. indeed.py

import requests
from bs4 import BeautifulSoup

LIMIT = 50
URL = f"https://www.indeed.com/jobs?q=python&limit={LIMIT}"


# 6. ํ•˜๋‚˜์˜ ํ•จ์ˆ˜๋กœ ๋ฌถ๊ธฐ > ํŽ˜์ด์ง€์ˆ˜์ถ”์ถœ ํ›„ ์ตœ๋Œ€ ํŽ˜์ด์ง€์ˆ˜ ๋ฐ˜ํ™˜
def get_last_page():
  # 1. requests.get()์œผ๋กœ HTML๊ฐ€์ ธ์˜ค๊ธฐ
  result=requests.get(URL)

  # 2. Soup ๋งŒ๋“ค๊ธฐ
  # soup์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•œ๊ฑฐ์•ผ ๊ฐ€๊ณต/ํƒ์ƒ‰ํ•˜๊ธฐ์ข‹๊ฒŒ!
  # BeautifulSoup(์ถ”์ถœํ• HTML๋ฌธ์„œ,"ํ˜•ํƒœ")
  soup=BeautifulSoup(result.text,"html.parser")

  # 3. ์›ํ•˜๋Š” ๋ธ”๋ก ๋‚ด์šฉ ์ถ”์ถœ : find("ํƒœ๊ทธ๋ช…",{์†์„ฑ})
  #   => ์›น์‚ฌ์ดํŠธ์˜ ์†Œ์Šค ๋ณด๊ณ  ์›ํ•˜๋Š” ์ •๋ณด์˜ class๋ช… ํ™•์ธ : ํŽ˜์ด์ง€๋ฐ•์Šค
  #   => ๊ฒฐ๊ณผ๊ฐ’ ์—ญ์‹œ soup!
  pagination = soup.find("div", {"class":"pagination"})

  # 4. ์›ํ•˜๋Š” ํƒœ๊ทธ๋ฅผ list๋กœ ์ถ”์ถœ : find_all("ํƒœ๊ทธ๋ช…")
  #   => ์œ„์—์„œ ์ถ”์ถœํ•œ ๋‚ด์šฉ ์ค‘ ํŽ˜์ด์ง€ ๋ฒˆํ˜ธ์— ํ•ด๋‹นํ•˜๋Š” aํƒœ๊ทธ ์ถ”์ถœ
  links = pagination.find_all('a')

  # (info) list์ถœ๋ ฅ์‹œ for in ๊ตฌ๋ฌธ์‚ฌ์šฉ, ์ธ๋ฑ์Šค -ํ‘œ๊ธฐ์‹œ ๋’ค์—์„œ ์นด์šดํŠธ(-1~)
  pages = []
  for link in links[:-1]:
    # spanํƒœ๊ทธ๋ฅผ ์ฐพ์•„์„œ str(ํƒœ๊ทธ์ œ์™ธ)๋งŒ ์ถ”์ถœํ•˜์—ฌ list์ถ”๊ฐ€
    #pages.append(link.find("span").string)
    pages.append(int(link.string))

  # 5. ๊ฐ€์žฅ ํฐ ํŽ˜์ด์ง€ ์ˆ˜ ์ €์žฅ
  max_page=pages[-1]
  return max_page

# 8. ์ƒ์„ธ์ •๋ณด ์ถ”์ถœํ•˜๋Š” fuction ๋”ฐ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ ์ •์˜ > dictionary๋ฐ˜ํ™˜
def exract_job(html):
  title = html.find("h2",{"class": "title"}).find("a")["title"]
  company = html.find("span",{"class": "company"})
  if company :
    company_anchor = company.find("a")
    if company_anchor is not None :
      company = str(company_anchor.string)
    else :
      company = str(company.string)
    # strip ์† ์ธ์ž๊ฐ€ ํฌํ•จ๋œ ๋ผ์ธ ์‚ญ์ œ
    company=company.strip()
  else:
    company = None
  location = html.find("div",{"class":"recJobLoc"})["data-rc-loc"]
  job_id = html["data-jk"]
  return {
    'title': title, 
    'company':company, 
    'location':location, 
    'link' : f"https://www.indeed.com/viewjob?jk={job_id}"
  }

# 7. ๋ชจ๋“  ํŽ˜์ด์ง€์˜ requests๊ฐ€์ ธ์˜ค๊ธฐ
#  => ๋ชจ๋“  ํŽ˜์ด์ง€์˜ ์‹œ์ž‘๊ธ€๋ฒˆํ˜ธ ์ถ”์ถœ : ๊ฐํŽ˜์ด์ง€ ๋ฐฉ๋ฌธ์„ ์œ„ํ•จ
#  => jobs๋ผ๋Š” ๋ณ€์ˆ˜์— ๋‹ด์•„์„œ ๋ฐ˜ํ™˜
def extract_jobs(last_page):
  jobs = []
  for page in range(last_page):
    print(f">>>>>>> INDEED Scrapping page {page}")
    result = requests.get(f"{URL}&start={page*LIMIT}")
    soup = BeautifulSoup(result.text,"html.parser")
    results = soup.find_all('div',{"class" : "jobsearch-SerpJobCard"})
    for result_div in results : 
      # 8. ํ•จ์ˆ˜๋กœ ๋ณ„๋„ ๋ถ„๋ฆฌ ํ›„ ํ˜ธ์ถœ > job๋ฐฐ์—ด์— ๋„ฃ๊ธฐ
      job=exract_job(result_div)
      jobs.append(job)
  return jobs

# 9. ํ˜ธ์ถœํ•จ์ˆ˜
def get_jobs():
  last_page = get_last_page()
  jobs = extract_jobs(last_page)
  return jobs
 

 

  2. so.py (stackoverflow)

import requests
from bs4 import BeautifulSoup

URL = f"https://stackoverflow.com/jobs?q=python"

# 1. get the pages
# 2. make each requests (HTML)
# 3. extract the jobs

def get_last_page():
  result = requests.get(URL)
  soup = BeautifulSoup(result.text,"html.parser")
  pages = soup.find("div", {"class":"s-pagination"}).find_all('a')
  last_page = pages[-2].get_text(strip=True)
  return int(last_page)


def extract_job(html):
  title = html.find("h2").find("a")["title"]
  # unpacking value : ์š”์†Œ ๊ฐœ์ˆ˜๋ฅผ ์•Œ ๋•Œ ๊ฐ๊ฐ ์•Œ์•„์„œ ๋‹ด๊น€..!!! ์šฐ์™€
  # recursive=False >> Dont go deep
  company, location = html.find("h3").find_all("span",recursive=False)
  company = company.get_text(strip=True).strip("\n")
  location = location.get_text(strip=True).strip("\n")
  #link = html.find("h2").find("a")["href"]
  link = html["data-jobid"]
  return {
    'title':title, 
    'company':company, 
    'location':location, 
    'link':f"https://stackoverflow.com/jobs/{link}"
  }


def extract_jobs(last_page):
  jobs = []
  for page in range (last_page):
    print(f">>>>>>> OVERFLOWSTACK Scrapping page {page+1}")
    result = requests.get(f"{URL}&pg={page+1}")
    # ํ™•์ธ์‹œ print(result.status_code)
    soup = BeautifulSoup(result.text,"html.parser")
    job_cards = soup.find_all("div",{"class": "-job"})
    for job_card in job_cards:
      job = extract_job(job_card)
      jobs.append(job)
  return jobs


def get_jobs():
  last_page = get_last_page()
  jobs = extract_jobs(last_page)
  return jobs

 

  3. save.py

import csv

def save_to_file(jobs):
  # open(*) : open the file that name is *
  #  => If u dont have that file, create file automatically first.
  #  => u've got to set 'mode' > W : write only N reset when u restart
  file = open("jobs.csv",mode="w")
  # set argument : where u write
  writer = csv.writer(file)
  writer.writerow(["title", "company", "location", "link"])
  for job in jobs:
    writer.writerow(list(job.values()))
  return

 

  4. main.py

from indeed import get_jobs as get_indeed_jobs
from so import get_jobs as get_so_jobs
from save import save_to_file

indeed_jobs = get_indeed_jobs()
so_jobs = get_so_jobs()

# Combine Two 'list' by '+'
jobs = indeed_jobs + so_jobs

save_to_file(jobs)

'Python > [Nomad] Web Scrapper' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[EXTENDING] put Web Scrapper in Server  (0) 2020.08.06
[GET READY FOR DJANGO] OOP  (0) 2020.08.05
[Introduction&Theory] Python_basic  (0) 2020.08.03

๋Œ“๊ธ€