Learning Python3 Crawler

Posted by Riino on

Selenium

Request is the easiest way to send a http request to a target URL. A request contains two parts: request head and the request url. And this is what your browser will send when visiting in a real case.

Generally a request head is in a json framat, and in Chrome, you can check 14 items inside, for example:

  1. authority: ogs.google.com ()
  2. method: GET (The type of http request, usually is GET)
  3. path: /u/0/widget/app?origin=chrome-search%3A%2F%2Flocal-ntp&pid=1&spid=243&hl=zh-TW&gm=
  4. :scheme: https 5 . accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9
  5. accept-encoding: gzip, deflate, br
  6. accept-language: zh-CN,zh;q=0.9,zh-TW;q=0.8
  7. cookie:
   - HSID=AUbkUAanFFXXXXXX;
   - SSID=ADbqhQOYTmfjyXXXX;
   - APISID=sgGGoOYyddd5pP0c/A8nx48yUAu9GnXXXX;
   - SAPISID=UvOceTV27z4-pRJt/AwIDrQ_uc-LZkXXXX;
   -  __Secure-HSID=AUbkUAanFFRUSXXXX; __
   - Secure-SSID=ADbqhQOYTmfjyXXXX;
   - __Secure-APISID=sgGGoOYyddd5pP0c/A8nx48yUAu9GnXXXX;
   - Secure-3PAPISID=UvOceTV27z4-pRJt/AwIDrQ_uc-LZkXXXX;
   - OTZ=5489941_24_24__24_;
   - __SID=yAdNJBetgwljReCOL1RX9kBlJAM8MjVJesOHLcVdF2-mMLwQtQMUMSpi2IBwjKISqnxxxx.;
   - __Secure-3PSID=yAdNJBetgwljReCOL1RX9kBlJAM8MjVJesOHLcVdF2-mMLwQv4q9ptDR0zSxBvQ9PF6kgQ.;
   - NID=204=g_d3k7sRDyZ1HNJ-ceym0tpmgr-U8X79E0_L_l2_ET_ryjLi9pXB59XrrfjmHpFXkwLMc640fp3hMzSxNus6W3uB1ALcKTtJA_lf36SGlgT3XhCzW562_lahSvakuExNlJ6SrILK7Wy-9EuwvnOE44oajmeHqy4eI9rr3W1xMCNQXa6cEQlrRykbF8T89VB_GnvownKrENIBMVebo30c4_ZeyZORmivcHEXyFvMggu6yvIOnbVQeYnY2J98Na07V4ZSCw;
   -  1P_JAR=2020-06-16-14; SIDCC=AJi4QfH_yuW_DTMEc0UncAJUZgGzkuxqvrUyxkn2n403X0GiWlcj9Uplj9Xu54GN_zgLvB3zWMc
  1. sec-fetch-dest: iframe
  2. sec-fetch-mode: navigate
  3. sec-fetch-site: cross-site
  4. upgrade-insecure-requests: 1
  5. user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36
  6. x-client-data: CIi2yQEIpLbJAQjBtskBCKmdygEI8KDKARibvsoBGL2+ygE=

And most of these data can be set as default when using request. For the details of these attributions above, please check RFC2616 https://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html.

In general cases, you have to pay attention to :

  1. User Agent : the identification info of the request sneder, such like the name of the browser,the application type, OS type, provider and version. Generally the format is like :

    User-Agent: <product> / <product-version> <comment>
    
    

    or (in browser)

    User-Agent: Mozilla/<version> (<system-information>) <platform> (<platform-details>) <extensions>
    

    e.g. A usual Chrome's user agent:

    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36
    
  2. cookie

    Cookie is the secondly important attribution here. This data is for server to identify user. About cookie we will use many word to explain.

    //TODO

BeautifulSoup

To begin with BeautifulSoup, let’s use selenium or request to get a full html content and write a simple example showing how BeautifulSoup works.

Basically BeautifulSoup is a tool that let you avoid using re to get specific tags in html, and get their content or parallel tags. So you just need to know several API, and there’s no need to grasp re, but we will still have to use re if the case is complex.

from bs4 import BeautifulSoup
from selenium import webdriver
#import requests

Our sample will use selenium, directly use selenium can help us to avoid value generated by JavaScript, and we don’t need to fix this after we tired requests. But, keep it in mind that selenium uses much more resources because it will boot an real browser.

Now, it’s necessary for us to confirm the config :

chrome_options= webdriver.ChromeOptions()
chrome_options.add_argument('headless')
driver = webdriver.Chrome(options=chrome_options)

Here we need to set ‘headless’ mode, to disable the window of browser. It’s necessary when we need visit many pages. Now we can use this driver to visit a URL and send back the html downloaded:

url='https://news.cqu.edu.cn/newsv2/show-14-10280-1.html'
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")

These code above will perform such procedure. And if need visit several pages, just redo driver.get(url) and soup = BeautifulSoup(driver.page_source, "html.parser") to get a variable that can be processed by BeautifulSoup.

Notes: When using selenium , we can’t get status code directly (https://github.com/seleniumhq/selenium-google-code-issue-archive/issues/141), for the driver just visit target url and get corresponding html. You have to use request and get status code. My solution is to use extra code to check if the key values in html changes. e.g. if the title in meta is wired, we can judge that we got a wrong page.

If you have no idea what the page will look like, and you enabled the headless mode, you can use :

To get specific part of data in a html, we need BeautifulSoup API. you can find full support in official doc, but I want to show some most useful ones.

Direct Tag Name

You can directly visit via tag’s name, e.g.:

<h1 class="dtitle">努力“乒”出梦想,健康“羽”你同行</h1>
soup.h1.string
#'努力“乒”出梦想,健康“羽”你同行

Tag Name+Attribution

<a data-deptid="" data-id="0" href="javascript:;"> </a>
<a href="javascript:;"> 杨柳 刘思羽 茅羽瑶 </a>
soup.find_all(name='a',attrs={"href":"javascript:;"})

find_all will return a list of tags, to visit them , you can write code like :

for i in soup.find_all(name='a',attrs={"href":"javascript:;"})[3:]:
    print(i.string)

If you are sure there’s only one tag match the requirement, or you just want to get the first one, you can use find rather than find_all ,the former will return a tag:

soup.find("div", {"class":"ibox"}).contents[-2].string

Notices here we not only used string but also used contents.

The string means get the text inside the tag, for example:

<tag>text</tag>

if you want to get ‘text’ inside , you can use string here.

And contents means get the child tags inside, for example:

<tag>
  <h1>Title</h1>
  <h2>subtitle</h2>
</tag>

If you get tag and you use its contents, you will get a list containing h1 and h2 .

The last useful method is has_attr('style'). It allows you to get target tags that the attribution is not clear.

e.g.

<p style="text-align:center">
  <img alt="4.jpg" src="/newsv2/uploadfile/20171107/1510022124881514.jpg" />
</p>
<p>
  <span style="text-indent: 2em;">
    羽毛球场上,不同学院之间也展开了激烈的个人和团体赛。随着午间温度的上升,选手们为了更好地发挥实力,纷纷脱下外套,轻装上阵。雪白的球在球场上空随着选手们不断挥舞的球拍转换着位置。裁判们也始终跟紧了目光,在计分板上记录下比赛双方的实时分数。
  </span>
</p>

In this case ,we only want to get the p tags that has a img inside. Noticed that this p has a style. So we can :

for tag_p in soup.find_all('p'):
    if len(str(tag_p.string))>50:
        print(tag_p.string)
    if tag_p.has_attr('style'):
        print(tag_p.string)