Learning Python3 Crawler

Collect Data from Web

Posted by Riino on June 16, 2020 Python Crawler Web


Request is the easiest way to send a http request to a target URL. A request contains two parts: request head and the request url. And this is what your browser will send when visiting in a real case.

Generally a request head is in a json framat, and in Chrome, you can check 14 items inside, for example:

  1. authority: ogs.google.com ()
  2. method: GET (The type of http request, usually is GET)
  3. path: /u/0/widget/app?origin=chrome-search%3A%2F%2Flocal-ntp&pid=1&spid=243&hl=zh-TW&gm=
  4. :scheme: https 5 . accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9
  5. accept-encoding: gzip, deflate, br
  6. accept-language: zh-CN,zh;q=0.9,zh-TW;q=0.8
  7. cookie:
   - SSID=ADbqhQOYTmfjyXXXX;
   - APISID=sgGGoOYyddd5pP0c/A8nx48yUAu9GnXXXX; 
   - SAPISID=UvOceTV27z4-pRJt/AwIDrQ_uc-LZkXXXX;
   -  __Secure-HSID=AUbkUAanFFRUSXXXX; __
   - Secure-SSID=ADbqhQOYTmfjyXXXX; 
   - __Secure-APISID=sgGGoOYyddd5pP0c/A8nx48yUAu9GnXXXX; 
   - Secure-3PAPISID=UvOceTV27z4-pRJt/AwIDrQ_uc-LZkXXXX; 
   - OTZ=5489941_24_24__24_;
   - __SID=yAdNJBetgwljReCOL1RX9kBlJAM8MjVJesOHLcVdF2-mMLwQtQMUMSpi2IBwjKISqnxxxx.; 
   - __Secure-3PSID=yAdNJBetgwljReCOL1RX9kBlJAM8MjVJesOHLcVdF2-mMLwQv4q9ptDR0zSxBvQ9PF6kgQ.; 
   - NID=204=g_d3k7sRDyZ1HNJ-ceym0tpmgr-U8X79E0_L_l2_ET_ryjLi9pXB59XrrfjmHpFXkwLMc640fp3hMzSxNus6W3uB1ALcKTtJA_lf36SGlgT3XhCzW562_lahSvakuExNlJ6SrILK7Wy-9EuwvnOE44oajmeHqy4eI9rr3W1xMCNQXa6cEQlrRykbF8T89VB_GnvownKrENIBMVebo30c4_ZeyZORmivcHEXyFvMggu6yvIOnbVQeYnY2J98Na07V4ZSCw;
   -  1P_JAR=2020-06-16-14; SIDCC=AJi4QfH_yuW_DTMEc0UncAJUZgGzkuxqvrUyxkn2n403X0GiWlcj9Uplj9Xu54GN_zgLvB3zWMc
  1. sec-fetch-dest: iframe
  2. sec-fetch-mode: navigate
  3. sec-fetch-site: cross-site
  4. upgrade-insecure-requests: 1
  5. user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36
  6. x-client-data: CIi2yQEIpLbJAQjBtskBCKmdygEI8KDKARibvsoBGL2+ygE=

And most of these data can be set as default when using request. For the details of these attributions above, please check RFC2616 https://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html.

In general cases, you have to pay attention to :

  1. User Agent : the identification info of the request sneder, such like the name of the browser,the application type, OS type, provider and version. Generally the format is like :
    User-Agent: <product> / <product-version> <comment>

    or (in browser)

    User-Agent: Mozilla/<version> (<system-information>) <platform> (<platform-details>) <extensions>

    e.g. A usual Chrome’s user agent:

    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36
  2. cookie

    Cookie is the secondly important attribution here. This data is for server to identify user. About cookie we will use many word to explain.



To begin with BeautifulSoup, let’s use selenium or request to get a full html content and write a simple example showing how BeautifulSoup works.

Basically BeautifulSoup is a tool that let you avoid using re to get specific tags in html, and get their content or parallel tags. So you just need to know several API, and there’s no need to grasp re, but we will still have to use re if the case is complex.

from bs4 import BeautifulSoup
from selenium import webdriver
#import requests

Our sample will use selenium, directly use selenium can help us to avoid value generated by JavaScript, and we don’t need to fix this after we tired requests. But, keep it in mind that selenium uses much more resources because it will boot an real browser.

Now, it’s necessary for us to confirm the config :

chrome_options= webdriver.ChromeOptions()
driver = webdriver.Chrome(options=chrome_options)

Here we need to set ‘headless’ mode, to disable the window of browser. It’s necessary when we need visit many pages. Now we can use this driver to visit a URL and send back the html downloaded:

soup = BeautifulSoup(driver.page_source, "html.parser")

These code above will perform such procedure. And if need visit several pages, just redo driver.get(url) and soup = BeautifulSoup(driver.page_source, "html.parser") to get a variable that can be processed by BeautifulSoup.

Notes: When using selenium , we can’t get status code directly (https://github.com/seleniumhq/selenium-google-code-issue-archive/issues/141), for the driver just visit target url and get corresponding html. You have to use request and get status code. My solution is to use extra code to check if the key values in html changes. e.g. if the title in meta is wired, we can judge that we got a wrong page.

If you have no idea what the page will look like, and you enabled the headless mode, you can use :

To get specific part of data in a html, we need BeautifulSoup API. you can find full support in official doc, but I want to show some most useful ones.

Direct Tag Name

You can directly visit via tag’s name, e.g.:

<h1 class="dtitle">

Tag Name+Attribution

<a data-deptid="" data-id="0" href="javascript:;">
        <a href="javascript:;">
         杨柳 刘思羽 茅羽瑶

find_all will return a list of tags, to visit them , you can write code like :

for i in soup.find_all(name='a',attrs={"href":"javascript:;"})[3:]:

If you are sure there’s only one tag match the requirement, or you just want to get the first one, you can use find rather than find_all ,the former will return a tag:

soup.find("div", {"class":"ibox"}).contents[-2].string

Notices here we not only used string but also used contents.

The string means get the text inside the tag, for example:


if you want to get ‘text’ inside , you can use string here.

And contents means get the child tags inside, for example:


If you get tag and you use its contents, you will get a list containing h1 and h2 .

The last useful method is has_attr('style'). It allows you to get target tags that the attribution is not clear.


<p style="text-align:center">
       <img alt="4.jpg" src="/newsv2/uploadfile/20171107/1510022124881514.jpg"/>
       <span style="text-indent: 2em;">

In this case ,we only want to get the p tags that has a img inside. Noticed that this p has a style. So we can :

for tag_p in soup.find_all('p'):
    if len(str(tag_p.string))>50:
    if tag_p.has_attr('style'):