Selenium
Request is the easiest way to send a http
request to a target URL. A request contains two parts: request head and the request url. And this is what your browser will send when visiting in a real case.
Generally a request head is in a json framat, and in Chrome, you can check 14 items inside, for example:
- authority: ogs.google.com ()
- method: GET (The type of http request, usually is GET)
- path: /u/0/widget/app?origin=chrome-search%3A%2F%2Flocal-ntp&pid=1&spid=243&hl=zh-TW&gm=
- :scheme: https 5 . accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9
- accept-encoding: gzip, deflate, br
- accept-language: zh-CN,zh;q=0.9,zh-TW;q=0.8
- cookie:
- HSID=AUbkUAanFFXXXXXX;
- SSID=ADbqhQOYTmfjyXXXX;
- APISID=sgGGoOYyddd5pP0c/A8nx48yUAu9GnXXXX;
- SAPISID=UvOceTV27z4-pRJt/AwIDrQ_uc-LZkXXXX;
- __Secure-HSID=AUbkUAanFFRUSXXXX; __
- Secure-SSID=ADbqhQOYTmfjyXXXX;
- __Secure-APISID=sgGGoOYyddd5pP0c/A8nx48yUAu9GnXXXX;
- Secure-3PAPISID=UvOceTV27z4-pRJt/AwIDrQ_uc-LZkXXXX;
- OTZ=5489941_24_24__24_;
- __SID=yAdNJBetgwljReCOL1RX9kBlJAM8MjVJesOHLcVdF2-mMLwQtQMUMSpi2IBwjKISqnxxxx.;
- __Secure-3PSID=yAdNJBetgwljReCOL1RX9kBlJAM8MjVJesOHLcVdF2-mMLwQv4q9ptDR0zSxBvQ9PF6kgQ.;
- NID=204=g_d3k7sRDyZ1HNJ-ceym0tpmgr-U8X79E0_L_l2_ET_ryjLi9pXB59XrrfjmHpFXkwLMc640fp3hMzSxNus6W3uB1ALcKTtJA_lf36SGlgT3XhCzW562_lahSvakuExNlJ6SrILK7Wy-9EuwvnOE44oajmeHqy4eI9rr3W1xMCNQXa6cEQlrRykbF8T89VB_GnvownKrENIBMVebo30c4_ZeyZORmivcHEXyFvMggu6yvIOnbVQeYnY2J98Na07V4ZSCw;
- 1P_JAR=2020-06-16-14; SIDCC=AJi4QfH_yuW_DTMEc0UncAJUZgGzkuxqvrUyxkn2n403X0GiWlcj9Uplj9Xu54GN_zgLvB3zWMc
- sec-fetch-dest: iframe
- sec-fetch-mode: navigate
- sec-fetch-site: cross-site
- upgrade-insecure-requests: 1
- user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36
- x-client-data: CIi2yQEIpLbJAQjBtskBCKmdygEI8KDKARibvsoBGL2+ygE=
And most of these data can be set as default when using request
. For the details of these attributions above, please check RFC2616
https://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html.
In general cases, you have to pay attention to :
User Agent : the identification info of the request sneder, such like the name of the browser,the application type, OS type, provider and version. Generally the format is like :
User-Agent: <product> / <product-version> <comment>
or (in browser)
User-Agent: Mozilla/<version> (<system-information>) <platform> (<platform-details>) <extensions>
e.g. A usual Chrome's user agent:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36
cookie
Cookie is the secondly important attribution here. This data is for server to identify user. About cookie we will use many word to explain.
//TODO
BeautifulSoup
To begin with BeautifulSoup, let’s use selenium or request to get a full html content and write a simple example showing how BeautifulSoup works.
Basically BeautifulSoup is a tool that let you avoid using re
to get specific tags in html, and get their content or parallel tags. So you just need to know several API, and there’s no need to grasp re
, but we will still have to use re
if the case is complex.
from bs4 import BeautifulSoup
from selenium import webdriver
#import requests
Our sample will use selenium
, directly use selenium
can help us to avoid value generated by JavaScript, and we don’t need to fix this after we tired requests
. But, keep it in mind that selenium
uses much more resources because it will boot an real browser.
Now, it’s necessary for us to confirm the config :
chrome_options= webdriver.ChromeOptions()
chrome_options.add_argument('headless')
driver = webdriver.Chrome(options=chrome_options)
Here we need to set ‘headless’ mode, to disable the window of browser. It’s necessary when we need visit many pages. Now we can use this driver to visit a URL and send back the html downloaded:
url='https://news.cqu.edu.cn/newsv2/show-14-10280-1.html'
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
These code above will perform such procedure. And if need visit several pages, just redo driver.get(url)
and soup = BeautifulSoup(driver.page_source, "html.parser")
to get a variable that can be processed by BeautifulSoup.
Notes: When using selenium
, we can’t get status code directly (https://github.com/seleniumhq/selenium-google-code-issue-archive/issues/141), for the driver just visit target url and get corresponding html. You have to use request
and get status code. My solution is to use extra code to check if the key values in html changes. e.g. if the title in meta is wired, we can judge that we got a wrong page.
If you have no idea what the page will look like, and you enabled the headless mode, you can use :
To get specific part of data in a html, we need BeautifulSoup API. you can find full support in official doc, but I want to show some most useful ones.
Direct Tag Name
You can directly visit via tag’s name, e.g.:
<h1 class="dtitle">努力“乒”出梦想,健康“羽”你同行</h1>
soup.h1.string
#'努力“乒”出梦想,健康“羽”你同行
Tag Name+Attribution
<a data-deptid="" data-id="0" href="javascript:;"> </a>
<a href="javascript:;"> 杨柳 刘思羽 茅羽瑶 </a>
soup.find_all(name='a',attrs={"href":"javascript:;"})
find_all
will return a list of tags, to visit them , you can write code like :
for i in soup.find_all(name='a',attrs={"href":"javascript:;"})[3:]:
print(i.string)
If you are sure there’s only one tag match the requirement, or you just want to get the first one, you can use find
rather than find_all
,the former will return a tag:
soup.find("div", {"class":"ibox"}).contents[-2].string
Notices here we not only used string
but also used contents
.
The string
means get the text inside the tag, for example:
<tag>text</tag>
if you want to get ‘text’ inside , you can use string
here.
And contents
means get the child tags inside, for example:
<tag>
<h1>Title</h1>
<h2>subtitle</h2>
</tag>
If you get tag
and you use its contents
, you will get a list containing h1
and h2
.
The last useful method is has_attr('style')
. It allows you to get target tags that the attribution is not clear.
e.g.
<p style="text-align:center">
<img alt="4.jpg" src="/newsv2/uploadfile/20171107/1510022124881514.jpg" />
</p>
<p>
<span style="text-indent: 2em;">
羽毛球场上,不同学院之间也展开了激烈的个人和团体赛。随着午间温度的上升,选手们为了更好地发挥实力,纷纷脱下外套,轻装上阵。雪白的球在球场上空随着选手们不断挥舞的球拍转换着位置。裁判们也始终跟紧了目光,在计分板上记录下比赛双方的实时分数。
</span>
</p>
In this case ,we only want to get the p
tags that has a img
inside. Noticed that this p
has a style. So we can :
for tag_p in soup.find_all('p'):
if len(str(tag_p.string))>50:
print(tag_p.string)
if tag_p.has_attr('style'):
print(tag_p.string)