Python3网络爬虫基本库的使用

1 urllib

Python内置的HTTP请求库，包含4个模块：

request 模拟发送请求
error 异常处理，保证程序不会异常终止
parse URL处理，包括拆分、解析、合并等
robotparser 识别网站的robots.txt文件，判断哪些网站可以爬

1.1 urllib.request.urlopen基本使用

urlopen的API，urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

1.1.1 GET

import urllib.request

reponse = urllib.request.urlopen("https://www.python.org/")
print(reponse.read().decode('utf8'))

1.1.2 POST

添加data参数。如果是字节流编码格式的内容，即bytes类型，需要通过bytes()方法转化

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
reponse = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(reponse.read().decode('utf8'))

返回结果可以看到我们传递的参数

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "word": "hello"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "10", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "Python-urllib/3.7", 
    "X-Amzn-Trace-Id": "Root=1-5fde1150-77771d313337109b23a16596"
  }, 
  "json": null, 
  "origin": "153.3.60.156", 
  "url": "http://httpbin.org/post"
}

1.1.3 设置超时

设置timeout，如果timeout=0.1则意味着服务器0.1秒内无相应则超时，测试代码如下

import urllib.request

reponse = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
print(reponse.read().decode('utf8'))

这段代码执行后会报urllib.error.URLError: ，这个异常属于urllib.error，因此需要进行异常处理

import socket
import urllib.request
import urllib.error

try:
    reponse = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):  # 判断异常是否是socket.timeout类型
        print('TIME OUT')

1.2 Request类构建更完整的请求

urlopen()可以发起最基本的请求，借助Request类可以构建更完整的请求，Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

url为必传参数
headers是请求头，添加请求头最常用的方法是通过修改User-Agent来伪装浏览器，默认是Python-urllib，可以修改。比如要伪装火狐浏览器可以设为Mozilla/s.o (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
origin_req_host指定请求方的host名称或者IP
unverifiable表示请求是否是无法验证的。True表示用户没有足够的权限来选择接受请求的结果，比如我们请求HTML文档中的图片，但是我们没有自动抓取图像的权限
method表示请求使用的方法，比如GET/POST/PUT

from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'ozilla/s.o (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
reponse = request.urlopen(req)
print(reponse.read().decode('utf8'))

返回结果为

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "Germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "ozilla/s.o (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11", 
    "X-Amzn-Trace-Id": "Root=1-5fde1845-6dd69e756aef8d7d196283c0"
  }, 
  "json": null, 
  "origin": "153.3.60.156", 
  "url": "http://httpbin.org/post"
}

成功设置了data、headers、method

1.3 获取Cookies

1.3.1 输出到控制台

首先声明一个CookieJar对象
利用HTTPCookieProcessor构建一个handler
利用build_opener构建opener，执行open函数

import http.cookiejar
import urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
reponse = opener.open('https://www.baidu.com')
for item in cookie:
    print(item.name + " = " + item.value)

返回结果中有每条cookie的名称和值

BAIDUID = 74596F1DEF3244E28A23ECA2168D97F4:FG=1
BIDUPSID = 74596F1DEF3244E2AC018C7F10BFE6E0
PSTM = 1608453123
BD_NOT_HTTPS = 1

1.3.2 生成Mozilla格式的Cookies文件

import http.cookiejar
import urllib.request

filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
reponse = opener.open('https://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file!  Do not edit.

.baidu.com	TRUE	/	FALSE	1639989697	BAIDUID	418B75A649A30AB1C79CED2D32734FBF:FG=1
.baidu.com	TRUE	/	FALSE	3755937344	BIDUPSID	418B75A649A30AB11E06E56E65FE5E5B
.baidu.com	TRUE	/	FALSE	3755937344	PSTM	1608453695
www.baidu.com	FALSE	/	FALSE	1608453997	BD_NOT_HTTPS	1

1.3.3 生成libwww-perl(LWP)格式的Cookies文件

只修改cookie行即可

1	cookie = http.cookiejar.LWPCookieJar(filename)

#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="F3403A28BB17E897CF386749420DF132:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2021-12-20 08:46:27Z"; comment=bd; version=0
Set-Cookie3: BIDUPSID=F3403A28BB17E8971CC847C9DC1B41F7; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2089-01-07 12:00:34Z"; version=0
Set-Cookie3: PSTM=1608453985; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2089-01-07 12:00:34Z"; version=0
Set-Cookie3: BD_NOT_HTTPS=1; path="/"; domain="www.baidu.com"; path_spec; expires="2020-12-20 08:51:27Z"; version=0

1.4 利用生成的Cookies文件

利用MozillaCookieJar的Cookies文件

import http.cookiejar
import urllib.request

cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
reponse = opener.open('http://www.baidu.com')
print(reponse.read().decode('utf8'))

利用LWPCookieJar格式的Cookies，输出网页的源代码

1	cookie = http.cookiejar.LWPCookieJar()

1.5 处理异常

1.5.1 URLError

URLError类来自urllib库的error模块，继承自OSError类，是error异常模块的基类，由request模块产生的异常都可以通过捕获这个类来处理。它具有一个属性reason，即返回错误的原因。

from urllib import request, error

try:
    reponse = request.urlopen('https://www.baidu.com/xiatiande.htm')
except error.URLError as e:
    print(e.reason)

返回Not Found

1.5.2 HTTPError

URLError的子类，专门用于处理HTTP请求错误，具有三个属性：code返回HTTP状态码，reason同父类，headers返回请求头。

from urllib import request, error

try:
    reponse = request.urlopen('https://www.baidu.com/xiatiande.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')

返回结果

Not Found
404
Content-Length: 211
Content-Type: text/html; charset=iso-8859-1
Date: Sun, 20 Dec 2020 09:30:22 GMT
Server: Apache
Connection: close

1.5.3 更好的异常处理方法

先捕获HTTPError再捕获URLError，最后用else处理正常逻辑

from urllib import request, error

try:
    reponse = request.urlopen('https://www.baidu.com/xiatiande.htm')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

有时reason属性返回的不是字符串，而是一个对象。这时我们可以用isinstance()方法来判断其类型，做出更详细的异常判断

from urllib import request, error
import socket

try:
    reponse = request.urlopen('https://www.baidu.com', timeout=0.01)
except error.URLError as e:
    print(e.reason)

返回结果是_ssl.c:1074: The handshake operation timed out，修改代码

from urllib import request, error
import socket

try:
    reponse = request.urlopen('https://www.baidu.com', timeout=0.01)
except error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

输出

1 2	<class 'socket.timeout'> TIME OUT

1.6 解析链接

1.6.1 urllib.parse.urlparse

urlparse的API
urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)
其中scheme是协议信息，若链接中不带协议信息则以输入的scheme为默认，否则以链接中的有效
allow_fragments表示是否解析fragments

利用urlparse方法解析一个URL

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result, sep='\n')

返回结果为

1 2	<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

1.6.2 urllib.parse.urlencode

可以将字典序列化成GET请求参数

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}

base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

输出http://www.baidu.com?name=germey&age=22

1.7 Robots协议

urllib.robotparser.RobotFileParser类，可以根据某网站的robots.txt文件判断一个爬虫是否有权限爬取这个网页。

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://www.baidu.com/robots.txt')
rp.read()
print(rp.can_fetch('*', 'https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct='
                        '201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=asdfmovie'))  # 判断是否可以爬取，第一个参数是用户代理，第二个参数是要抓取的URL

2 requests

2.1 get

基本用法

import requests

r = requests.get('http://httpbin.org/get')  # 可以添加params, headers等参数
print(r.text)

用Cookies维持登录状态

import requests
import re

headers = {
    'Cookie': '从登录后的网页copy下来',
    'Host': 'www.zhihu.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.66'
}

r = requests.get('https://www.zhihu.com/explore', headers=headers)
print(r.text)

跳过SSL证书验证并忽略警告

import requests
import logging

logging.captureWarnings(True)
reponse = requests.get('https://www.testurl.blabla', verify=False)
print(reponse.status_code)