Python3网络爬虫基本库的使用

1 urllib

Python内置的HTTP请求库,包含4个模块:

  1. request 模拟发送请求
  2. error 异常处理,保证程序不会异常终止
  3. parse URL处理,包括拆分、解析、合并等
  4. robotparser 识别网站的robots.txt文件,判断哪些网站可以爬

1.1 urllib.request.urlopen基本使用

urlopen的APIurlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

1.1.1 GET

1
2
3
4
import urllib.request

reponse = urllib.request.urlopen("https://www.python.org/")
print(reponse.read().decode('utf8'))

1.1.2 POST

  1. 添加data参数。如果是字节流编码格式的内容,即bytes类型,需要通过bytes()方法转化
1
2
3
4
5
6
import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8')
reponse = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(reponse.read().decode('utf8'))
  1. 返回结果可以看到我们传递的参数
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    {
    "args": {},
    "data": "",
    "files": {},
    "form": {
    "word": "hello"
    },
    "headers": {
    "Accept-Encoding": "identity",
    "Content-Length": "10",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "Python-urllib/3.7",
    "X-Amzn-Trace-Id": "Root=1-5fde1150-77771d313337109b23a16596"
    },
    "json": null,
    "origin": "153.3.60.156",
    "url": "http://httpbin.org/post"
    }

1.1.3 设置超时

  1. 设置timeout,如果timeout=0.1则意味着服务器0.1秒内无相应则超时,测试代码如下
1
2
3
4
import urllib.request

reponse = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
print(reponse.read().decode('utf8'))
  1. 这段代码执行后会报urllib.error.URLError: ,这个异常属于urllib.error,因此需要进行异常处理
1
2
3
4
5
6
7
8
9
import socket
import urllib.request
import urllib.error

try:
reponse = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
if isinstance(e.reason, socket.timeout): # 判断异常是否是socket.timeout类型
print('TIME OUT')

1.2 Request类 构建更完整的请求

urlopen()可以发起最基本的请求,借助Request类可以构建更完整的请求,Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

  1. url为必传参数
  2. headers是请求头,添加请求头最常用的方法是通过修改User-Agent来伪装浏览器,默认是Python-urllib,可以修改。比如要伪装火狐浏览器可以设为Mozilla/s.o (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
  3. origin_req_host指定请求方的host名称或者IP
  4. unverifiable表示请求是否是无法验证的。True表示用户没有足够的权限来选择接受请求的结果,比如我们请求HTML文档中的图片,但是我们没有自动抓取图像的权限
  5. method表示请求使用的方法,比如GET/POST/PUT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from urllib import request, parse

url = 'http://httpbin.org/post'
headers = {
'User-Agent': 'ozilla/s.o (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11',
'Host': 'httpbin.org'
}
dict = {
'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
reponse = request.urlopen(req)
print(reponse.read().decode('utf8'))

返回结果为

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
"args": {},
"data": "",
"files": {},
"form": {
"name": "Germey"
},
"headers": {
"Accept-Encoding": "identity",
"Content-Length": "11",
"Content-Type": "application/x-www-form-urlencoded",
"Host": "httpbin.org",
"User-Agent": "ozilla/s.o (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11",
"X-Amzn-Trace-Id": "Root=1-5fde1845-6dd69e756aef8d7d196283c0"
},
"json": null,
"origin": "153.3.60.156",
"url": "http://httpbin.org/post"
}

成功设置了data、headers、method

1.3 获取Cookies

1.3.1 输出到控制台

  1. 首先声明一个CookieJar对象
  2. 利用HTTPCookieProcessor构建一个handler
  3. 利用build_opener构建opener,执行open函数
1
2
3
4
5
6
7
8
9
import http.cookiejar
import urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
reponse = opener.open('https://www.baidu.com')
for item in cookie:
print(item.name + " = " + item.value)

返回结果中有每条cookie的名称和值

1
2
3
4
BAIDUID = 74596F1DEF3244E28A23ECA2168D97F4:FG=1
BIDUPSID = 74596F1DEF3244E2AC018C7F10BFE6E0
PSTM = 1608453123
BD_NOT_HTTPS = 1

1.3.2 生成Mozilla格式的Cookies文件

1
2
3
4
5
6
7
8
9
import http.cookiejar
import urllib.request

filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
reponse = opener.open('https://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
1
2
3
4
5
6
7
8
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This is a generated file! Do not edit.

.baidu.com TRUE / FALSE 1639989697 BAIDUID 418B75A649A30AB1C79CED2D32734FBF:FG=1
.baidu.com TRUE / FALSE 3755937344 BIDUPSID 418B75A649A30AB11E06E56E65FE5E5B
.baidu.com TRUE / FALSE 3755937344 PSTM 1608453695
www.baidu.com FALSE / FALSE 1608453997 BD_NOT_HTTPS 1

1.3.3 生成libwww-perl(LWP)格式的Cookies文件

只修改cookie行即可

1
cookie = http.cookiejar.LWPCookieJar(filename)

1
2
3
4
5
#LWP-Cookies-2.0
Set-Cookie3: BAIDUID="F3403A28BB17E897CF386749420DF132:FG=1"; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2021-12-20 08:46:27Z"; comment=bd; version=0
Set-Cookie3: BIDUPSID=F3403A28BB17E8971CC847C9DC1B41F7; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2089-01-07 12:00:34Z"; version=0
Set-Cookie3: PSTM=1608453985; path="/"; domain=".baidu.com"; path_spec; domain_dot; expires="2089-01-07 12:00:34Z"; version=0
Set-Cookie3: BD_NOT_HTTPS=1; path="/"; domain="www.baidu.com"; path_spec; expires="2020-12-20 08:51:27Z"; version=0

1.4 利用生成的Cookies文件

  1. 利用MozillaCookieJar的Cookies文件
1
2
3
4
5
6
7
8
9
import http.cookiejar
import urllib.request

cookie = http.cookiejar.MozillaCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
reponse = opener.open('http://www.baidu.com')
print(reponse.read().decode('utf8'))
  1. 利用LWPCookieJar格式的Cookies,输出网页的源代码
1
cookie = http.cookiejar.LWPCookieJar()

1.5 处理异常

1.5.1 URLError

URLError类来自urllib库的error模块,继承自OSError类,是error异常模块的基类,由request模块产生的异常都可以通过捕获这个类来处理。它具有一个属性reason,即返回错误的原因。

1
2
3
4
5
6
from urllib import request, error

try:
reponse = request.urlopen('https://www.baidu.com/xiatiande.htm')
except error.URLError as e:
print(e.reason)

返回Not Found

1.5.2 HTTPError

URLError的子类,专门用于处理HTTP请求错误,具有三个属性:code返回HTTP状态码,reason同父类,headers返回请求头。

1
2
3
4
5
6
from urllib import request, error

try:
reponse = request.urlopen('https://www.baidu.com/xiatiande.htm')
except error.HTTPError as e:
print(e.reason, e.code, e.headers, sep='\n')

返回结果

1
2
3
4
5
6
7
Not Found
404
Content-Length: 211
Content-Type: text/html; charset=iso-8859-1
Date: Sun, 20 Dec 2020 09:30:22 GMT
Server: Apache
Connection: close

1.5.3 更好的异常处理方法

  1. 先捕获HTTPError再捕获URLError,最后用else处理正常逻辑
1
2
3
4
5
6
7
8
9
10
from urllib import request, error

try:
reponse = request.urlopen('https://www.baidu.com/xiatiande.htm')
except error.HTTPError as e:
print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
print(e.reason)
else:
print('Request Successfully')

  1. 有时reason属性返回的不是字符串,而是一个对象。这时我们可以用isinstance()方法来判断其类型,做出更详细的异常判断
1
2
3
4
5
6
7
from urllib import request, error
import socket

try:
reponse = request.urlopen('https://www.baidu.com', timeout=0.01)
except error.URLError as e:
print(e.reason)

返回结果是_ssl.c:1074: The handshake operation timed out,修改代码

1
2
3
4
5
6
7
8
9
from urllib import request, error
import socket

try:
reponse = request.urlopen('https://www.baidu.com', timeout=0.01)
except error.URLError as e:
print(type(e.reason))
if isinstance(e.reason, socket.timeout):
print('TIME OUT')

输出

1
2
<class 'socket.timeout'>
TIME OUT

1.6 解析链接

1.6.1 urllib.parse.urlparse

urlparse的API
urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)
其中scheme是协议信息,若链接中不带协议信息则以输入的scheme为默认,否则以链接中的有效
allow_fragments表示是否解析fragments

  1. 利用urlparse方法解析一个URL

    1
    2
    3
    4
    from urllib.parse import urlparse

    result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
    print(type(result), result, sep='\n')
  2. 返回结果为

1
2
<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

1.6.2 urllib.parse.urlencode

可以将字典序列化成GET请求参数

1
2
3
4
5
6
7
8
9
10
from urllib.parse import urlencode

params = {
'name': 'germey',
'age': 22
}

base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

输出http://www.baidu.com?name=germey&age=22

1.7 Robots协议

urllib.robotparser.RobotFileParser类,可以根据某网站的robots.txt文件判断一个爬虫是否有权限爬取这个网页。

1
2
3
4
5
6
7
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('https://www.baidu.com/robots.txt')
rp.read()
print(rp.can_fetch('*', 'https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct='
'201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=asdfmovie')) # 判断是否可以爬取,第一个参数是用户代理,第二个参数是要抓取的URL

2 requests

2.1 get

  1. 基本用法
1
2
3
4
import requests

r = requests.get('http://httpbin.org/get') # 可以添加params, headers等参数
print(r.text)
  1. 用Cookies维持登录状态
1
2
3
4
5
6
7
8
9
10
11
import requests
import re

headers = {
'Cookie': '从登录后的网页copy下来',
'Host': 'www.zhihu.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36 Edg/87.0.664.66'
}

r = requests.get('https://www.zhihu.com/explore', headers=headers)
print(r.text)
  1. 跳过SSL证书验证并忽略警告
1
2
3
4
5
6
import requests
import logging

logging.captureWarnings(True)
reponse = requests.get('https://www.testurl.blabla', verify=False)
print(reponse.status_code)


----------- 本文结束 -----------




0%