爬取某网站时,发送请求后,却收到了【403 Forbidden】错误,并且返回title为Just a moment…
import requests
resp = requests.get('https://www.cosdna.com/chs/product.php?q=%E5%B0%8F&sort=featured')
print(resp.status_code)
print(resp.text)
模拟请求发送是没有问题的,如下图:
经过一番思考,发现了网站使用5s盾,也就是Cloudflare,很有可能是通过SSL指纹识别来认定我们的HTTP请求,不是从正常的浏览器发出的,而是Python之类的脚本程序,从而拦截。
SSL指纹是什么呢?简单来说,客户端发起https的请求的第一步就是向服务器发送tls握手请求,这其中就包含了客户端的一些特征。
TLS握手阶段的【Client Hello】中包含了一些我们客户端SSL Library的特征,这取决于我们客户端用的openssl版本,编译openssl的时候选择支持的Cipher suite等信息。Cloudflare有一个SSL指纹特征库,再结合【User-Agent】很容易判断出你的这个HTTP请求,根本不是来自于【Chome】或【Firefox】浏览器,于是把你拒绝。
修改特征
from requests.adapters import HTTPAdapter
from urllib3.util.ssl_ import create_urllib3_context
import random
ORIGIN_CIPHERS = ('ECDH+AESGCM:DH+AESGCM:ECDH+AES256:DH+AES256:ECDH+AES128:DH+AES:ECDH+HIGH:'
'DH+HIGH:ECDH+3DES:DH+3DES:RSA+AESGCM:RSA+AES:RSA+HIGH:RSA+3DES')
class DESAdapter(HTTPAdapter):
def __init__(self, *args, **kwargs):
"""
A TransportAdapter that re-enables 3DES support in Requests.
"""
CIPHERS = ORIGIN_CIPHERS.split(':')
random.shuffle(CIPHERS)
CIPHERS = ':'.join(CIPHERS)
self.CIPHERS = CIPHERS + ':!aNULL:!eNULL:!MD5'
super().__init__(*args, **kwargs)
def init_poolmanager(self, *args, **kwargs):
context = create_urllib3_context(ciphers=self.CIPHERS)
kwargs['ssl_context'] = context
return super(DESAdapter, self).init_poolmanager(*args, **kwargs)
def proxy_manager_for(self, *args, **kwargs):
context = create_urllib3_context(ciphers=self.CIPHERS)
kwargs['ssl_context'] = context
return super(DESAdapter, self).proxy_manager_for(*args, **kwargs)
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36 Edg/92.0.902.67'}
s = requests.Session()
s.headers.update(headers)
s.mount('https://www.cosdna.com/chs/product.php?q=%E5%B0%8F&sort=featured', DESAdapter())
resp = s.get('https://www.cosdna.com/chs/product.php?q=%E5%B0%8F&sort=featured')
print(resp.text)
修改指纹后,重新请求,就可以在响应结果看到页面内容了。
scrapy修改指纹:https://github.com/synodriver/awsome_scrapy_utils