杂项 简单的爬虫 爬虫概念:编写程序获取互联网资源
1 2 3 4 5 6 7 from urllib.request import urlopenurl = "http://www.baidu.com" resp = urloepn(url) with open ("mybaidu.html" , mode="w" ) as f: f.write(resp.read().decode("utf8" ))
得到模拟百度首页的内容。
两种Request Method: POST 、 GET
两种渲染方式:
区别方式:在页面源代码中是否能看到数据
HTTP协议 协议:两个计算机之间为了能流畅沟通设置的一个君子协定。常见协议:TCP/IP,HTTP
HTTP:Hyper text Transfer Protocol(超文本传输协议)
请求和相应分三块: 请求:
请求行:请求方式,请求url地址,协议
请求头:服务器需要的附加信息
请求体:参数
响应:
状态行:协议,状态码
响应头:客户端使用的附加信息
服务器返回给客户端的内容(HTML,json)
两种open 1 2 3 4 5 with open ("baidutest.html" , mode="w" ,encoding = "utf8" ) as f1: f1.write(resp.read().decode("utf-8" )) f2 = open ("baidutest.html" , mode="w" ,encoding = "utf8" ) f2.close()
三种输出方式 1 2 3 4 i = 1 print ('asdas{:d}dasd' .format (i))print (f'asdas{i} dasd' )print ('asdas%ddasd' %(i))
数据提取问题有三种解析方案:re解析,bs4解析,xpath解析
re模块的应用 豆瓣想读爬虫
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import re, requests , csvhead = {"User-agent" :"OperaWin7:Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50" } obj = re.compile (r'<li class="subject-item">.*?title="(?P<name>.*?)".*?<div class="pub">(?P<information>.*?)</div>' , re.S) f = open ("pics.csv" , mode= "w" , encoding= 'utf8' , newline="" ) csvwrite = csv.writer(f) i = 0 while i < 1141 : url = "https://book.douban.com/people/211783344/wish?start=" + str (i) resp = requests.get(url=url, headers=head) res = obj.finditer(resp.text) for it in res: dic = it.groupdict() dic["name" ] = '《' + dic["name" ].strip() + '》 ---- ' dic["information" ] = dic["information" ].strip() csvwrite.writerow(dic.values()) i = i + 15 f.close() resp.close()
电影天堂超链接爬虫
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import requests, redomain = "http://www.dytt89.com/" resp1 = requests.get(url= domain,verify=False ) resp1.encoding = "gb2312" obj1 = re.compile ("2022必看热片.*?<ul>(?P<ul>.*?)<ul>" ,re.S) obj2 = re.compile ("<li><a href='(?P<href>.*?)'.*?>(?P<film>.*?)</a>" ,re.S) obj3 = re.compile ('◎片 名(?P<name>.*?)<.*?<td style="WORD-WRAP:.*?href="(?P<download>.*?)">magnet:' ,re.S) res1 = obj1.search(resp1.text) ul = res1.group('ul' ) res2 = obj2.finditer(ul) for it2 in res2: url = domain.rstrip('/' ) + it2.group("href" ) resp2 = requests.get(url,verify=False ) resp2.encoding = 'gb2312' res3 = obj3.finditer(resp2.text) for it3 in res3: print (it3.group("name" ).strip(),end=":" ) print (it3.group("download" ))
bs4 库:
1 from bs4 import BeautifulSoup
常见方法:
1 2 3 4 main_name = BeautifulSoup(resp.text, "html.parser" ) taotus = main_page.find_all("div" , class_= "taotu-main" )
属性和关键字重复时:
1 2 table = page.find("table" , attrs={"class" : "hq_table" })
示例:爬取唯美图片
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 import requestsfrom bs4 import BeautifulSoupurl = "https://www.umei.cc/bizhitupian/" resp = requests.get(url) resp.encoding = 'utf8' main_page = BeautifulSoup(resp.text, "html.parser" ) taotus = main_page.find_all("div" , class_= "taotu-main" ) for taotu in taotus: pictures = taotu.find_all("p" ) for picture in pictures: href = picture.find("a" ).get("href" ) child_page_resp = requests.get(f"https://www.umei.cc/{href} " ) child_page_resp.encoding = "utf8" cptext = child_page_resp.text child_page = BeautifulSoup(cptext, "html.parser" ) page = child_page.find("div" ,class_="big-pic" ) a = page.find("a" ) img = a.find("img" ) src = img.get("src" ) img_resp = requests.get(src) img_name = src.split("/" )[-1 ] with open ("img" +img_name, mode="wb" ) as f: f.write(img_resp.content)
xpath xpath是xml文档中搜索内容的一门语言。html 又是 xml 的一门语言。
1 2 3 4 5 6 7 8 9 from lxml import etreename = etree.HTML() name = etree.XML() xpath("/book" )
语法入门:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 <!DOCTYPE html > <html lang ="en" > <head > <meta charset ="UTF-8" /> <title > Title</title > </head > <body > <ul > <li > <a href ="http://www.baidu.com" > 百度</a > </li > <li > <a href ="http://www.google.com" > 谷歌</a > </li > <li > <a href ="http://www.sogou.com" > 搜狗</a > </li > </ul > <ol > <li > <a href ="feiji" > 飞机</a > </li > <li > <a href ="dapao" > 大炮</a > </li > <li > <a href ="huoche" > 火车</a > </li > </ol > <div class ="job" > 李嘉诚</div > <div class ="common" > 胡辣汤</div > </body > </html >
1 2 3 4 5 6 7 8 9 10 11 12 13 from lxml import etreetree = etree.parse("b.html" ) result1 = tree.xpath("/html/body/ol/li/a[@href='feiji']/text()" ) result2 = tree.xpath("/html//li/a/text()" ) result3 = tree.xpath("/html/body/ul/li" ) for it in result3: temp1 = it.xpath("./a/text()" ) print (temp1) temp2 = it.xpath("./a/@href" ) print (temp2) temp3 = it.xpath("./a[postion()>1]" )
示例(爬取猪八戒网)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 from lxml import etreeimport requestsresp = requests.get("https://cangzhou.zbj.com/search/service/?kw=saas&l=0&r=2" ) tree = etree.HTML(resp.text) divs = tree.xpath('/html/body//div[@class="search-result-list-service search-result-list-service-width"]/div' ) for div in divs: name = div.xpath(".//div[@class = 'name-pic-box']/a/text()" ) price = div.xpath("./div//div[@class = 'price']/span/text()" ) print (price,end=' ' ) print (name) ''' for div in divs: price = div.xpath("./ div / div[3] / div[1]/span/text()") name = div.xpath("./div/div[2]/div[2]/a/text()") print(price,end=' ') print(name) #绝对路径不靠谱,网页的布局可能不遵循规律,而且更新后要重新定位。 #所以直接找变量是更好的方法(像上面那样[@]) '''
要熟悉抓包工具,找到网页内容,方便解析(精准抓取)
cookie cookie:网站用于识别用户的一连串代码
步骤:
登陆 ->得到 cookie
带着 cookie 请求到书架的 url ->得到书架上的内容
这两个步骤需要合并在一起,即第二步的进行需要立即使用第一步的信息
方法一:
1 2 3 4 5 6 7 8 9 10 11 12 import requestsurl = "https://passport.17k.com/ck/user/login" data = { 'loginName' : '15532608912' , 'password' : 'asd123' } sess = requests.session() sess.post(url, data= data, ) resp = sess.get("https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919" ) print (resp.json())
方法二:
1 2 3 4 5 6 7 8 url = "https://passport.17k.com/ck/user/login" head = {"cookie" :"GUID=31f82c40-9141-4860-a059-8eb54cb91250; sajssdk_2015_cross_new_user=1; Hm_lvt_9793f42b498361373512340937deb2a0=1668070477; c_channel=0; c_csc=web; accessToken=avatarUrl%3Dhttps%253A%252F%252Fcdn.static.17k.com%252Fuser%252Favatar%252F00%252F00%252F03%252F99430300.jpg-88x88%253Fv%253D1668071104000%26id%3D99430300%26nickname%3D%25E4%25B9%25A6%25E5%258F%258B777e827gw%26e%3D1683623230%26s%3Db904c643179ed595; sensorsdata2015jssdkcross=%7B%22distinct_id%22%3A%2299430300%22%2C%22%24device_id%22%3A%2218460bf8d334a4-012d512c1dffe8-3e604809-1049088-18460bf8d3414f%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E5%BC%95%E8%8D%90%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22https%3A%2F%2Fgraph.qq.com%2Foauth2.0%2Fshow%3Fwhich%3DLogin%26display%3Dpc%26response_type%3Dcode%26client_id%3D215314%26scope%3Dget_user_info%26redirect_uri%3Dhttps%253A%252F%252Fpassport.17k.com%252Fsns%252FqqCallback.action%253FfromUrl%253Dhttp%22%2C%22%24latest_referrer_host%22%3A%22graph.qq.com%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC%22%7D%2C%22first_id%22%3A%2231f82c40-9141-4860-a059-8eb54cb91250%22%7D; Hm_lpvt_9793f42b498361373512340937deb2a0=1668071834" } resp = sess.get("https://user.17k.com/ck/author/shelf?page=1&appKey=2406394919" ,headers= head) print (resp.json())
反反爬:防盗链 防盗链示例网站:梨视频
分析
首先梨视频官网采用二次传输(客户端渲染?),第二次才传送视频信息。这意味着网站源代码里看不到视频信息。
首先可以在页面检查里找到视频播放地址,因为它是实时更新的。
视频代码:
1 <video webkit-playsinline ="" playsinline ="" x-webkit-airplay ="" autoplay ="autoplay" src ="https://video.pearvideo.com/mp4/short/20170818/cont-1135255-10772797-hd.mp4" style ="width: 100%; height: 100%;" > </video >
src后面即使视频播放地址。
既然是二次刷新,应该在网页打开的过程中传送文件。
检查项里 Network 中找到了唯一一个XHR文件,标题就是 videoStatus...
在文件的preview里,能找到,里面有个srcUrl: https://video.pearvideo.com/mp4/short/20170818/1668059979658-10772797-hd.mp4
这个地址和上面视频的真正的地址,差别在于cont-数字那部分。这个cont-数字是在爬取中需要替换的关键位置。
示例视频官网网页为:https://www.pearvideo.com/video_1135255
cont-1135255即替换数据。
经过上述分析,确认以下步骤。
找到视频对应网页
找到抓包到的XHR文件中的srcUrl内容
==防盗链==
这一步需要连接文件header中的Request URL,在连接时,会遇到“文章已下线”的问题(即使网页并未关闭),这就是防盗链的作用。
如果正常网页的连接过程为1->2->3 ,防盗链会在连接2或3时进行回溯,即查看前面的步骤是否完整。如果是NULL->2 , 这个2就会引起怀疑,拒绝连接要求。
解决办法 : 如同补充User-Agent一样,在连接时的header参数中补充 Referer 数据。
替换srcUrl内容
下载新地址对应的目标视频
再抽象到代码层面:
拿到cont-id
拿到videoStatus里的srcUrl(json格式)
以cont-id替换srcUrl内容
下载视频
代码实现
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import requestsurl = 'https://www.pearvideo.com/video_1135255' cont_id = url.split("_" )[-1 ] head={ "User-Agent" :"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36" , "Referer" :url, } videoStatus = f"https://www.pearvideo.com/videoStatus.jsp?contId={cont_id} &mrd=0.768106800161237" resp = requests.get(videoStatus, headers= head, ) dic = resp.json() srcUrl = dic["videoInfo" ]["videos" ]["srcUrl" ] systemTime = dic["systemTime" ] srcUrl = srcUrl.replace(systemTime,f"cont-{cont_id} " ) with open ("asd.mp4" ,mode="wb" ) as f: f.write(requests.get(srcUrl).content) resp.close()
代理 1 2 3 4 5 6 7 import requestsproxies = { "https" :"http://114.86.181.92" } resp = requests.get("https://www.baidu.com" ) resp.encoding = "utf8" print (resp.text)
代理:类似找黄牛买票,更换自己的IP地址来进行目标服务器访问。