今天讲解如何⽤python爬取芒果TV、腾讯视频、B站、爱奇艺、知乎、微博这⼏个常见常⽤的影视、舆论平台的弹幕和评论,这类爬⾍得到的结果⼀般⽤于娱乐、舆情分析,如:新出⼀部⽕爆的电影,爬取弹幕评论分析他为什么这么⽕;微博⼜出⼤⽠,爬取底下评论看看⽹友怎么说,等等这娱乐性分析。
本⽂爬取⼀共六个平台,⼗个爬⾍案例,如果只对个别案例感兴趣的可以根据:芒果TV、腾讯视频、B站、爱奇艺、知乎、微博这⼀顺序进⾏拉取观看。完整的实战源码已在⽂中,我们废话不多说,下⾯开始操作!
很多⼈学习蟒蛇,不知道从何学起。
很多⼈学习寻找python,掌握了基本语法之后,不知道在哪⾥案例上⼿。很多已经可能知道案例的⼈,却不怎么去学习更多⾼深的知识。
这三类⼈,我给⼤家提供⼀个好的学习平台,免费获取视频教程,电⼦书,以及课程的源代码!QQ群:101677771欢迎加⼊,⼀起讨论学习
芒果TV
本⽂以爬取电影《悬崖之上》为例,讲解如何爬取芒果TV视频的弹幕和评论!⽹页地址:
https://www.mgtv.com/b/335313/12281642.html?fpa=15800&fpos=8&lastp=ch_movie
弹幕
分析⽹页
弹幕数据所在的⽂件是动态加载的,需要进⼊浏览器的开发者⼯具进⾏抓包,得到弹幕数据所在的真实url。当视频播放⼀分钟它就会更新⼀个json数据包,⾥⾯包含我们需要的弹幕数据。
得到的真实url:
https://bullet-ali.hitv.com/bullet/2021/08/14/005323/12281642/0.jsonhttps://bullet-ali.hitv.com/bullet/2021/08/14/005323/12281642/1.json
可以发现,每条url的差别在于后⾯的数字,⾸条url为0,后⾯的逐步递增。视频⼀共120:20分钟,向上取整,也就是121条数据包。
实战代码
import requests
import pandas as pd
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
df = pd.DataFrame()for e in range(0, 121):
print(f'正在爬取第{e}页')
resposen = requests.get(f'https://bullet-ali.hitv.com/bullet/2021/08/3/004902/12281642/{e}.json', headers=headers) # 直接⽤json提取数据
for i in resposen.json()['data']['items']: ids = i['ids'] # ⽤户id
content = i['content'] # 弹幕内容 time = i['time'] # 弹幕发⽣时间 # 有些⽂件中不存在点赞数 try:
v2_up_count = i['v2_up_count'] except:
v2_up_count = ''
text = pd.DataFrame({'ids': [ids], '弹幕': [content], '发⽣时间': [time]}) df = pd.concat([df, text])
df.to_csv('悬崖之上.csv', encoding='utf-8', index=False)
结果展⽰:
评论
分析⽹页
芒果TV视频的评论需要拉取到⽹页下⾯进⾏查看。评论数据所在的⽂件依然是动态加载的,进⼊开发者⼯具,按下列步骤进⾏抓包:Network→js,最后点击查看更多评论。
加载出来的依然是js⽂件,⾥⾯包含评论数据。得到的真实url:
https://comment.mgtv.com/v4/comment/getCommentList?page=1&subjectType=hunantv2014&subjectId=12281642&callback=jQuery1820749973529821774_1628942431449&_support=10000000&_=1628943290494https://comment.mgtv.com/v4/comment/getCommentList?page=2&subjectType=hunantv2014&subjectId=12281642&callback=jQuery1820749973529821774_1628942431449&_support=10000000&_=1628943296653
其中有差别的参数有page和_,page是页数,_是时间戳;url中的时间戳删除后不影响数据完整性,但⾥⾯的callback参数会⼲扰数据解析,所以进⾏删除。最后得到url:
https://comment.mgtv.com/v4/comment/getCommentList?page=1&subjectType=hunantv2014&subjectId=12281642&_support=10000000
数据包中每页包含15条评论数据,评论总数是2527,得到最⼤页为169。
实战代码
import requests
import pandas as pd
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
df = pd.DataFrame()for o in range(1, 170):
url = f'https://comment.mgtv.com/v4/comment/getCommentList?page={o}&subjectType=hunantv2014&subjectId=12281642&_support=10000000' res = requests.get(url, headers=headers).json() for i in res['data']['list']:
nickName = i['user']['nickName'] # ⽤户昵称 praiseNum = i['praiseNum'] # 被点赞数 date = i['date'] # 发送⽇期
content = i['content'] # 评论内容
text = pd.DataFrame({'nickName': [nickName], 'praiseNum': [praiseNum], 'date': [date], 'content': [content]}) df = pd.concat([df, text])
df.to_csv('悬崖之上.csv', encoding='utf-8', index=False)
结果展⽰:
腾讯视频
本⽂以爬取电影《⾰命者》为例,讲解如何爬取腾讯视频的弹幕和评论!⽹页地址:
https://v.qq.com/x/cover/mzc00200m72fcup.html
弹幕
分析⽹页
依然进⼊浏览器的开发者⼯具进⾏抓包,当视频播放30秒它就会更新⼀个json数据包,⾥⾯包含我们需要的弹幕数据。
得到真实url:
https://mfm.video.qq.com/danmu?otype=json&callback=jQuery19109541041335587612_1628947050538&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C32%2C1628947057×tamp=15&_=1628947050569https://mfm.video.qq.com/danmu?otype=json&callback=jQuery19109541041335587612_1628947050538&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C32%2C1628947057×tamp=45&_=1628947050572
其中有差别的参数有timestamp和_。_是时间戳。timestamp是页数,⾸条url为15,后⾯以公差为30递增,公差是以数据包更新时长为基准,⽽最⼤页数为视频时长7245秒。依然删除不必要参数,得到url:
https://mfm.video.qq.com/danmu?otype=json&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C18%2C1628418094×tamp=15&_=1628418086509
实战代码
import pandas as pdimport timeimport requests
headers = {
'User-Agent': 'Googlebot'}
# 初始为15,7245 为视频秒长,链接以三⼗秒递增df = pd.DataFrame()
for i in range(15, 7245, 30):
url = \"https://mfm.video.qq.com/danmu?otype=json&target_id=7220956568%26vid%3Dt0040z3o3la&session_key=0%2C18%2C1628418094×tamp={}&_=1628418086509\".format(i) html = requests.get(url, headers=headers).json() time.sleep(1)
for i in html['comments']: content = i['content'] print(content)
text = pd.DataFrame({'弹幕': [content]}) df = pd.concat([df, text])
df.to_csv('⾰命者_弹幕.csv', encoding='utf-8', index=False)
结果展⽰:
评论
分析⽹页
腾讯视频评论数据在⽹页底部,依然是动态加载的,需要按下列步骤进⼊开发者⼯具进⾏抓包:
点击查看更多评论后,得到的数据包含有我们需要的评论数据,得到的真实url:
https://video.coral.qq.com/varticle/6655100451/comment/v2?callback=_varticle6655100451commentv2&orinum=10&oriorder=o&pageflag=1&cursor=0&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132&_=1628948867522https://video.coral.qq.com/varticle/6655100451/comment/v2?callback=_varticle6655100451commentv2&orinum=10&oriorder=o&pageflag=1&cursor=6786869637356389636&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&
url中的参数callback以及_删除即可。重要的是参数cursor,第⼀条url参数cursor是等于0的,第⼆条url才出现,所以要查找cursor参数是怎么出现的。经过我的观察,cursor参数其实是上⼀条url的last参数:
实战代码
import requests
import pandas as pdimport timeimport random
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
df = pd.DataFrame()a = 1
# 此处必须设定循环次数,否则会⽆限重复爬取
# 281为参照数据包中的oritotal,数据包中⼀共10条数据,循环280次得到2800条数据,但不包括底下回复的评论
# 数据包中的commentnum,是包括回复的评论数据的总数,⽽数据包都包含10条评论数据和底下的回复的评论数据,所以只需要把2800除以10取整数+1即可!while a < 281: if a == 1:
url = 'https://video.coral.qq.com/varticle/6655100451/comment/v2?orinum=10&oriorder=o&pageflag=1&cursor=0&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132' else:
url = f'https://video.coral.qq.com/varticle/6655100451/comment/v2?orinum=10&oriorder=o&pageflag=1&cursor={cursor}&scorecursor=0&orirepnum=2&reporder=o&reppageflag=1&source=132' res = requests.get(url, headers=headers).json() cursor = res['data']['last']
for i in res['data']['oriCommList']: ids = i['id']
times = i['time'] up = i['up']
content = i['content'].replace('\\n', '')
text = pd.DataFrame({'ids': [ids], 'times': [times], 'up': [up], 'content': [content]}) df = pd.concat([df, text]) a += 1
time.sleep(random.uniform(2, 3))
df.to_csv('⾰命者_评论.csv', encoding='utf-8', index=False)
效果展⽰:
B站
本⽂以爬取视频《“ 这是我见过最拽的⼀届中国队奥运冠军”》为例,讲解如何爬取B站视频的弹幕和评论!⽹页地址:
https://www.bilibili.com/video/BV1wq4y1Q7dp
弹幕
分析⽹页
B站视频的弹幕不像腾讯视频那样,播放视频就会触发弹幕数据包,他需要点击⽹页右侧的弹幕列表⾏的展开,然后点击查看历史弹幕获得视频弹幕开始⽇到截⾄⽇链接:
链接末尾以oid以及开始⽇期来构成弹幕⽇期url:
https://api.bilibili.com/x/v2/dm/history/index?type=1&oid=384801460&month=2021-08
在上⾯的的基础之上,点击任⼀有效⽇期即可获得这⼀⽇期的弹幕数据包,⾥⾯的内容⽬前是看不懂的,之所以确定它为弹幕数据包,是因为点击了⽇期他才加载出来,且链接与前⾯的链接具有相关性:
得到的url:
https://api.bilibili.com/x/v2/dm/web/history/seg.so?type=1&oid=384801460&date=2021-08-08
url中的oid为视频弹幕链接的id值;data参数为刚才的的⽇期,⽽获得该视频全部弹幕内容,只需要更改data参数即可。⽽data参数可以从上⾯的弹幕⽇期url获得,也可以⾃⾏构造;⽹页数据格式为json格式
实战代码
import requests
import pandas as pdimport re
def data_resposen(url): headers = {
\"cookie\": \"你的cookie\
\"user-agent\": \"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36\" }
resposen = requests.get(url, headers=headers) return resposen
def main(oid, month): df = pd.DataFrame()
url = f'https://api.bilibili.com/x/v2/dm/history/index?type=1&oid={oid}&month={month}' list_data = data_resposen(url).json()['data'] # 拿到所有⽇期 print(list_data)
for data in list_data:
urls = f'https://api.bilibili.com/x/v2/dm/web/history/seg.so?type=1&oid={oid}&date={data}' text = re.findall(\".*?([\一-\龥]+).*?\ for e in text: print(e)
data = pd.DataFrame({'弹幕': [e]})
df = pd.concat([df, data])
df.to_csv('弹幕.csv', encoding='utf-8', index=False, mode='a+')if __name__ == '__main__':
oid = '384801460' # 视频弹幕链接的id值 month = '2021-08' # 开始⽇期 main(oid, month)
结果展⽰:
评论
分析⽹页
B站视频的评论内容在⽹页下⽅,进⼊浏览器的开发者⼯具后,只需要向下拉取即可加载出数据包:
得到真实url:
https://api.bilibili.com/x/v2/reply/main?callback=jQuery1720034332372316460136_1629011550479&jsonp=jsonp&next=0&type=1&oid=589656273&mode=3&plat=1&_=1629012090500https://api.bilibili.com/x/v2/reply/main?callback=jQuery1720034332372316460136_1629011550483&jsonp=jsonp&next=2&type=1&oid=589656273&mode=3&plat=1&_=1629012513080https://api.bilibili.com/x/v2/reply/main?callback=jQuery1720034332372316460136_1629011550484&jsonp=jsonp&next=3&type=1&oid=589656273&mode=3&plat=1&_=1629012803039
两条urlnext参数,以及_和callback参数。_和callback⼀个是时间戳,⼀个是⼲扰参数,删除即可。next参数第⼀条为0,第⼆条为2,第三条为3,所以第⼀条next参数固定为0,第⼆条开始递增;⽹页数据格式为json格式。
实战代码
import requests
import pandas as pd
df = pd.DataFrame()headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}try:
a = 1
while True: if a == 1:
# 删除不必要参数得到的第⼀条url
url = f'https://api.bilibili.com/x/v2/reply/main?&jsonp=jsonp&next=0&type=1&oid=589656273&mode=3&plat=1' else:
url = f'https://api.bilibili.com/x/v2/reply/main?&jsonp=jsonp&next={a}&type=1&oid=589656273&mode=3&plat=1' print(url)
html = requests.get(url, headers=headers).json() for i in html['data']['replies']:
uname = i['member']['uname'] # ⽤户名称 sex = i['member']['sex'] # ⽤户性别 mid = i['mid'] # ⽤户id
current_level = i['member']['level_info']['current_level'] # vip等级 message = i['content']['message'].replace('\\n', '') # ⽤户评论 like = i['like'] # 评论点赞次数 ctime = i['ctime'] # 评论时间
data = pd.DataFrame({'⽤户名称': [uname], '⽤户性别': [sex], '⽤户id': [mid],
'vip等级': [current_level], '⽤户评论': [message], '评论点赞次数': [like], '评论时间': [ctime]}) df = pd.concat([df, data]) a += 1
except Exception as e: print(e)
df.to_csv('奥运会.csv', encoding='utf-8')print(df.shape)
结果展⽰,获取的内容不包括⼆级评论,如果需要,可⾃⾏爬取,操作步骤差不多:
爱奇艺
本⽂以爬取电影《哥斯拉⼤战⾦刚》为例,讲解如何爬爱奇艺视频的弹幕和评论!⽹页地址:
https://www.iqiyi.com/v_19rr0m845o.html
弹幕
分析⽹页
爱奇艺视频的弹幕依然是要进⼊开发者⼯具进⾏抓包,得到⼀个br压缩⽂件,点击可以直接下载,⾥⾯的内容是⼆进制数据,视频每播放⼀分钟,就加载⼀条数据包:
得到url,两条url差别在于递增的数字,60为视频每60秒更新⼀次数据包:
https://cmts.iqiyi.com/bullet/64/00/1078946400_60_1_b2105043.brhttps://cmts.iqiyi.com/bullet/64/00/1078946400_60_2_b2105043.br
br⽂件可以⽤brotli库进⾏解压,但实际操作起来很难,特别是编码等问题,难以解决;在直接使⽤utf-8进⾏解码时,会报以下错误:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x91 in position 52: invalid start byte
在解码中加⼊ignore,中⽂不会乱码,但html格式出现乱码,数据提取依然很难:
decode(\"utf-8\
⼩⼑被编码弄到头疼,如果有兴趣的⼩伙伴可以对上⾯的内容继续研究,本⽂就不在进⾏深⼊。所以本⽂采⽤另⼀个⽅法,对得到url进⾏修改成以下链接⽽获得.z压缩⽂件:
https://cmts.iqiyi.com/bullet/64/00/1078946400_300_1.z
之所以如此更改,是因为这是爱奇艺以前的弹幕接⼝链接,他还未删除或修改,⽬前还可以使⽤。该接⼝链接中1078946400是视频id;300是以前爱奇艺的弹幕每5分钟会加载出新的弹幕数据包,5分钟就是300秒,《哥斯拉⼤战⾦刚》时长112.59分钟,除以5向上取整就是23;1是页数;64为id值的第7为和第8为数。
实战代码
import requests
import pandas as pdfrom lxml import etree
from zlib import decompress # 解压
df = pd.DataFrame()for i in range(1, 23):
url = f'https://cmts.iqiyi.com/bullet/64/00/1078946400_300_{i}.z' bulletold = requests.get(url).content # 得到⼆进制数据 decode = decompress(bulletold).decode('utf-8') # 解压解码
with open(f'{i}.html', 'a+', encoding='utf-8') as f: # 保存为静态的html⽂件 f.write(decode)
html = open(f'./{i}.html', 'rb').read() # 读取html⽂件 html = etree.HTML(html) # ⽤xpath语法进⾏解析⽹页
ul = html.xpath('/html/body/danmu/data/entry/list/bulletinfo') for i in ul:
contentid = ''.join(i.xpath('./contentid/text()')) content = ''.join(i.xpath('./content/text()')) likeCount = ''.join(i.xpath('./likecount/text()')) print(contentid, content, likeCount)
text = pd.DataFrame({'contentid': [contentid], 'content': [content], 'likeCount': [likeCount]}) df = pd.concat([df, text])
df.to_csv('哥斯拉⼤战⾦刚.csv', encoding='utf-8', index=False)
结果展⽰:
评论
分析⽹页
爱奇艺视频的评论在⽹页下⽅,依然是动态加载的内容,需要进⼊浏览器的开发者⼯具进⾏抓包,当⽹页下拉取时,会加载⼀条数据包,⾥⾯包含评论数据:
得到的真实url:
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=10&last_id=&page=&page_size=10&types=hot,https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=0&last_id=7963601726142521&page=&page_sizehttps://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&authcookie=null&business_type=17&channel_id=1&content_id=1078946400&hot_size=0&last_id=4933019153543021&page=&page_size
第⼀条url加载的是精彩评论的内容,第⼆条url开始加载的是全部评论的内容。经过删减不必要参数得到以下url:
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=&page_size=10
https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=7963601726142521&page_size=20https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id=4933019153543021&page_size=20
区别在于参数last_id和page_size。page_size在第⼀条url中的值为10,从第⼆条url开始固定为20。last_id在⾸条url中值为空,从第⼆条开始会不断发⽣变化,经过我的研究,last_id的值就是从前⼀条url中的最后⼀条评论内容的⽤户id(应该是⽤户id);⽹页数据格式为json格式。
实战代码
import requests
import pandas as pdimport timeimport random
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
df = pd.DataFrame()try:
a = 0
while True: if a == 0:
url = 'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&page_size=10' else:
# 从id_list中得到上⼀条页内容中的最后⼀个id值
url = f'https://sns-comment.iqiyi.com/v3/comment/get_comments.action?agent_type=118&agent_version=9.11.5&business_type=17&content_id=1078946400&last_id={id_list[-1]}&page_size=20' print(url)
res = requests.get(url, headers=headers).json() id_list = [] # 建⽴⼀个列表保存id值 for i in res['data']['comments']: ids = i['id']
id_list.append(ids)
uname = i['userInfo']['uname'] addTime = i['addTime']
content = i.get('content', '不存在') # ⽤get提取是为了防⽌键值不存在⽽发⽣报错,第⼀个参数为匹配的key值,第⼆个为缺少时输出 text = pd.DataFrame({'ids': [ids], 'uname': [uname], 'addTime': [addTime], 'content': [content]}) df = pd.concat([df, text]) a += 1
time.sleep(random.uniform(2, 3))except Exception as e: print(e)
df.to_csv('哥斯拉⼤战⾦刚_评论.csv', mode='a+', encoding='utf-8', index=False)
结果展⽰:
知乎
本⽂以爬取知乎热点话题《如何看待⽹传腾讯实习⽣向腾讯⾼层提出建议颁布拒绝陪酒相关条令?》为例,讲解如爬取知乎回答!⽹页地址:
https://www.zhihu.com/question/478781972
分析⽹页
经过查看⽹页源代码等⽅式,确定该⽹页回答内容为动态加载的,需要进⼊浏览器的开发者⼯具进⾏抓包。进⼊Noetwork→XHR,⽤⿏标在⽹页向下拉取,得到我们需要的数据包:
得到的真实url:
https://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_stickyhttps://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky
url有很多不必要的参数,⼤家可以在浏览器中⾃⾏删减。两条url的区别在于后⾯的offset参数,⾸条url的offset参数为0,第⼆条为5,offset是以公差为5递增;⽹页数据格式为json格式。
实战代码
import requests
import pandas as pdimport reimport timeimport random
df = pd.DataFrame()headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'}
for page in range(0, 1360, 5):
url = f'https://www.zhihu.com/api/v4/questions/478781972/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_s response = requests.get(url=url, headers=headers).json() data = response['data'] for list_ in data:
name = list_['author']['name'] # 知乎作者 id_ = list_['author']['id'] # 作者id
created_time = time.strftime(\"%Y-%m-%d %H:%M:%S\回答时间 voteup_count = list_['voteup_count'] # 赞同数
comment_count = list_['comment_count'] # 底下评论数 content = list_['content'] # 回答内容
content = ''.join(re.findall(\"[\。\;\,\:\“\”\(\)\、\?\《\》\一-\龥]\正则表达式提取 print(name, id_, created_time, comment_count, content, sep='|') dataFrame = pd.DataFrame(
{'知乎作者': [name], '作者id': [id_], '回答时间': [created_time], '赞同数': [voteup_count], '底下评论数': [comment_count], '回答内容': [content]})
df = pd.concat([df, dataFrame]) time.sleep(random.uniform(2, 3))
df.to_csv('知乎回答.csv', encoding='utf-8', index=False)print(df.shape)
结果展⽰:
微博
本⽂以爬取微博热搜《霍尊⼿写道歉信》为例,讲解如何爬取微博评论!⽹页地址:
https://m.weibo.cn/detail/4669040301182509
分析⽹页
微博评论是动态加载的,进⼊浏览器的开发者⼯具后,在⽹页上向下拉取会得到我们需要的数据包:
得到真实url:
https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id_type=0
https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id=3698934781006193&max_id_type=0
两条url区别很明显,⾸条url是没有参数max_id的,第⼆条开始max_id才出现,⽽max_id其实是前⼀条数据包中的max_id:
但有个需要注意的是参数max_id_type,它其实也是会变化的,所以我们需要从数据包中获取max_id_type:
实战代码
import re
import requests
import pandas as pdimport timeimport randomdf = pd.DataFrame()try:
a = 1
while True: header = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36' }
resposen = requests.get('https://m.weibo.cn/detail/4669040301182509', headers=header) # 微博爬取⼤概⼏⼗页会封账号的,⽽通过不断的更新cookies,会让爬⾍更持久点...
cookie = [cookie.value for cookie in resposen.cookies] # ⽤列表推导式⽣成cookies部件 headers = {
# 登录后的cookie, SUB⽤登录后的
'cookie': f'WEIBOCN_FROM={cookie[3]}; SUB=; _T_WM={cookie[4]}; MLOGIN={cookie[1]}; M_WEIBOCN_PARAMS={cookie[2]}; XSRF-TOKEN={cookie[0]}', 'referer': 'https://m.weibo.cn/detail/4669040301182509',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 UBrowser/4.0.3214.0 Safari/537.36' }
if a == 1:
url = 'https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id_type=0' else:
url = f'https://m.weibo.cn/comments/hotflow?id=4669040301182509&mid=4669040301182509&max_id={max_id}&max_id_type={max_id_type}' html = requests.get(url=url, headers=headers).json() data = html['data']
max_id = data['max_id'] # 获取max_id和max_id_type返回给下⼀条url max_id_type = data['max_id_type'] for i in data['data']:
screen_name = i['user']['screen_name'] i_d = i['user']['id']
like_count = i['like_count'] # 点赞数 created_at = i['created_at'] # 时间
text = re.sub(r'<[^>]*>', '', i['text']) # 评论 print(text)
data_json = pd.DataFrame({'screen_name': [screen_name], 'i_d': [i_d], 'like_count': [like_count], 'created_at': [created_at],'text': [text]}) df = pd.concat([df, data_json]) time.sleep(random.uniform(2, 7)) a += 1
except Exception as e: print(e)
df.to_csv('微博.csv', encoding='utf-8', mode='a+', index=False)print(df.shape)
结果展⽰:
以上便是今天的全部内容了,如果你喜欢今天的内容,希望你能在下⽅点个赞和在看,谢谢!
因篇幅问题不能全部显示,请点此查看更多更全内容