首页 > 美文阅读

pythonb站弹幕分析_【python】B站弹幕数据分析及可视化（爬虫+数据挖掘）

更新时间:2023-06-07 08:45:36 阅读：评论：0

pythonb站弹幕分析_【python】B站弹幕数据分析及可视化

（爬⾍+数据挖掘）

成果展⽰

项⽬地址

爬取弹幕

可以看我之前写的这篇⽂章：10⾏代码下载B站弹幕

下载代码

# download.py

'''依赖模块

pip install requests

'''

import re

import requests

url = input('请输⼊B站视频链接: ')

res = (url)

cid = re.findall(r'"cid":(.*?),', )[-1]

url = f'/{cid}.xml'

res = (url)

with open(f'{cid}.xml', 'wb') as f:

f.t)

样例输⼊

样例输出

数据处理

尾巴骨下载弹幕⽂件l后，我们打开看⼀下：

51816463

3000

k-v

长颈⿅呢？还是⼤象呢？(

我也不想的，实在是太⼤了呀

真是深不可测啊

此处省略很多字

可以看到xml⽂件中d标签的text部分就是弹幕的⽂本，⽽d标签的p属性应该是弹幕的相关参数，共有8个，⽤逗号分隔。stime: 弹幕出现时间 (s)

mode: 弹幕类型 (< 7 时为普通弹幕)

size: 字号

color: ⽂字颜⾊

date: 发送时间戳钎焊

pool: 弹幕池ID

author: 发送者ID

dbid: 数据库记录ID(单调递增)

参数详解：

① stime(float)：弹幕出现时间，单位是秒；也就是在⼏秒出现弹幕。

② mode(int)：弹幕类型，有8种；⼩于8为普通弹幕，8是⾼级弹幕。

1~3：滚动弹幕

4：底端弹幕

6：顶端弹幕

7：逆向弹幕

8：⾼级弹幕

③ size(int)：字号。

12：⾮常⼩

16：特⼩

18：⼩

25：中

36：⼤

45：很⼤

麻将机中间圆盘不升降

64：特别⼤

④ color(int)：⽂字颜⾊；⼗进制表⽰的颜⾊。

⑤ data(int)：弹幕发送时间戳。也就是从基准时间1970-1-1 08:00:00开始到发送时间的秒数。

⑥ pool(int)：弹幕池ID。

0：普通池

1：字幕池

2：特殊池(⾼级弹幕专⽤)

⑦ author(str)：发送者ID，⽤于"屏蔽此发送者的弹幕"的功能。

⑧ dbid(str)：弹幕在数据库中的⾏ID，⽤于"历史弹幕"功能。

了解弹幕的参数后，我们就将弹幕信息保存为danmus.csv⽂件：

# processing.py

import re

with open('l', encoding='utf-8') as f:

data = f.read()

comments = re.findall('(.*?)', data)

# print(len(comments)) # 3000

danmus = [','.join(item) for item in comments]

headers = ['stime', 'mode', 'size', 'color', 'date', 'pool', 'author', 'dbid', 'text'] headers = ','.join(headers)

高考奋斗格言

danmus.inrt(0, headers)

with open('danmus.csv', 'w', encoding='utf_8_sig') as f:

f.writelines([line+'\n' for line in danmus])

数据分析

词频分析

# wordCloud.py

'''依赖模块

pip install jieba, pyecharts

'''

from pyecharts import options as opts

from pyecharts.charts import WordCloud

import jieba

with open('danmus.csv', encoding='utf-8') as f:

text = " ".join([line.split(',')[-1] for line adlines()])

words = jieba.cut(text)

_dict = {}

for word in words:

if len(word) >= 2:

_dict[word] = _(word, 0)+1

items = list(_dict.items())

items.sort(key=lambda x: x[1], rever=True)

c = (

WordCloud()

.add(

"",

items,

word_size_range=[20, 120],

textstyle_opts=opts.TextStyleOpts(font_family="cursive"),

)

.render("wordcloud.html")

)

情感分析

春节日记200字由饼状图可知：3000条弹幕中，积极弹幕超过⼀半，中⽴弹幕有百分之三⼗⼏。

当然，弹幕调侃内容居中，⽽且有很多梗，会对情感分析造成很⼤的障碍，举个栗⼦：

>>> from snownlp import SnowNLP

>>> s = SnowNLP('阿伟死了')

>>> s.ntiments

0.1373666377744408

"阿伟死了"因带有"死"字，所以被判别为消极情绪。但实际上，它反映的确是积极情绪，形容对看到可爱的事物时的激动⼼情。# emotionAnalysis.py

蜂窝麻面'''依赖模块

pip install snownlp, pyecharts

'''

from snownlp import SnowNLP

from pyecharts import options as opts

from pyecharts.charts import Pie

with open('danmus.csv', encoding='utf-8') as f:

text = [line.split(',')[-1] for line adlines()[1:]]

emotions = {

'positive': 0,

'negative': 0,

'neutral': 0

}

for item in text:

if SnowNLP(item).ntiments > 0.6:

emotions['positive'] += 1

elif SnowNLP(item).ntiments < 0.4:

emotions['negative'] += 1

el:

emotions['neutral'] += 1

比较实用的敬酒词

print(emotions)

c = (

Pie()

.add("", list(emotions.items()))

.t_colors(["blue", "purple", "orange"])

.t_ries_opts(label_opts=opts.LabelOpts(formatter="{b}: {c} ({d}%)"))

.render("emotionAnalysis.html")

)

精彩⽚段

由折线图可知：第3分钟，第8、第9分钟，还有第13分钟分别是该视频的精彩⽚段。# highlights.py

'''依赖模块

pip install snownlp, pyecharts

'''

s.utils import JsCode

养老保险查询from pyecharts.charts import Line

from pyecharts.charts import Line, Grid

import pyecharts.options as opts

with open('danmus.csv', encoding='utf-8') as f:

text = [float(line.split(',')[0]) for line adlines()[1:]]

text = sorted([int(item) for item in text])

data = {}

for item in text:

item = int(item/60)

data[item] = (item, 0)+1

x_data = list(data.keys())

y_data = list(data.values())

background_color_js = (

"aphic.LinearGradient(0, 0, 0, 1, "

本文发布于:2023-06-07 08:45:36，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/892486.html

上一篇：2023年实用的感伤的个性签名集锦95句

下一篇：工伤申请书劳动仲裁工伤申请书(5篇)

标签：弹幕时间参数分析视频看到发送出现

留言与评论（共有 0 条评论）