有害评论识别问题:数据可视化与频率词云

更新时间:2023-05-29 06:23:12 阅读: 评论:0

有害评论识别问题:数据可视化与频率词云机器学习训练营——机器学习爱好者的⾃由交流空间(⼊群联系qq:2279055353)
案例介绍
⼀项由⾕歌发起的研究,使⽤机器学习技术识别在线谈话⾥的有害评论。这⾥的“有害评论”,是指任何粗鲁的(rude)、⽆礼的(disrespectful), 或者其它导致某⼈终⽌讨论的⾔谈。该案例将构建分类模型,识别有害评论,并且减少不需要的偏差。例如,⼀个特定的名字经常与有害评论联系,⼀些模型可能把出现在⽆害评论⾥的同名的评论错误地分在有害评论⾥。
数据描述
在案例数据集⾥,每⼀条评论⽂本在comment_text列。训练集的每⼀条评论有⼀个toxicity标签(target), 开发的模型将预测检验集⾥
2012伦敦奥运会主题曲的target. 所有其它属性是给定评论的属性⽐例值。为了便于评价模型,在检验集⾥,target>0.5的样本被标记为阳性类(toxic).
加载包
import gc
import os
新视野大学英语4课后答案import warnings
import operator
import numpy as np
import pandas as pd2018年12月六级答案
import aborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
from wordcloud import WordCloud, STOPWORDS
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk
from gensim import corpora, models
import pyLDAvis
sim
from import Tokenizer
np.random.ed(2018)
warnings.filterwarnings('ignore')
加载数据
JIGSAW_PATH = "../input/jigsaw-unintended-bias-in-toxicity-classification/"
train = pd.read_csv(os.path.join(JIGSAW_PATH,'train.csv'), index_col='id')
test = pd.read_csv(os.path.join(JIGSAW_PATH,'test.csv'), index_col='id')msisdn
显⽰train, test的前5⾏。
train.head(), test.head()
数据探索
评论⽂本存储在comment_text列⾥。此外,在train⾥有标记特定的敏感主题是否存在于评论⾥。主题与5个类别有关:race or ethnicity: asian, black, jewish, latino, other_race_or_ethnicity, white
gender: female, male, transgender, other_gender
xual orientation: bixual, heteroxual, homoxual_gay_or_lesbian, other_xual_orientation
导演英文religion: atheist,buddhist, christian, hindu, muslim, other_religion
disability: intellectual_or_learning_disability, other_disability, physical_disability, psychiatric_or_mental_illness 我们也有⼏个评论识别信息:
created_date
publication_id
parent_id
article_id
⼏个评论相关的⽤户反馈信息:
rating
funny
cat and mou
wow
sad
likes
一年级汉语拼音辅导
disagree
xual_explicit
数据集⾥还有两个注释变量:
identity_annotator_count
在线英语翻译汉语toxicity_annotator_count
⽬标特征
让我们检查⼀下训练集⾥target值的分布。
plt.figure(figsize=(12,6))
plt.title("Distribution of target in the train t")
sns.distplot(train['target'],kde=True,hist=Fal, bins=120, label='target')
plt.legend(); plt.show()
让我们表⽰另外的有害特征分布的相似性。
def plot_features_distribution(features, title):
plt.figure(figsize=(12,6))
plt.title(title)
for feature in features:
sns.distplot(train.loc[~train[feature].isnull(),feature],kde=True,hist=Fal, bins=120, label=feature)    plt.xlabel('')
plt.legend()
plt.show()
features = ['vere_toxicity', 'obscene','identity_attack','insult','threat']
plot_features_distribution(features, "Distribution of additional toxicity features in the train t")
n ba
敏感的话题
现在,让我们检查敏感话题特征的值分布。
features = ['asian', 'black', 'jewish', 'latino', 'other_race_or_ethnicity', 'white']
plot_features_distribution(features, "Distribution of race and ethnicity features values in the train t")
features = ['female', 'male', 'transgender', 'other_gender']
plot_features_distribution(features, "Distribution of gender features values in the train t")
gravelfeatures = ['atheist','buddhist',  'christian', 'hindu', 'muslim', 'other_religion']
plot_features_distribution(features, "Distribution of religion features values in the train t")
features = ['intellectual_or_learning_disability', 'other_disability', 'physical_disability', 'psychiatric_or_mental_illness'] plot_features_distribution(features, "Distribution of disability features values in the train t")
反馈信息
让我们看⼀看反馈信息值的分布。
def plot_count(feature, title,size=1):
f, ax = plt.subplots(1,1, figsize=(4*size,4))
total = float(len(train))
g = untplot(train[feature], order = train[feature].value_counts().index[:20], palette='Set3')
g.t_title("Number and percentage of {}".format(title))
for p in ax.patches:
height = p.get_height()
<(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(100*height/total),
ha="center")
plt.show()
plot_count('rating','rating')

本文发布于:2023-05-29 06:23:12,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/78/804760.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:评论   模型   识别
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图