有害评论识别问题：数据可视化与频率词云

更新时间:2023-05-29 06:23:12 阅读：评论：0

有害评论识别问题：数据可视化与频率词云机器学习训练营——机器学习爱好者的⾃由交流空间（⼊群联系qq：2279055353）

案例介绍

⼀项由⾕歌发起的研究，使⽤机器学习技术识别在线谈话⾥的有害评论。这⾥的“有害评论”，是指任何粗鲁的(rude)、⽆礼的(disrespectful), 或者其它导致某⼈终⽌讨论的⾔谈。该案例将构建分类模型，识别有害评论，并且减少不需要的偏差。例如，⼀个特定的名字经常与有害评论联系，⼀些模型可能把出现在⽆害评论⾥的同名的评论错误地分在有害评论⾥。

数据描述

在案例数据集⾥，每⼀条评论⽂本在comment_text列。训练集的每⼀条评论有⼀个toxicity标签(target), 开发的模型将预测检验集⾥

2012伦敦奥运会主题曲的target. 所有其它属性是给定评论的属性⽐例值。为了便于评价模型，在检验集⾥，target>0.5的样本被标记为阳性类(toxic).

加载包

import gc

import os

新视野大学英语4课后答案import warnings

import operator

import numpy as np

import pandas as pd2018年12月六级答案

import aborn as sns

import matplotlib.pyplot as plt

from tqdm import tqdm_notebook

from wordcloud import WordCloud, STOPWORDS

import gensim

from gensim.utils import simple_preprocess

from gensim.parsing.preprocessing import STOPWORDS

from nltk.stem import WordNetLemmatizer, SnowballStemmer

from nltk.stem.porter import *

import nltk

from gensim import corpora, models

import pyLDAvis

sim

from import Tokenizer

np.random.ed(2018)

warnings.filterwarnings('ignore')

加载数据

JIGSAW_PATH = "../input/jigsaw-unintended-bias-in-toxicity-classification/"

train = pd.read_csv(os.path.join(JIGSAW_PATH,'train.csv'), index_col='id')

test = pd.read_csv(os.path.join(JIGSAW_PATH,'test.csv'), index_col='id')msisdn

显⽰train, test的前5⾏。

train.head(), test.head()

数据探索

评论⽂本存储在comment_text列⾥。此外，在train⾥有标记特定的敏感主题是否存在于评论⾥。主题与5个类别有关：race or ethnicity: asian, black, jewish, latino, other_race_or_ethnicity, white

gender: female, male, transgender, other_gender

xual orientation: bixual, heteroxual, homoxual_gay_or_lesbian, other_xual_orientation

导演英文religion: atheist,buddhist, christian, hindu, muslim, other_religion

disability: intellectual_or_learning_disability, other_disability, physical_disability, psychiatric_or_mental_illness 我们也有⼏个评论识别信息：

created_date

publication_id

parent_id

article_id

⼏个评论相关的⽤户反馈信息：

rating

funny

cat and mou

wow

sad

likes

一年级汉语拼音辅导

disagree

xual_explicit

数据集⾥还有两个注释变量：

identity_annotator_count

在线英语翻译汉语toxicity_annotator_count

⽬标特征

让我们检查⼀下训练集⾥target值的分布。

plt.figure(figsize=(12,6))

plt.title("Distribution of target in the train t")

sns.distplot(train['target'],kde=True,hist=Fal, bins=120, label='target')

plt.legend(); plt.show()

让我们表⽰另外的有害特征分布的相似性。

def plot_features_distribution(features, title):

plt.figure(figsize=(12,6))

plt.title(title)

for feature in features:

sns.distplot(train.loc[~train[feature].isnull(),feature],kde=True,hist=Fal, bins=120, label=feature) plt.xlabel('')

plt.legend()

plt.show()

features = ['vere_toxicity', 'obscene','identity_attack','insult','threat']

plot_features_distribution(features, "Distribution of additional toxicity features in the train t")

n ba

敏感的话题

现在，让我们检查敏感话题特征的值分布。

features = ['asian', 'black', 'jewish', 'latino', 'other_race_or_ethnicity', 'white']

plot_features_distribution(features, "Distribution of race and ethnicity features values in the train t")

features = ['female', 'male', 'transgender', 'other_gender']

plot_features_distribution(features, "Distribution of gender features values in the train t")

gravelfeatures = ['atheist','buddhist', 'christian', 'hindu', 'muslim', 'other_religion']

plot_features_distribution(features, "Distribution of religion features values in the train t")

features = ['intellectual_or_learning_disability', 'other_disability', 'physical_disability', 'psychiatric_or_mental_illness'] plot_features_distribution(features, "Distribution of disability features values in the train t")

反馈信息

让我们看⼀看反馈信息值的分布。

def plot_count(feature, title,size=1):

f, ax = plt.subplots(1,1, figsize=(4*size,4))

total = float(len(train))

g = untplot(train[feature], order = train[feature].value_counts().index[:20], palette='Set3')

g.t_title("Number and percentage of {}".format(title))

for p in ax.patches:

height = p.get_height()

<(p.get_x()+p.get_width()/2.,

height + 3,

'{:1.2f}%'.format(100*height/total),

ha="center")

plt.show()

plot_count('rating','rating')

本文发布于:2023-05-29 06:23:12，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/804760.html

上一篇：YBB00042003-2015 急性全身中毒检查法

下一篇：Surfactants Ud in Food Industry A Review

标签：评论模型识别

留言与评论（共有 0 条评论）