贝叶斯判别分析器_用朴素的贝叶斯构建情感分析器

更新时间:2023-07-27 03:44:55 阅读: 评论:0

贝叶斯判别分析器_⽤朴素的贝叶斯构建情感分析器
贝叶斯判别分析器
Sentiment Analysis is contextual mining of text which identifies and extracts subjective information in the source material and helping a business to understand the social ntiment of their brand product or rvice while monitoring online conversations. So basically, here I have ud the datat to predict whether a given movie review strikes the positive ntiment or the negative ntiment. I have ud Naive Bayes here becau it outperforms most other Machine Learning Algorithms when the data is textual. Though I will be using some NLP libraries, the basic focus would be on using Naive Bayes. The accuracy of my predictions is coming out to be more or less 89 %. So I would say not bad. You can u other techniques like Bert or various Deep Learning techniques to increa the accuracy further.
情感分析是对⽂本的上下⽂挖掘,该⽂本识别和提取源材料中的主观信息,并帮助企业在监视在线对话的同时了解其品牌产品或服务的社会情感。 因此,基本上,这⾥我使⽤了数据集来预测给定的电影评论是正⾯情绪还是负⾯情绪。 我在这⾥使⽤过朴素贝叶斯(Naive Bayes),因为当数据为⽂本时,它优于⼤多数其他机器学习算法。 尽管我将使⽤⼀些NLP库,但基本重点是使⽤朴素贝叶斯。 我的预测准确性⼤约为89%。 所以我会说还不错。 您可以使⽤其他技术(例如Bert)或各种深度学习技术来进⼀步提⾼准确性。
Let us take our baby steps towards Natural Language Processing using Naive Bayes:33
让我们迈出第⼀步,使⽤朴素贝叶斯(Naive Bayes)进⾏⾃然语⾔处理:33
To know About Naive Bayes you can Refer the following:
要了解关于朴素贝叶斯,您可以参考以下内容:
让我们开始吧 (Let’s get Started)
So our target would be to convert the entire textual reviews into a Bag of Words i.e to convert each unique word in our datat to a column name and simply storing the frequency count of each word in each row of a review. Our steps involved in the process would be:
因此,我们的⽬标是将整个⽂本评论转换为词袋,即将数据集中的每个唯⼀单词转换为列名,然后简单地将每个单词的频率计数存储在评论的每⼀⾏中。 该过程涉及的步骤如下:
1. Text preprocessing七夕节手抄报
首都人才网
⽂字预处理
2. Vectorize(Bag of Words)
Vectorize(单词袋)
3. Creating a Machine Learning Model
创建机器学习模型
4. Deployment
部署⽅式
⽂字预处理 (Text Preprocessing)
So at first, we have to analyze and clean the data before fitting into the ML models otherwi we will get…..
因此,⾸先,我们必须先分析并清理数据,然后再拟合到ML模型中,否则我们将获得.....
The steps involved in data cleaning are
数据清理涉及的步骤是
Remove HTML tags
删除HTML标签
Remove special characters
删除特殊字符
Convert everything to lowerca
将所有内容转换为⼩写
Remove stopwords
删除停⽤词
Stemming
抽⼲
We will import the necessary libraries at first that we are going to need for our ntiment analyzer.
⾸先,我们将导⼊情感分析器所需的必要库。
First, we are going to need NumPy and pandas, our esntial data science tools. “Re” stands for regular expression which is ud to extract a certain portion of a string. Nltk is an NLP library and we are going to import it in certain parts of our code to process the textual data. Then we are going to import sklearn for model creation. We are also importing some metrics from sklearn to analyze model performance.
⾸先,我们将需要NumPy和pandas ,这是我们必不可少的数据科学⼯具。 “ Re”代表正则表达式 ,⽤于提取字符串的特定部分。 Nltk 是⼀个NLP库 ,我们将在代码的某些部分中将其导⼊以处理⽂本数据。 然后,我们将导⼊sklearn以进⾏模型创建。 我们还从sklearn导⼊了⼀些指标来分析模型性能。
Then we will import our datat and casually go through it just to get a rough idea about the data provided to us.
游泳圈图片然后,我们将导⼊我们的数据集,并随意进⾏遍历,以⼤致了解提供给我们的数据。
Image for post
So we have 50000 rows with only one feature column that is the “review”. You can already e the HTML tags that need processing.
因此,我们有50000⾏,其中只有⼀个要素列是“评论”。 您已经可以看到需要处理HTML标记。
Image for post
There are know missing values as we can e from above. Phew!!
从上⾯我们可以看到缺少的值。 ew!
Now we will convert the positive ntiment with 1 and the negative ntiment with -1. We get
现在,我们将正情绪转换为1,将负情绪转换为-1。 我们得到
删除HTML标签 (Removing HTML tags)
洪福子
We will now remove the HTML tags with the help of “regular expression” library from python. This is ud to extract a
part of the string that follows a certain pattern. For example, if somehow the phone number and the email id are merged
螨虫的症状into one column and we want to create two parate columns, one for the phone numbers and the other for the email id. It would be impossible for us to manually compute each row. In that ca, we u regular expression (also known as regex). To know more about regex
现在,我们将借助python中的“ 正则表达式 ”库删除HTML标签。 这⽤于提取遵循特定模式的字符串部分。 例如,如果以某种⽅式将电话号码和电⼦邮件ID合并到⼀列中,⽽我们想创建两个单独的列,⼀列⽤于电话号码,另⼀列⽤于电⼦邮件ID。 对于我们来说,⼿动计算每⼀⾏是不可能的。 在这种情况下,我们使⽤正则表达式(也称为regex)。 要了解有关正则表达式的更多信息,
As you can e all the HTML tags have been removed.
如您所见,所有HTML标记均已删除。
删除特殊字符 (Remove special characters)
We don’t want the punctuation signs or any other non-alphanumeric characters in our Bag of Words, so we will remove tho.
我们不希望我们的单词袋中使⽤标点符号或任何其他⾮字母数字字符,因此我们将其删除。
All the non-alphanumeric characters have been removed.
所有⾮字母数字字符均已删除。
将所有内容转换为⼩写 (Convert everything to lowerca)
For better analysis, we will convert everything to lower ca.
为了更好的分析,我们将所有内容都转换为⼩写。
删除停⽤词 (Removing stopwords)
Stopwords are tho words that might not add much value to the meaning of the document. So converting the words into our Bag of words column would be a waste of time and space. The will add an unnecessary feature to our datat and may affect our correctness of the predictions. The are articles, prepositions, or conjunctions like “the”, “is”, “in”,“for”, “where”, “when”, “to”, “at” etc. The “nltk” library of python for Natural Language Processing comes with a class where all the probable stopwords are stored. For this purpo, we import “stopwords” from the “nltk. corpus”library for processing stop words.
停⽤词是那些可能不会给⽂档含义带来太⼤价值的词。 因此,将这些单词转换为“单词袋”列将浪费时间和空间。 这些将为我们的数据集添加不必要的功能,并可能影响我们对预测的正确性。 这些是⽂章,介词或连词,例如“ the”,“ is”,“ in”,“ for”,“where”,“ when”,“ to”,“ at”等。python的“ nltk ”库⾃然语⾔处理附带了⼀个类,其中存储了所有可能的停⽤词。 为此,我们从“ nltk ”中导⼊“停⽤词” 。 语料库”库,⽤于处理停⽤词。
It returns a list of all the words without stopwords.
它返回所有不带停⽤词的单词的列表。
抽⼲ (Stemming)
This means words that have different forms of the same common word has to be removed. The basic agenda of stemming is reducing a word to its word stem that affixes to suffixes and prefixes or the roots of words known as a lemma. Stemming is important in natural language understanding (NLU) and natural language processing (NLP). Suppo we are given words like playing, played, play- all the words have the same stem word known as “play”. The only word that would be uful as a vector in our Bag of Words is “play”. Other words would not contribute significant meaning to our datat or
predictions and the are unnecessary. The “nltk” library again comes with a class for stemming words. Here we import “SnowballStemmer” from the “nltk.stem” library for the purpo.
这意味着必须删除具有相同公共字词不同形式的字词。 词⼲的基本议程是将⼀个词减少为其词⼲,该词⼲附有后缀和前缀或称为词根的词根。 词⼲在⾃然语⾔理解(NLU)和⾃然语⾔处理 ( NLP )中很重要。 假设我们给了类似玩耍,玩耍,玩耍的单词,所有这些单词都具有称为“玩耍”的相同词⼲。 在我们的“词袋”中,唯⼀可⽤作向量的词是“游戏”。 其他词语不会对我们的数据集或预测起重要作⽤,⽽这些都是不必要的。 “ nltk”库再次带有词⼲类。 为此,我们从“ nltk.stem ”库中导⼊“ SnowballStemmer ”。
个人借款合同范本
With this, we are done with our text processing.
这样,我们就完成了⽂本处理。
形成单词袋并将其实现到我们的模型中 (Forming Bag of Words and implementing them into our model)
As I have mentioned earlier, like all other Natural Language Processing methods we have to vectorize all the unique words and store the frequency of each word as their datapoint. In this article, we will be vectorizing the unique words with
农业工人如前所述,像所有其他⾃然语⾔处理⽅法⼀样,我们必须向量化所有唯⼀单词并将每个单词的频率存储为它们的数据点。 在本⽂中,我们将使⽤
1. CountVectorizer
CountVectorizer
2. TfidfVectorizer
TfidfVectorizer
We will construct a parate model for both of the vectorizers and check their accuracy.
我们将为这两种⽮量化器构建⼀个单独的模型,并检查其准确性。
使⽤CountVectorizer建⽴模型 (Building Model with CountVectorizer)
CountVectorizer simply converts all the unique words into columns and store their frequency count. It is the simplest vectorizer ud in Machine Learning.
CountVectorizer只需将所有唯⼀字转换为列并存储其频率计数。 它是机器学习中使⽤的最简单的⽮量化程序。
Image for post
Now we will split the data
现在我们将拆分数据
Then we will create our models and fit the data into them. Here we will be using GaussianNB, Multinomial NB, and Bernoulli NB.
然后,我们将创建模型并将数据拟合到其中。 在这⾥,我们将使⽤GaussianNB,Multinomial NB和Bernoulli NB。
We will understand the performance of the model by calculating the accuracy score, precision score, and recall score. You will say we can very well say the performance of the model by calculating the accuracy score. But it is not that simple dear. So let us understand it. In our classification model, we can predict the performance considering the following factors:
我们将通过计算准确性得分,准确性得分和召回得分来了解模型的性能。 您会说我们可以通过计算准确性得分很好地说明模型的性能。 但这不是那么简单的亲爱的。 因此,让我们了解它。 在我们的分类模型中,我们可以考虑以下因素来预测性能:
True Positives (TP) — The are the correctly predicted positive values which mean that the value of the actual class is positive and the value of the predicted class is also positive.童年的主人公是谁
真实正值(TP) -这些是正确预测的正值,这意味着实际类别的值是正,⽽预测类别的值也是正。
True Negatives (TN) — The are the correctly predicted negative values which mean that the value of the actual class is negative and the value of the predicted class is also negative.

本文发布于:2023-07-27 03:44:55,感谢您对本站的认可!

本文链接:https://www.wtabcd.cn/fanwen/fan/89/1098162.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:单词   数据   模型   预测
相关文章
留言与评论(共有 0 条评论)
   
验证码:
推荐文章
排行榜
Copyright ©2019-2022 Comsenz Inc.Powered by © 专利检索| 网站地图