首页 > 美文阅读

机器学习：使用matlab实现SVM完成垃圾邮件识别

更新时间:2023-05-19 07:21:33 阅读：评论：0

机器学习：使⽤matlab实现SVM完成垃圾邮件识别

⽂章⽬录

预处理

在开始机器学习之前，多观察数据集中的数据通常很有帮助。⽐如在下⾯这封邮件⾥

我们可以看到⼀个 URL、⼀个电⼦邮件地址（在末尾）、数字和美元⾦额。

虽然许多电⼦邮件会包含类似类型的⽂本（例如，数字、其他 URL 或其他电⼦邮件地址），但⼏乎每封电⼦邮件中的这些⽂本都会有所不同。

因此，处理电⼦邮件时常⽤的⼀种⽅法是将这些值“标准化”，以便所有 URL 都被视为相同，所有数字都

被视为相同等。例如，我们可以将电⼦邮件中的每个 URL 替换为唯⼀的字符串“httpaddr”表⽰存在 URL。这具有让垃圾邮件分类器根据是否存在任何 URL ⽽不是特定URL 是否存在来做出分类决定的效果。这通常会提⾼垃圾邮件分类器的性能，因为垃圾邮件发送者通常会随机化 URL，因此在新的垃圾邮件中再次看到任何特定 URL 的⼏率⾮常⼩。

在 processEmail.m 中，我们实现了以下电⼦邮件预处理和规范化步骤：

⼩写：将整个电⼦邮件转换为⼩写，因此忽略⼤写（例如，将 IndIcaTE 视为与指⽰相同）。

剥离 HTML：从电⼦邮件中删除所有 HTML 标记。许多电⼦邮件通常带有 HTML 格式。我们删除了所有的 HTML 标记，因此只保留了内容。

规范化 URL：所有 URL 都替换为⽂本“httpaddr”。

标准化电⼦邮件地址：所有电⼦邮件地址都替换为⽂本“emailaddr”。

规范化数字：所有数字都替换为⽂本“数字”。

标准化美元：所有美元符号 ($) 都替换为⽂本“美元”。

词⼲：词被简化为词⼲形式。例如，‘discount’、‘discounts’、‘discounted’和’discounting’都替换为’discount’。

有时，Stemmer 实际上会从末尾剥离额外的字符，因此“包含”、“包含”、“包含”和“包含”都被替换为“包含”。

去除⾮单词：去除⾮单词和标点符号。所有空格（制表符、换⾏符、空格）都已被修剪为单个空格字符。

词库映射

另外，我们还需要将数据中的单词替换为数字，即⽤单词在我们词库⾥的索引来替代字符串。字典库包含所有信件中出现次数超过100的单词（如果囊括太多词汇，包括那些仅出现过⼏次的，很可能会出现过拟合），共1899个。

在 MATLAB 中，可以使⽤ strcmp 函数⽐较两个字符串。例如，strcmp(str1, str2)仅当两个字符串相等时才会返回 1。在提供的起始代码中，vocabList 是⼀个包含词汇表中单词的cell-array。在 MATLAB 中，除了它的元素也可以是字符串（它们不能在普通的 MATLAB 矩阵/向量中），cell-array就像⼀个普通数组（即向量），可以⼤括号对它们进⾏索引。

旁若无人前两个部分的函数如下：

function word_indices = processEmail(email_contents)

蜀道难难于上青天%PROCESSEMAIL preprocess a the body of an email and

%returns a list of word_indices

发出自己的声音% word_indices = PROCESSEMAIL(email_contents) preprocess

% the body of an email and returns a list of indices of the

% words contained in the email.

% Load Vocabulary

vocabList = getVocabList();

% Init return value

word_indices = [];

% ========================== Preprocess Email ===========================

% Find the Headers ( \n\n and remove )

% Uncomment the following lines if you are working with raw emails with the

% full headers

% hdrstart = strfind(email_contents, ([char(10) char(10)]));

% email_contents = email_contents(hdrstart(1):end);

% Lower ca

email_contents = lower(email_contents);

% Strip all HTML

% Looks for any expression that starts with < and ends with > and replace

% and does not have any < or > in the tag it with a space

email_contents = regexprep(email_contents, '<[^<>]+>', ' ');

% Handle Numbers

% Look for one or more characters between 0-9

伙字组词email_contents = regexprep(email_contents, '[0-9]+', 'number');

% Handle URLS

% Look for strings starting with or

email_contents = regexprep(email_contents, ...

'(http|https)://[^\s]*', 'httpaddr');

高清美女手机壁纸% Handle Email Address

% Look for strings with @ in the middle

email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr');

% Handle $ sign

email_contents = regexprep(email_contents, '[$]+', 'dollar');

% ========================== Tokenize Email ===========================

% Output the email to screen as well

fprintf('\n==== Procesd Email ====\n\n');

% Process file

l = 0;

while ~impty(email_contents)

% Tokenize and also get rid of any punctuation

[str, email_contents] = ...

strtok(email_contents, ...

[' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]);

% Remove any non alphanumeric characters

str = regexprep(str, '[^a-zA-Z0-9]', '');

报考科类% Stem the word

% (the porterStemmer sometimes has issues, so we u a try catch block)

try str = porterStemmer(strtrim(str));

catch str = ''; continue;

end;

% Skip the word if it is too short

if length(str) < 1

continue;

end

% Look up the word in the dictionary and add to word_indices if

% found

% ====================== YOUR CODE HERE ====================== % Instructions: Fill in this function to add the index of str to

% word_indices if it is in the vocabulary. At this point

% of the code, you have a stemmed word from the email in

% the variable str. You should look up str in the

% vocabulary list (vocabList). If a match exists, you

% should add the index of the word to the word_indices

% vector. Concretely, if str = 'action', then you should

% look up the vocabulary list to find where in vocabList

% 'action' appears. For example, if vocabList{18} =

% 'action', then, you should add 18 to the word_indices

% vector (e.g., word_indices = [word_indices ; 18]; ).

% Note: vocabList{idx} returns a the word with index idx in the

% vocabulary list.

% Note: You can u strcmp(str1, str2) to compare two strings (str1 and

% str2). It will return 1 only if the two strings are equivalent.

idx=find(strcmp(str,vocabList));

word_indices=[word_indices;idx];

% =============================================================

% Print to screen, ensuring that the output lines are not too long

if (l + length(str) + 1) > 78

fprintf('\n');

l = 0;

end

fprintf('%s ', str);

l = l + length(str) + 1;

end

% Print footer

fprintf('\n\n=========================\n');

end

气功十二法

做完前两步处理，我们的信件就从⼀封信变成⼀个数字向量了。

%% Initialization

clear;

% Extract Features

file_contents = readFile('');

word_indices = processEmail(file_contents);

% Print Stats

disp(word_indices)

构造特征向量

对于⼀封信，它的特征向量就是词库⾥的某个词有没有在这封信⾥出现过，有则该项为1，没有则是0，很简单：

function x = emailFeatures(word_indices)

%EMAILFEATURES takes in a word_indices vector and produces a feature vector

%from the word indices

% x = EMAILFEATURES(word_indices) takes in a word_indices vector and

% produces a feature vector from the word indices.

% Total number of words in the dictionary

n = 1899;

% You need to return the following variables correctly.

x = zeros(n, 1);

% ====================== YOUR CODE HERE ======================

% Instructions: Fill in this function to return a feature vector for the

% given email (word_indices). To help make it easier to

% process the emails, we have have already pre-procesd each

% email and converted each word in the email into an index in

% a fixed dictionary (of 1899 words). The variable

% word_indices contains the list of indices of the words

% which occur in one email.

% Concretely, if an email has the text:

% The quick brown fox jumped over the lazy dog.

% Then, the word_indices vector for this text might look

% like:

% 60 100 33 44 10 53 60 58 5

%一氧化碳中毒迟发性脑病

% where, we have mapped each word onto a number, for example:

% the -- 60

% quick -- 100

% ...

% (note: the above numbers are just an example and are not the

% actual mappings).

% Your task is take one such word_indices vector and construct

% a binary feature vector that indicates whether a particular

% word occurs in the email. That is, x(i) = 1 when word i

% is prent in the email. Concretely, if the word 'the' (say,

% index 60) appears in the email, then x(60) = 1. The feature

% vector should look like:

% x = [ 0 0 0 0 1 0 0 0 ... 0 0 0 0 1 ... 0 0 0 1 0 ..];

for i=1:length(word_indices)

x(word_indices(i))=1;

end

% ========================================================================= end

训练

让数据中的每封信都经过上⾯的处理之后，我们就可以进⾏常规的SVM训练了：

本文发布于:2023-05-19 07:21:33，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/691654.html

上一篇：月偏食现象(7篇)

下一篇：2023年岗位认知自我介绍岗位自我认知(19篇)

标签：包含单词数字邮件垃圾邮件字符串

留言与评论（共有 0 条评论）