重庆大学
硕士学位论文
文本分类中特征提取和特征加权方法研究
姓名:***
申请学位级别:硕士
专业:计算机系统结构
指导教师:***
2010-04
摘要
随着近年来互联网技术和信息技术的飞速发展,人们可获取的数据量迅速增长。如何快速、准确、全面地从浩瀚的信息资源中寻找到所要的狭小领域内的相关信息内容,已经成为了一项十分有意义的课题。文本分类技术作为解决这一问题的关键技术之一,也成为了研究的热点。
文本分类是一个复杂的系统工程,一般包括文本预处理、特征降维、特征加权、分类器训练和分类器性能评估等几个过程。本文在对这些过程进行详细了解和研究的基础之上,重点探讨了特征降维和特征加权过程。
对高维的特征集进行降维是文本分类过程中的一个重要环节,它不仅可以提高分类器的速度,节省存储空间,还能够过滤一些无关属性,减少无关信息对文本分类过程的干扰,从而提高文本分类的精度和防止过拟合。特征降维可以分为两类:特征抽取和特征提取。特征提取因其方法简单、计算速度快,适合用来处理大规模的文本数据,在文本分类中得到了广泛的应用。本文详细研究了目前常用的特征提取方法,包括文档频数、互信息、信息增益、期望交叉熵、2χ统计量和文本证据权。本文分析了这些方法的各自的特点,针对这些方法的不足之处,结合类间集中度、类内分散度和类内平均频度,提出了一种新的特征提取方法。该方法突出了特征与文本类别的正相关关系,避免了考虑负相关情况所带来的干扰,综合考虑了特征和类别之间的联系,以及特征在类内出现的平均频度,是一种简单有效的特征提取方法。
特征加权过程,能够改善文本集合在向量空间中的分布状态,使得同类文本的空间结构更加紧凑,异类文本的空间结构更加松散,从而简化了从文本到类别的映射关系,有利于提高文本分类器的分类性能。本文研究了经典的特征加权方法——TF-IDF,分析了它由于未考虑特征项在类间和类内的分布情况而导致对稀有特征赋予较大权值,而对类别区分贡献大的特征赋予较小权值的不足之处。进而结合
了类间集中度、类内分散度,提出一种TF-IDF公式的改进形式,来弥补原始TF-IDF方法的缺陷。
本文在中文文本分类实验平台上,通过多组对比实验来考察本文提出的新的特征提取方法和改进的TF-IDF方法的有效性。实验结果使用查全率、查准率和F1值等多项评价指标来衡量。结果表明新的特征提取方法能够取得比其他常见特征提取方法更优的降维效果,同时改进的TF-IDF特征加权方法的效果也要好于传统的TF-IDF方法。
关键词:文本分类,向量空间模型,特征提取,特征加权
ABSTRACT
With the rapid development of Internet technology and information technology, people can gain more and more knowledge. It has become a very significant issue regarding how to accurately, comprehensively and quickly find the desired information within a narrow field of knowledge from the huge number of information. Text classification technology, which is one of the key technologies to solve this problem, has become a hotspot of rearch.
Text classification is a complex and systematic project, which includes text preprocessing, feature dimension reduction, feature weighting, classifier training and classifier performance evaluation. Bas
ed on a detailed study of the process, this thesis focus on characteristics of feature dimension reduction and feature weighting.
Reducing the dimensions of high-dimensional feature t is an esntial part of text categorization. It not only can improve the classifier’s speed and save storage space, but also can filter out irrelevant attributes and reduce interference caud by irrelevant information on the text categorization process. Therefore, feature dimension reduction can enhance the accuracy of text classification and prevent over-fitting.Feature dimension reduction can be divided into two categories: feature extraction and feature lection.Feature lection has been effectively applied in text classification, becau of its simplicity, fast calculation, suitable for handling large-scale text data.The current commonly ud feature lection methods, such as document frequency, mutual information, information gain, expected cross entropy, Chi-square statistic and weight of evidence for text, have been studied in this thesis. The characteristic of each of the methods has been analyzed. In order to overcome the deficiencies of the methods, a new approach in feature lection was propod by comprehensively taking concentration among categories, distribution in category and average frequency in category into account.This new approach is a simple and effective feature lection method, which highlights the positive correlation between features and categories, avoids the interfer
ence caud by the negative correlation between features and categories. Besides, the relevance between features and categories, the average frequency features occur within the class were comprehensively considered.
Feature weighting can improve the distribution of text t in the vector space. It can make the spatial structure of the texts, which belong to the same category, more
compact, and make the spatial structure of the texts, which belong to the different categories, more loo. Thus it can simplify the mapping from texts to categories, and improve the performance of the text classifier. The classical feature weighting method, TF-IDF, has also been studied in this thesis. The s hortcoming of TF-IDF that does not takes the distribution of the features inter-class and within-class in consider, leads to the result that rare features are given large weight and features with the ability to distinguish categories are given small weight.To make up for the original TF-IDF formula defects, an improved TF-IDF formula, which combines concentration of a feature among categories, distribution of a feature in category, is propod.
To verify efficiency of the new feature lection approach and improved TF-IDF formula, a multi-t of experiments ba on the Chine text categorization test system platform have been taken.Recall,
Precision and F1 are ud as the evaluating indicators of experiments results. The results show that the new feature lection approach has a more excellent effect of reducing dimension than other staple feature lection methods, while the improved TF-IDF feature weighting method performs better than the traditional TF-IDF method.
Keywords:Text Classification, Vector Space Model, Feature Selection, Feature Weighting
重庆大学硕士学位论文 1
绪论
1 绪论
1.1 研究背景及意义
Internet作为一个开放的、分布式的全球信息的汇聚空间,被公认为20世纪末
人类科技史上的里程碑,并正以惊人的速度不断向前发展着。通过Internet,人们
可以方便、快捷的获取世界各地的各种信息资源,同时还可以向世界发布自己拥
有的信息资源。目前,Internet上的信息量呈现了雪崩式的增长,每天在网络上新
出现的文本数量在109以上,其包含的内容极为广泛,几乎囊括了人类社会从政治、
经济、军事到生活、娱乐、体育等的各个方面的信息,并且完全开放。由此网络
上出现了“信息爆炸”这一问题,即信息极大的丰富而知识却相对匮乏。一方面,人
们希望能够获得越来越多的信息;另一方面,在海量的信息资源中快速有效的查
找到自己感兴趣的内容变得越来越困难了。在这些海量、异构的Web信息资源中,
蕴含着具有巨大潜在价值的知识,人们迫切的需要能够有效地查找、过滤和管理
这些资源的工具。如何快速、准确、全面地从浩瀚的信息资源中寻找到自己所需
的狭小领域内的相关信息内容,已经成为了一项十分有意义的课题[1]。
在这些网络上的海量信息中,有一大部分是非结构化或半结构化的文本信息。
要想从这些文本信息中迅速有效地获得所需的有关信息,必须先要对这些信息进
行分门别类,由此产生了文本分类技术。文本分类的任务是对未知类别的文档进
行自动处理,根据给定文本的内容,判别它为事先确定的若干个文本类别中的某
一类或者某几类。传统的人工分类方法既要耗费大量的人力,效率又非常低下,
导致丰富的资源无法得到充分利用,无法适应当前Internet上信息暴涨的现状。而
文本分类正是解决这一问题有效途径,它是大规模文本处理的基础,也是提高其
他文本处理功能和质量的有效手段。通过文本分类技术,人们可以按类别对文本
进行存储、检索和进一步处理。文本分类技术的研究目标就是实现文本分类的自
动化,以达到降低分类费用、改善分类性能、提高分类精度和分类的一致性等目
的。
文本分类技术被视为是所有基于内容的文本信息管理技术的基础,相关技术可
以应用到许多领域。文本分类作为组织和管理文本数据的一种有效手段,在信息
检索、信息过滤、文本数据库、数字化图书馆搜索引擎、元数据提取、构建索引、
歧义消解、排序电子邮件、学习用户兴趣等领域,有着广泛的应用前景和商业价
值。因此,对文本分类的研究具有非常重要的理论意义和实用价值,文本分类也
已经成为近几年研究的热点之一[2]。