News: the machine learning way
When dealing with multiple class there are two possible ways of averaging the
recall, precision, F1-measure) , namely, macro-average and
micro-average. The macro-average weights equally all the class, regardless of how many
documents belong to it. The micro-average weights equally all the documents, thus favouring
the performance on common class. Different classifiers will perform different in common
and rare categories. Learning algorithms are trained more often on more populated class
thus risking local over-fitting.
⽰被正确分到该类的实例的个数,b 表⽰被误分到该类的实例的个数,c 表⽰属于该类但被误分
r = a / (a + c), if a + c > 0; otherwi r = 1
p = a / (a + b), if a + b > 0; otherwi p = 1
其中参数β⽤来为准确率(p)和召回率(r)赋予不同的权重,当β取1 时,准确率和召回率被
F1 的微平均指标是相同的。
Coping with the News: the machine learning
Dr. Rafael A. Calvo [HREF1], Web Engineering Group The University of
Sydney [HREF2], NSW, 2006.
Prof. Jae-Moon Lee [HREF1], Hansung University[HREF3], Web Engineering
Group The University of Sydney [HREF2], NSW, 2006.
Abstract News articles reprent some of the most popular and commonly accesd
content on the web. This paper describes how machine learning and
automatic document classification techniques can be ud for managing
large numbers of news articles. In this paper, we work with more than
800,000 of Reuters news stories and classify them using a Naive Bayes and
k-Nearest Neighbours approach. The articles are stored in newsML format,
commonly ud for content syndication. The methodology developed would
enable a web bad routing system to automatically filter and deliver news
to urs bad on an interest profile.
Information overload ? in which individuals are faced with an oversupply of
content ? could become the road rage for the new millennium. In order for
this content to become uful information and empower urs, rather than
frustrate or confu them, we need novel ways of delivering only what is
needed, at the right time and in the right format. News stories are the most
important source of up-to-date information about the world around us, and reprent some of the most often updated, and highest quality content on
the web. Therefore, it is most important to develop ways to process news efficiently. In this paper we have studied machine learning models, and
applied them to the automatic classification of large numbers of online news articles.
Language technologies have provided excellent tools for information
retrieval by exploiting the content being indexed. In the beginnings of the
Internet, arch engines and classification systems on the web, only ud information from unstructured text to index and find content arch engines
(i.e. Altavista); later they ud the topology of the net to find which of the information being indexed was more important (i.e. Google). Meanwhile,
the progress achieved in document classification has not been as noticeable,
and most web page classification is done manually by thousands of human indexers, in private catalogues or in large scale ones such as Yahoo! and the
Open Directory Project. Instead of using human labour, automatic
document classification tools[3,11,12] u statistical models, similar to
tho in information retrieval. The systems can be ud to create
hierarchical taxonomies of content as are commonly found on the web, on
e-learning systems and even on traditional library information systems.
News story articles are written by reporters from all over the world. The
reporters often work for news agencies such as the Associated Press and Reuters. The agencies collect the news, edit it and ll bundles of articles
to the periodicals accesd by web urs (e.g. , Sydney
Morning Herald, etc). It is important for both the agencies and the periodicals to have an organized well-managed stream of news. News are
normally classified according to taxonomies that are relevant to readers,
(e.g. “politics”, “Iraq” or “Oil”). This classification can be very difficult
becau it requires human experti to spot relationships between the
taxonomy and the documents. Even the experts do not agree on what
should go where and inter-indexer consistency in classification has
considerable variation [9].
Automatic classification techniques u algorithms that learn from the
human classifications, so they can only do as well as the human training
data provided. In addition, different algorithms can learn different types of
patterns in the data. In order to compare the classification performance of
different algorithms, rearchers have a t of standarid benchmarks,
with a particular datat, and a well defined task. The most popular
classification benchmark during the late nineties was a Reuters collection
called Reuters-21578 (bad on Reuters-22173) with 21578 documents,
that had to be classified in about 100 different [4,13]. This benchmark is still
ud currently to compare the performance of different algorithms but the challenges now lie in moving towards larger scale document classification
[8]. In 2002, Reuters relead a new rearch datat with over 800,000 documents that we discuss in this paper.
There are veral Machine Learning (ML) algorithms that have been
successfully ud in the past [4,7,11,13]. They include Neural Networks,
Naive Bayes, Support Vector Machines (SVM) and k-Nearest neighbours (kNN). Each of the methods has their advantages and limitations on classification performance and scalability. The choice of algorithms will
depend in the application, and the amount of data to be ud. In web applications, efficiency is of particular importance, since the large number of
urs and data can make some algorithms unfeasible.
Section 2 of this paper describes the Reuters RCV1 collection ud in this
project, focusing on the challenges offered by its size, structure and the
richness of its XML structure. Section 3 describes the ML algorithms we have ud, namely Naive Bayes and k-Nearest Neighbours, ction 4 describes
how we have improved a classification framework [12] in order to make it
scalable for web applications. Section 5 describes the performance results
for the classifier and the improvements in memory and CPU requirements.
Section 6 concludes and summarizes our future work in the area.
The Reuters collection, challenges and solutions
The Reuters RCV1 Corpus [9] consists of all English news stories (806,791)
published by Reuters in the period between 20/8/1996 and 19/8/1997. The news is stored as files in XML format, using a News ML Document Type
definition (DTD). NewsML is an open standard being developed by the
International Press and Telecommunications Council (IPTC). The news is
written by approximately 2000 reporters and then classified by Reuters
specialists in a number of ways. The classified news articles are then
syndicated by websites such as or
periodicals like the Sydney Morning Herald that may or may not have a
Due to asonal variations, the number of stories per day is not a constant.
In addition, on weekdays there are an average of 2,880 stories per day
compared to 480 on weekends. Approximately 3.7Gb is required for the
storage of the uncompresd XML files.
The NewsML schema contains metadata produced by human indexers about
themes and categories. When two humans index a document differently
they create inter-indexer variations. The can be measured using a
correction rate C=(NC/NE)*100, where NE is the number of stories indexed
by an editor and NC is the number of times an editor has been corrected by
a cond editor. Normally untrained editors will have a higher C than more
experts one, but even when they are all experienced correction rates of 10%
are common. In the RCV1 collection there are correction rates of up to 77%.
Since ML algorithms learn from examples ?classifications done by humans-
correction rates are an important limiting factor to their performance.
Performance measures in classification systems are really a measure of how
much they correlate to the human classifiers, it is not possible for a student
to be better than the teacher, or at least it not possible for the teacher to
know so.
RCV1 data is stored in XML documents providing the metadata information
normally required by news agencies and periodicals who need to deliver the
stories to end urs. NewsML defines a rich schema shown simplified in
Figure 1, we can e that it has entities for title, headline, text, copyright
and veral types of classification. For our experiments we have ud all
three available classifications (topic, country and industry) as a single task.
Figure 1: An example of newsML document
Web applications will increasingly exploit this type of metadata. As it has
been discusd by rearchers studying the concept of the mantic web [2],
the next revolution in the Internet will come when web applications have
access to structured collections of information, and t of inference rules
that can be ud to perform automated reasoning. This project aims at
producing such applications.
Language Technologies and document classification
Language technologies is an emerging field with rearchers from different backgrounds. Mathematicians, linguistics and computers scientists are
producing theories and software tools that manage content more efficiently.
This rearch includes speech processing, information retrieval and
document classification. The models in document classification are similar to
tho in the other areas, and extensive literature describes them in detail
[4,7,11,14]. For the sake of brevity, in this ction we only summarize the
basic concepts.
The esnce of the classification techniques is that documents and
categories can be reprented in high dimensional vector space. Since
classification consists of assigning documents to predefined a categories,
machine learning techniques have to learn the mapping between the two
vector spaces, document to categories.
The simplest model to reprent a document as a vector is the binary
reprentation. In this model, a vector has the dimension of the dictionary, and the occurrence of a word puts a 1 in the corresponding element, all
other elements being 0.
The next level of complexity, and the one we have ud, is called Term
Frequency (TF) since the value of the element is equal to the number of
occurrences of the term in the document. If the term appears more often, it
is often becau it is more important. This is not always the ca, since words like articles and prepositions often do not add much to the information value. The terms are called stop-words and normally eliminated from the document. Other technique to reduce the number of terms is stemming, where words are reduced to their common stem (i.e. “runner” and “runners” are reduced to “run”.
Finally, more elaborate models take into account the idea that terms that appear in many documents are not telling us much about an individual one, so an Inver Document Frequency (IDF) weight is ud. The models require computing statistics across the whole corpus, making it much more computational expensive. Since our goal in this project was to study scalable techniques that could be ud in web applications, we have ud the simpler TF vector reprentations.
Once the documents have been reprented as vectors, they can be further reduced using statistical reduction techniques such as c2, or term frequency. The techniques lect tho terms that have
higher impact or correlation with the classification. We do not discuss them here as veral other sources describe them in detail [4,7,11,13].
Machine learning algorithms are trained on a subt of the corpus, and later tested on a different subt to measure their performance. Several ML algorithms have been ud successfully in a number of applications [3,11], including Naive Bayes and kNN both ud here.
The success of a classification is often stated with two standard efficacy measures: precision and recall. Precision is the probability that if a random document is classified as part of C, this classification is correct. Recall is the probability that if a random document ought to be classified under C, this classification is actually made. The two quantities are highly and inverly correlated. We can have absolute recall by assigning all documents to all categories, but at the cost of having the worst precision. We can also have high precision by never assigning a document to a category, but this would make the classifier uless.
The trade-off between recall and precision is controlled by tting the classifiers’ parameters. Both values should be provided to describe the performance. Another common performance measure is the F1-measure: When dealing with multiple class there are two possible ways of averaging the measures, namely, macro-average and micro-average. The
macro-average weights equally all the class, regardless of how many documents belong to it. The micro-average weights equally all the documents, thus favouring the performance on common class. Different classifiers will perform different in common and rare categories. Learning algorithms are trained more often on more populated class thus risking local over-fitting.
Object Buffering and performances燕窝早上吃好还是晚上吃好
Object Oriented Application Frameworks (OOAF) are software engineering artefacts that improve reusability of design and implementation [5,6,10].
The classification framework ud in this project[HREF4] [12] was designed to provide a consistent way to build document categorization systems. It allows the developer to focus on the quality of the more critical and complex modules by allowing reu of common, simple ba modules. In this project we study how the framework can be ud in a news stories classification system and extended the framework to be more scalable as required in web applications by adding an object buffering module.
One of the most rious limitations when training ML algorithms, is the amount of data required to achieve satisfactory performance and the scalability issues that ari when training such algorithms.
Memory requirements are often the first obstacle for managing large corpora, specially for a text categorization system, where the ML algorithms are being trained on very large vector spaces and large amounts of data. Most operating systems have dealt with this problems by adding support for virtual memory, but the general purpo techniques are not always enough or appropriate for special applications like ours. In tho cas, applications often have an adaptive buffer management tool that allows the
application to allocate specific amounts of memory.
Figure 2: Run time object diagram
We have extended our text categorization framework [12] so it can handle
much larger corpora. Figure 2 shows the object diagram of important
class in the run time of the framework. The circle in Figure 2 reprents an
object and the arrow reprents a relationship meaning a
category/document ‘HAS-A’ feature vector (FV). The document object
corresponds exactly to a document in the training or testing ts. The
category object corresponds exactly to a category. For the RCV1, there are
over 800K documents, so the framework needs to manage that number of
document objects and the same number of feature vectors. The object
buffer can be configured to store as many objects as possible in RAM
memory, and requesting other ones from the hard disk as they are required.
Thus, there is only a constant amount of objects in the memory, and the
system can be optimid for the available resources and independently on
天气用英语怎么说the amount of data.
The buffered object module we implemented is described in Figure 3. The
module supports a unique persistent Object Identifier (OID) for all objects
stored in it. The framework will then u this OID instead of the object or its
reference. If the framework needs an object to be stored in the buffered
二级手术object module, it makes a request to the buffered object manager using the
OID. The buffered object manager first arches for the request object in the
object pool. If object manager finds it, then it returns the object’s reference,
otherwi it requests the object to the buffered file manager which tries to
糯米粉是什么粉minimize I/O to the disk. The buffered file manager reads the request object
from the disk, and returns it to the buffered object manager. As the buffered
object manager receives the object from the buffered file manager, it firstly
stores the received object to the buffer pool and returns it to the requester
in the framework.
Figure 3: Architecture of the buffered object module
In order to optimi memory usage, the buffered object manager us a
constant amount of memory. Our current implementation follows the Least
Recently Ud (LRU) strategy [1]. In this strategy, common for paging in
operating systems, the least recently ud (read or written) object (or page)
is lected to be taken out of the buffer (paged out). The same rule is also
ud in other cache systems when they lect which cache entry to flush. In
future work we plan to add new buffering strategies such as: Most Recently
Ud (MRU) and others. We expect that the lection of the strategies will
depend on each Machine Learning algorithm. For example, Naive Bayes
might work better with a LRU and kNN with a MRU strategy.
With a scalable framework we are able to show the feasibility of
automatically classifying large amounts of news stories. Since news stories
in the RCV1 datat are in newsML format the extensions to the framework
need to include more efficient XML parsing. We first tried the Perl and C implementations of the SAX package. Adapting one of the libraries
implemented in C and integrating it into the framework produced the best
We have tested the extended framework on the Reuters RCV1 data in order
· Asss the feasibility of automatic document classification on large-scale management of news stories.
· Measure the classification performance of Naive Bayes and kNN
classifiers on this new corpus, and find ways to improve it.
· Measure the improvements over the non-buffered framework in
computing time and memory management. We performed three ries of experiments: one to determine how Naive
Bayes and kNN algorithms performed for different amounts of training data,
a cond one to e if using the newsML structure could improve
performance by giving different weights to the title of the story, and finally
how they scaled ?in computation time- for different amounts of data.
高一集合练习题Although the precision of the classification itlf should not be modified by