Sarah Zelikovitz ZELIKOVI@CS.RUTGERS.EDU Haym Hirsh HIRSH@CS.RUTGERS.EDU Computer Science Department,Rutgers University,110Frelinghuyn Road,Piscataway,NJ08854-8019USA
Abstract
We describe a method for improving the classi-
fication of short text strings using a combination
of labeled training data plus a condary corpus
of unlabeled but related longer documents.We
show that such unlabeled background knowledge
can greatly decrea error rates,particularly if
the number of examples or the size of the strings
in the training t is small.This is particularly
uful when labeling text is a labor-intensive job
and when there is a large amount of information
available about a particular problem on the World
Wide Web.Our approach views the task as one
of information integration using WHIRL,a tool
that combines databa functionalities with tech-
niques from the information-retrieval literature.
1.Introduction
The task of classifying textual data that has been culled from sites on the World Wide Web is both difficult and in-tensively studied(Cohen&Hirsh,1998;Joachims,1998; Nigam et al.,1999).Applications of various machine learn-ing techniques that attempt to solve this problem include categorization of Web pages into sub-categories for arch engines,and classification of news articles by subject.Ma-chine learning programs,such as C4.5(Quinlan,1993)and RIPPER(Cohen,1995)have the limitation that they learn bad solely upon previously classified data.It is often both impractical and extremely
start uptedious and expensive to hand-label a sufficient number of training examples to achieve the high accuracy that is needed for a given task.Given the huge proliferation of data on the Web,only a tiny per-centage of which can realistically be classified and labeled, the programs are unable to exploit this information to achieve higher accuracy when faced with new unlabeled ex-amples.V arious rearchers in text learning and mining have recog-nized that although there might be very few labeled exam-ples,there can be a tremendous amount of unlabeled exam-ples.Nigam et al.(in press)have done work on this using Expectation Maximization(EM)and a naive Bayes classi-fier.The parameters of the naive Bayes classifier are t us-ing labeled examples.The learned model is then ud by EM to probabilistically classify unlabeled documents,with the resulting collection of classified documents ud to es-timate a new t of parameters for naive Bayes.The EM al-gorithm iterates until there is no change in the naive Bayes parameters.Nigam et al.prent a number of experimen-tal results that show that error rates can be reduced signifi-cantly using unlabeled examples in this way.Other related algorithms are described by McCallum and Nigam(1999) and Jones et al.(1999).
Blum and Mitchell’s(1998)co-training algorithm also us unlabeled data to improve learning.Their algorithm applies to problems where the target concept can be described in two redundantly sufficient ways(such as through two dif-ferent subts of attributes describing each example).Each vi
ew of the data is ud to create a predictor,and each pre-dictor is ud to classify unlabeled data which are then ud to further train the other learner.Blum and Mitchell prove that under certain conditions,the u of unlabeled examples in this way is sufficient to PAC-learn a concept given only an initial weak learner.Lewis and his colleagues(Lewis& Gale,1994;Lewis&Catlett,1994)also make u of unla-beled data in learning,but focus on asking for labels from a human labeler for only a modest subt of the data,tho who class membership is most undecided given the result of learning on the data that has been labeled thus far.
This paper also describes a method that us a corpus of un-labeled data to assist in the classification task.However, unlike the preceding approaches,we do not attempt to clas-sify the data,and,indeed,do not even require that it be of a form comparable to that of the training data.In many cas it is not even possible to classify the data using the classi-fication schema of the labeled instances.Instead,we u the unlabeled corpus as“background knowledge”for the
learner,to aid it in its decision task.Rather than directly comparing a new unlabeled example directly to elements of the labeled training corpus,we u the unlabeled back-ground knowledge as a“bridge”,to connect the new exam-ple with labeled examples.A labeled training example is uful in classifying an unknown test instance if there exists some t of unlabeled background knowledge th
at is simi-lar to both the test example and the training example.We call this a“cond-order”approach to classification,in that data are no longer directly compared but,rather,are com-pared one step removed,through an intermediary.
In more detail,in this paper we look at improving the clas-sification of short text strings by using unlabeled but related longer documents.A concrete example of the ufulness of our approach can be en in the task of assigning topic la-bels to technical papers.In labeling the title of a physics ar-ticle with its sub-specialty,any title containing a word such as galaxy should easily be classified correctly as an astro-physics paper,even if there are few training articles in that domain.However,an article on a less common topic,as for example old white dwarfs,would only be classified cor-rectly if a title with the words appears in the labeled train-ing examples.Although the training t does not contain the words old white dwarf in our experimental data,our sys-tem is able to correctly classify a title with the words as astrophysics,by utilizing a corpus of unlabeled paper ab-stracts from the samefield,which is naturally available on the Web.In our“cond-order”approach,our systemfinds tho unlabeled paper abstracts that are most similar to both old white dwarfs and to various training titles.The train-ing titles are then ud to classify old white dwarfs correctly, although each of the titles is quite dissimilar to it when compared directly.
In order to achieve our goal,we u WHIRL(Cohen, 1998a;Cohen,1998b)which is a conventional databa system augmented with special operators for text compar-ison.Its u as a text classification program is a near-est neighbor approach(Cohen&Hirsh,1998),with text documents specified as TFIDF vectors,and similarity be-tween text documents measured as cosine similarity(Salton 1989).WHIRL makes it possible to po SQL-like queries on databas with text-valuedfields.If we consider the training examples as a table,the background knowledge as a table,and a test example as a table as well,WHIRL provides a framework in which we can easily specify and explore “cond-order”similarity classification.It allows for suc-cinct queries that specify the combination of training simi-larity and background similarity to a new test example.
In the next ction we give a brief review on WHIRL and a discussion on how we u it for classification with unla-beled data.We then describe four distinctly different do-mains on which we tested our system.The domain descrip-tions are followed by a t of results for each of the domains for varied data ts.We conclude with a discussion of the various possible dimensions that our choices along the way can take and directions for current and future rearch. 2.Our Approach
2.1WHIRL for Text Classification
WHIRL(Cohen1998a;Cohen1998b)is an information in-tegration tool that is specifically designed to query and inte-grate varied textual sources from the Web.WHIRL’s SQL-type queries can arch and retrieve textual sources bad upon specified conditions.Assume that we have a corpus of training examples with labels,and a test example that must be assigned a label.The training examples can be viewed as a table with thefield instance,to hold the textual data, andfield label to hold the class label.The test example is a one line table,with simply the textualfield instance.An example of a WHIRL query(Cohen&Hirsh,1998)is: SELECT Test.instance,Train.label
FROM Train AND Test
WHERE Train.instance SIM Test.instance
Given a ur-specified parameter K,this query willfirst gen-erate an intermediate table containing the K tuples
Test.instance,Train.instance,Train.label
that maximize the similarity score between Test.instance and Train.instance.Unlike traditional SQL queries, the result of this is a t of tuples ordered by score, with the highest score reprenting the
clost Train.instance,Test.instance pair,using WHIRL’s SIM op-erator to compute the similarity of the textual documents. To compute similarity WHIRL computes a model of each text document by reprenting each document as a vector in a vector space.This reprentation is computed by pass-ing each document through a stemmer(Porter,1980)and by then computing weights for each term using the TFIDF (Salton,1989)weighting method.Distances between vec-tors are computed using the cosine metric,which reprents the statistical similarity between documents.
In WHIRL’sfinal step it takes this table of K tuples and projects it onto thefields specified in the SELECT state-ment.Note that this can mean that there may be many train-ing examples among the K with the same ,multi-ple nearby examples in the training t),and the are com-bined into a single tuple in thefinal result table.The com-bination of scores is performed by treating scores as proba-bility and the combination as a“noisy or.”If the individual scores of the tuples with a given label are{s1,...,s n},the final score for that label is1− n i=1(1−s i).Whichever label has the highest score in the resulting projected table is
returned as the label for the test instance.This method bears many similarities to the nearest-neighbor method of Yang and Chute(1994),which has been shown to perform quite well on text classification tasks(Cohen&Hirsh,1998).In-deed,bad on the two papers we also u a value of K
= 30in our experiments.
2.2WHIRL with Background Knowledge
The question we now ask is:How can we u a large body of unlabeled data,or“background knowledge”,to aid clas-sification?Although the pieces of information need not be labeled,and indeed may have no relationship whatsoever to the classification task,we would hope to learn something from the word combinations in the examples.Such back-ground knowledge may provide us with a corpus of text that contains information both about importance of words(in terms of their TFIDF values in this large corpus),and joint probability of words(what percentage of the time do two words coexist in a document?).This gives us a large con-text in which to test the similarity of a training example with a new test example.We can u this context in conjunction with the training examples to label a new example. Becau of WHIRL’s expressive language,and the ability to create conjunctive queries simply by adding conditions to a query,WHIRL’s queries for text classification can be expanded to allow for the u of“background knowledge”on a subject.In the example of the classification of physics paper titles discusd earlier,suppo that we had a fairly small t of labeled paper titles,and also a very large t of unlabeled titles,papers or abstracts(or Web pages resulting from a arch),in a relation called Background with a single field,value.We can
create the following query for classifi-cation
SELECT Test.instance,Train.label
FROM Train AND Test AND Background
WHERE Train.instance SIM Background.value
AND Test.instance SIM Background.value
Here each of the two similarity comparisons in the query computes a score,and WHIRL multiplies them together to obtain afinal score for each tuple in the intermediate-results table.This table is then projected onto the Test.instance and Train.labelfields as discusd before.Whichever label gives the highest score is returned as the label for the test example.
One way of thinking about this is that rather than trying to connect a test example directly with each training exam-ple,it instead tries to bridge them through the u of an ele-ment of the background table.Note that WHIRL combines the scores of tuples generated from different matches to the background table.Our u of WHIRL in this fashion thus esntially conducts a arch for a t of items in the back-ground knowledge that are clo neighbors of the test ex-ample,provided that ther
e exists a training example that is a neighbor of the background knowledge as well.Train-ing neighbors of a test example are defined differently when background knowledge is incorporated.If words in a test example are found in some background knowledge,then other words that are in that background knowledge can con-nect this test example to dissimilar(in terms of word over-lap and direct cosine difference)training examples.Thefi-nal classification thus integrates information from multiple training examples and the multiple“bridge”examples that lie between them in the background text.
Note that this approach does not concern itlf with which class(if any!)a background item belongs to.We instead simply u the text directly as part of the decision-making process.This is in contrast to an approach that would ex-plicitly u the training t to classify the background items as if they were true examples,and then add them to the la-beled t.Our method allows for more sophisticated u and combination of the training instances and background knowledge.A background instance that is clo to numer-ous training instances can be included more than once in the table returned by the WHIRL query–even if the training examples that it is clo to have different class.Similarly, a training example can also be included in the table multiple times,if it is clo to numerous background instances.Sup-po that our classification task consists of labeling thefirst few words of
a news article with a topic.If a test example belongs to the category sports,for instance,the cosine dis-tance between the few words in the test example and each of the small number of training examples might be large. However,we would hope that given a large corpus of unla-beled news articles,it is likely that there will be one or more articles that contains both the few words of the test example and the words of one of the training examples.
Finally,note that,our u of“background knowledge”in a WHIRL query is in a n a form of query expansion (Buckley et al.,1995).Instead of directly arching for the training examples that are clost to a test example we arch for the training examples that are clost to the “background knowledge”expansion of the the test exam-ple.However,unlike standard query expansion,and be-cau of our conjunctive conditions,the background knowl-edge expansion itlf is chon with respect to the training example that it is clo to.This means that each query has multiple expansions,and all tho that maximize the score of the conjunctive condition are combined.
3.Experiments and Results
We have tested our system on four distinct text-categorization tasks that we have taken from the World Wide Web.In each ca,the training and test examples are short text strings,a problem that is prevalent in real-
comma
world applications and for which WHIRL was especially designed.For each of our four problems,the source of our background knowledge varies,sometimes originating at the same site from which we obtained the labeled data,and sometimes from unrelated sites also found on the Web. 3.1Data Sets
Technical papers One common text categorization task is assigning discipline or sub-discipline names to techni-cal papers.We created a data-t from the physics papers archive(v),where we downloaded the ti-tles for all technical papers in thefirst three areas in physics (astrophysics,condend matter,and general relativity and quantum cosmology)for the month of March1999.As background knowledge we downloaded the abstracts of all papers in the same areas from the two previous months –January and February1999.In total there were1701 pieces of knowledge in the background t,and1066in the training-test t combined.The distribution of class was skewed,however,as there were493titles in astrophysics, 460in condend matter,and only113in quantum cosmol-ogy.The background knowledge abstracts were down-loaded without their ,without knowledge of what sub-discipline they were from)so that our learning program had no access to them.
News Another data t that we created was obtained from ClariNet news.We downloaded all articles under the sports and banking headings on November17th1999,using the most recent ones for trainin
g and test ts and the older ones for background knowledge.In total,our background knowledge consisted of a corpus of1165articles.The back-ground knowledge in this problem consisted of thefirst100 words of each of the articles.Informal studies showed us that including the entire articles did not improve accu-racy substantially,and degraded the efficiency of WHIRL. Our training-test data had1033data points of which637be-longed to the sports category,and406belonged to banking category.We prent four ts of results in connection with the1033data points,called3-words,5-words,7-words and 9-words,corresponding to the test and training t consist-ing of thefirst3,5,7,or9words of each article respectively. Web page titles To determine the ufulness of WHIRL as a nearest neighbor classification tool,Cohen and Hirsh (1998)ud two data ts taken from the World Wide Web. Thefirst,NetV et(wwwvet.wustle.edu)included the Web page headings for its pages concerning cows, hors,cats,dogs,rodents,birds and primates.For exam-ple,a training example in the class birds might have been:“Wild Bird Center of Walnut Creek”.Each of the ti-tles had a URL that linked the title to its associated Web page.For the labeled corpus,we cho half of the titles with their labels,in total1789examples.We discarded the other half of the titles,with their labels,and simply kept the URL to the associated Web page.We ud the URLs to download thefirst100words from each of the pages,to be placed into a corpus for background knowledge.Tho URLs that were not reachable were ignored by the program that created the back
ground knowledge.In total there were 1158entries in the background knowledge databa. Companies The cond of Cohen and Hirsh’s data ts con-sisted of a training t of company names,2472in all,taken from the Hoover Web site()la-beled with one of124industry names.We created back-ground knowledge from an entirely different Web site–We downloaded the Web pages un-der each business category in the Yahoo!business hierar-chy to create101pieces of background knowledge.The Yahoo!hierarchy had a different number of class and dif-ferent way of dividing the companies,but this was irrele-vant to our purpos since we treated it solely as a source of unlabeled background text.Each piece of background knowledge consisted of the combination of Web pages that were stored under a sub-topic in the Yahoo!hierarchy.Each instance in the table of background knowledge was thus a much longer text string than the training or test examples.
3.2Results
We prent a ries of results and graphs that show that WHIRL incorporating background knowledge most often performs better than WHIRL without the supplementary knowledge.This improvement is most dramatic when there are fewer labeled examples,and when the labeled examples are shorter strings.
However,since a number of the data ts are being ud for thefirst time in this paper,we start this ction by compar-ing the core WHIRL method without background knowl-edge(which we label WHIRL-nn),to a more traditional method,RIPPER(Cohen1995),to demonstrate that our im-provements are on top of already strong classification per-formance.Error rates in Table1reprent average error of five cross-validated runs on the full training t.The“best”value for each problem is shown in bold.Since WHIRL us the Porter’s stemming algorithm when it creates the TFIDF reprentation,we ud it as well,before giving the data to RIPPER.As can be en from this table,in all data ts,WHIRL-nn outperforms RIPPER,and thus any im-provements above and beyond WHIRL-nn that we now re-port reprent even stronger classification performance than this credible state-of-the-art method.
The remainder of our results,prented in a ries offig-ures,report error rates for the baline WHIRL approach WHIRL-nn to our new approach,which we label WHIRL-bg.In each ca we report error rates as we vary the num-ber of training examples given to the learner.Each point
melody Table1.Error rates:RIPPER vs WHIRL-nn 3-words20.2312.36 5-words14.987.11 7-words13.07 5.55 9-words11.71 4.77 netvet44.2639.20 hoovers77.9570.90 2class physics30.507.89 3class physics39.9616.90
7
乔布斯演讲89101112
131420
40
60
80
100
E r r o r R a t e
Percentage of Data
WHIRL-bg WHIRL-nn
Figure 4.WHIRL-bg and WHIRL-nn for 5-word News
5
678910
辽宁开学时间202011122040
60
80100
E r r o r R a t e
Percentage of Data
WHIRL-bg WHIRL-nn
Figure 5.WHIRL-bg and WHIRL-nn for 7-word News
4.55
5.56
6.57
7.58
8.599.51020
40
60
vincy80
100
E r r o r R a t e
Percentage of Data
WHIRL-bg WHIRL-nn
Figure 6.WHIRL-bg and WHIRL-nn for 9-word News
38
394041
424344454647
20
40
60
1998世界杯主题曲80
100
E r r o r R a t eampoule
Percentage of Data
WHIRL-bg WHIRL-nn
never land
Figure 7.WHIRL-bg and WHIRL-nn for NetVet
ror rates only when less than 60percent of the data was ud.This gives empirical evidence that the less informa-tive the training data is,the greater the advantage in having a corpus of background knowledge available for u dur-ing classification.The size of the reduction in error rate ob-tained by running WHIRL-bg was also greater when there were fewer words in each example.
Results for the NetV et domain are graphed in Figure 7.Re-ductions in error rate incread as the number of training examples decread.The NetV et domain is unlike the two previously discusd in that there was overlap in topics in the background knowledge.A Web page that could be u-ful in classifying a test example as belonging to the cate-gory of Dogs was quite likely to discuss Cats and vice versa.Some of the training and test examples,too,could have caud confusion.There were titles of Web pages on pet stores or animal care that were placed under one topic,but could just have easily been placed in many other different categories.We therefore were not surprid to e that the error rate did not decrea by a large percentage.
sataoddThe results for the Companies data t are graphed in Fig-ure 8.Once again,WHIRL-bg outperformed WHIRL-nn.Using 100percent of the data,the decrea in error rate is substantial.Howe
ver,when the percent of training exam-ples that was ud is lower,the difference in error rate be-tween the two systems is reduced.This is unlike the results of the previous three domains.This might have been due to the fact that the training and test examples were company names,which often consisted of words that occured only once (for example,Xerox )so that reducing the number of training examples actually reduced the dictionary of words in the training corpus substantially.There were therefore fewer words that could be ud to find bridges in the back-ground knowledge.