BNS Feature Scaling: An Improved Reprentation over TF·IDF for SVM Text Classification
George Forman HP Laboratories HPL-2007-32R1
Keyword(s):
Text classification, topic identification, machine learning, feature lection, Support Vector Machine, TF*IDF text reprentation.
Abstract:
In the realm of machine learning for text classification, TF·IDF is the most widely ud reprentation f
or real-valued feature vectors. However, IDF is oblivious to the training class labels and naturally scales some features inappropriately. We replace IDF with Bi-Normal Separation (BNS), which has been previously found to be excellent at ranking words for feature lection filtering. Empirical evaluation on a benchmark of 237 binary text classification tasks shows substantially better accuracy and F-measure for a Support Vector Machine (SVM) by using BNS scaling. A wide variety of other feature reprentations were later tested and found inferior, as well as binary features with no scaling. Moreover, BNS scaling yielded better performance without feature lection, obviating the need for feature lection.
External Posting Date: August 6, 2008 [Fulltext] Approved for External Publication
Internal Posting Date: August 6, 2008 [Fulltext]
To be prented and published in ACM 17th Conference on Information and Knowledge Management. Napa Valley, CA,
October 26-30, 2008
BNS Feature Scaling: An Improved Reprentation over TF·IDF for SVM Text Classification
George Forman
Hewlett-Packard Labs
Palo Alto, CA, USA
ghforman@
ABSTRACT
In the realm of machine learning for text classification, TF·IDF is the most widely ud reprentation for real-valued feature vectors. However, IDF is oblivious to the training class labels and naturally scales some features inappropriately. We replace IDF with Bi-Normal Separation (BNS), which has been previously found to be excellent at ranking words for feature lection filtering. Empirical evaluation on a benchmark of 237 binary text classification tasks shows substantially better accuracy and F-measure for a Support Vector Machine (SVM) by using BNS scaling. A wide variety of other feature reprentations were later tested and found inferior, as well as binary features with no scaling. Moreover, BNS scaling yielded better performance without featur
e lection, obviating the need for feature lection. Categories and Subject Descriptors
H.3.3 [Information Search & Retrieval]: Information filtering;
I.5 [Pattern Recognition]: Design methodology, feature evaluation and lection.
General Terms
Algorithms, Performance, Experimentation.
Keywords
Text classification, topic identification, machine learning, feature lection, Support Vector Machine, TF*IDF text reprentation. 1.INTRODUCTION
Text classification via machine learning is at the heart of effective document categorization, personalization, news filtering, and information routing. State-of-the-art classification accuracy can be achieved by applying a linear Support Vector Machine (SVM) to a ‘bag-of-words’ reprentation of the text, where each unique word in the training corpus becomes a parate feature [1][9][10]. The numerical feature value for a given word/term is often reprented by its term frequency TF in th
e given text multiplied by its inver document frequency (IDF) in the entire corpus—the ubiquitous ‘TF·IDF’ reprentation. IDF is commonly taken to be log( # documents ÷ # documents containing the term). By multiplying by IDF, the common functional words such as ‘of’ and ‘can’ are devalued relative to the uncommon words that are more likely topic-specific indicators.
Although TF·IDF is widely ud in text classification, it is oblivious to the class labels in the training t, which can lead to inappropriate scaling for some features. Consider a toy example: a word X occurs in 80% of the positive training cas and another word Y occurs in only 3% of the positive cas—suppo neither occurs among the negative training cas. IDF gives a super-linear boost to words with lower frequency of occurrence. But in this ca the more common word is a much stronger predictor. A specific, real example is illustrated later in Section 5.1.
Filter methods for feature lection have developed a variety of metrics that do a good job of correctly ranking the predictive value of different features, e.g. Bi-Normal Separation (BNS) [4].
In this paper we improve on the state-of-the-art by using the BNS feature scoring metric in a new way: to scale the magnitude of the feature values. That is, we compute the BNS score for each feature and u TF·BNS for each feature value, replacing IDF. This increas the effect of important
words on the kernel distance computations for the SVM. We show that this simple idea substantially improves SVM accuracy and F-measure on a benchmark of 237 binary text classification tasks, especially when TF is restricted to be binary. For comparison, we also tested a dozen other metrics and found that BNS scaling performed best. BNS feature lection was previously shown to substantially improve text classification [4]. One of the difficulties of feature lection, however, is in deciding the optimal number of features to u. The new method of BNS scaling offers to simplify the process, becau it consistently performed best by using all features. This has an intuitive appeal of not ‘throwing away’ information. For tho situations where the volume of data must be reduced for computational scalability at the cost of classification accuracy, we recommend a hybrid that us Information Gain for feature lection and BNS for feature scaling, bad on our empirical study.
1.1Related Work and Scope
In this space of classification rearch, some work address binary classification [4][9][13], as in information filtering, e.g. parating spam from good email. Other work address multi-class classification [17], e.g. routing or classifying a document into one of many categories. Our focus here is on binary classification, but we expect the results to generalize. Binary tasks are an important sub-
problem in most multi-class classification methods, which decompo the 1-of-n problem by pitting each class against the others. Finally, we note that the problem n-of-m multi-class classification, e.g. topic recognition, is addresd by m independent binary classifiers.
Permission to make digital or hard copies of all or part of this work for personal or classroom u is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwi, or republish, to post on rvers or to redistribute to lists, requires prior specific permission and/or a fee.
CIKM’08, October 26–30, 2008, Napa Valley California, USA. Copyright 2008 ACM 978-1-59593-991-3/$5.00.
There is a large rearch literature in feature lection metrics to filter words for text classification. The goal is often to improve accuracy, but in some papers it is to prerve accuracy as much as possible as the number of features is reduced in order to decrea computational workload. In this work our goal is simply to maximize classification performance, as measured by accuracy and F-measure. However, in the process, we developed a method that satisfies both goals.
As a side note, in non-text domains, there has been a lot of work that takes a collection of features with widely disparate ranges and normalizes or discretizes them to make them palatable for induction algorithms. The concerns led to the common practice of normalizing the feature space for SVMs. Instead, we modify the normalization pha as an additional opportunity to condition the data for learning. Finally, there are many references to ‘feature scaling’ or ‘feature weighting’ in the literature that simply refer to variations on normalization. For example, word counts are sometimes scaled so that long documents appear to have a uniform number of words as short documents. In contrast, this work scales the feature ranges bad on the supervid labels. The clost references to this sort of feature weighting in the literature are in lazy learning (ca-bad learning), where the goal is to learn an appropriate distance function so that the nearest neighbors of a novel ca suggest appropriate class labels [15]. Such methods constitute iterative wrapper arches to adjust the feature weights of a (linear) distance metric. By contrast, in our work, we let the SVM perform the inductive learning, and we simply condition the input feature vectors with a single, fast pass over the training t. Any of the feature scoring metrics and IDF can be computed in a single linear scan the training corpus.
2.METHODS
Here we briefly define the feature scoring metrics and how they are ud for feature lection and/or for feature scaling.
2.1Feature Selection via Filtering
Filtering methods for feature lection evaluate each feature independently via a chon scoring metric. Then, some number of top-scoring features is lected for input to the induction algorithm. One can either specify the number of top-ranked features to lect, or equivalently, specify a particular score threshold, which is particular to the feature scoring metric being ud. In order to compare different scoring metrics on a common x-axis scale, the former is often preferable in rearch.
2.2Feature Scaling
The key idea of this paper is to u a feature scoring metric to make the numeric range greater for more predictive features, just as IDF attempts to do in TF·IDF. This affects the dot-product distance between cas as evaluated by the linear SVM kernel [9]. For example, if the BNS score is 2.1 for the word feature ‘free’ in a spam classification task, then its Boolean prence or abnce in a document would be reprented as either 0 or 2.1, rather than 0 or 1. A less predictive word ‘cat’ wit
h a BNS score of 0.3 would have the smaller range 0 or 0.3, and therefore have less effect on the kernel distance computations. This basic idea can be applied to any scoring metric. Furthermore, it can also be applied to non-binary features, e.g. to scale term frequency TF counts: TF·BNS. Additionally, feature scaling may be ud in conjunction with feature lection. The scoring metric ud for feature lection may be different than the metric ud for feature scaling. The best method found by our experiments is such a hybrid: IG ud for feature lection and BNS for feature scaling.
2.3Feature Scoring Metrics
The primary feature scoring metrics we u in this paper are defined as follows.
Bi-Normal Separation (BNS): | F-1(tpr) – F-1(fpr) |
Inver Document Freq (IDF): log( (pos+neg) ÷ (tp+fp) )
Log Odds Ratio [13]: log( (tp·tn) ÷ (fp·fn) )
Information Gain (IG): H(data) – H(data | word)
= H(pos,neg) – (P(word) H(tp,fp) + (1-P(word)) H(fn,tn) ) where
pos = number of positive training cas, typically minority,
neg = number of negative training cas,
tp = number of positive training cas containing word,
fp = number of negative training cas containing word,
fn = pos – tp,
tn = neg – fp,
true positive rate tpr=P(word | positive class) = tp/pos,
fal positive rate fpr=P(word | negative class) = fp/neg,
P(word) = (tp+fp) / (pos+neg),
entropy H(x,y) = –nln(x/(x+y)) – nln(y/(x+y)),
nln(x) = x log2 x, and
F-1 is the inver Normal cumulative distribution function, as commonly available from statistical tables.
Note that the are computed using binary word features, i.e. many occurrences of a word in a single document only count toward one tp or fp count. Information Gain and BNS are naturally symmetric with respect to positively and negatively correlated features. Log Odds Ratio, however, assigns a very low score to a strongly predictive feature that occurs in almost all negative cas but in none of the positive cas. To rectify this, for any negatively correlated features we rever the meaning of a word occurrence to be a word non-occurrence, i.e. tp ↔ fn and fp ↔tn. This solution improves a number of feature lection metrics that otherwi ignore strong negative features [4].
As usual, there are some nuances to converting the straightforward mathematical definitions to robust code. For example, Log Odds Ratio is undefined if tp, tn, fp, or fn is zero. To avoid this, we substitute 0.5 for any zero count, which has the desirable property that even if some of the variables are zero, the function remains responsive to the magnitude of the other variables. Likewi, for IG we define nln(x) = 0, whenever x = 0. Finally, in the BNS function, the inver Normal goes to infinity at zero or one; hence, we limit tpr and fpr to the range [0.0005, 1-0.0005]. Laplace smoothing is a more common method to avoid the extreme probabilities, but it damages the maximum likelihood
estimate, and it los the good performance of BNS by devaluing many valuable negative features in favor of very rare positive features [4]. Alternately and perhaps preferably, one could substitute a fractional count if tp or fp is exactly zero; this may work better for extremely large training ts. We ud a fixed limit becau we ud a finite size lookup table for the inver Normal function, generated by Gnuplot’s invnorm() function and transferred to our Java code, since this standard statistical function is not available in the Java math libraries.
3.EXPERIMENT DESIGN
Our experiments consider a wide variety of text feature reprentations. For each potential feature scoring metric, we consider using it as a scale factor on TF feature counts and parately as a scale factor on binary features. We also consider plain binary features, as well as raw, unscaled TF features. In Section 4.6, we also combine scaling with feature lection.
3.1Benchmark Datats
The benchmark text classification tasks we u are drawn from Reuters-21578 (re), OHSUMED abstracts (oh), the Los Angeles Times (la), the Foreign Broadcast Information Service (fbis), the Web ACE project (wap), and various TREC competitions (tr, new3), originally compiled by Han and Karypi
s [7]. See Table 1, which includes the total vocabulary size and the average number of words prent per document. There are 19 multi-class datats and their class sizes vary widely. For the experiments we consider each class vs. all others within that datat, yielding 237 binary text classification tasks, which are reprentative of the size and difficulty of many of the industrial text classification tasks we face at HP Labs.
Although the tasks are not entirely independent of each other becau of some overlap in their negative class, they are much more independent than studies that consider only a single source of documents with a common negative class, e.g. using just Reuters. We have made the procesd TF feature vectors available for other rearchers at the WEKA web site [16].
Table 1. Benchmark text classification datats
Datat Class Docs Words Words/Doc
fbis 17 2463
2000
160 la1 6
3204
31472
151 la2 6
3075
31472
148 new3 44 9558
26833
235 oh0 10 1003
3182
53
oh10 10 918
3012
56
oh15 10 1050
3238
59
oh5 10 913
3100
54
ohscal 10 11162
11465 60 re0 13 1504
2886
52
re1 25 1657
3758
53
tr11 9 414
6429
282 tr12 8 313
5804
274 tr21 6 336
7902
470 tr23 6 204
5832
385 tr31 7 927
10128
269 tr41 10 878
7454
195 tr45 10 690
8261
281 wap 20 1560
8460
141
3.2Evaluation Measures
We evaluate performance bad on two standard performance measures: accuracy for its historical standard in machine learning rearch and F-measure for its improved nsitivity in the common information retrieval situation where positives are rare. F-measure is the harmonic average of precision & recall, where precision = true positives ÷ predicted positives, and recall = true positives ÷ all positives in ground truth. To achieve good F-measure requires the classifier have good precision and good recall on the positive class, whereas good accuracy can be achieved simply by predicting all test cas as negative, if the positive class is rare.
For each of the 237 binary text classification tasks, the performance of the learning algorithm is measured via standard 4-fold stratified cross-validation, repeated with eight different random split eds (kept consistent for each reprentation tested). Note that all feature lection and scaling is determined from only the training folds within cross-validation, to avoid leaking information from the test t.
For any given feature reprentation under study, we obtain 237 x 8 performance measurements. To distill the into one summary number, we u macro-averaging, weighting each of the measurements equally (as oppod to micro-averaging, which would weight them according to the number of documents that happen be available in each task).
3.3Induction Algorithms
We u the linear SVM implementation provided by the WEKA library v3.4 [16]. In this experiment, we u its default behavior, rather than attempting to optimize each of its many parameters. In particular, its default complexity constant is one. We had to disable its feature space normalization feature, otherwi it re-scales all feature values to the range [0,1]. Thus, BNS feature scaling caus the more important features to have a greater effect on the SVM kernel dot product than less predictive features.
This kernel-centric explanation of the effect is specific to kernel-bad classifiers such as the SVM, but we found that the BNS scaling also helps some other classification models that are not bad on kernels. We repeated the experiment protocol with the multinomial Naïve Bayes classifier, which has been shown superior to its simple binomial form for text classification [11]. (Note that feature scaling would have no effect on the classic binomial formulation, which converts each feature to binary nominal values. The multinomial formulation us word counts, which BNS scaling effectively increas for the more important features.) We found that while BNS scaling does benefit the multinomial Naïve Bayes classifier, its best performance came from using plain binary features with BNS feature lection, and even then it had substantially wor accuracy than SVM—consistent with
much of the literature. Thus, we report the Naïve Bayes results only in a tech report [2]. Henceforth we just consider SVM models.
4.EMPIRICAL RESULTS
Initially we focus on the BNS feature scoring metric (the initial impetus for this work) and contrast it with IDF or no scaling, since they are most common. Later in Section 4.5 we compare against many other feature scoring metrics that we considered later.
4.1Accuracy & F-measure
Figure 1 shows the average accuracy and F-measure for six different text reprentations, including error bars which extend above and below each point by one standard error, stdev÷sqrt(N). Binary features and term frequency (TF) counts are considered for each of three feature scaling methods: IDF, BNS, and no scaling. The two most common text reprentations, TF·IDF and binary features, both performed significantly wor than BNS scaling of
binary features (labeled just ‘BNS’). We expected this from the intuitive motivation in the introduction. BNS scaling leverages the class labels to improve the distance evaluations of the SVM
kernel. Using binary features, as oppod to term frequencies, consistently improves accuracy; binary features do not appear to have a consistent effect on F-measure, but we elucidate this next. Overall, the BNS-scaled binary features provided a 1% improvement over TF·IDF for accuracy and 7% for F-measure, both with no question of statistical significance by their small error bars in proportion to the differences.
4.2 Precision vs. Recall Analysis
Good F-measure requires a balance between precision and recall. It can be informative to e the tradeoffs the different methods make between precision and recall at the classifier’s natural classification threshold. (Alternately, one could consider full precision-recall curves by sweeping through all decision thresholds, but this considers many decision points for which the classification algorithm was not attempting to optimize and makes it difficult to compare six methods.) Figure 2 shows Precision vs. Recall for the six methods considered in the previous figure, including standard error bars in both dimensions. The x- and y-axis ranges are identical to make it clear that the SVM precision typically exceeded its recall.
Here we can e consistent effects. Using binary features yields a large improvement in precision c
ompared with TF features. Likewi, BNS scaling results in a large improvement in recall compared with IDF or no scaling. (We will explain the underlying cau of this in the discussion ction, as well as the additional data point labeled ‘hBNS.’) Together, the two effects make for a substantial increa in F-measure. The dotted curved lines in the background show points of equal F-measure, indicating the gradient to obtain better F-measure performance. Since the positive class is typically small for text classification, the best improvement in accuracy is by improving precision, not recall—hence, the consistent benefit of binary features on accuracy we obrved in Figure 1. (Note: for applications where the best precision is sought regardless of recall, one may prefer instead to optimize the precision in the top N predicted positives.)
4.3 The Effect of Class Distribution
The importance of BNS scaling increas as the class distribution becomes more skewed. To illustrate this, we have binned together tho benchmark tasks that have between 0% and 5% positive cas, 5%—10% positives, and so forth. See Figure 3, which shows macro-average F-measure for each bin and again includes standard error bars. When the minority class is relatively 97 98A c c u r a c y
T F .I D F
I D F T F b i n a r y T F .B N S B N S
0.70
0.75
0.80
F -m e a s u r e
T F .I D
F I D F T F b i n a r y T F .B N S B N S
Figure 1. SVM accuracy (left) and F-measure (right) for six different text feature reprentations, including standard error bars. 0.70
0.80
0.90
0.70
0.800.90
P r e c i s i o n
Recall
Figure 2. Precision & Recall for six text reprentations. The thin dotted lines show isoclines of equal F-measure. 0.65
0.70
0.750.800.850.90
0.95
5
10
15
20
25
F -m e a s u r e
% positives
BNS binary IDF
Figure 3. The benefit of BNS feature scaling grows as the percentage of positive cas becomes small.
balanced (20%–25%), the benefit of BNS scaling vs. no scaling or IDF scaling of binary features is small. As the percentage of positives decreas to a small minority, the classification performance naturally drops for all methods, but the decline is substantially less if BNS scaling is ud. In other words, the benefit of BNS scaling is greater for highly imbalanced classification problems. Each successive bin contains roughly half as many benchmark tasks, indicating that the performance under high skew is the more common operating range for text classification problems.
4.4Per-Datat Views
Next we show the macro-average F-measure for each of the 19 datats parately. Although the graph is cluttered, Figure 4 shows clearly the near-complete dominance of BNS scaling of binary features (labeled ‘BNS’). To aid the visualization, we have sorted the class by their BNS performance. BNS did not dominate for three of the most difficult datats, where instead TF·BNS performed best. Note that the performance of some of the methods can be quite erratic, e.g. for datats tr21 and tr23 the methods IDF and binary performed extraordinarily poorly.
4.5Comparing Many Other Scoring Metrics At first, we tested the feature scaling idea with only the BNS metric and the Information Gain metric, becau the two were the leaders for feature lectio
n in our prior study [4]. But this begs the question of whether some other feature scoring metric might perform yet better. To resolve this question, we tested all the feature lection metrics of our prior study, where their many formulae can be found. Note that asymmetric functions, such as the Probability Ratio [13], have been adjusted to symmetrically value strong features that happen to be correlated with word abnce rather than word prence. To the we added Pearson Correlation and Pointwi Mutual Information [10]. Altogether, fifteen choices for feature scaling (all supervid except for IDF and ‘no scaling’) are paired both with TF counts and parately with binary features in Figure
5. Although some future metric may be devid that exceeds BNS, the prent evidence indicates that BNS with binary features is statistically significantly better—it pass a paired two-tailed t-test against the runner-up with p<1% (even at N=75 trials, for readers who may not consider the benchmark problems as independent). Many of the methods, including IDF, performed wor than using no scaling.
We note that the runner up was Log Odds Ratio. (We included its formula in Section 2.3, partly becau it is extremely easy for rearchers and practitioners to adopt in their code, whereas BNS requires statistical tables not available in common software libraries.) Our prior study showed that O
dds Ratio has a preference surface that is similar to BNS. For feature lection, the dynamic range of the scoring metric is immaterial—only the relative ranking of features is important. But here, we find that the logarithm of Odds Ratio did a better job of conditioning the feature vectors for the linear SVM kernel than using raw Odds Ratio, which has a very large range. Similarly for Chi-Squared and Probability Ratio.1
4.6Feature Selection & Scaling Combined Finally, we consider the question of whether performance might further be improved by its u in conjunction with feature lection, since feature lection has been shown beneficial, esp. with BNS and IG [4]. In Figure 6 we vary the number of features lected, with and without BNS scaling on binary features (all performance figures were wor for TF features, so they are not shown in order to reduce visual clutter). The rightmost point on the logarithmic x-axis is not to scale, but indicates that all features were ud within each datat, i.e. no feature lection. The overall best performance for both accuracy and F-measure is obtained using BNS scaling with all features. Without BNS scaling, the best available performance is obtained by lecting a subt of features via BNS. (The lack of improvement with IG feature lection and certain other methods has led some rearchers in the past to conclude that SVM does not benefit from feature lection.)
1 Whereas most experiment points were evaluated in conds or minutes, an occasional classification task paired with Odds Ratio, Probability Ratio or Chi-Squared (each without the logarithm) took many hours to train. We eventually terminated the, and exclude them from the average. We suppo their huge dynamic range made a highly discontinuous arch space for the SVM optimization.
for the 19 datats.
Figure 5. F-measure for 30 different text reprentations.