Manifold Adaptive Experimental Design
for Text Categorization
Deng Cai,Member,IEEE,and Xiaofei He,Senior Member,IEEE Abstract—In many information processing tasks,labels are usually expensive and the unlabeled data points are abundant.To reduce the cost on collecting labels,it is crucial to predict which unlabeled examples are the most ,improve the classifier the most if they were labeled.Many active learning techniques have been propod for text categorization,such as SVM Active and Transductive Experimental Design.However,most of previous approaches try to discover the discriminant structure of the data space, whereas the geometrical structure is not well respected.In this paper,we propo a novel active learning algorithm which is performed in the data manifold adaptive kernel space.The manifold structure is incorporated into the kernel space by using graph Laplacian.This way,the manifold adaptive kernel space reflects the underlying geometry of the data.By minimizing the expected error with respect to the optimal classifier,we can lect the most reprentative and discriminative data points for labeling.Experimental results on text categorization have demonstrated the effectiveness of our propod approach.
Index Terms—Text categorization,active learning,experimental design,manifold learning,kernel method.
Ç
1I NTRODUCTION
T EXT classification has been a fundamental problem in many information processing tasks[1],[14],[17],[22], [32],[44].In order to train a classifier that can automatically distributes documents into different mantic categories, one usually needs to collect a large t of labeled examples. In order to reduce the efforts in collecting labels,many rearchers studied to u active learning[37]for text categorization.The key problem in active learning is determining which unlabeled examples would be the most ,improve the classifier the most if they were labeled and ud as training examples.
There has been a long tradition of rearch on active learning in machine learning community[10],[12],[16]. One popular group of algorithms lect the most uncertain data given previously trained models.One reprentative algorithm in this group is SVM Active.Bad on the obrvation that the clor to the SVM boundary a data point is,the less reliable its classification is,Tong and Koller propo
d SVM Active which lects tho unlabeled data points clost to the boundary to solicit ur’s labeling so as to achieve maximal refinement on the hyperplane between the two class[43].Another group of algorithms choo the most informative points that optimize some expected measures[12].Many algorithms in statistics belong to this category.In statistics,the problem of lecting samples to label is typically referred to as experimental design[2]. Classical optimal experimental design approaches include A-optimal design,D-optimal design,and E-optimal design. Recently,Yu et al.has propod Transductive Experimental Design(TED)with either quential[45]or convex[46] optimization,which has yielded impressive results on text categorization.TED is fundamentally bad on optimal experimental design but evaluates the expected prediction error on both labeled and unlabeled examples.
Standard learning systems operate on input data after they have been transformed into feature vectors living in a m-dimensional space.In such a space,standard learning tasks like classification,clustering,and data lection(active learning)can be performed.The resulting hypothesis will then be applied to test points in the same vector space,in order to make predictions.Recently,various rearchers(e [3],[33],[41])have considered the ca when the data are drawn from sampling a probability distribution that has support on or near to a submanifold of the 怎样经营服装店
ambient space.Here, a d-dimensional submanifold of a euclidean space I R m is a subt M d&I R m,which locally looks like a flat d-dimen-sional euclidean space[26].In order to detect the underlying manifold structure,many manifold learning algorithms have been propod,such as Locally Linear Embedding(LLE)[33], ISOMAP[41],and Laplacian Eigenmap[3].One of the key ideas in manifold learning approaches is the so called locally invariant idea[18],i.e.,the nearby points are likely to have the similar embedding/labels.
All the early manifold learning techniques mainly focus on dimensionality reduction.Recently,the manifold idea(or, locally invariant idea)has been successfully applied to clustering[30],mi-supervid learning[4],[25],[40],[47], topic modeling[9],and matrix factorization[6].Particularly, the manifold idea achieved great success on various text analysis tasks[7],[8],[20],[24].For example,both Tansduc-tive SVM[24]and spectral graph transducers[25](two of the popular mi-supervid learning algorithms for text analy-sis)ud the locally invariant idea.All the approaches demonstrated that learning performance can be significantly enhanced if the geometrical structure is exploited and the local invariance is considered.It is very natural that this idea should also be considered in active learning.However,most
.The authors are with the State Key Lab of CAD&CG,College of Computer
Science,Zhejiang University,388Yu Hang Tang Rd.,Hangzhou,
Zhejiang310058,China.E-mail:{dengcai,xiaofeihe}@cad.zju.edu.
Manuscript received26Aug.2009;revid9Nov.2009;accepted17Sept.
2010;published online28Apr.2011.
Recommended for acceptance by J.Zobel.
For information on obtaining reprints of this article,plea nd e-mail to:
tkde@computer,and reference IEEECS Log Number TKDE-2009-08-0632.
Digital Object Identifier no.10.1109/TKDE.2011.104.
1041-4347/12/$31.00ß2012IEEE Published by the IEEE Computer Society
of the existing active learning algorithms fail to take into account the intrinsic manifold structure.
In this paper,we propo a novel manifold adaptive active learning algorithm for text categorization.B
y using a data-dependent norm on reproducing kernel Hilbert space (RKHS)propod by Sindhwani et al.[40],we can warp the structure of the RKHS to reflect the underlying geometry of the data.The conventional optimal experimental design can then be performed in the manifold adaptive kernel space. We discuss how to kernelize the convex transductive experimental design,which gives ri to nonlinear manifold adaptive data lection for text categorization.
The rest of the paper is organized as follows:in Section2, we provide a brief review of the related work.Our manifold adaptive active learning algorithm for text categorization is introduced in Section 3.The experimental results are prented in Section4.Finally,we provide the concluding remarks and suggestions for future work in Section5.
2B ACKGROUND
The generic problem of active learning is the following. Given a t of points X¼f x1;x2;...;x n g in I R m,find a subt Z¼f z1;z2;...;z k g&X,which contains the most informative points.In other words,the points z iði¼1;...;kÞcan improve the classifier the most if they are labeled and ud as training points.
There has been extensive rearch in machine learning on this subject.Some popular directions incl
ude lecting the most uncertain data given previously trained models [42]and lecting the most reprentative points by exploiting the cluster structure of the data[13].One reprentative algorithm which lects the most uncertain data is SVM Active[42],[43].This method lects the points that can reduce the size of the version space as much as possible.Since it is difficult to measure the version space, the authors provide three approximations.One of them which lects the points clost to the current decision boundary is called SimpleMargin.This method was also propod by Schohn and Cohn[35]and has been very popular.Some other methods include query-by-committee [39],density-weighted methods[29],[38],and explicit error-reduction techniques[34],[48].Plea refer[37]for a comprehensive treatment of active learning approaches.
In statistics,the problem of lecting samples to label is typically referred to as experimental design.The sample x is referred to as experiment,and its label y is referred to as measurement.The study of optimal experimental design(OED) [2]is concerned with the design of experiments that are expected to minimize variances of a parameterized model. Since the approach described in this paper will be bad on OED,we give some detailed descriptions on optimal experimental design as follows:
2.1Optimal Experimental Design
We consider learning a linear function fðxÞ¼w T x from obrvation y¼w T xþ ,where $Nð0; 2Þis obrvation error.Suppo we have a t of labeled example points ðz1;y1Þ;...;ðz k;y kÞ,where y i is the label of z i.Thus,the maximum likelihood estimate of w is obtained by
索赔英文b w¼arg min
w
JðwÞ¼
X k
i¼1
À
w T z iÀy i
Á2
()
:ð1Þ
By Gauss-Markov theorem,we know that e¼^wÀw has
zero mean and a covariance matrix given by 2C w,where
C w is the inverted Hessian of JðwÞ
C w¼
@2J s
@w
À1
¼
X k
i¼1
z i z T
i
!À1
¼
À
ZZ T
ÁÀ1
;
where Z¼ðz1;z2;...;z kÞ.Then OED formulates the opti-
mization problem as minimization of some measurement of
estimation error derived from C w.Three most common
measures are trace of C w(leads to A-optimal design),
determinant of C w(leads to D-optimal design),and
maximum eigenvalue of C w(leads to E-optimal design).
Some recent work on optimal experimental design can be
found in[15],[21],[45].
3M ANIFOLD A DAPTIVE E XPERIMENTAL D ESIGN
In order to incorporate the manifold structure into the
learning process,a natural way is to perform learning tasks
in manifold adaptive kernel space.In this ction,we will
describe our manifold adaptive experimental design ap-
proach,which is fundamentally bad on transductive
experimental design and manifold adaptive kernel.We begin
with a description of transductive experimental design.
3.1Transductive Experimental Design
frisian
Let X¼f x1;...;x n g be the t of all the data points and
Z¼f z1;...;z k g&X be the t of lected points for
labeling.
The key idea of TED is to minimize the average expected
square predictive error of the learned function f.For any x,
let^y¼^w T x be its predicted obrvation.The expected
square prediction error can be written as follows:
jenniferhudsonEðyÀ^yÞ2
¼Eð þw T xÀ^w T xÞ2
¼ 2þx T½EðwÀ^wÞðwÀ^wÞT x
¼ 2þ 2x TðZZ TÞÀ1x:
Interestingly,the expected square prediction error of x does
not depend on the labels,but only the training points Z.The
average expected square predictive error over the complete
data t X is
1
n
X n
i¼1
Eðy iÀ^w T x iÞ2¼ 2þ 2Tr
À
X TðZZ TÞÀ1X
Á
:ð2Þ
In order to minimize the average expected square predictive
error,one should find a subt Z which minimizes(2).
However,it can be verified that this optimization problem is
NP-hard[45]and,therefore,infeasible to find a global
optimum.After some mathematical derivations,the mini-
mization of average expected square predicative error can be
formulated as an equivalent optimization problem as follows:
min
i2I R k;Z¼ðz1;...;z kÞ
X n
i¼1
k x iÀZ T i k2þ k i k2:ð3Þ
Yu et al.[45]propos a quential greedy algorithm that
lects z0i s one at time.However,the obtained result is
suboptimal.
Recently,a convex relaxation of(2)was introduced in [46].By introducing auxiliary variables ¼ð 1;...; mÞto control the inclusion of examples into the training t,the optimization problem can be rewritten as follows:
min ; i2I R n X n
i¼1
k x iÀX T i k2þ
X n
j¼1
2
i;j
j
!
þ k k1
s:t: i!0;i¼1;...;n;
上海西班牙语培训班
ð4Þ
where i¼ð i;1;...; i;nÞT and kÁk1denotes the‘1norm. As suggested by Lasso regression[19],the minimization of the‘1norm k k1leads to a spar .That is,some entries of will be zero.It is easy to check that,if j¼0,then all 1;j;...; n;j have to be zero,otherwi the objective function goes to infinity.Thus,the jth example will not be lected.It can be shown that the optimization problem (4)is convex,and therefore global optimum can be obtained.For the details,plea e[46].
Convex TED has shown its promising results on text categorization.However,it fails to take into account the intrinsic manifold structure,which has been shown very uful for improving the learning performance by many previous studies[4],[28].
3.2Manifold Adaptive Kernel
In order to incorporate the manifold structure into the active learning process,a natural way is to perform active learning tasks in manifold adaptive kernel space.In the following,we discuss how to incorporate the manifold structure into the reproducing kernel Hilbert space,which leads to manifold adaptive kernel space.
Kernel trick is usually applied in the hope of discovering the nonlinear structure in the data by mapping the original nonlinear obrvations into a higher dimensional linear space [36].The most commonly ud kernels include Gaussian kernel and polynomial kernel.However,the nonlinear structure captured by the data-independent kernels may not be consistent with the intrinsic manifold structure,such as geodesic distance,curvature,and homology[4],[31].
In this work,we adopt the manifold adaptive kernel propod by Sindhwani et al.[40].Let V be a linear space with a positive midefinite inner product(quadratic form) and let S:H!V be a bounded linear operator.We define ~H to be the space of functions from H with the modified inner product[40]
h f;g i~H¼h f;g i Hþh Sf;Sg i V:
Sindhwani et al.have shown that~H is still a RKHS[40].
Given the examples x1;...;x m,let S:H!I R m be the evaluation map
SðfÞ¼À
fðx1Þ;...;fðx mÞ
ÁT
:
Denote f¼ðfðx1Þ;...;fðx mÞÞT and g¼ðgðx1Þ;...;gðx mÞÞT. Notice that f;g2V,thus we have
h Sf;Sg i V¼h f;g i¼f T M g;
where M is a positive midefinite matrix.We define
k x¼À
Kðx;x1Þ;...;Kðx;x mÞ
Á
:
It can be shown that the reproducing kernel in~H is[40]:
~Kðx;zÞ¼Hðx;zÞÀ k T
x
ðIþMKÞÀ1M k z;ð5Þ
where I is an identity matrix,K is the kernel matrix in H,
and !0is a constant controlling the smoothness of the
functions.The key issue now is the choice of M,so that the
deformation of the kernel induced by the data-dependent
norm,is motivated with respect to the intrinsic geometry of
the data.
react nativeIn order to model the manifold structure,we construct a
nearest neighbor graph G.For each data point x i,we find its
k nearest neighbors denoted by Nðx iÞand put an edge
between x i and its neighbors.There are many choices for
the weight matrix on the graph.A simple one is as follows:
W ij¼
1;if x i2Nðx jÞor x j2Nðx iÞ;
0;otherwi:
&
ð6Þ
The graph Laplacian[11]is defined as L¼DÀW where D
is a diagonal degree matrix given by D ii¼
P
j
W ij.The
graph Laplacian provides the following smoothness penalty
on the graph:
f T L f¼
X n
i¼1
ðfðx iÞÀfðx jÞÞ2W ij:
By tting M¼L,we eventually get the following
manifold adaptive kernel:
K Mðx;zÞ¼Kðx;zÞÀ k T xðIþLKÞÀ1L k z:ð7Þ
It is important to note that all the existing popular
,Gaussian kernel,polynomial kernel and linear
kernel)can be transformed to manifold adaptive kernels.
For text analysis,previous studies[23],[44]have shown that
linear models are enough due to the large number of
四级作文题目features for text data.Actually,a stronger conclusion can be
found at[5]which shows that a linear mapping(function)
can unfold any manifold structure in the data as long as the
data points are linear independent(it is usually true for the
text data becau the number of features of the text data is
usually larger than the number of samples).Thus,we
simply u the linear kernel(transformed to manifold
adaptive kernel)in our text categorization experiments.
3.3Convex TED in Reproducing Kernel Hilbert
Space
In the following,we discuss how to perform convex TED in
the manifold adaptive kernel space.
For given examples x1;...;x n2I R m with a positive
definite mercer kernel K:I R mÂI R m!I R,there exists a
unique RKHS H of real valued functions on I R m.Let K tðsÞ
be the function of s obtained by fixing t and letting
K tðsÞ¼
:
Kðs;tÞ.H consists of all finite linear combinations of
the form
P l
i¼1
a i K t
i
with t i2I R m and limits of such
functions as the t i becomes den in I R m.We have
hK s;K t i H¼Kðs;tÞ[36].
Let :I R m!H be a feature map from the input space I R m
to H,and Kðx i;x jÞ¼h ðx iÞ; ðx jÞi.Let ðXÞdenote the data
matrix in RKHS,that is, ðXÞ¼ð ðx1Þ;...; ðx nÞÞ.Similarly,
we define ðZÞ¼ð ðz1Þ;...; ðz nÞÞ.The convex TED opti-
mization problem in RKHS can be written as follows:
CAI AND HE:MANIFOLD ADAPTIVE EXPERIMENTAL DESIGN FOR TEXT CATEGORIZATION709
min ; i2I R n X n
i¼1
k ðx iÞÀ ðXÞ i k2þ
X n
j¼1
2
i;j
j
!
þ k k1
s:t: j!0;j¼1;...;n:
ð8Þ
Let diagð Þbe a diagonal matrix who entries are 1;...; n.Thus,
X n j¼1 2
i;j
j
¼ T i diagð ÞÀ1 i:
By some simple algebraic steps,we get
X n i¼1k ðx iÞÀ ðXÞ i k2þ
X n
j¼1
2
i;j
j
!
¼
X n
i¼1
ðð ðx iÞÀ ðXÞ iÞTð ðx iÞÀ ðXÞ iÞþ T i diagð ÞÀ1 iÞ
¼
X n
i¼1À
ðx iÞT ðx iÞÀ2 T i ðXÞT ðx iÞ
þ T i ðXÞT ðXÞ iþ T i diagð ÞÀ1 i Á:
Now,taking the derivative of the objective function(8)with respect to i and requiring it to be zero,we get:À2 ðXÞT ðx iÞþ2 ðXÞT ðXÞ iþ2diagð ÞÀ1 i¼0: Finally,we get:
i¼ðdiagð ÞÀ1þ ðXÞT ðXÞÞÀ1 ðXÞT ðx iÞ:ð9ÞWe define a nÂn kernel matrix K such that K ij¼Kðx i;x jÞ. Let u i be the i th column(or row,since K is symmetric) vector of K:
u i¼À
ðx iÞT ðx1Þ;...; ðx iÞT ðx nÞ
ÁT
¼ ðXÞT ðx iÞ:
By noticing that ðXÞT ðXÞ¼K,(9)can be rewritten as follows:
i¼ðdiagð ÞÀ1þKÞÀ1u i:ð10ÞOnce i’s are obtained,we can fix i’s and find the minimum solution for j.Again,we take the derivative of the objective function(8)with respect to j and require the derivative to be zero.By noticing that j is nonnegative,we have
X n i¼1À
2
i;j
j
!
þ ¼0:ð11Þ
Finally,we get:
j¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P n
i¼1
2
i;j
s
:ð12Þ
So i;j and j can be iteratively computed.Since the
objective function is convex,the globally optimal solution is
guaranteed to be obtained.
3.4The Manifold Adaptive Experimental Design
Algorithm
We summarize our manifold adaptive experimental design
algorithm as follows(also in Table1):
1.Construct the manifold adaptive kernel.Construct
a k nearest neighbor G with weight matrix defined in
(6).Calculate the graph Laplacian L¼DÀW.Let K
be any data-independent ,Gaussian or
linear kernel)associated with the kernel matrix K.
That is,K ij¼Kðx i;x jÞ.Let k i be the i th column
vector of K.Calculate the manifold adaptive kernel
matrix K M as follows:
K M;ij¼K Mðx i;x jÞ
¼K ijÀ k T iðIþLKÞÀ1L k j:
ð13Þ
2.Solve manifold adaptive active learning optimiza-
tion problem.Let u i be the ith column(or row,since
K M is symmetric)vector of K M.Initialize i;j¼1,
and iteratively compute
j¼
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
P n
i¼1
2i;j
s
;j¼1;...;n;ð14Þ
i¼ðdiagð ÞÀ1þK MÞÀ1u i;i¼1;...;n;ð15Þ
until convergence.
3.Data lection.Rank the data points according to
jðj¼1;...;nÞin descending order,and lect the
top k data points.
Once we lect the most informative data points,any
classification algorithm can be applied to do pattern
classification.
Constructing the p nearest neighbor graph in the first step
of MAED needs Oðpn2Þ.Computing the data-independent
starrykernel matrix K in the cond step needs Oðn2Þ.Computing
the manifold adaptive kernel matrix in the third step needs
710IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.24,NO.4,APRIL2012
TABLE1
The Algorithm of Manifold Adaptive Experimental Design
(MAED)
O ðn 3Þand the fourth step needs O ðtn 3Þwhere t is the iteration times.In our experiments,the MAED algorithm converges very fast and t is usually less than 20.The overall computational cost of MAED is O ðn 3Þ,which is the same as the original convex TED algorithm in the kernel space.
4E XPERIMENTS
In this ction,we evaluate the performance of our propod algorithm and compare it with the state-of-the-art active learning algorithms for text categorization.
4.1Simple Toy Example
Our MAED algorithm is fundamentally bad on TED.The difference between them is whether the geometric structure of the data is considered.To get a intuitive idea of how the two algorithms perform differently,we give a simple toy example in Fig.1.The data t contains two circles.Eight points are lected by TED and MAED.Both algorithms u the Gaussian kernel.As can be en,all the points lected by TED are from the small circle,while MAED lects five points from the big circle and three from the small circle.Clearly,the points lected by our MAED algorithm can better reprent the original data t.
4.2Data and Experimental Settings
Our empirical study on text categorization was conducted bad on three real-world text corpora..
The first data t is 20Newsgroups corpus,1which contains 18,744documents with 61,188distinct words.This data t has 20categories,each with around 1,000documents.
.
The cond data t is a subt of the Reuters-21578text data t.2This subt has 2,919documents,including categories “acq,”“crude,”“trade,”and “money,”each with 2,025,321,298,and 245docu-ments,respectively.In this data t we have 10,499distinct words.
.
The third data t is a subt of the RCV1-v2corpus [27].RCV1contains the information of topics,regions,and industries for each document and a hierarchical structure for topics and industries.A t
of 9,625documents with 29,992distinct words is chon for our experiments,including categories “C15,”“ECAT,”“GCAT,”and “MCAT,”each with 2,022,2,064,2,901,and 2,638documents,respectively.
The standard TF-IDF weighting scheme is ud to generate the feature vector for each document:我很好英语
tf -idf ¼ð1þlog tf ÞÂlog
N df
;where N is the number of documents in the corpus and df is the number of documents containing a particular word.The experimental ttings in this work are basically the same as tho in [46].We conduct one-against-all classifi-cation for each category and treat each problem as binary classification.We u the standard precision,recall and F 1measure [44].Precision is the ratio of correct assignments by the classifier divided by the total number of the classifier’s assignments.Recall is defined to be the ratio of correct assignments by the classifier divided by the total number of correct assignments.The F 1measure combines recall (r )and precision (p )with an equal weight in the following form:
F 1ðr;p Þ¼
2rp
r þp
:The scores can be computed for the binary decisions on each individual category first and then b
e averaged over categories.Or,they can be computed globally over all the n Âm binary decisions where m is the number of total test documents,and n is the number of categories in considera-tion.The former way is called macroaveraging and the latter way is called microaveraging .It is understood that the microaveraged scores tend to be dominated by the classifier’s performance on common categories,and the macroaveraged scores are more influenced by the performance on rare categories.Providing both kinds of scores is more informa-tive than providing either alone [44].Another popular metric in our situation is AUC ,area under the Receiver Operating Characteristic (ROC)curve ,which is ud in [46].We also report the AUC score in our experiment.西班牙语培训
In each run of the experiments,an active learning method is applied to lect a given number k of training examples,k ¼f 5;10;...;50g on Reuters and RCV1and k ¼f 20;40;...;200g on 20Newsgroups,then a classifier is trained on the examples with their labels.The trained
CAI AND HE:MANIFOLD ADAPTIVE EXPERIMENTAL DESIGN FOR TEXT CATEGORIZATION 711
1.people.csail.mit.edu/jrennie/20Newsgroups/.
2./resources/testcollections/
reuters21578/.
Fig.1.Data lection by active learning algorithms TED and MAED.The lected data points are marked as Ã.Clearly,the points lected by our MAED algorithm can better reprent the original data t.(a)Data.(b)TED.(c)MAED.
classifier is then ud to predict the class labels of the remaining examples,and both Macro-F1and Micro-F1scores are computed bad on the results.In order to randomize the experiments,in each run of experiments we restrict the training examples to be lected from a random candidate t of 50percent of the total data.Strictly speaking,since the candidate t is available for all the active learning algorithms,the remaining 50percent of the total data can be regarded as the test data.Thus,we reported the classification results on both unlabeled t (all the unlabeled data)and test t (the remaining 50percent of the total data).For each combination of active learning method and a number k ,we compute the mean and standard deviation bad on 10randomized experiments.The following four active learning methods are evaluated and compared:
.
Random Sampling method uniformly lects exam-ples as training data.We u this method as the b
aline for active learning.
.Simple Margin method is propod in [43].This
method lects the example clost to the current decision boundary of the classifier,which is a usual SVM using the hinge loss.
.Convex TED method is propod in [46].
.Manifold Adaptive Experimental Design method,
as described in Section 3.4,is a new method propod in this paper.
We note that all the methods u least-squares SVM (LSSVM)as the ba classification method,except the Simple Margin method that us hinge-loss SVM.In all the experiments we fix the parameter as ¼0:1. 4.3Text Categorization Results
In this ction,we discuss the performance of the four different algorithms on text categorization.Before experi-mental comparison,it would be important to note that the algorithms Random Sampling,Convex TED,and MAED are all classifier independent ,while the algorithm Simple
Margin is classifier dependent .For the former three algorithms,the data lection is performed globally.In other words,the lected data points will be ud for all the binary classification tasks.However,for Simple Margin,since the active learning (data lection)process is dependent on the decision boundary,for each binary classification task we have to lect k data points for labeling.In our experiments,four categories are ud,thus Simple Margin may lect maximally 4k data points,if there is no overlap.Moreover,since Simple Margin is classifier dependent,it needs at least one example for each category to train the initial classifier.In our experiments,we randomly lect one example from each category to train an initial SVM classifier for Simple Margin.
4.3.120Newsgroups
We apply the above-mentioned four algorithms to text categorization on 20Newsgroups.Given training size k ,the average classification performance measured by Macro-F1and Micro-F1is reported in Table 2(on all the unlabeled data)and Table 3(on test data).As can be en,our MAED algorithm outperforms the other three algorithms in all the cas.Random sampling performs the worst in most of the cas.As we have mentioned,Simple Margin us much more labels than other algorithms.Even so,Convex TED outperforms Simple Margin in most of the cas.
712IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.24,NO.4,APRIL 2012
TABLE 2
Text Categorization Results on
20Newsgroups
On the TABLE 3
Text Categorization Results on
20Newsgroups
On the