Multi-Domain Active Learning for Text Classification
Lianghao Li†Xiaoming Jin†Sinno Jialin Pan‡Jian-Tao Sun§
†School of Software,Tsinghua University,Beijing100084,P.R.China
日语三级‡Institute for Infocomm Rearch,Singapore138632
§Microsoft Rearch Asia,Beijing100080,P.R.China
,xmjin@tsinghua.edu,jspan@i2r.a-star.edu.sg,
ABSTRACT
Active learning has been proven to be effective in reducing labeling efforts for supervid learning.However,existing active learning work has mainly focud on training mod-els for a single domain.In practical applications,it is com-mon to simultaneously train classifiers for multiple domains. For example,some merchant web sites() may need a t of classifiers to
predict the ntiment polar-ity of product reviews collected from various , electronics,books,shoes).Though different domains have their own unique features,they may share some common latent features.If we apply active learning on each domain parately,some data instances lected from different do-mains may contain duplicate knowledge due to the common features.Therefore,how to choo the data from multiple domains to label is crucial to further reducing the human labeling efforts in multi-domain learning.In this paper,we propo a novel multi-domain active learning framework to jointly lect data instances from all domains with duplicate information considered.In our solution,a shared subspace isfirst learned to reprent common latent features of differ-ent domains.By considering the common and the domain-specific features together,the model loss reduction induced by each data instance can be decompod into a common part and a domain-specific part.In this way,the duplicate information across domains can be encoded into the common part of model loss reduction and taken into account when querying.We compare our method with the state-of-the-art active learning approaches on veral text classification tasks:ntiment classification,newsgroup classification and email spamfiltering.The experiment results show that our method reduces the human labeling efforts by33.2%,42.9% and68.7%on the three tasks,respectively. Categories and Subject Descriptors
儿童学英语视频
I.2.6[Artificial Intelligence]:Learning—knowledge acqui-sition,concept learning;I.5.2[Pattern Recognition]:De-sign Methodology—Classifier design and evaluation Permission to make digital or hard copies of all or part of this work for personal or classroom u is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on thefirst page.To copy otherwi,to republish,to post on rvers or to redistribute to lists,requires prior specific permission and/or a fee.
KDD’12,August12–16,2012,Beijing,China.
Copyright2012ACM978-1-4503-1462-6/$15.00.General Terms
Algorithms,Experimentation
Keywords
Active Learning,Transfer Learning,Text Classification 1.INTRODUCTION
Text classification has drawn much rearch attention in the literature.Typically,supervid classification algorithms require sufficient labeled data to train accurate classifiers, while the data labeling cost may be expensive.Active learn-ing has been proven to be effective in reducing the hum
an labeling efforts by actively choosing the most informative data to label.Existing active learning work has mainly fo-cud on training models for a single domain.But in many applications,data of interest are from multiple domains and a group of classifiers need to be trained simultaneously for all the domains.For has organized ur reviews of many products.A ntiment classifier[3]of each product class(domain)is highly desirable to automatically organize reviews according to ur demands.Since differ-ent words can be ud to express ntiment in different do-mains[17],training a single classifier for all domains would not generalize well across various domains.For instance, words like“blur”,“fast”,“sharp”are ud to comment elec-tronics products,while they do not carry opinion in books domain.Therefore,each domain should have its own nti-ment classifier.Email spamfiltering is another example[8]. Since urs may have different backgrounds and interests,it is reasonable to customize spamfilters for individual urs. Active learning for multi-domain text classification is a novel rearch problem.The algorithm of lecting data in-stances to label is not trivial.If we simply apply active learning on each domain parately,some data instances lected from different domains may contain duplicate in-formation due to the inherent relationship among domains. For example,in ntiment classification,reviews containing common ntiment words like“wonderful”,“perfect”may be lected to label by active learners of each domain,which may cau redundant labeling efforts.On the other hand,if we apply active learning for all domains togeth
er,the query strategy may be affected by the distribution gap between different domains.Therefore,how to measure the informa-tiveness of data instances across domains is crucial.In this paper,we propo a novel global optimization bad active learning framework for multi-domain text classification.The propod query strategy aims to lect unlabeled instances
which can maximally reduce the model loss of all classi-fiers once labeled.In our solution,a shared subspace isfirst learned to reprent common latent features of different do-mains.By splitting the feature space into a common part and a domain-specific part,the model loss reduction induced by each data candidate can be decompod into the domain-specific loss reduction of the classifier on its corresponding domain,and the common loss reduction of the classifiers on all domains.By jointly querying instances,the common model loss of all classifiers can be reduced simultaneously, and the redundant labeling efforts can be saved.
It is worth noting that the problem tting of multi-domain classification is different from that of cross-domain classifi-cation.In cross-domain classification,data of interest are assumed to come from a source domain and a target do-main.Sufficient labeled data are available in the source domain while no or few labeled data are available in the target domain.The goal is to train a classifier of the tar-get domain by leveraging the labeled data of the source do-main.In multi-domain classification,no domai
n is assumed to have sufficient labeled data.The goal is to simultaneously train classifiers for multiple domains by leveraging common knowledge among them.Active learning for multi-domain classification aims to jointly lect data to label for training accurate classifiers on all domains.
The main contributions of our work include:1)We stud-ied an important practical problem for active learning in multiple domains.To the best of our knowledge,this is the first work which aims to actively build text classifiers for multiple domains simultaneously.2)We propod an effi-cient multi-domain active learning framework and showed its effectiveness on three real-world iment classification,newsgroup classification and email spamfilter-ing.The experiment results on the three tasks demonstrate that our propod method can save more than33%labeling efforts compared with the state-of-the-art active learning ap-proaches,and save more than50%labeling efforts compared with the random query methods.
The rest of this paper is organized as follows:we begin by reviewing the related works in the next ction.After that,we describe the problem statement in Section3,and prent our solution in Section4.The experiment results are discusd in Section5.Finally,we conclude the paper and discuss some future work in Section6.
2.RELATED WORK
The performance of supervid classification highly relies on labeled data.However,to collect sufficient training data is difficult and time-consuming.Active learning is an alter-native learning framework which allows classification algo-rithms to choo the data they learn from.Existing active learning algorithms can be generally put into three cate-gories:1)uncertainty sampling[13,25],which lects the data instances that are the most uncertainly predicted by the current classifier;2)query by committee[22]lects the data instances about which the“committee”disagree most; and3)expected error reduction[20],which aims to lect the instance that can contribute the largest model loss reduction for the current classifier once labeled.Recently,Donmez and Carbonell propod the proactive learning framework which relaxes some unrealistic assumptions of active learning in practical applications[7].Beygelzimer et al.propod an importance weighting method to avoid label-sampling bias in active learning[2].In[15]and[5],the authors propod the active learning methods for data with multiple views.In multi-view learning,every data instance is assumed to have veral different descriptions,each of which can be ud to learn concepts of interest.
Transfer learning is another technology to save the label-ing efforts for supervid learning.Dredze et al.developed a multi-domain learning method bad on parameter combi-nation[8].Xie et al.propod the LatentMap algorithm to leverage the shared features for transfer learning[26].Given
an oracle and a lot of labeled data from a source domain, some rearchers propod to combine active learning and transfer learning to train an accurate classifier for a target domain[18,23].Shi et al.propod to u the source do-main classifier to answer the target domain queries as often as possible,and query the oracle only when necessary[23]. In[18],Rai sidered to u the source domain clas-sifier as an initial classifier for the target domain.And the source domain data are further ud to rule out the target domain queries which appear similar to the source domain data.Different from their works,we aim to build classifiers for multiple domains together,while they targeted at train-ing the classifier of target domain by using the knowledge from the source domain.
Our work is also related to multi-task active learning, which has been studied to solve the problem where data instances are labeled in multiple ways for different tasks. Reichart et al.propod a novel active learning method to label data instances with veral linguistic annotations,such as named entities,syntactic par trees,etc.[19].Zhang tried to solve the multi-task active learning problem where outputs of different tasks are coupled by constraints[27]. Harpale et al.propod an active learning method for multi-task adaptivefiltering[11].An adaptivefiltering system monitors a t of documents tofind and deliver the rele-vant items to a particular task.Its performance is boosted with the relevance feedback received on the delivered items. In[11],the items which lead to the maximal r
elevance feed-back will be lected to deliver.Different from their works, we focus on a single task with multiple domains.In our problem,different domains share the same target concepts but have different data distributions.In addition,our work aims at proposing an active learning method for classifica-tion problems instead of adaptivefiltering or natural lan-guage annotation.This makes the optimization goal of our propod query strategy different from theirs.
英语口语学习
3.PROBLEM DEFINITION
In this ction,we introduce some definitions and the problem statement.
Definition 1.(Domain)A domain consists of a t of data instances which are generated from the same data dis-tribution P(x),where x∈X and X is a feature space.
For example,a t of ur reviews for electronics prod-ucts can be regarded as one domain,while reviews for dif-ferent types of products,such as books,movies,can be re-garded as books and movies domains,respectively.Data instances from one domain are assumed to be independent and identically distributed(i.i.d.).But data distributions across domains may be different.In this paper,the domain each data instance belongs to is assumed to be known.The multi-domain classification problem is defined as follows:
Definition 2.(Multi-Domain Classification)Given a t of data instances collected from K different domains,where each domain has its own data distribution.Let X be a fea-ture space1and Y be a pre-defined label t.The task is to train K classifiers f :X→Y, =1,2,...,K,for all the domains.
bring upBad on the definitions above,we now define the problem we aim to address in this paper as follows:
Definition 3.(Active Learning for Multi-Domain Clas-sification)Let P={P1,P2,...,P K}be an unlabeled data pool which consists of data instances collected from K dif-ferent domains.Here P ={x 1,x 2,...,x N }includes N data instances come from the ’th domain.The task is to build K accurate classifiers f :X→Y, =1,2,...,K,by lecting data instances to label as few as possible.
Our active learning framework is bad on pool-bad sampling[13,21].In pool-bad sampling,active learning is iteratively performed on an unlabeled data pool,which is usually assumed to be stationary)[21].Typi-cally,in each iteration,the active learner scans the unlabeled data pool and choos the most informative data candidates to label.
4.OUR SOLUTION
In this ction,we describe our solution for multi-domain active learning.The main notations are listed in Table1.
Table1:Notations
Symbols Description
K total number of domains
x i the i’th labeled data in the ’th domain
y i ground truth of x i
θlearned shared subspace transformation matrix
w weight vector specific to the ’th domain
v weight vector associated with the shared subspace
英语词组翻译
f D predictive function of the ’th domain
L(·)model loss of a classifier
λ domain weight specified by urs
L D global model loss of all classifiers
V version space of the ’th domain
W parameter space of the ’th domain
V D the size of V
4.1A General Optimization Framework Recall that,in active learning for a single domain,an ac-tive learner attempts to lect the most informative data instances to label in order to train an accurate classifier us-ing as few labeling efforts as possible.In active learning for multiple domains,the goal is to choo the data instances which are not only informative for their corresponding do-mains but also for other domains such that all classifiers can benefit from the labeling.
Suppo that L(f D)is the model loss of classifier f D,the global model loss of all classifiers is defined as:
L D=
K
=1
λ L(f D),(1)
1In this work,we assume all domains share the same feature space).where{λ }K =1are ur specified weights for different do-mains.The goal of our query strategy is to lect an un-labeled instance x∗which can maximally reduce the global model loss once labeled.The optimization objective can be formulated as:
x∗=arg max
x∗
L D−L D+(x∗,y∗)
=arg max
x∗
K
=1
λ ·
L(f D)−L(f D+(x∗,y∗))
,(2)
where D+(x∗,y∗)is the expanded training t after data instance x∗and its ground truth y∗are added.In some real-world applications,different domains may have different priorities.For example,urs may require high classification performance or fast model convergency for some particular domains.In this ca,one can assign larger weights for such domains.However,in many other scenarios,urs may not have the requirements.Under such ca,one can simply t the same weight for each domain.Without loss of gen-erality,we tλ =1for all domains in this paper.
In practice,we do not know ground truth y∗of data in-stance x∗before querying.Therefore,we are not able to estimate the model loss in(2)directly.Instead,we u the expectation loss over all possible labels to approximate the true model loss.As a result,we can replace(2)by the fol-lowing objective:
x∗=arg max
x∗
y∈Y
ˆP(y|x∗)
K
=1
L(f D)−L(f D+(x∗,y))
,(3)
whereˆP(y|x∗)is the conditional probability of label y given data instance x∗estimated by the current classifier.
4.2Multi-Domain Classification with SVM Before describing our solution for multi-domain active learn-ing,wefirst prent an SVM-bad multi-domain classifica-tion method which is ud as the cla
ssification model in our optimization framework.
Support Vector Machines(SVMs)have been widely ud for text classification[12,25].In this paper,we incorporate a shared subspace to reprent common latent features into SVM for multi-domain classification.The predictive func-tion f D of the ’th( ∈{1,...,K})domain is defined as:
f D(x )=w ·Φ(x )+v·Φ(θx ),(4) which consists of two parts:one is performed on the orig-inal feature space,and the other is derived for the shared subspace.Here D is a trainin
g t,Φis a feature map,w and v are two weight vectors,θis a learned transformation matrix to map the original feature space to the shared low-dimensional subspace.The shared parameters v andθare leveraged to capture the common latent features across do-mains.Note that the idea of the formulation above is similar to that in multi-task learning[1,9].However,in this paper, we focus on proposing a novel active learning framework for multi-domain classification instead of a novel multi-domain classification method.In[1],Ando and Zhang propod to learn the parameters{w }’s,v andθjointly by updating them iteratively.In eac
h iteration,the singular value de-composition is required to updateθ,which is not efficient, especially for active learning.
We propo to learn the parameters in two steps.In the first step,we apply Spectral Feature Alignment(SF A)[17],
which is an unsupervid shared subspace learning method, to estimateθ.Note that besides SF A,many other effec-tive approaches to shared subspace learning can be inte-grated into our framework,such as Structural Correspon-dence learning(SCL)[4],Maximum Mean Discrepancy Em-bedding(MMDE)[16],etc.In SF A,a t of domain indepen-dent features arefirstly identified,and a bipartite graph is constructed to model the co-occurrence between the domain-independent features and the domain-specific features.Then a spectral clustering algorithm is adapted on the bipartite graph to co-align the two kinds of features into unified clus-ters.The space spanned by the unified clusters is then con-sidered as the shared subspace across domains.In the c-ond step,we estimate{w }’s and v by solving the SVM optimization problem as follows2:
min {w }’s,v 1
2
clevererv 2+1
2
K
=1
w 2,(5)
< i(w ·Φ(x i)+v·Φ(θx i))≥1, =1,...,K. Note that for text classification,data instances are often
linearly-parable due to the high dimensionality of its fea-
ture space.Therefore,in this paper,we prent our frame-
work in the linearly-parable manner and leave the nonp-arable ca to our future work.It can be shown that the
optimization problem(5)can be directly linked to a stan-
dard SVM problem with a proper feature map[9]and solved
by a standard SVM solver.
In our approach,the weight vector v is derived from the
shared subspace,and learned from all training data across domains.Therefore,it can reflect the common discrimina-
tive information of all domains.The weight vectors{w }’s are only affected by the training data in the corresponding
domain,which implies that they should reflect the domain-
specific discriminative information.By splitting the feature space into the two parts,we can measure both the com-mon and the domain-specific model loss reduction induced by each data instance.
4.3Multi-Domain Active Learning
In this ction,we describe our solution for the propod
optimization framework(3)bad on the multi-domain SVM. According to(1),the global model loss can be decompod into the model loss of the classifier in each domain.So our problem becomes to measure the model loss reduction {L(f D)−L(f D+(x∗,y))}K =1of each classifier.As suggested by Tong and Koller[24],we can measure the model loss of each classifier by the size of version space.A version space V is a t of hypothes that are consistent with the cur-rent training data instances[1
4].For the ’th domain,the version space V is defined as:
V =
u
u
u ∈W ,∀i y
i(w
Φ(x
i)+vΦ(θx
i))>0
,(6)
where u =[w ,v]and W is the parameter space.Since we can simply multiply a non-zero scale to a consistent hy-pothesis to get another one,we normalize the weight vectors to eliminate this freedom.
For SVM,we can u the margin of SVM as an indicator of the size of version space.Suppo we have a pool of un-
2Here we introduceΦ
0(x)=1to replace the bias parameter of SVM.labeled instances,we can evaluate each candidate by adding it into D and re-training an SVM bad on(5)to estimate the new margin.We then lect the data candidate which contributes the largest reduction of all version spaces to la-bel.However,this process is very expensive in computation, especially when the candidate pool is large.To make it more practical,we apply a heuristic idea as propod in[24](cf. page34)to simplify the computation by mapping the size of new version space to the size of current version space. Denote V D the size of current version space,the size of new version V D+(x∗,y))after adding(x∗,y)into the training t can be approximated as:
V D+(x∗,y)≈1+yf
D(x
∗)
2
V D.(7) Bad on the approximation above,the model loss reduction of each classifier in(3)can be rewritten as:
L(f D)−L(f D+(x∗,y))=V D−V D+(x∗,y)
≈1−yf
D(x
∗)
2
V D.(8) An intuitive explanation for the above estimation is that if data candidate x∗can be correctly predicted by the current model,that is y=sgn(f D(x∗)),then the smaller the value of f D(x∗) is,the less confidence on x∗the current model has.As a result,data candidate x∗tend to be queried for labeling.On the other hand,if data candidate x∗cannot be correctly predicted,then the larger the value of f D(x∗) is,the more errors the current model makes.In this ca, querying x∗can greatly improve the current model.
Recall that,given classifier f D of the ’th domain,if data candidate x∗is not from the ’th domain,then x∗can only affect the version space of the ’th domain via the shared subspace when queried.Correspondingly,classifier f D can only make prediction on data candidate x∗through the com-mon weight vector v.So we propo to u the following pre-dictive function f D(x∗)to calculate the model loss reduction in(8),
f D(x∗)=
w ·Φ(x∗)+v·Φ(θx∗)x∗∈P ,
v·Φ(θx∗)x∗∈P . Therefore,the model loss reduction induced by each data candidate is decompod into two parts:1)the version space reduction of its corresponding domain in the whole feature space,and2)the version space reduction of other domains in the shared subspace.In this way,the common model loss of all classifiers can be reduced together,and more labeling efforts can be saved.Since we learn all the classifiers jointly, there is no guarantee that the solutions of(5)can lead to the maximal margin solution for the classifier of each domain. However,becau the low-dimensional subspace is shared by all domains,the hyperplane learned onto it should be con-sistent with the data instances from all domains.Therefore, the hyperplane of each domain learned by(5)is a
good ap-proximation of the hyperplane learned on the labeled data only from its corresponding domain.
By using the size of SVM margin as the indicator of V D, and substitute(8)into(3),ourfinal query strategy for multi-domain active learning can be written as:
x∗=arg max
x∗
冠军的拼音y=±1
ˆP(y|x∗)
K
=1
1−yf D(x∗)
u
.(9)
Algorithm 1:Multi-Domain Active Learning
Input :(1)A pool P of unlabeled instances which are
collected from K domains,(2)Number of initial training data in each domain M ,(3)Number of iterations T ,(4)Number of queried instances per iteration S
Output :K classifiers
Randomly label M data instances of each domain,and form the initial training t D ;为黛茜小姐开车
Learn the low-dimensional shared subspace using SF A;for t ←1to T do
Train K classifiers in the training t D using (5);foreach x n ∈P do
Estimate the global model loss reduction via (9);end
Query the labels Y ∗of S unlabeled instances U ∗which have the largest global model loss reduction;Update the training t by D ←D ∪(U ∗,Y ∗),and remove U ∗from P ;end
In order to calculate ˆP
(y |x ∗)in (9),we train a Logistic Re-gression classifier on all training data by maximizing the log-likelihood J (w 1,···,w K ,v )= ,i log σ(y i (w x i +v θx i )),and u it to estimate the probabilities.The complete pro-cess of our propod method is summarized in Algorithm 1.The propod method is very efficient becau it only needs to learn one SVM per iteration,and in each iteration,it estimates the global model loss reduction induced by each candidate efficiently via (9).
For the classification problem having more than two cat-egories,one simple and effective way is to u the one-vs-all technique.Suppo we have C class,we can train C bi-nary classifiers {f ,c D }C c =1,where the classifier f ,c
D is ud to predict whether an instance belongs to the c ’th class or not.Our multi-domain active learning method can be applied accordingly.
5.EXPERIMENTS
In this ction,we conduct experiments on three real-world applications (i.e.,ntiment classification,newsgroup classification and email spam filtering)to evaluate the effec-tiveness of our method.
5.1Datats
5.1.1Multi-Domain Sentiment Datat
The Multi-Domain Sentiment Datat [3]has been widely ud as a benchmark datat for domain adaptation and n-timent analysis.It contains a collection of product reviews The reviews are about four product do-mains:Book (B ),DVD (D ),Electronics (E )and Kitchen (K ).Each review has been annotated as positive or neg-ative ntiment polarity according to urs’rating scores.The summary of this datat is described in Table 2.
From this datat,we construct five multi-domain n-timent classification tasks:B +D +E ,B +D +K ,B +E +K ,D +E +K and B +D +E +K ,where each boldfaced letter cor-responds with a domain.For example,B +D +E denotes ntiment classification in Book ,DVD and Electronics do-mains.
Table 2:Summary of Multi-Domain Sentiment Datat
Domain #Reviews #Pos #Neg #Features Book 6,4653,2643,20117,465DVD 5,5852,8072,77815,437Electronics 7,6773,8533,82413,687Kitchen
7,945
3,954
3,991
12,439statuette
5.1.220Newsgroups
The 20Newsgroups datat 3has been widely ud for news-group classification and cross-domain text classification.As in the previous work [6],we generate four newsgroup do-mains from the datat by utilizing its hierarchical struc-ture.Table 3shows the generated newsgroup domains.For example,domain NG-1contains documents from four sub-categories,which are under four top-categories,respectively.The classification task is defined in the top-category level,where our goal is to classify documents into one of the four top-categories:comp ,rec ,sci and talk .This domain gener-ation strategy can ensure the domains are different but re-lated,becau different domains consist of documents in dif-ferent sub-categories,but are under the same top-categories.Table 3:Four Domains Generated from 20Newsgroups
Domain Newsgroups
NG-2comp.os.ms-windows.ycles
sci.electronics talk.politics.mideast NG-3comp.sys.ibm.pc.hardware rec.sport.baball
NG-4dress的复数形式
comp.sys.mac.hardware rec.sport.hockey
sci.ligion.misc
By using the generated domains,we construct four multi-domain newsgroup classification tasks:NG-123,NG-124,NG-134and NG-234,where each digit denotes a domain.For example,NG-123denotes the multi-domain newsgroup classification in NG-1,NG-2and NG-3domains.
5.1.3Email Spam Filtering Datat
The email spam filtering datat 4relead by ECML/PKDD 2006discovery challenge contains 15
parate inboxes for urs u00∼u14,where “u ∗∗”is a ur id.For each inbox,there are 200spam and 200non-spam emails.In our exper-iments,each inbox is regarded as a domain and the learning task is to train a spam filter for each ur to classify whether a new mail is a spam or not.From this datat,we construct four multi-domain spam filtering tasks:u00-u04,u05-u09,u10-u14and u00-u14.For example,u00-u04denotes the email spam filtering in u00∼u04domains.
5.2Comparison Methods
In order to test the effectiveness of our method (which is referred to as MultiAL ),we compare it with veral active learning approaches.The first method is to perform a single-domain active learning for each domain independently.We call it SingleAL .The cond method is to merge all domain data into a unified pool and perform active learning in the unified pool to train a single classifier for prediction.We call this approach UnifiedAL .In addition,once the shared subspace is identified,we can embed data instances from all domains into the shared subspace and generate a new
3people.csail.mit.edu/jrennie/20Newsgroups/4