Logistic Regression with an Auxiliary Data Source
Xuejun Liao xjliao@ee.duke.edu Ya Xue yx10@ee.duke.edu Lawrence Carin lcarin@ee.duke.edu Department of Electrical and Computer Engineering,Duke University,Durham,NC27708
Abstract
To achieve good generalization in supervid
learning,the training and testing examples
are usually required to be drawn from the
same source distribution.In this paper we
propo a method to relax this requirement
in the context of logistic regression.Assum-
ing D p and D a are two ts of examples
drawn from two mismatched distributions,
where D a are fully labeled and D p partially
labeled,our objective is to complete the la-
bels of D p.We introduce an auxiliary vari-
ableμfor each example in D a to reflect its
mismatch with D p.Under an appropriate
constraint theμ’s are estimated as a byprod-
uct,along with the classifier.We also prent
an active learning approach for lecting the
labeled examples in D p.The propod algo-
rithm,called“Migratory-Logit”or M-Logit,
is demonstrated successfully on simulated as
well as real data ts.
1.Introduction
In supervid learning problems,the goal is to design
a classifier using the training examples(labeled data)
D tr={(x tr i,y tr i)}N tr
i=1such that the classifier predicts
the label y p
i correctly for unlabeled primary test data
D p={(x p i,y p i):y p i missing}N p i=1.The accuracy of the predictions is significantly affected by the quality of D tr,which is assumed to contain esntial informa-tion about D p.A common assumption utilized by learning algorithms is that D tr are a sufficient sam-ple of the same source distribution from which D p are drawn.Under this assumption,a classifier designed bad on D tr will generalize
well when it is tested on D p.This assumption,however,is often violated in Appearing in Proceedings of the22nd International Confer-ence on Machine Learning,Bonn,Germany,2005.Copy-right2005by the author(s)/owner(s).practice.First,in many applications labeling an obr-
vation is an expensive process,resulting in insufficient
labeled data in D tr that are not able to characterize
the statistics of the primary data.Second,D tr and D p are typically collected under different experimen-tal conditions and therefore often exhibit differences
in their statistics.
Methods to overcome the insufficiency of labeled data
have been investigated in the past few years under
the names“active learning”[Cohn et al.,1995,Krogh
&Vedelsby,1995]and“mi-supervid learning”
[Nigam&et al.,2000],which we do not discuss here,
though we will revisit active learning in Section5. The problem of data mismatch has been studied in econometrics,where the available D tr are often a non-randomly lected sample of the true distribution of in-terest.Heckman(1979)developed a method to correct the sample-lection bias for linear regression models. The basic idea of Heckman’s method is that if one can estimate the probability of an obrvation being lected into the sample,one can u this probability estimate to correct the lection bias.
Heckman’s model has recently been extended to clas-
sification problems[Zadrozny,2004],where it is as-
stpsumed that the primary test data D p∼Pr(x,y)while the training examples D tr=D a∼Pr(x,y|s=1), where the variable s controls the lection of D a:if s=1,(x,y)is lected into D a;if s=0,(x,y)is not lected into D a.Evidently,unless s is independent of(x,y),Pr(x,y|s=1)=Pr(x,y)and hence D a are mismatched with D p.By Bayes rule,
Pr(x,y)=
Pr(s=1)
Pr(x,y|s=1)(1)
which implies that if one has access to Pr(s=1)
Pr(s=1|x,y)
one
can correct the mismatch by weighting and resam-pling[Zadrozny et al.,2003,Zadrozny,2004].In the special ca when Pr(s=1|x,y)=Pr(s=1|x),one may estimate Pr(s=1|x)from a sufficient sample of Pr(x,s)if such a sample is available[Zadrozny,2004].
In the general ca,however,it is difficult to estimate
Pr(s=1) Pr(s=1|x,y),as we do not have a sufficient sample of
Pr(x,y,s)(if we do,we already have a sufficent sam-ple of Pr(x,y),which contradicts the assumption of the problem).
In this paper we consider the ca in which we have a fully labeled auxiliary data t D a and a partially
labeled primary data t D p=D p
l ∪D p u,where D p
l
are labeled and D p u unlabeled.We assume D p and D a are drawn from two distributions that are mis-matched.Our objective is to u a mixed training t
D tr=D p
l ∪D a to train a classifier that predicts the la-
bels of D p u accurately.Assume D p∼Pr(x,y).In light of equation(1),we can write D a∼Pr(x,y|s=1)as long as the source distributions of D p and D a have the same domain of nonzero probability1.As explained in the previous paragraph,it is difficult to correct the
mismatch by directly estimating Pr(s=1)2019考研成绩查询
Pr(s=1|x,y).Therefore
we take an alternative approach.We introduce an aux-iliary variableμi for each(x a i,y a i)∈D a to reflect its mismatch with D p and to control its participation in the learning process.Theμ’s play a similar role as
the weighting factors Pr(s=1)
Pr(s=1|x,y)in(1).However,un-
like the weighting factors,the auxiliary variables are estimated along with the classifier in the learning.We employ logistic regression as a specific classifier and develop our method in this context.
A related problem has been studied in[Wu&Diet-terich,2004],where the classifier is trained on twofixed and labeled data ts D p and D a,where D a is of lower quality and provides weaker evidence for the classifier design.The problem is approached by minimizing a weighted sum of two parate loss functions,with one defined for the primary data and the other for the aux-iliary data.Our method is distinct from that in[Wu& Dietterich,2004]in two respects.First,we introduce an auxiliary va
riableμi for each(x a i,y a i)∈D a and the auxiliary variables are estimated along with the classi-fier.A largeμi implies large mismatch of(x a i,y a i)with
D p and accordingly less participation of x a i in learning the classifier.Second,we prent an active learning
strategy to define D p
l ⊂D p when D p is initially fully
unlabeled.
The remainder of the paper is organized as follows.
A detailed description of the propod method is pro-vided in Section2,followed by description of a fast learning algorithm in Section3and a theoretical dis-1For any Pr(x,y|s=1)=0and Pr(x,y)=0,there
exists Pr(s=1)
Pr(s=1|x,y)=Pr(x,y)
Pr(x,y|s=1)
法语课∈(0,∞)such that equation
(1)is satisfied.For Pr(x,y|s=1)=Pr(x,y)=0,any
Pr(s=1) Pr(s=1|x,y)=0makes equation(1)satisfied.
工程质量管理制度cussion in4.In Section5we prent a method to
actively define D p
l
when D p
l
is initially empty.We
demonstrate example results in Section6.Finally,Sec-
tion7contains the conclusions.
2.Migratory-Logit:Learning Jointly
on the Primary and Auxiliary Data
We assume D p
l
arefixed and nonempty,and with-
out loss of generality,we assume D p
l
are always in-
dexed prior to D p ,D p
l
={(x p
i
,y p i)}N p l i=1and D p u=
{(x p i,y p i):y p i missing}N p
i=N p l+1
.We u N a,N p,and
N p l to denote the size(number of data points)of D a,
D p,and D p
vipkid英语价格l
automaticupdates
,respectively.In Section5we discuss how
to actively determine D p
l
when D p
l
is initially empty.
We consider the binary classification problem and the
labels y a,y p∈{−1,1}.For notational simplicity,we
let x always include a1as itsfirst element to accom-
modate a bias(intercept)term,thus x p,x a∈R d+1
where d is the number of features.For a primary data
point(x p
i
,
y p i)∈D p l,we follow standard logistic regres-
sion to write
Pr(y p
i
|x p i;w)=σ(y p i w T x p i)(2)
where w∈R d+1is a column vector of classifier param-
eters andσ(μ)=1
1+exp(−μ)
is the sigmoid function.
For a auxiliary data point(x a i,y a i)∈D a,we define
Pr(y a i|x a i;w,μi)=σ(y a i w T x a i+y a iμi)(3)
whereμi is an auxiliary variable.Assuming the ex-
amples in D p
l
and D a are drawn i.i.d.,we have the
log-likelihood function
(w,µ;D p l∪D a)
lawfirm
=
N p
l
i=1
lnσ(y p
i
w T x p i)+
N a
i=1
lnσ(y a i w T x a i+y a iμi)(4)
whereµ=[μ1,···,μN a]T is a column vector of all
auxiliary variables.
The auxiliary variableμi is introduced to reflect the
mismatch of(x a i,y a i)with D p and to control its par-
ticipation in the learning of w.A larger y a iμi makes
Pr(y a i|x a i;w,μi)less nsitive to w.When y a iμi=∞,
Pr(y a i|x a i;w,μi)=1becomes completely indepen-
dent of w.Geometrically,theμi is an extra inter-
cept term that is uniquely associated with x a i and
caus it to migrate towards class y a i.If(x a i,y a i)
is mismatched with the primary data D p,w cannot
make
N p
l
i=1
lnσ(y p
i
w T x p i)and lnσ(y a i w T x a i)large at
the same time.In this ca x a i will be given an ap-
propriateμi to allow it to migrate towards class y a i,
so that w is less nsitive to(x a i,y a i)and can focus
more onfitting D p
l
.Evidently,if theμ’s are allowed
to change freely,their influence will override that of w in fitting the auxiliary data D a and then D a will not participate in learning w .To prevent this from hap-pening,we introduce constraints on μi and maximize the log-likelihood subject to the constraints:
max w ,µ (w ,µ;D p
l ∪D a )
(5)
subject to
1
N a N a i =1y a i μi ≤C,C ≥0
(6)y a
i μi
≥0,i =1,2,···,N
a
(7)
where the inequalities in (7)reflect the fact that in or-der for x a i to fit y a i =1(or y a
i =−1)we need to have μi >0(or μi <0),if we want μi to exert a positive influence in the fitting process.Under the constraints
in (7),a larger value of y a
i
μi reprents a larger mis-match between (x a i ,y a
i )and D p and accordingly makes
(x a i ,y a
i )play a less important role in determining w .The classifier resulting from solving the problem in (5)-(7)is referred to as “Migratory-Logit”or “M-Logit”.The C in (6)reflects the average mismatch between D a and D p and controls the average participation of D a in determining w .It can be learned from data if weadd oil
have a reasonable amount of D p
l
.However,in practice we usually have no or very scarce D p
l
to begin with.In this ca,we must rely on other information to t C.We will come back to a more detailed discussion on C in Section 4.
3.Fast Learning Algorithm
The optimization problem in (5),(6),and (7)is con-cave and any standard technique can be utilized to
find the global maxima.However,there is a unique
μi associated with every (x a i ,y a i )∈D a ,and when D a
is large using a standard method to estimate μ’s can consume most of the computational time.
In this ction,we give a fast algorithm for training the M-Logit,by taking a block-coordinate ascent ap-proach [Bertkas,1999],in which we alternately solve for w and µ,keeping one fixed when solving the other.The algorithm draws its efficiency from the analytic solution of µ,which we establish in the following the-orem.Proof of the theorem is given in the appendix,and Section 4contains a discussion that helps to un-derstand the theorem from an intuitive perspective.Theorem 1:Let f (z )be
a twice continuously differ-entiable function and its cond derivative f (z )<0for any z ∈R .Let b 1≤b 2≤···≤b N ,R ≥0,and n =max {m :mb m −
m
i =1b i
≤R,1≤m ≤N }(8)
Then the problem
max {z i } N
i =1f (b i +z i )
(9)subject to
N
i =1z i ≤R,
R ≥0
(10)z i ≥0,i =1,2,···,N
(11)
has a unique global solution
z i =
1n n
j =1b j +1
n R −b i ,1≤i ≤n 0,n <i ≤N
(12)
For a fixed w ,the problem in (5)-(7)is simplified to
maximizing N a
i =1ln σ(y a i w T x a i +y a
i μi )with respect to µ,subject to 1N a N a i =1y a i μi ≤C ,C ≥0,and y a
i
μi ≥0for i =1,2,···,N a .Clearly ln σ(z )is a twice continuously differentiable function of z and its
cond derivative ∂2
∂z 2ln σ(z )=−σ(z )σ(−z )<0for −∞<z <∞.Thus Theorem 1applies.We first
solve {y a
i
μi }using Theorem 1,then {μi }are triv-ially solved using the fact y a
i
∈{−1,1}.Assume y a k 1
w T x a k 1≤y a k 2w T x a k 2≤···≤y a k N a w T x a k N a ,where k 1,k 2,···,k N a is a permutation of 1,2,···,N a .Then we can write the solution of {μi }analytically,
μk i =⎧⎨⎩1y a k i n
j =1y a k j w T x a
k j
+N a
n y a k i C −w T x a k i ,1≤i ≤n 0,n <i ≤N a
(13)where
n =max m :my a k m
w T x a k m − m i =1y a k i w T x a k i ≤N a
C,1≤m ≤N
a
(14)For a fixed µ,we u the standard gradient-bad method [Bertkas,1999]to find w .The main p
ro-cedures of the fast training algorithm for M-Logit are summarized in Table 1,where the gradient w and the Hessian matrix 2w are computed from (4).
4.Auxiliary Variables and Choice of C
Theorem 1and its constructive proof in the appendix offers some insight into the mechanism of how the mis-match between D a and D p is compensated through the auxiliary variables {μi }.To make the descrip-tion easier,we think of each data point x a i ∈D
a
as getting a major “wealth”y a i w T x a
i from w and
an additional wealth y a
i μi from a given budget to-taling N a C (C reprents the average budget for a single x a ).From the appendix,N a C is distributed among the auxiliary data {x a i }by a “poorest-first”
rule:the “poorest”x a
k 1(that which has the small-est y a k 1
w T x a k 1),gets a portion y a k 1μk 1from N a C first,
Table 1.Fast Learning Algorithm of M-Logit
Input:D a ∪D p
l and C ;Output:w and {μi }N a
i =1
1.Initialize w and μi =0for i =1,2,···,N a .
2.Compute the gradient w and Hessian ma-trix 2w .
3.Compute the ascent direction d =
−( 2w )−1
w .
4.Do a linear arch for the step-size α∗=
arg max α (w +αd ).
5.Update w :w ←w +α∗d .
6.Sort {y a i w T x a i }N a
i =1in ascending order.As-sume the result is y a k 1
w T x a k 1≤y a k 2w T x a
k 2≤···≤y a
k N a
w T x a k N a ,where k 1,k 2,···,k N a is a permutation of 1,2,···,N a .7.Find the n using (14).
8.Update the auxiliary variables {μi }N a
i =1using
(13).
9.Check the convergence of :exit and output
w and {μi }N a
i =1if converged;go back to 2oth-erwi.
and as soon as the total wealth y a k 1
w T x a k 1+y a
k 1μk 1reaches the wealth of the cond poorest x a k 2,N a
C
becomes equally distributed to x a k 1and x a
k 2such that their total wealths are always equal.Then,as soon as y a k 1
w T x a k 1+y a k 1μk 1=y a k 2w T x a k 2+y a k 2μk 2reach the wealth of the third poorest,N a C be
comes equally distributed to three of them to make them equally rich.The distribution continues in this way until the budget N a C is ud up.The “poorest-first”rule is esntially a result of the concavity of the logarith-mic sigmoid function ln σ(·).The goal is to maximize N a
i =1ln σ(y a i w T x a i +y a
i μi ).The concavity of ln σ(·)dictates that for any given portion of N a C ,distribut-ing it to the poorest makes the maximum gain in ln σ.
The C is ud as a means to compensate for the loss that D a may suffer from w .The classifier w is respon-sible for correctly classifying both D a and D p .Becau D a and D p are mismatched,w cannot satisfy both of them:one must suffer if the other is to gain.As D p is the primary data t,we want w to classify D p as accurately as possible.The auxiliary variables are therefore introduced to reprent compensations that D a get from C .When x a gets small wealth from w and is poor,it is becau x a is mismatched and in conflict with D p (assuming perfect paration of D a ,no conflict exists among themlves).By the “poorest first”rule,the most mismatched x a gets compensation
first.
A high compensation y a
i
μi whittles down the partici-pation of x a
i in learning w .This is easily en from
the contribution of (x a i ,y a i )to w and 2
humorousw ,which
are obtained from (4)as σ(−y a i w T x a i −y a i μi )y a i x a
i and
−σ(−y a i w T x a i −y a i μi )σ(y a i w T x a i +y a i μi )x a i x a i T
,respec-tively.When y a i
μi is large,σ(−y a i w T x a i −y a i μi )is clo to zero and hence the contributions of (x a i ,y a
i )to w and 2w are ignorable.We in fact do not need an in-finitely large y a
i μi to make the contributions of x a i ig-norable,becau σ(μ)is almost saturated at μ=±6.
If y a i
w T x a i =−6,σ(−y a i w T x a i )=0.9975,implying a large contribution of (x a i ,y a
i )to w ,which happens
when w assigns x a i to the correct class y a
i with prob-ability of σ(y a i
w T x a
i )=σ(−6)=0.0025only.In this nearly worst ca,a compensation of y a
i
μi =12can effectively remove the contribution of (x a i ,y a
i )becau
σ(−y a i
w T x a i −y a i μi )=σ(6−12)=σ(−6)=0.0025.To effectively remove the contributions of N m auxil-iary data,one needs a total budge 12N m ,resulting in an average budget C =12N m /N a .
To make a right choice of C ,the N m /N a should rep-rent the rate that D a are mismatched with D p .This is so becau we want N a C to be distributed only to that part of D a that is mismatched with D p ,thus per-mitting us to u the remaining part in learning w .The quantity N m /N a is usually unknown in practice.However,C =12N m /N a gives one a n of at least what range C should be in.As 0≤N m ≤N a ,letting 0≤C ≤12is usually a reasonable choice.In our experiences,the performance of M-Logit is relatively robust to C ,and this will be demonstrated in Section 6.2using an example data t.
5.Active Selection of D p
l In Section 2we assumed that D p
l had already been de-termined.In this ction we describe how D p
l
can be actively lected from D p
,bad on the Fisher infor-mation matrix [Fedorov,1972,MacKay,1992].The approach is known as active learning [Cohn et al.,1995,Krogh &Vedelsby,1995].
Let Q denote the Fisher information matrix of D p
l ∪D a about w .By definition of the Fisher information ma-trix [Cover &Thomas,1991],Q =E {y p i },{y a i }∂ ∂w ∂ ∂w T ,
and substituting (4)into this equation gives (a brief derivation is given in the appendix)
Q = N p l i =1σp i (1−σp i
)x p i x p i T + N a i =1σa i (1−σa i )x a i x a i T
(15)
where σp i =σ(w T x p i )for i =1,2,...,N p l ,and σa
i
=σ(w T x a i +μi )for i =1,2,...,N a
,and w and {μi }reprent the true classifier and auxiliary variables.
It is well known the inver Fisher information Q −1lower bounds the covariance matrix of the estimated w [Cover &Thomas,1991].In particular,[det(Q )]−1lower bounds the product of variances of the elements
in w .The goal in lecting D p
l
is to reduce the vari-ances,or uncertainty,of w .Thus we ek the D p
l
that maximize det(Q ).
The lection proceeds in a quential manner.Ini-tially D p
u =D p ,D p l is empty,and Q = N a i =1σa i (1−
skinshipσa i )x a i x a i T .Then one at a time,a data point x p i ∈D p
u
is lected and moved from D p
u to D p l .This caus Q
to be updated as:Q ←Q +σp i (1−σp i
)x p i (x p i )T
.At each iteration,the lection is bad on
max x p i ∈D p u
det Q +σp i (1−σp i )x p i (x p i )T =max x p i ∈D p u 1+σp i (1−σp i )(x p i )T Q −1x p i (16)where we assume the existence of Q −1,which can often
be assured by using sufficient auxiliary data D a .Evaluation of (16)requires the true values of w and {μi },which are not known a priori .We follow Fedorov (1972)and replace them with the w and {μi }that
are
estimated from D a ∪D p l
,where D p l are the primary labeled data lected up to the prent.
6.Results
In this ction the performance of M-Logit is demon-strated and compared to the standard logistic regres-sion,using test error rate as the performance index.
The M-Logit is trained using D a ∪D p l
,where D p l are either randomly lected from D p
,or actively lected
from D p using the method in Section 5.When D p
l are randomly lected,50independent trials are per-formed and the results are obtained as an average over the trials.Three logistic regression classifiers are
trained using different combinations of D a and D p l
:D a ∪D p l ,D p l alone,and D a
alone,where D p l are identi-cal to the D p ud for M-Logit.The four classifiers are
tested on D p
u =D p \D p l ,using the following decision rule:declare y p
=−1if σ(w T x p )≤0.5and y p =1
otherwi,for any x p ∈D p u
.Throughout this ction the C for M-Logit is t to
C =6when the comparison is made to logistic regres-sion.In addition,we prent a comparison of M-Logit with different C ’s,to examine the nsitivity of M-Logit’s performance to C .6.1.A toy Example
In the first example,the primary data are simulated as two bivariate Gaussian distributions reprenting class “−1”and class “+1”,respectively.In particu-larly,we have Pr(x p |y p =−1)=N (x p ;µ0,Σ)and Pr(x p |y p =+1)=N (x p ;µ1,Σ),where the Gaus-sian parameters µ0=[0,0]T ,µ1=[2.3,2.3]T ,and Σ=
1.75−0.433
−0.4331.25
.The auxiliary data D a
are then a lected draw from the two Gaussian dis-tributions,as described in [Zadrozny,2004].We take the lection probability Pr(s |x p ,y p =−1)=
σ(w 0+w 1K (x p ,µs 0;Σ))and Pr(s |x p ,y
p
=+1)=σ(w 0+w 1K (x p ,µs
1;Σ)),where σis the sigmoid func-tion,w 0=−1,w 1=exp(1),K (x p ,µs 0;Σ)=
exp {−0.5(x p −µs 0)T Σ−1(x p −µs 0)}with µs
0=[2,1]T ,
and K (x p ,µs 1;Σ)=exp {−0.5(x p −µs 1)T Σ
−1(x p
−µs 1)}with µs 1=[0,3]T
.We obtain 150samples of D p and 150samples of D a ,which are shown in Figure 3.
The M-Logit and logistic regression classifiers are trained and tested as explained at the beginning o
f this ction.The test error rates are shown in Figure 1and Figure 2,as a function of number of primary
labeled data ud in training.The D p
l
in Figure 1are randomly lected and the D p
l in Figure 2are actively lected as described in Section 5.
Figure 1.Test error rates of M-Logit and logistic regression
on the toy data,as a function of size of D p
l
.The primary labeled data D p
l are randomly lected from D p .The error rates are an average over 50independent trials of random
lection of D p l
.Several obrvations are made from inspection of Fig-ures 1and 2.
•The M-Logit consistently outperforms the three
standard logistic regression classifiers,by a con-siderable margin.This improvement is a result of
properly fusing D a and D p
l
,with D a determining the classifier under the guidance of few D p l
.
Figure 2.Error rates of M-Logit and logistic regression on
the toy data,as a function of size of D p
l
.The primary labeled data D p
l are actively lected from D p ,using the method in Section 5.
•The performance of the logistic regression trained
on D p
l
alone changes significantly with the size of D p l .This is understandable,considering that D p l are the only examples determining the classifier.The abrupt drop of errors from iteration 11to iteration 12in Figure 2may be becau the label found at iteration 12is critical to determining w .•The logistic regression trained on D a alone per-forms significantly wor than M-Logit,reflecting a marked mismatch between D a and D p .
•The logistic regression trained on D a ∪D p
l im-proves,but mildly,as D p
l grows,and it is ulti-mately outperformed by the the logistic regression
trained on D p
l
alone,demonstrating that some data in D a
are mismatched with D p and hence cannot be correctly classified along with D p ,if the mismatch is not compensated.•As D p l grows,the logistic regression trained on D p
l alone finally approaches to M-Logit,showing that
without the interference of D a ,a sufficient D p
l
can define a correct classifier.
•All four classifiers benefit from the actively -lected D p
l ,this is consistent with the general obrvation with active learning [Cohn et al.,1995,Krogh &Vedelsby,1995].To better understand the active lection process,we show in Figure 3the first few iterations of active learn-ing.Iteration 0corresponds to the initially empty D p l
,and iterations 1,5,10,13respectively correspond to
1,5,10,13data points lected accumulatively from
D p
u
into D p l .Each time a new data point is lected,the w is re-trained,yielding the different decision boundaries.As can be en in Figure 3,the decision boundary does not change much after 10data are lected,demonstrating convergence.
In Figure 3,each auxiliary data point x a i ∈D a
is symbolically displayed with a size in proportion to
exp(−y a
i
μi /12),hence a small symbol of auxiliary data corresponds to large y a
i
μi and hence small participa-tion in determining w .The auxiliary data that cannot be correctly classified along with the primary data are de-emphasized by the M-Logit.Usually the auxiliary data near the decision boundary are de-emphasized.
6.2.Results on the Wisconsin Breast Cancer
Databas In the cond example we consider the Wisconsin Breast Cancer Databas from the UCI Machine Learning Repository.The data t consist of 569in-stances with feature dimensionality 30.We randomly partition the data t into two subts,one with 228data points and the other with 341data points.The first is ud as D p ,and the cond as D a .We arti-ficially make D a mismatched with D p by introducing errors into the labels and adding noi to the features.Specifically,we make changes to 50%randomly chon
(x a i ,y a i )∈D a :change the signs of y a
i and add 0dB white Gaussian noi to x a i .We then proceed,as in Section 6.1,to training and testing the four classifiers.
We again consider both random D p
l
and actively -lected D p
l .The test errors are summarized in Figures 4and 5.The results are esntially consistent with tho in Figures 1and 2,extending the obrvations we made there to the real data here.It is particularly noted that the mismatch between D a and D p here is more prominent than in the toy data,as manifested by the error rates of logistic regression trained alone on D a .This makes M-Logit more advantageous in the comparison:not only does it give the best results but
it also converges faster than others with the size of D p l .To examine the effect of C on the performance of M-Logit,we prent in Figure 6the test error rates of
M-Logit using five different C :C =2,4,6,8,10.Here
the D p
l
are determined by active learning as described in Section 5.Clearly,the results for the 5different C ’s are almost indistinguishable.This relative inn-sitivity of M-Logit to C may partly be attributed to the adaptivity brought about by active learning.With
different C ,the D p
l
are also lected differently,thus counteracting the effect of C and keeping M-Logit ro-