Logistic regression with an auxiliary data source

更新时间:2023-06-25 00:42:04 阅读：评论：0

Logistic Regression with an Auxiliary Data Source

Xuejun Liao xjliao@ee.duke.edu Ya Xue yx10@ee.duke.edu Lawrence Carin lcarin@ee.duke.edu Department of Electrical and Computer Engineering,Duke University,Durham,NC27708

Abstract

To achieve good generalization in supervid

learning,the training and testing examples

are usually required to be drawn from the

same source distribution.In this paper we

propo a method to relax this requirement

in the context of logistic regression.Assum-

ing D p and D a are two ts of examples

drawn from two mismatched distributions,

where D a are fully labeled and D p partially

labeled,our objective is to complete the la-

bels of D p.We introduce an auxiliary vari-

ableμfor each example in D a to reﬂect its

mismatch with D p.Under an appropriate

constraint theμ’s are estimated as a byprod-

uct,along with the classiﬁer.We also prent

an active learning approach for lecting the

labeled examples in D p.The propod algo-

rithm,called“Migratory-Logit”or M-Logit,

is demonstrated successfully on simulated as

well as real data ts.

1.Introduction

In supervid learning problems,the goal is to design

a classiﬁer using the training examples(labeled data)

D tr={(x tr i,y tr i)}N tr

i=1such that the classiﬁer predicts

the label y p

i correctly for unlabeled primary test data

D p={(x p i,y p i):y p i missing}N p i=1.The accuracy of the predictions is signiﬁcantly aﬀected by the quality of D tr,which is assumed to contain esntial informa-tion about D p.A common assumption utilized by learning algorithms is that D tr are a suﬃcient sam-ple of the same source distribution from which D p are drawn.Under this assumption,a classiﬁer designed bad on D tr will generalize

well when it is tested on D p.This assumption,however,is often violated in Appearing in Proceedings of the22nd International Confer-ence on Machine Learning,Bonn,Germany,2005.Copy-right2005by the author(s)/owner(s).practice.First,in many applications labeling an obr-

vation is an expensive process,resulting in insuﬃcient

labeled data in D tr that are not able to characterize

the statistics of the primary data.Second,D tr and D p are typically collected under diﬀerent experimen-tal conditions and therefore often exhibit diﬀerences

in their statistics.

Methods to overcome the insuﬃciency of labeled data

have been investigated in the past few years under

the names“active learning”[Cohn et al.,1995,Krogh

&Vedelsby,1995]and“mi-supervid learning”

[Nigam&et al.,2000],which we do not discuss here,

though we will revisit active learning in Section5. The problem of data mismatch has been studied in econometrics,where the available D tr are often a non-randomly lected sample of the true distribution of in-terest.Heckman(1979)developed a method to correct the sample-lection bias for linear regression models. The basic idea of Heckman’s method is that if one can estimate the probability of an obrvation being lected into the sample,one can u this probability estimate to correct the lection bias.

Heckman’s model has recently been extended to clas-

siﬁcation problems[Zadrozny,2004],where it is as-

stpsumed that the primary test data D p∼Pr(x,y)while the training examples D tr=D a∼Pr(x,y|s=1), where the variable s controls the lection of D a:if s=1,(x,y)is lected into D a;if s=0,(x,y)is not lected into D a.Evidently,unless s is independent of(x,y),Pr(x,y|s=1)=Pr(x,y)and hence D a are mismatched with D p.By Bayes rule,

Pr(x,y)=

Pr(s=1)

Pr(x,y|s=1)(1)

which implies that if one has access to Pr(s=1)

Pr(s=1|x,y)

one

can correct the mismatch by weighting and resam-pling[Zadrozny et al.,2003,Zadrozny,2004].In the special ca when Pr(s=1|x,y)=Pr(s=1|x),one may estimate Pr(s=1|x)from a suﬃcient sample of Pr(x,s)if such a sample is available[Zadrozny,2004].

In the general ca,however,it is diﬃcult to estimate

Pr(s=1) Pr(s=1|x,y),as we do not have a suﬃcient sample of

Pr(x,y,s)(if we do,we already have a suﬃcent sam-ple of Pr(x,y),which contradicts the assumption of the problem).

In this paper we consider the ca in which we have a fully labeled auxiliary data t D a and a partially

labeled primary data t D p=D p

l ∪D p u,where D p

are labeled and D p u unlabeled.We assume D p and D a are drawn from two distributions that are mis-matched.Our objective is to u a mixed training t

D tr=D p

l ∪D a to train a classiﬁer that predicts the la-

bels of D p u accurately.Assume D p∼Pr(x,y).In light of equation(1),we can write D a∼Pr(x,y|s=1)as long as the source distributions of D p and D a have the same domain of nonzero probability1.As explained in the previous paragraph,it is diﬃcult to correct the

mismatch by directly estimating Pr(s=1)2019考研成绩查询

Pr(s=1|x,y).Therefore

we take an alternative approach.We introduce an aux-iliary variableμi for each(x a i,y a i)∈D a to reﬂect its mismatch with D p and to control its participation in the learning process.Theμ’s play a similar role as

the weighting factors Pr(s=1)

Pr(s=1|x,y)in(1).However,un-

like the weighting factors,the auxiliary variables are estimated along with the classiﬁer in the learning.We employ logistic regression as a speciﬁc classiﬁer and develop our method in this context.

A related problem has been studied in[Wu&Diet-terich,2004],where the classiﬁer is trained on twoﬁxed and labeled data ts D p and D a,where D a is of lower quality and provides weaker evidence for the classiﬁer design.The problem is approached by minimizing a weighted sum of two parate loss functions,with one deﬁned for the primary data and the other for the aux-iliary data.Our method is distinct from that in[Wu& Dietterich,2004]in two respects.First,we introduce an auxiliary va

riableμi for each(x a i,y a i)∈D a and the auxiliary variables are estimated along with the classi-ﬁer.A largeμi implies large mismatch of(x a i,y a i)with

D p and accordingly less participation of x a i in learning the classiﬁer.Second,we prent an active learning

strategy to deﬁne D p

l ⊂D p when D p is initially fully

unlabeled.

The remainder of the paper is organized as follows.

A detailed description of the propod method is pro-vided in Section2,followed by description of a fast learning algorithm in Section3and a theoretical dis-1For any Pr(x,y|s=1)=0and Pr(x,y)=0,there

exists Pr(s=1)

Pr(s=1|x,y)=Pr(x,y)

Pr(x,y|s=1)

法语课∈(0,∞)such that equation

(1)is satisﬁed.For Pr(x,y|s=1)=Pr(x,y)=0,any

Pr(s=1) Pr(s=1|x,y)=0makes equation(1)satisﬁed.

工程质量管理制度cussion in4.In Section5we prent a method to

actively deﬁne D p

when D p

is initially empty.We

demonstrate example results in Section6.Finally,Sec-

tion7contains the conclusions.

2.Migratory-Logit:Learning Jointly

on the Primary and Auxiliary Data

We assume D p

areﬁxed and nonempty,and with-

out loss of generality,we assume D p

are always in-

dexed prior to D p ,D p

={(x p

,y p i)}N p l i=1and D p u=

{(x p i,y p i):y p i missing}N p

i=N p l+1

.We u N a,N p,and

N p l to denote the size(number of data points)of D a,

D p,and D p

vipkid英语价格l

automaticupdates

,respectively.In Section5we discuss how

to actively determine D p

when D p

is initially empty.

We consider the binary classiﬁcation problem and the

labels y a,y p∈{−1,1}.For notational simplicity,we

let x always include a1as itsﬁrst element to accom-

modate a bias(intercept)term,thus x p,x a∈R d+1

where d is the number of features.For a primary data

point(x p

y p i)∈D p l,we follow standard logistic regres-

sion to write

Pr(y p

|x p i;w)=σ(y p i w T x p i)(2)

where w∈R d+1is a column vector of classiﬁer param-

eters andσ(μ)=1

1+exp(−μ)

is the sigmoid function.

For a auxiliary data point(x a i,y a i)∈D a,we deﬁne

Pr(y a i|x a i;w,μi)=σ(y a i w T x a i+y a iμi)(3)

whereμi is an auxiliary variable.Assuming the ex-

amples in D p

and D a are drawn i.i.d.,we have the

log-likelihood function

(w,µ;D p l∪D a)

lawfirm

N p

i=1

lnσ(y p

w T x p i)+

N a

i=1

lnσ(y a i w T x a i+y a iμi)(4)

whereµ=[μ1,···,μN a]T is a column vector of all

auxiliary variables.

The auxiliary variableμi is introduced to reﬂect the

mismatch of(x a i,y a i)with D p and to control its par-

ticipation in the learning of w.A larger y a iμi makes

Pr(y a i|x a i;w,μi)less nsitive to w.When y a iμi=∞,

Pr(y a i|x a i;w,μi)=1becomes completely indepen-

dent of w.Geometrically,theμi is an extra inter-

cept term that is uniquely associated with x a i and

caus it to migrate towards class y a i.If(x a i,y a i)

is mismatched with the primary data D p,w cannot

make

N p

i=1

lnσ(y p

w T x p i)and lnσ(y a i w T x a i)large at

the same time.In this ca x a i will be given an ap-

propriateμi to allow it to migrate towards class y a i,

so that w is less nsitive to(x a i,y a i)and can focus

more onﬁtting D p

.Evidently,if theμ’s are allowed

to change freely,their inﬂuence will override that of w in ﬁtting the auxiliary data D a and then D a will not participate in learning w .To prevent this from hap-pening,we introduce constraints on μi and maximize the log-likelihood subject to the constraints:

max w ,µ (w ,µ;D p

l ∪D a )

(5)

subject to

N a N a i =1y a i μi ≤C,C ≥0

(6)y a

i μi

≥0,i =1,2,···,N

(7)

where the inequalities in (7)reﬂect the fact that in or-der for x a i to ﬁt y a i =1(or y a

i =−1)we need to have μi >0(or μi <0),if we want μi to exert a positive inﬂuence in the ﬁtting process.Under the constraints

in (7),a larger value of y a

μi reprents a larger mis-match between (x a i ,y a

i )and D p and accordingly makes

(x a i ,y a

i )play a less important role in determining w .The classiﬁer resulting from solving the problem in (5)-(7)is referred to as “Migratory-Logit”or “M-Logit”.The C in (6)reﬂects the average mismatch between D a and D p and controls the average participation of D a in determining w .It can be learned from data if weadd oil

have a reasonable amount of D p

.However,in practice we usually have no or very scarce D p

to begin with.In this ca,we must rely on other information to t C.We will come back to a more detailed discussion on C in Section 4.

3.Fast Learning Algorithm

The optimization problem in (5),(6),and (7)is con-cave and any standard technique can be utilized to

ﬁnd the global maxima.However,there is a unique

μi associated with every (x a i ,y a i )∈D a ,and when D a

is large using a standard method to estimate μ’s can consume most of the computational time.

In this ction,we give a fast algorithm for training the M-Logit,by taking a block-coordinate ascent ap-proach [Bertkas,1999],in which we alternately solve for w and µ,keeping one ﬁxed when solving the other.The algorithm draws its eﬃciency from the analytic solution of µ,which we establish in the following the-orem.Proof of the theorem is given in the appendix,and Section 4contains a discussion that helps to un-derstand the theorem from an intuitive perspective.Theorem 1:Let f (z )be

a twice continuously diﬀer-entiable function and its cond derivative f (z )<0for any z ∈R .Let b 1≤b 2≤···≤b N ,R ≥0,and n =max {m :mb m −

i =1b i

≤R,1≤m ≤N }(8)

Then the problem

max {z i } N

i =1f (b i +z i )

(9)subject to

i =1z i ≤R,

R ≥0

(10)z i ≥0,i =1,2,···,N

(11)

has a unique global solution

z i =

1n n

j =1b j +1

n R −b i ,1≤i ≤n 0,n <i ≤N

(12)

For a ﬁxed w ,the problem in (5)-(7)is simpliﬁed to

maximizing N a

i =1ln σ(y a i w T x a i +y a

i μi )with respect to µ,subject to 1N a N a i =1y a i μi ≤C ,C ≥0,and y a

μi ≥0for i =1,2,···,N a .Clearly ln σ(z )is a twice continuously diﬀerentiable function of z and its

cond derivative ∂2

∂z 2ln σ(z )=−σ(z )σ(−z )<0for −∞<z <∞.Thus Theorem 1applies.We ﬁrst

solve {y a

μi }using Theorem 1,then {μi }are triv-ially solved using the fact y a

∈{−1,1}.Assume y a k 1

w T x a k 1≤y a k 2w T x a k 2≤···≤y a k N a w T x a k N a ,where k 1,k 2,···,k N a is a permutation of 1,2,···,N a .Then we can write the solution of {μi }analytically,

μk i =⎧⎨⎩1y a k i n

j =1y a k j w T x a

k j

+N a

n y a k i C −w T x a k i ,1≤i ≤n 0,n <i ≤N a

(13)where

n =max m :my a k m

w T x a k m − m i =1y a k i w T x a k i ≤N a

C,1≤m ≤N

(14)For a ﬁxed µ,we u the standard gradient-bad method [Bertkas,1999]to ﬁnd w .The main p

ro-cedures of the fast training algorithm for M-Logit are summarized in Table 1,where the gradient w and the Hessian matrix 2w are computed from (4).

4.Auxiliary Variables and Choice of C

Theorem 1and its constructive proof in the appendix oﬀers some insight into the mechanism of how the mis-match between D a and D p is compensated through the auxiliary variables {μi }.To make the descrip-tion easier,we think of each data point x a i ∈D

as getting a major “wealth”y a i w T x a

i from w and

an additional wealth y a

i μi from a given budget to-taling N a C (C reprents the average budget for a single x a ).From the appendix,N a C is distributed among the auxiliary data {x a i }by a “poorest-ﬁrst”

rule:the “poorest”x a

k 1(that which has the small-est y a k 1

w T x a k 1),gets a portion y a k 1μk 1from N a C ﬁrst,

Table 1.Fast Learning Algorithm of M-Logit

Input:D a ∪D p

l and C ;Output:w and {μi }N a

i =1

1.Initialize w and μi =0for i =1,2,···,N a .

2.Compute the gradient w and Hessian ma-trix 2w .

3.Compute the ascent direction d =

−( 2w )−1

w .

4.Do a linear arch for the step-size α∗=

arg max α (w +αd ).

5.Update w :w ←w +α∗d .

6.Sort {y a i w T x a i }N a

i =1in ascending order.As-sume the result is y a k 1

w T x a k 1≤y a k 2w T x a

k 2≤···≤y a

k N a

w T x a k N a ,where k 1,k 2,···,k N a is a permutation of 1,2,···,N a .7.Find the n using (14).

8.Update the auxiliary variables {μi }N a

i =1using

(13).

9.Check the convergence of :exit and output

w and {μi }N a

i =1if converged;go back to 2oth-erwi.

and as soon as the total wealth y a k 1

w T x a k 1+y a

k 1μk 1reaches the wealth of the cond poorest x a k 2,N a

becomes equally distributed to x a k 1and x a

k 2such that their total wealths are always equal.Then,as soon as y a k 1

w T x a k 1+y a k 1μk 1=y a k 2w T x a k 2+y a k 2μk 2reach the wealth of the third poorest,N a C be

comes equally distributed to three of them to make them equally rich.The distribution continues in this way until the budget N a C is ud up.The “poorest-ﬁrst”rule is esntially a result of the concavity of the logarith-mic sigmoid function ln σ(·).The goal is to maximize N a

i =1ln σ(y a i w T x a i +y a

i μi ).The concavity of ln σ(·)dictates that for any given portion of N a C ,distribut-ing it to the poorest makes the maximum gain in ln σ.

The C is ud as a means to compensate for the loss that D a may suﬀer from w .The classiﬁer w is respon-sible for correctly classifying both D a and D p .Becau D a and D p are mismatched,w cannot satisfy both of them:one must suﬀer if the other is to gain.As D p is the primary data t,we want w to classify D p as accurately as possible.The auxiliary variables are therefore introduced to reprent compensations that D a get from C .When x a gets small wealth from w and is poor,it is becau x a is mismatched and in conﬂict with D p (assuming perfect paration of D a ,no conﬂict exists among themlves).By the “poorest ﬁrst”rule,the most mismatched x a gets compensation

ﬁrst.

A high compensation y a

μi whittles down the partici-pation of x a

i in learning w .This is easily en from

the contribution of (x a i ,y a i )to w and 2

humorousw ,which

are obtained from (4)as σ(−y a i w T x a i −y a i μi )y a i x a

i and

−σ(−y a i w T x a i −y a i μi )σ(y a i w T x a i +y a i μi )x a i x a i T

,respec-tively.When y a i

μi is large,σ(−y a i w T x a i −y a i μi )is clo to zero and hence the contributions of (x a i ,y a

i )to w and 2w are ignorable.We in fact do not need an in-ﬁnitely large y a

i μi to make the contributions of x a i ig-norable,becau σ(μ)is almost saturated at μ=±6.

If y a i

w T x a i =−6,σ(−y a i w T x a i )=0.9975,implying a large contribution of (x a i ,y a

i )to w ,which happens

when w assigns x a i to the correct class y a

i with prob-ability of σ(y a i

w T x a

i )=σ(−6)=0.0025only.In this nearly worst ca,a compensation of y a

μi =12can eﬀectively remove the contribution of (x a i ,y a

i )becau

σ(−y a i

w T x a i −y a i μi )=σ(6−12)=σ(−6)=0.0025.To eﬀectively remove the contributions of N m auxil-iary data,one needs a total budge 12N m ,resulting in an average budget C =12N m /N a .

To make a right choice of C ,the N m /N a should rep-rent the rate that D a are mismatched with D p .This is so becau we want N a C to be distributed only to that part of D a that is mismatched with D p ,thus per-mitting us to u the remaining part in learning w .The quantity N m /N a is usually unknown in practice.However,C =12N m /N a gives one a n of at least what range C should be in.As 0≤N m ≤N a ,letting 0≤C ≤12is usually a reasonable choice.In our experiences,the performance of M-Logit is relatively robust to C ,and this will be demonstrated in Section 6.2using an example data t.

5.Active Selection of D p

l In Section 2we assumed that D p

l had already been de-termined.In this ction we describe how D p

can be actively lected from D p

,bad on the Fisher infor-mation matrix [Fedorov,1972,MacKay,1992].The approach is known as active learning [Cohn et al.,1995,Krogh &Vedelsby,1995].

Let Q denote the Fisher information matrix of D p

l ∪D a about w .By deﬁnition of the Fisher information ma-trix [Cover &Thomas,1991],Q =E {y p i },{y a i }∂ ∂w ∂ ∂w T ,

and substituting (4)into this equation gives (a brief derivation is given in the appendix)

Q = N p l i =1σp i (1−σp i

)x p i x p i T + N a i =1σa i (1−σa i )x a i x a i T

(15)

where σp i =σ(w T x p i )for i =1,2,...,N p l ,and σa

=σ(w T x a i +μi )for i =1,2,...,N a

,and w and {μi }reprent the true classiﬁer and auxiliary variables.

It is well known the inver Fisher information Q −1lower bounds the covariance matrix of the estimated w [Cover &Thomas,1991].In particular,[det(Q )]−1lower bounds the product of variances of the elements

in w .The goal in lecting D p

is to reduce the vari-ances,or uncertainty,of w .Thus we ek the D p

that maximize det(Q ).

The lection proceeds in a quential manner.Ini-tially D p

u =D p ,D p l is empty,and Q = N a i =1σa i (1−

skinshipσa i )x a i x a i T .Then one at a time,a data point x p i ∈D p

is lected and moved from D p

u to D p l .This caus Q

to be updated as:Q ←Q +σp i (1−σp i

)x p i (x p i )T

.At each iteration,the lection is bad on

max x p i ∈D p u

det Q +σp i (1−σp i )x p i (x p i )T =max x p i ∈D p u 1+σp i (1−σp i )(x p i )T Q −1x p i (16)where we assume the existence of Q −1,which can often

be assured by using suﬃcient auxiliary data D a .Evaluation of (16)requires the true values of w and {μi },which are not known a priori .We follow Fedorov (1972)and replace them with the w and {μi }that

are

estimated from D a ∪D p l

,where D p l are the primary labeled data lected up to the prent.

6.Results

In this ction the performance of M-Logit is demon-strated and compared to the standard logistic regres-sion,using test error rate as the performance index.

The M-Logit is trained using D a ∪D p l

,where D p l are either randomly lected from D p

,or actively lected

from D p using the method in Section 5.When D p

l are randomly lected,50independent trials are per-formed and the results are obtained as an average over the trials.Three logistic regression classiﬁers are

trained using diﬀerent combinations of D a and D p l

:D a ∪D p l ,D p l alone,and D a

alone,where D p l are identi-cal to the D p ud for M-Logit.The four classiﬁers are

tested on D p

u =D p \D p l ,using the following decision rule:declare y p

=−1if σ(w T x p )≤0.5and y p =1

otherwi,for any x p ∈D p u

.Throughout this ction the C for M-Logit is t to

C =6when the comparison is made to logistic regres-sion.In addition,we prent a comparison of M-Logit with diﬀerent C ’s,to examine the nsitivity of M-Logit’s performance to C .6.1.A toy Example

In the ﬁrst example,the primary data are simulated as two bivariate Gaussian distributions reprenting class “−1”and class “+1”,respectively.In particu-larly,we have Pr(x p |y p =−1)=N (x p ;µ0,Σ)and Pr(x p |y p =+1)=N (x p ;µ1,Σ),where the Gaus-sian parameters µ0=[0,0]T ,µ1=[2.3,2.3]T ,and Σ=

1.75−0.433

−0.4331.25

.The auxiliary data D a

are then a lected draw from the two Gaussian dis-tributions,as described in [Zadrozny,2004].We take the lection probability Pr(s |x p ,y p =−1)=

σ(w 0+w 1K (x p ,µs 0;Σ))and Pr(s |x p ,y

=+1)=σ(w 0+w 1K (x p ,µs

1;Σ)),where σis the sigmoid func-tion,w 0=−1,w 1=exp(1),K (x p ,µs 0;Σ)=

exp {−0.5(x p −µs 0)T Σ−1(x p −µs 0)}with µs

0=[2,1]T ,

and K (x p ,µs 1;Σ)=exp {−0.5(x p −µs 1)T Σ

−1(x p

−µs 1)}with µs 1=[0,3]T

.We obtain 150samples of D p and 150samples of D a ,which are shown in Figure 3.

The M-Logit and logistic regression classiﬁers are trained and tested as explained at the beginning o

f this ction.The test error rates are shown in Figure 1and Figure 2,as a function of number of primary

labeled data ud in training.The D p

in Figure 1are randomly lected and the D p

l in Figure 2are actively lected as described in Section 5.

Figure 1.Test error rates of M-Logit and logistic regression

on the toy data,as a function of size of D p

.The primary labeled data D p

l are randomly lected from D p .The error rates are an average over 50independent trials of random

lection of D p l

.Several obrvations are made from inspection of Fig-ures 1and 2.

•The M-Logit consistently outperforms the three

standard logistic regression classiﬁers,by a con-siderable margin.This improvement is a result of

properly fusing D a and D p

,with D a determining the classiﬁer under the guidance of few D p l

Figure 2.Error rates of M-Logit and logistic regression on

the toy data,as a function of size of D p

.The primary labeled data D p

l are actively lected from D p ,using the method in Section 5.

•The performance of the logistic regression trained

on D p

alone changes signiﬁcantly with the size of D p l .This is understandable,considering that D p l are the only examples determining the classiﬁer.The abrupt drop of errors from iteration 11to iteration 12in Figure 2may be becau the label found at iteration 12is critical to determining w .•The logistic regression trained on D a alone per-forms signiﬁcantly wor than M-Logit,reﬂecting a marked mismatch between D a and D p .

•The logistic regression trained on D a ∪D p

l im-proves,but mildly,as D p

l grows,and it is ulti-mately outperformed by the the logistic regression

trained on D p

alone,demonstrating that some data in D a

are mismatched with D p and hence cannot be correctly classiﬁed along with D p ,if the mismatch is not compensated.•As D p l grows,the logistic regression trained on D p

l alone ﬁnally approaches to M-Logit,showing that

without the interference of D a ,a suﬃcient D p

can deﬁne a correct classiﬁer.

•All four classiﬁers beneﬁt from the actively -lected D p

l ,this is consistent with the general obrvation with active learning [Cohn et al.,1995,Krogh &Vedelsby,1995].To better understand the active lection process,we show in Figure 3the ﬁrst few iterations of active learn-ing.Iteration 0corresponds to the initially empty D p l

,and iterations 1,5,10,13respectively correspond to

1,5,10,13data points lected accumulatively from

D p

into D p l .Each time a new data point is lected,the w is re-trained,yielding the diﬀerent decision boundaries.As can be en in Figure 3,the decision boundary does not change much after 10data are lected,demonstrating convergence.

In Figure 3,each auxiliary data point x a i ∈D a

is symbolically displayed with a size in proportion to

exp(−y a

μi /12),hence a small symbol of auxiliary data corresponds to large y a

μi and hence small participa-tion in determining w .The auxiliary data that cannot be correctly classiﬁed along with the primary data are de-emphasized by the M-Logit.Usually the auxiliary data near the decision boundary are de-emphasized.

6.2.Results on the Wisconsin Breast Cancer

Databas In the cond example we consider the Wisconsin Breast Cancer Databas from the UCI Machine Learning Repository.The data t consist of 569in-stances with feature dimensionality 30.We randomly partition the data t into two subts,one with 228data points and the other with 341data points.The ﬁrst is ud as D p ,and the cond as D a .We arti-ﬁcially make D a mismatched with D p by introducing errors into the labels and adding noi to the features.Speciﬁcally,we make changes to 50%randomly chon

(x a i ,y a i )∈D a :change the signs of y a

i and add 0dB white Gaussian noi to x a i .We then proceed,as in Section 6.1,to training and testing the four classiﬁers.

We again consider both random D p

and actively -lected D p

l .The test errors are summarized in Figures 4and 5.The results are esntially consistent with tho in Figures 1and 2,extending the obrvations we made there to the real data here.It is particularly noted that the mismatch between D a and D p here is more prominent than in the toy data,as manifested by the error rates of logistic regression trained alone on D a .This makes M-Logit more advantageous in the comparison:not only does it give the best results but

it also converges faster than others with the size of D p l .To examine the eﬀect of C on the performance of M-Logit,we prent in Figure 6the test error rates of

M-Logit using ﬁve diﬀerent C :C =2,4,6,8,10.Here

the D p

are determined by active learning as described in Section 5.Clearly,the results for the 5diﬀerent C ’s are almost indistinguishable.This relative inn-sitivity of M-Logit to C may partly be attributed to the adaptivity brought about by active learning.With

diﬀerent C ,the D p

are also lected diﬀerently,thus counteracting the eﬀect of C and keeping M-Logit ro-

本文发布于:2023-06-25 00:42:04，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/78/1032527.html

上一篇：MASS DEFECT-BASED MULTIPLEX DIMETHYL PYRIMIDINYL O

下一篇：stata数据整理常用命令

标签：考研管理制度查询

留言与评论（共有 0 条评论）