Analysisofreprentationsfordomainadaptation
Shai Ben-David School of Computer Science University of Waterloo
shai@cs.uwaterloo.ca John Blitzer,Koby Crammer,and Fernando Pereira Department of Computer and Information Science University of Pennsylvania
{blitzer,crammer,pereira}@/doc/d3bd993987c24028915fc339.html
Abstract
Discriminative learning methods for classi?cation perform well when training and
孕妇吃什么水果test data are drawn from the same distribution.In many situations,though,we
have labeled training data for a source domain,and we wish to learn a classi?er
which performs well on a target domain with a different distribution.Under what
conditions can we adapt a classi?er trained on the source domain for u in the
target domain?Intuitively,a good feature reprentation is a crucial factor in the
success of domain adaptation.We formalize this intuition theoretically with a
generalization bound for domain adaption.Our theory illustrates the tradeoffs in-
herent in designing a reprentation for domain adaptation and gives a new justi?-
cation for a recently propod model.It also points toward a promising new model
for domain adaptation:one which explicitly minimizes the difference between the
source and target domains,while at the same time maximizing the margin of the
training t.
1Introduction
We are all familiar with the situation in which someone learns to perform a task on training examples drawn from some domain(the source domain),but then needs to perform the same task on a related domain(the target domain).In this situation,we expect the task performance in the target domain to depend on both the performance in the source domain and the similarity between the two domains.
This situation aris often in machine learning.For example,we might want to adapt for a new ur(the target domain)a spam?lter trained on the email of a group of previous urs(the source domain),under the assumption that urs generally agree on what is spam and what is not.Then,the challenge is that the distributions of emails for the?rst t of urs and for the new ur are different. Intuitively,one might expect that the clor the two distributions are,the better the?lter trained on the source domain will do on the target domain.
Many other instances of this situation ari in natural language processing.In general,labeled data for tasks like part-of-speech tagging,parsing,or information extraction are drawn from a limited t of document types and genres in a given language becau of availability,cost,and project goals. However,applications for the trained systems often involve somewhat different document types and genres.Nevertheless,part-of-speech,syntactic structure,or entity mention decisions are to a large extent stable across different types and genres since they depend on general properties of the language under consideration.
Discriminative learning methods for classi?cation are bad on the assumption that training and test data are drawn from the same distribution.This assumption underlies both theoretical estimates of generalization error and the many experimental evaluations of learning methods.However,the as-su
mption does not hold for domain adaptation[5,7,13,6].For the situations we outlined above,the challenge is the difference in instance distribution between the source and target domains.We will approach this challenge by investigating how a common reprentation between the two domains安全基地
can make the two domains appear to have similar distributions,enabling effective domain adapta-tion.We formalize this intuition with a bound on the target generalization error of a classi?er trained from labeled data in the source domain.The bound is stated in terms of a reprentation function,
and it shows that a reprentation function should be designed to minimize domain divergence,as well as classi?er error. While many authors have analyzed adaptation from multiple ts of labeled training data[3,5,7, 13],our theory applies to the tting in which the target domain has no labeled training data,but
封建土地制度plentiful unlabeled data exists for both target and source domains.As we suggested above,this tting realistically captures the problems widely encountered in real-world applications of machine learning.Indeed recent empirical work in natural language processing[11,6]has been targeted at
exactly this tting.
We show experimentally that the heuristic choices made by the recently propod structural corre-spondence learning algorithm[6]do lead to lower values of the relevant quantities in our theoretical
analysis,providing insight as to why this algorithm achieves its empirical success.Our theory also points to an interesting new algorithm for domain adaptation:one which directly minimizes a trade-off between source-target similarity and source training error.
The remainder of this paper is structured as follows:In the next ction we formally de?ne domain
adaptation.Section3gives our main theoretical results.We discuss how to compute the bound in ction4.Section5shows how the bound behaves for the structural correspondence learning reprentation[6]on natural language data.We discuss our? ndings,including a new algorithm for
domain adaptation bad on our theory,in ction6and conclude in ction7.
2Background and Problem Setup
Let X be an instance t.In the ca of[6],this could be all English words,together with the possible contexts in which they occur.Let Z be a feature space(R d is a typical choice)and{0,1} be the label t for binary classi?cation1.
A learning problem is speci?ed by two parameters:a distribution D over X and a(stochastic)target
function f:X→[0,1].The value of f(x)corresponds to the probability that the label of x is 1.A reprentation function R is a function which maps instances to features R:X→Z.A reprentation R induces a distribution over Z and a(stochastic)target function from Z to[0,1]as
follows:
[B]def=Pr D R?1(B)
水煮菜的做法Pr?
D
f(z)def=E
D[f(x)|R(x)=z]
for any A?Z such that R?1(B)is D-measurable.In words,the probability of an event B under ?D is the probability of the inver image of B under R according to D,and the probability that the label of z is1
according to?f is the mean of probabilities of instances x that z reprents.Note that?f(z)may be a stochastic function even if f(x)is not.This is becau the function R can map two instances with different f-labels to the same feature reprentation.In summary,our learning tting is de?ned by? xed but unknown D and f,and our choice of reprentation function R and hypothesis class H?{g:Z→{0,1}}of deterministic hypothes to be ud to approximate the function f.
2.1Domain Adaptation
We now formalize the problem of domain adaptation.A domain is a distribution D on the instance t X.Note that this is not the domain of a function.To avoid confusion,we will always mean a speci?c distribution over the instance t when we say domain.Unlike in inductive transfer,where the tasks we wish to perform may be related but different,in domain adaptation we perform the same task in multiple domains.This is quite common in natural language processing,where we might be performing the same syntactic analysis task,such as tagging or parsing,but on domains with very different vocabularies[6,11]. We assume two domains,a source domain and a target domain.We denote by D S the source distribution of instances and?D S the induced distribution over the feature space Z.We u parallel notation,D T,?D T,for the target domain.f:X→[0,1]is the labeling rule,common to both domains,and?f is the induced image of f under R.
A predictor is a function,h,from the feature space,Z to[0,1].We denote the probability,according the distribution D S,that a predictor h disagrees with f by
S(h)=E
z~?D S E y~?f(z)[y=h(z)]
=E
z~?D S ?f(z)?h(z) .
Similarly,?T(h)denotes the expected error of h with respect to D T.
3Generalization Bounds for Domain Adaptation
We now proceed to develop a bound on the target domain generalization performance of a classi?er trained in the source domain.As we alluded to in ction1,the bound consists of two terms.The?rst term bounds the performance of the classi?er on the source domain.The cond term is a measure of the divergence between the induced source marginal?D S and the induced target marginal?D T.A natural measure of divergence for distributions is the L1or variational distance.This is de?ned as
d L
1(D,D′)=2sup
B∈B
|Pr D[B]?Pr D′[B]|
where B is the t of measureable subts under D and D′.Unfortunately the variational distance between real-valued distributions cannot be computed from?nite samples[2,9]and therefore is not uful to us when investigating reprentations for domain adaptation on real-world data.
A key part of our theory is the obrvation that in many realistic domain adaptation scenarios,we do not need such a powerful measure as variational distance.Instead we can restrict our notion of domain distance to be measured only with respect to function in our hypothesis class.
3.1The A-distance and labeling function complexity
We make u of a special measure of distance between probability distributions,the A-distance,as introduced in[9].Given a domain X and a collection A of subts of X,let D,D′be probability distributio
ns over X,such that every t in A is measurable with respect to both distributions.the A-distance between such distributions is de?ned as
d A(D,D′)=2sup
A∈A
|Pr D[A]?Pr D′[A]|
In order to u the A-distance,we need to limit the complexity of the true function f in terms of our hypothesis class H.We say that a function?f:Z→[0,1]isλ-clo to a function class H with respect to distributions?D S and?D T if
inf
h∈H
[?S(h)+?T(h)]≤λ.
A function?f isλ-clo to H when there is a single hypothesis h∈H which performs well on both domai
ns.This embodies our domain adaptation assumption,and we will assume will assume that our induced labeling function?f isλ-clo to our hypothesis class H for a smallλ.
We brie?y note that in standard learning theory,it is possible to achieve bounds with no explicit as-sumption on labeling function complexity.If H has bounded ,a?nite VC-dimension), then uniform convergence theory tells us that whenever?f is notλ-clo to H,large training samples have poor empirical error for every h∈H.This is not the ca for domain adaptation.If the training data is generated by some D S and we wish to u some H as a family of predictors for labels in the target domain,T,then one can construct a function which agrees with some h∈H with respect to?D S and yet is far from H with respect to?D T.Nonetheless we believe that such examples do not occur for realistic domain adaptation problems when
the hypothesis class H is suf?ciently rich, since for most domain adaptation problems of interest the labeling function
is’similarly simple’for both the source and target domains.
3.2Bound on the target domain error
We require one last piece of notation before we state and prove the main theorems of this work:the correspondence between functions and characteristic subts.For a binary-valued function g(z),we let Z g?Z be the subt who characteristic function is g
脸上有Z g={z∈Z:g(z)=1}.
In a slight abu of notation,for a binary function class H we will write d H(·,·)to indicate the A-distance on the class of subts who characteristic functions are functions in H.Now we can state our main theoretical result.
Theorem1Let R be a?xed reprentation function from X to Z and H be a hypothesis space of VC-dimension d.If a random labeled sample of size m is generated by applying R to a D S-i.i.d. sample labeled according to f,then with probability at least1?δ,for every h∈H:
T(h)≤S(h)+ m d log2emδ +d H(?D S,?D T)+λ
where e is the ba of the natural logarithm.
Proof:Let h?=argmin h∈H(?T(h)+?S(h)),and letλT andλS be the errors of h?with respect to D T and D S respectively.Notice thatλ=λT+λS.
T(h)≤λT+Pr D
[Z h?Z h?]
T
[Z h?Z h?]+|Pr D S[Z h?Z h?]?Pr D T[Z h?Z h?]|
≤λT+Pr D
S
[Z h?Z h?]+d H(?D S,?D T)
≤λT+Pr D麻辣鱿鱼
S
≤λT+λS+?S(h)+d H(?D S,?D T)
≤λ+?S(h)+d H(?D S,?D T)
The theorem now follows by a standard application Vapnik-Chervonenkis theory[14]to bound the true?S(h)by its empirical estimate??S(h).Namely,if S is an m-size.i.i.d.sample,then with probability exceeding1?δ,
S(h)≤S(h)+ m d log2emδ
m′
m d+log4d log(2m′)+log(4
Let us brie?y examine the bound from theorem2,with an eye toward feature reprentations,R. Under the assumption of subction3.1,we assume thatλis small for reasonable R.Thus the two main terms of interest are the?rst and fourth
terms,since the reprentation R directly affects them. The?rst term is the empirical training error.The fourth term is the sample A-distance between domains for hypothesis class H.Looking at the two terms,we e that a good reprentation R is one which achieves low values for both training error and domain A-distance simultaneously.
4Computing the A-distance for Signed Linear Classi?ers
In this ction we discuss practical considerations in computing the A-distance on real data.Ben-David et al.[9]show that the A-distance can be approximated arbitrarily well with increasing sample size.Recalling the relationship between ts and their characteristic functions,it should be clear that
computing the A-distance is cloly related to learning a classi?er.In fact they are identical.The t A h∈H which maximizes the H-distance between?D S and?D T has a characteristic function h.Then h is the classi?er which achieves minimum error on the binary classi?cation problem of
discriminating between points generated by the two distributions.
To e this,suppo we have two samples?U S and?U T,each of size m′from?D S and?D T respectively. De?ne the error of a classi?er h on the task of discriminating between points sampled from different distributions as
虎皮辣子怎么做
1
err(h)=
theorem2
that random projections approximate well distances in the original high dimensional space,as long as d is suf?ciently large.Arriaga and Vempala[1]show that one can achieve good prediction with random projections as long as the margin is suf?ciently large.
5.2Structural Correspondence Learning
Blitzer et al.[6]describe a heuristic method for domain adaptation that they call structural corre-spondence
learning(henceforth also SCL).SCL us unlabeled data from both domains to induce correspondences among features in the two domains.Its?rst step is to identify a small t of domain-independent“pivot”features which occur frequently in the unlabeled data of both domains.Other features are then reprented using their relative co-occurrence counts with the pivot features.Fi-nally they u a low-rank approximation to the co-occurence count matrix as a projection matrix P. The intuition is that by capturing the important correlations,features from the source and target domains which behave similarly for PoS tagging will be reprented similarly in the projected space.
作文初三
5.3Results
We u as our source data t100ntences(about2500words)of PoS-tagged Wall Street Journal text.The target domain test t is the same t as in[6].We u one million words(500thousand from each domain)of unlabeled data to estimate the A-distance between the?nancial and biomedi-cal domains.
The results in this ction are intended to illustrate the different parts of theorem2and how they can affect the target domain generalization error.We give two types of results.The?rst are pictorial and appear in?gures1(a),1(b)and2(a).The are intended to illustrate either the A-distance(?gures 1(a)an
d2(a))or the empirical error(?gure1(b))for different reprentations.The cond type are empirical and appear in2(b).In this ca we u the Huber loss as a proxy from the empirical training error.
Figure1(a)shows one hundred random instances projected onto the space spanned by the best two discriminating projections from the SCL projection matrix for part of the?nancial and biomedical datat.Instances from the WSJ are depicted as?lled red squares,whereas tho from MEDLINE are depicted as empty blue circles.An approximating linear discrimnator is also shown.Note, however,that the discriminator performs poorly,and recall that if the best discriminator performs poorly the A-distance is low.On the other hand,?gure1(b)shows the best two discriminating components for the task of discriminating between nouns and verbs.Note that in this ca,a good discriminating divider is easy to?nd,even in such a low-dimensional space.Thus the pictures lead us to believe that SCL?nds a reprentation which results both in small empirical classi?cation error and small A-distance.In this ca theorem2predicts good performance.
(a)Plot of random projections repre-