Extracting and Composing Robust Features with Denoising Autoencoders

更新时间:2023-07-14 14:38:03 阅读：评论：0

Extracting and Composing Robust Features with Denoising

Autoencoders

Pascal Vincent vincentp@iro.umontreal.ca Hugo Larochelle larocheh@iro.umontreal.ca Yoshua Bengio bengioy@iro.umontreal.ca Pierre-Antoine Manzagol manzagop@iro.umontreal.ca Universit´e de Montr´e al,Dept.IRO,CP6128,Succ.Centre-Ville,Montral,Qubec,H3C3J7,Canada

Abstract

Previous work has shown that the diﬃcul-

ties in learning deep generative or discrim-

inative models can be overcome by an ini-

tial unsupervid learning step that maps in-

puts to uful intermediate reprentations.

We introduce and motivate a new training

principle for unsupervid learning of a rep-

rentation bad on the idea of making the

learned reprentations robust to partial cor-

ruption of the input pattern.This approach

can be ud to train autoencoders,and the

denoising autoencoders can be stacked to ini-

tialize deep architectures.The algorithm can

be motivated from a manifold learning and

information theoretic perspective or from a

80后儿歌generative model perspective.Comparative

experiments clearly show the surprising ad-

vantage of corrupting the input of autoen-

coders on a pattern classiﬁcation benchmark

suite.

1.Introduction

Recent theoretical studies indicate that deep architec-tures(Bengio&Le Cun,2007;Bengio,2007)may be needed to eﬃciently model complex distributions and achieve better generalization performance on challeng-ing recognition tasks.The belief that additional levels of functional composition will yield incread repre-ntational and modeling power is not new(McClel-land et al.,1986;Hinton,1989;Utgoﬀ&Stracuzzi, 2002).However,in practice,learning in deep archi-tectures has proven to be diﬃcult.One needs only Appearing in Proceedings of the25th International Confer-ence on Machine Learning,Helsinki,Finland,2008.Copy-right2008by the author(s)/owner(s).to ponder the diﬃcult problem of inference in deep directed graphical models,due to“explaining away”. Also looking back at the history of multi-layer neural networks,their diﬃcult optimization(Bengio et al., 2007;Bengio,2007)has long prevented reaping the ex-pected beneﬁts of going beyond one or two hidden lay-ers.However this situation has recently changed with the successful approach of(Hinton e

t al.,2006;Hinton &Salakhutdinov,2006;Bengio et al.,2007;Ranzato et al.,2007;Lee et al.,2008)for training Deep Belief Networks and stacked autoencoders.

One key ingredient to this success appears to be the u of an unsupervid training criterion to perform a layer-by-layer initialization:each layer is atﬁrst trained to produce a higher level(hidden)repren-tation of the obrved patterns,bad on the repre-ntation it receives as input from the layer below, by optimizing a local unsupervid criterion.Each level produces a reprentation of the input pattern that is more abstract than the previous level’s,be-cau it is obtained by composing more operations. This initialization yields a starting point,from which a globalﬁne-tuning of the model’s parameters is then performed using another training criterion appropriate for the task at hand.This technique has been shown empirically to avoid getting stuck in the kind of poor solutions one typically reaches with random initializa-tions.While unsupervid learning of a mapping that produces“good”intermediate reprentations of the input pattern ems to be key,little is understood re-garding what constitutes“good”reprentations for initializing deep architectures,or what explicit crite-ria may guide learning such reprentations.We know of only a few algorithms that em to work well for this purpo:Restricted Boltzmann Machines(RBMs) trained with contrastive divergence on one hand,and various types of autoencoders on the other.

峨眉3099

The prent rearch begins with the question of what

explicit criteria a good intermediate reprentation should satisfy.Obviously,it should at a minimum re-tain a certain amount of“information”about its input, while at the same time being constrained to a given a real-valued vector of a given size in the ca of an autoencoder).A supplemental criterion that has been propod for such models is sparsity of the reprentation(Ranzato et al.,2008;Lee et al.,2008). Here we hypothesize and investigate an additional spe-ciﬁc criterion:robustness to partial destruction of the ,partially destroyed inputs should yield almost the same reprentation.It is motivated by the following informal reasoning:a good repren-tation is expected to capture stable structures in the form of dependencies and regularities characteristic of the(unknown)distribution of its obrved input.For high dimensional redundant input(such as images)at least,such structures are likely to depend on evidence gathered from a combination of many input dimen-sions.They should thus be recoverable from partial obrvation only.A hallmark of this is our human ability to recognize partially occluded or corrupted im-ages.Further evidence is our ability to form a high level concept associated to multiple modalities(such as image and sound)and recall it even when some of the modalities are missing.

To validate our hypothesis and asss its ufulness as one of the guiding principles in learning dee

p architec-tures,we propo a modiﬁcation to the autoencoder framework to explicitly integrate robustness to par-tially destroyed inputs.Section2describes the algo-rithm in details.Section3discuss links with other approaches in the literature.Section4is devoted to a clor inspection of the model from diﬀerent theo-retical standpoints.In ction5we verify empirically if the algorithm leads to a diﬀerence in performance. Section6concludes the study.

2.Description of the Algorithm

2.1.Notation and Setup

Let X and Y be two random variables with joint prob-ability density p(X,Y),with marginal distributions p(X)and p(Y).Throughout the text,we will u the following notation:Expectation:E E p(X)[f(X)]=

p(x)f(x)d x.Entropy:I H(X)=I H(p)= E E p(X)[−log p(X)].Conditional entropy:I H(X|Y)= E E p(X,Y)[−log p(X|Y)].Kullback-Leibler divergence:

I D KL(p q)=E E p(X)[log p(X)

q(X)

]

.Cross-entropy:I H(p q)= E E p(X)[−log q(X)]=I H(p)+I D KL(p q).Mutual infor-mation:I(X;Y)=I H(X)−I H(X|Y).Sigmoid:s(x)=

1+e−x and s(x)=(s(x1),...,s(x d))T.Bernoulli dis-

tribution with meanµ:Bµ(x).and by extension

Bµ(x)=(Bµ

(x1),...,Bµ

(x d)).

The tup we consider is the typical supervid learn-

ing tup with a training t of n(input,target)pairs

D n={(x(1),t(1))...,(x(n),t(n))},that we suppo

to be an i.i.d.sample from an unknown distribution

q(X,T)with corresponding marginals q(X)and q(T).

2.2.The Basic Autoencoder

We begin by recalling the traditional autoencoder

model such as the one ud in(Bengio et al.,2007)

to build deep networks.An autoencoder takes an

input vector x∈[0,1]d,andﬁrst maps it to a hid-

den reprentation y∈[0,1]d through a deterministic

mapping y=fθ(x)=s(Wx+b),parameterized by

θ={W,b}.W is a d ×d weight matrix and b

is a bias vector.The resulting latent reprentation

y is then mapped back to a“reconstructed”vector

z∈[0,1]d in input space z=gθ (y)=s(W y+b )

withθ ={W ,b }.The weight matrix W of the

rever mapping may optionally be constrained by

W =W T,in which ca the autoencoder is said to

have tied weights.Each training x(i)is thus mapped

to a corresponding y(i)and a reconstruction z(i).The

parameters of this model are optimized to minimize

the average reconstruction error:

θ ,θ =arg min

θ,θ

i=1

x(i),z(i)

=arg min

θ,θ

i=1

x(i),gθ (fθ(x(i)))

(1)

where L is a loss function such as the traditional

squared error L(x,z)= x−z 2.An alternative loss,

suggested by the interpretation of x and z as either

bit vectors or vectors of bit probabilities(Bernoullis)

曲棍球is the reconstruction cross-entropy:

L I H(x,z)=I H(B x B z)

=−

k=1

[x k log z k+(1−x k)log(1−z k)](2)

Note that if x is a binary vector,L I H(x,z)is a negative

log-likelihood for the example x,given the Bernoulli

parameters z.Equation1with L=L I H can be written

θ ,θ =arg min

θ,θ

E E q0(X)[L I H(X,gθ (fθ(X)))](3)

where q0(X)denotes the empirical distribution asso-

ciated to our n training inputs.This optimization will

typically be carried out by stochastic gradient descent.

2.3.The Denoising Autoencoder

To test our hypothesis and enforce robustness to par-tially destroyed inputs we modify the basic autoen-coder we just described.We will now train it to recon-struct a clean “repaired”input from a corrupted ,par-tially destroyed one.This is done by ﬁrst corrupting the initial input x to get a partially destroyed version ˜x by means of a stochastic mapping ˜x ∼q D (˜x |x ).In our experiments,we considered the following corrupt-ing process,parameterized by the desired proportion νof “destruction”:for each input x ,a ﬁxed number νd of components are chon at random,and their value is forced to 0,while the others are left untouched.All information about the chon components is thus re-moved from that particuler input pattern,and the au-toencoder will be trained to “ﬁll-in”the artiﬁcially introduced “blanks”.Note that alternative corrupting

nois could be considered 1.The corrupted input ˜x

is then mapped,as with the basic autoencoder,to a hid-den reprentation y =f θ(˜x )=s (W ˜x +b )from which we reconstruct a z =g θ (y )=s (W

y +b )(e ﬁgure 1for a schematic reprentation of the process).As before the parameters are trained to minimize the av-erage reconstruction error L I H (x ,z )=I H (B x B z )over a training have z as clo as possible to the uncorrupted input x .But the key diﬀerence is that z

is now a deterministic function of ˜x

rather than x and thus the result of a stochastic mapping of x

手工五角星的折法z

Figure 1.An example x is corrupted to ˜x

.The autoen-coder then maps it to y and attempts to reconstruct x .

Let us deﬁne the joint distribution

q 0(X, X,Y )=q 0(X )q D ( X |X )δf θ(e X )

(Y )(4)

where δu (v )puts mass 0when u =v .Thus Y is a

deterministic function of X

.q 0(X, X,Y )is param-eterized by θ.The objective function minimized by stochastic gradient descent becomes:

arg min θ,θ

E E q 0(X,e X ) L I H X,g θ (f θ( X ))

(5)So from the point of view of the stochastic gradient de-scent algorithm,in addition to picking an input sam-ple from the training t,we will also produce a ran-dom corrupted version of it,and take a gradient step

The approach we describe and our analysis is not spe-ciﬁc to a particular kind of corrupting noi.

towards reconstructing the uncorrupted version from the corrupted version.Note that in this way,the au-toencoder cannot learn the identity,unlike the basic autoencoder,thus removing the constraint that d <d or the need to regularize speciﬁcally to avoid such a trivial solution.

2.4.Layer-wi Initialization and Fine Tuning The basic autoencoder has been ud as a building block to train deep networks (Bengio et al.,2007),with the reprentation of the k -th layer ud as input for the (k +1)-th,and the (k +1)-th layer trained after the k -th has been trained.After a few layers have been trained,the parameters are ud as initialization for a network optimized with respect to a supervid train-ing criterion.This greedy layer-wi procedure has been shown to yield signiﬁcantly better local minima than random initialization of deep networks ,achieving better generalization on a number of tasks (Larochelle et al.,2007).

The procedure to train a deep network using the de-noising autoencoder is similar.The only diﬀerence is how each layer is ,to minimize the crite-rion in eq.5instead of eq.3.Note that the corrup-tion process q D is only ud during training,but not for propagating reprentations from the raw input to higher-level reprentations.Note also that when layer k is trained,it receives as input the uncorrupted out-put of the previous layers.

3.Relationship to Other Approaches

Our training procedure for the denoising autoencoder involves learning to recover a clean input from a cor-rupted version,a task known as denoising .The prob-lem of image denoising,in particular,has been exten-sively studied in the image processing community and many recent developments rely on machine learning approaches (Roth and Black (2005);Elad and Aharon (2006);Hammond and Simoncelli (2007)).A particular form of gated autoencoders has also been ud for denoising in Memivic (2007).Denoising us-ing autoencoders was actually introduced much ear-lier (LeCun,1987;Gallinari et al.,1987),as an alter-native to Hopﬁeld models (Hopﬁeld,1982).Our ob-jective however is fundamentally diﬀerent from that of developing a competitive image denoising algorithm.We investigate explicit robustness to corrupting noi as a novel criterion guiding the learning of suitable in-termediate reprentations to initialize a deep network.Thus our corruption+de

noising procedure is applied not only on the input,but also recursively to interme-diate reprentations.

The approach also bears some remblance to the well known technique of augmenting the training data with stochastically“transformed”augment-ing a training t by transforming original bitmaps through small rotations,translations,and scalings is known to improveﬁnal classiﬁcation performance.In contrast to this technique our approach does not u any prior knowledge of image topology,nor does it pro-duce extra labeled examples for supervid training. We u corrupted patterns in a spe-ciﬁc to images)unsupervid initialization step,while the supervid training pha us the unmodiﬁed orig-inal data.

There is a well known link between“training with noi”and regularization:they are equivalent for small additive noi(Bishop,1995).By contrast,our cor-ruption process is a large,non-additive,destruction of information.We train autoencoders to”ﬁll in the blanks”,not merely be smooth functions(regulariza-tion).Also in our experience,regularized autoencoders (i.e.with weight decay)do not yield the quantitative jump in performance and the striking qualitative dif-ference obrved in theﬁlters that we get with denois-ing autoencoders.

There are also similarities with the work of(Doi et al., 2006)on robust coding over noisy channels.In their framework,a linear encoder is to encode a clean input for optimal transmission over a noisy channel to a de-coder that reconstructs the input.This work was later extended to robustness to noi in the input,in a pro-posal for a model of retinal coding(Doi&Lewicki, 2007).Though some of the inspiration behind our work comes from neural coding and computation,our goal is not to account for experimental data of neu-ronal activity as in(Doi&Lewicki,2007).Also,the non-linearity of our denoising autoencoder is crucial for its u in initializing a deep neural network.

It may be objected that,if our goal is to handle missing values correctly,we could have more naturally deﬁned a proper latent variable generative model,and infer the posterior over the latent(hidden)reprentation in the prence of missing inputs.But this usually requires a costly marginalization2which has to be carried out for each new example.By contrast,our approach tries to learn a fast and robust deterministic mapping fθfrom examples of already corrupted inputs.The bur-den is on learning such a constrained mapping during training,rather than on unconstrained inference at u time.We expect this may force the model to capture implicit invariances in the data,and result in interest-2as for RBMs,where it is exponential in the number of missing values ing features.Also note that in ction4.2we will e how our learning algorithm for the denoising autoen-coder can be viewed as a form of variational inference in a particular generative model.

4.Analysis of Denoising Autoencoders The above intuitive motivation for the denoising au-toencoder was given with the perspective of discover-ing robust reprentations.In the following,which can be skipped without hurting the remainder of the paper, we propo alternative perspectives on the algorithm.

4.1.Manifold Learning Perspective

The process of mapping a corrupted example to an uncorrupted one can be visualized in Figure2,with a low-dimensional manifold near which the data con-centrate.We learn a stochastic operator p(X| X)that maps an X to an X,p(X| X)=B g

(fθ(e X))

(X).The corrupted examples will be much more likely to be outside and farther from the manifold than the uncor-rupted ones.Hence the stochastic operator p(X| X) learns a map that tends to go from lower probability points X to high probability points X,generally on or near the manifold.Note that when X is farther from the manifold,p(X| X)should learn to make big-ger steps,to reach the manifold.

At the limit we e that the operator should map even far away points to a small volume near the manifold.

The denoising autoencoder can thus be en as a way to deﬁne and learn a manifold.The intermediate rep-rentation Y=f(X)can be interpreted as a coordi-nate system for points on the manifold(this is most clear if we force the dimension of Y to be smaller than the dimension of X).More generally,one can think of Y=f(X)as a reprentation of X which is well suited to capture the main variations in the ,on the manifold.When additional criteria(such as sparsity) are introduced in the learning model,one can no longer directly view Y=f(X)as an explicit low-dimensional coordinate system for points on the manifold,but it retains the property of capturing the main factors of variation in the data.

4.2.Top-down,Generative Model Perspective In this ction we recover the training criterion for our denoising autoencoder(eq.5)from a generative model perspective.Speciﬁcally we show that training the denoising autoencoder as described in ction2.3 is equivalent to maximizing a variational bound on a particular generative model.

Consider the generative model p(X, X,Y)= p(Y)p(X|Y)p( X|X)where p(X|Y)=B s(W Y+b )and

p ( X

|X )=q D ( X |X ).p (Y )is a uniform prior over Y ∈[0,1]d

.This deﬁnes a generative model with pa-rameter t θ ={W ,b }.We will u the previ-ously deﬁned q 0(X, X,Y )=q 0(X )q D ( X |X )δf θ(e X )

(Y )(equation 4)as an auxiliary model in the context of a variational approximation of the log-likelihood of p ( X

).Note that we abu notation to make it lighter,and u the same letters X , X

and Y for diﬀerent ts of random variables reprenting the same quan-tity under diﬀerent distributions:p or q 0.Keep in mind that whereas we had the dependency structure

X → X

→Y for q or q 0,we have Y →X → X for p .Since p contains a corruption operation at the last

generative stage,we propo to ﬁt p ( X

)

to corrupted training samples.Performing maximum likelihood ﬁt-ting for samples drawn from q 0( X

)corresponds to min-imizing the cross-entropy,or maximizing

H =max θ

{−I H (q 0( X ) p ( X ))}=max θ

{E E q 0(e X )[log p ( X )]}.(6)

Let q

(X,Y | X )be a conditional density,the quan-tity L (q , X )=E E q (X,Y |e X ) log p (X,e X,Y )q (X,Y |e X ) is a lower

bound on log p ( X

)since the following can be shown to be true for any q :

log p ( X

)

=L (q , X )+I D KL (q (X,Y | X ) p (X,Y | X ))Also it is easy to verify that the bound is tight when q (X,Y | X

)=p (X,Y | X ),where the I D KL becomes 0.We can thus write log p ( X

)=max q L (q , X ),and conquently rewrite equation 6as

H =max θ

{E E q 0(e X )[max q

L (q

, X )]}=max θ ,q

{E E q 0(e X )[L (q , X )]}

(7)

Figure 2.Manifold learning perspective.Suppo

training data (×)concentrate near a low-dimensional man-ifold.Corrupted examples (.

)obtained by applying cor-ruption process q D (e X

|X )will lie farther from the manifold.The model learns with p (X |e X

)to “project them back”onto the manifold.Intermediate reprentation Y can be inter-preted as a coordinate system for points on the manifold.

where we moved the maximization outside of the ex-pectation becau an unconstrained q (X,Y | X

)can in principle perfectly model the conditional distribu-tion needed to maximize L (q , X

)for any X .Now if we replace the maximization over an unconstrained q by the maximization over the parameters θof our q 0(appearing in f θthat maps an x to a y ),we get

a lower bound on H :H ≥max θ ,θ{E E q 0(e X )[L (q 0, X )]}

Maximizing this lower bound,we ﬁnd

arg max θ,θ

{E E q 0(e X )[L (q 0, X )]}=arg max θ,θ E E q 0(X,e X,Y )

log p (X, X,

Y )q 0(X,Y | X

)

=arg max θ,θ

E E q 0(X,e X,Y ) log p (X, X,Y )

+E E q 0(e X )

H [q 0

(X,Y | X )] =arg max θ,θ

E E q 0(X,e X,Y ) log p (X, X,Y )

备课要求Note that θonly occurs in Y =f θ( X

南京镇江),and θ only occurs in p (X |Y ).The last line is therefore obtained

becau q 0(X | X

)∝q D ( X |X )q 0(X )(none of which de-pends on (θ,θ )),and q 0(Y | X

)is ,its entropy is constant,irrespective of (θ,θ ).Hence the

entropy of q 0(X,Y | X

)

=q 0(Y | X )q 0(X | X ),does not vary with (θ,θ ).Finally,following from above,we obtain our training criterion (eq.5):

arg max θ,θ E E q 0(e X )[L (q 0, X )]=arg max θ,θ

E E q 0(X,e X,Y )[log[p (Y )p (X |Y )p ( X |X )]]=arg max θ,θ

E E q 0(X,e X,Y )[log p (X |Y )]

=arg max θ,θ

E E q 0(X,e X )[log p (X |Y =f θ( X ))]

=arg min θ,θ

E E q 0(X,e X ) L I H

X,g θ (f θ( X ))

where the third line is obtained becau (θ,θ )have no inﬂuence on E E q 0(X,e X,Y )[log p (Y )]becau we cho p (Y )stant,nor on

E E q 0(X,e X )[log p ( X |X )],and the last line is obtained

by inspection of the deﬁnition of L I H in eq.2,when

p (X |Y =f θ( X ))is a B g θ

各种乐器图片(f θ(e X

)).4.3.Other Theoretical Perspectives

Information Theoretic Perspective:Consider

X ∼q (X ),q unknown,Y =f θ( X

里番目录

).It can easily be shown (Vincent et al.,2008)that minimizing the expected reconstruction error amounts to maximizing

本文发布于:2023-07-14 14:38:03，感谢您对本站的认可！

本文链接：https://www.wtabcd.cn/fanwen/fan/82/1096253.html

上一篇：语言学概论试卷5

下一篇：英汉语言对比研究及对比分析综述

标签：折法要求镇江图片手工备课南京乐器

留言与评论（共有 0 条评论）