Cognitive Sciences EPrint Archive(CogPrints)6223,(2008)
cogprints/6223/
Understanding Slow Feature Analysis:
A Mathematical Framework
Henning Sprekeler∗h.sprekeler@biologie.hu-berlin.de Laurenz Wiskott l.wiskott@biologie.hu-berlin.de Institute for Theoretical Biology
and Bernstein Center for Computational Neuroscience Berlin
Humboldt University Berlin
Unter den Linden6
10099Berlin,Germany
∗corresponding author;now at:Laboratory for Computational Neuroscience,´Ecole Poly-technique F´e d´e rale de Lausanne,Station15,1015Lausanne,Switzerland
Abstract
Slow feature analysis is an algorithm for unsupervid learning of invariant
reprentations from data with temporal correlations.Here,we prent a math-
ematical analysis of slow feature analysis for the ca where the input-output
functions are not restricted in complexity.We show that the optimal functions
obey a partial differential eigenvalue problem of a type that is common in the-
oretical physics.This analogy allows the transfer of mathematical techniques
and intuitions from physics to concrete applications of slow feature analysis,
thereby providing the means for analytical predictions and a better understand-
鸡眼传染吗ing of simulation results.We put particular emphasis on the situation where the
input data are generated from a t of statistically independent sources.The
dependence of the optimal functions on the sources is calculated analytically
for the cas where the sources have Gaussian or uniform distribution.
Keywords:slow feature analysis,unsupervid learning,invariant repren-
tations,statistically independent sources,theoretical analysis
1.Introduction
Reliable recognition of objects in spite of changes in position,size,or illumination is a problem that,although highly relevant for computer vision,has not been solved in a general manner.The key problem is to establish reprentations of the data that are invariant with respect to typical changes in the appearance of the objects while remaining lective for object identity.
One approach for the unsupervid learning of such invariant reprentations is bad on the exploitation of temporal correlations in the training data.The basic idea is that reprentations that are invariant with respect to typical transforma-tions in a given t of time-dependent training data should lead to temporally slowly varying output signals.Using this obrvation as a rational,it is possible to learn invariant reprentations by favoring reprentations that generate slowly varying si
gnals over others that generate quickly varying signals.There are veral implementations of this principle,with approaches ranging from gradient descent (Stone and Bray,1995;Kayr et al.,2001),over temporally nonlocal Hebbian learning(Mitchison,1991;F¨o ldi`a k,1991;Wallis and Rolls,1997;Sprekeler et al., 2007)to batch learning(Wiskott and Sejnowski,2002).Here,we focus on Slow Feature Analysis(SFA)as introduced by Wiskott(1998).
SFA has been applied to a t of problems,ranging from biological model-ing of receptivefields in early visual processing(Berkes and Wiskott,2005)and hippocampal place and head direction cells(Franzius et al.,2007a)to technical applications such as invariant object recognition(Berkes,2005;Franzius et al.,
Sprekeler and Wiskott
2007b).On the conceptual side,it has turned out that SFA is cloly related to independent component analysis techniques that rely on cond order statistics (Blaschke et al.,2006).SFA or variations thereof can therefore also be ud for problems of blind source paration(Blaschke et al.,2007).
Previous studies have shown that SFA is amenable to analytical considerations (Wiskott,2003;Franzi
us et al.,2007a).Here,we extend this line of rearch and prent a mathematical framework for the ca where SFA has access to an unlim-ited function space.The aim of this article is threefold.Firstly,the mathematical framework allows us to make analytical predictions for the input-output functions found by SFA in concrete applications.Parts of the theory have been prented and ud for this purpo earlier(Franzius et al.,2007a).Secondly,the theory shows that SFA bears clo analogies to standard systems in physics,which pro-vide an intuitive interpretation of the functions found by SFA.Thirdly,the theory makes specific predictions for the ca where the input data are generated from a t of statistically independent sources.
The structure of the paper is as follows.In ction2we introduce the optimiza-tion problem that underlies SFA.In ctions3and4we develop the mathematical framework and study the ca of statistically independent sources.In ction5, we briefly discuss analogies to standard systems in physics and try to convey an intuitive understanding of the solutions of SFA bad on the analogies.
2.Slow Feature Analysis
Slow Feature Analysis is bad on the following learning task:Given a multi-dimensional input signal
we want tofind scalar input-output functions that gener-ate output signals that vary as slowly as possible but carry significant information. To ensure the latter we require the output signals to be uncorrelated and have unit variance.In mathematical terms,this can be stated as follows:
Optimization problem1:Given a function space F and an N-dimensional input signal x(t)find a t of J real-valued input-output functions g j(x)such that the output signals y j(t):=g j(x(t))minimize
∆(y j)= ˙y2j t(1) under the constraints
y j t=0(zero mean),(2)
y2j t=1(unit variance),(3)∀i<j: y i y j t=0(decorrelation and order),(4) with · t and˙y indicating temporal averaging and the derivative of y,respectively.
Equation(1)introduces the∆-value,which is a measure of the temporal slow-ness(or rather’fastness’)of the signal y(t).The constraints(2)and(3)avoid the trivial constant solution.Constraint(4)ensures that different functions g j code for different aspects of the input.Note that the decorrelation constraint is asym-metric:The function g1is the slowest function in F,while the function g2is the slowest function that fulfills the constraint of generating a signal that is uncorre-late
d to the output signal of g1.Therefore,the resulting quence of functions is ordered according to the slowness of their output signals on the training data.2018年犯太岁的生肖
It is important to note that although the objective is the slowness of the output signal,the functions g j are instantaneous functions of the input,so that slowness cannot be achieved by low-passfiltering.Slow output signals can only be obtained if the input signal contains slowly varying features that can be extracted by the functions g j.
2
Theory of Slow Feature Analysis
Depending on the dimensionality of the function space F,the solution of the optimization problem requires different techniques.If F isfinite-dimensional,the problem can be reduced to a(generalized)eigenvalue problem(Wiskott and Se-jnowski,2002;Berkes and Wiskott,2005).If F is infinite-dimensional,the problem requires variational calculus and is in general difficult to solve.Here,we consider this cond ca for the special situation that there are no constraints on F apart from sufficient differentiability and integrability.Although strictly speaking this ca cannot be implemented numerically,it has the advantage that it permits the analytical derivation of partial differe
ntial equations for the optimal functions and predictions for the behavior of systems that implement very high-dimensional func-tion spaces such as hierarchical systems(Franzius et al.,2007a).It also yields an intution as to how the structure of the input signal is reflected by the optimal solutions.
3.A Mathematical Framework for SF A
In this ction,we prent a rigorous mathematical framework for SFA for the ca of an unrestricted function space F.The key results are that the output signals extracted by SFA are independent of the reprentation of the input signals and that the optimal functions for SFA are the solutions of a partial differential eigenvalue problem.
3.1Reprentations of the input signals
The assumption that SFA has access to an unrestricted function space F has important theoretical implications.For restricted(but possibly still infinitely-dimensional)function spaces,coordinate changes in the space of the input data generally alter the results,becau they effectively change the function space from which the solutions are taken.As an example,assume that the input signal x= (x,y)is two-dimensional and the function space is the space of linear functions. Then,a change of the coordinate system to(x ,y )=(x3,y)if still allowing only linear functions in the new coordinates leads to
a very different function space in the variables x and y.Thus the optimal functions generate different optimal output signals y j for the different coordinate systems.The optimization problem with a restricted function space is generally not invariant with respect to coordinate changes of the input signals.
For an unrestricted function space,the situation is different,becau the con-catenation of any function in F with the inversion of the coordinate change is again an element of the function space.The t of output signals that can be generated by the function space is then invariant with respect to invertible coordi-nate changes of the input signals.Becau the slowness of a function is measured in terms of its output signal,the optimal functions will of cour depend on the coordinate system ud,but the output signals will be the same.
This is particularly interesting in situations where the high-dimensional in-put signal does not cover the whole space of possible values,but lies on a low-dimensional manifold.For illustration,consider the example of a video showing a single rotating object.In this ca,the t of possible images can be parameterized by three angles that characterize the orientation of the object in space.Therefore, the images lie on a three-dimensional manifold in the space of all possible im-ages.Becau there are no data outside the input manifold,we are generally only interested in the behavior of the optimal
functions within the input manifold,that is,in the reaction of the system to all images that are possible within the given training scenario.The equivalence of different coordinate systems then implies that it is not important whether we take the(high-dimensional)video quence or
3
Sprekeler and Wiskott
the(3-dimensional)time-dependent abstract angles as input signals.The output signal is the same.Of cour the low-dimensional reprentation is much more amenable to analytical predictions and to intuitive interpretations of the system behavior.We have previously ud this simplification to predict the behavior of a hierarchical model of visual processing that reproduces the behavior of veral cell types in the hippocampal formation of rodents commonly associated with spatial navigation(Franzius et al.,2007a).
Another situation in which the coordinate invariance is uful is the ca of nonlinear blind source paration.Here,the input data are assumed to be a nonlinear mixture of some underlying sources.The task is to reconstruct the sources from the data without knowledge of the mixture or the sources.A natural prerequisite for the reconstruction is that the mixture is invertible.The mixture can t
hen be interpreted as a nonlinear coordinate change,which–due to the equivalence of different coordinate systems–is immaterial to the optimization problem above.From the theoretical perspective,we can thus simply assume that we had the sources themlves as input signals and try to make predictions about how they are mixed in the optimal output signals found by SFA.If we can infer the sources(or good reprentatives thereof)from the optimal output signals under this condition,we can infer the sources from the output signals,no matter how they are mixed in the input data.Thus,SFA may be an interesting way of solving certain nonlinear blind source paration problems.
It is important to bear in mind that the theory developed in the following is valid for an arbitrary choice of the input coordinate system,so that x(t)can stand for concrete input video quences)as well as abstract reprentations of the angles that denote the orientation of the object in the video). Note however,that as the input data(or the manifold they lie on)becomes very high-dimensional,the resulting equations may be tedious to solve.
3.2Further assumptions and notation
We assume that the input signal x(t)is ergodic,so that we can replace time av-erages by enmble a
verages with a suitable probability density.Becau the opti-mization problem underlying SFA relies on the temporal structure of the training data as reflected by its derivative,a statistical description of the training signal x must incorporate not only the probability distribution for the values of x,but rather the joint distribution p x,˙x(x,˙x)of the input signal x and its derivative˙x. We assume that p x,˙x(x,˙x)is known and that we can define the marginal and conditional probability densities
p x(x):=Z
p x,˙x(x,˙x)d N˙x,(5)
p˙x|x(˙x|x):=
p x,˙x(x,˙x)
x
,(6) and the corresponding averages
f(x,˙x) x,˙x:=Z
p x,˙x(x,˙x)f(x,˙x)d N x d N˙x,(7)
f(x) x:=Z
p x(x)f(x)d N x,(8)
f(x,˙x) ˙x|x(x):=Z
p˙x|x(˙x|x)f(x,˙x)d N˙x.(9)
We assume throughout that all averages taken exist.This introduces integrability constraints on the functions of which the average is taken.The function space is thus not completely unrestricted.The functions are restricted to be integrable in
4
Theory of Slow Feature Analysis
the n that the averages above exist.In addition,they should be differentiable, simply to assure that the temporal derivative of their output signal exists.
We u greek letters to denote the index of vector components.Partial deriva-tives with respect to a component xµare written as∂µ.For example,the divergence of a vectorfield v(x)takes the short form
div v(x):=
X
µ∂vµ(x)
∂xµ
=
X
µ
∂µvµ(x).(10)
We u the convention that within products,∂µacts on all functions to its right.If we want∂µto act locally,we u square brackets.This convention can be illustrated by the product rule
∂µf(x)g(x)=[∂µf(x)]g(x)+f(x)[∂µg(x)].(11)
3.3Reformulation of the optimization problem
To describe the∆-value in terms of the probability distribution p x,˙x(x,˙x),we need to express the temporal derivative of the output signal y(t)=g(x(t))in terms of the input signals x(t)and their derivatives.This is readily done by the chain rule
˙y(t)=d
d t
g(x(t))=
X
世界最小狗
书法初学µ
˙xµ(t)∂µg(x(t)).(12)
We can now rewrite the objective function(1)by replacing the time average · t by the enmble average · x,˙x
∆(g j)(1)= ˙y j(t)2 t(13)
督查通知梦见车翻了(12)
=
X
µ,ν
˙xµ[∂µg j(x)]˙xν[∂νg j(x)] x,˙x(14)
=
X
µ,ν ˙xµ˙xν ˙x|x
|{z}
=:Kµν(x)
[∂µg j(x)][∂νg j(x)] x(15)
=
假如我是X
µ,ν
Kµν(x)[∂µg j(x)][∂νg j(x)] x,(16)
where Kµν(x)is the matrix of the cond moments of the conditional velocity distribution p˙x|x(˙x|x)and reflects the dynamical structure of the input signal.
An elegant reformulation of the optimization problem can be obtained by in-troducing the following scalar product(f,g)between functions f,g∈F:
(f,g):= f(x)g(x) x.(17)
With this definition,the function space F becomes a Hilbert space and the slowness objective for a function g can be written as
∆(g j)=
X
µ,ν
(∂µg,Kµν∂νg).(18)
Note that we restrict the action of the partial derivatives to the argument of the scalar product they appear in.
Replacing the temporal averages by enmble averages and using the scalar product(17),the original optimization problem becomes
古诗飞花令5