lstm原⽂_LSTM艾青诗选摘抄
:《LongShort-TermMemory》的翻译并解读LSTM:《Long Short-Term Memory》的翻译并解读
⽬录
Long Short-Term Memory
Abstract
1 INTRODUCTION
2 PREVIOUS WORK
3 CONSTANT ERROR BACKPROP
3.1 EXPONENTIALLY DECAYING ERROR
3.2 CONSTANT ERROR FLOW: NAIVE APPROACH
4 LONG SHORT-TERM MEMORY
5 EXPERIMENTS
5.1 EXPERIMENT 1: EMBEDDED REBER GRAMMAR
Long Short-Term Memory
地址02:www.bioinf.jku.at/publications/older/2604.pdf
论⽂原⽂地址01
地址01:/pdf/1506.04214.pdf地址02
Abstract
Learning to store information over extended time intervals via recurrent backpropagation takes a very long time, mostly due to insucient, decaying error back ow. We brie y review Hochreiter's 1991 analysis of this problem, then address it by introducing a novel, ecient, gradient-bad method called Long Short-Term Memory" (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error ow through constant error carrouls" within special units. Multiplicative gate units learn to open and clo access to the constant error ow. LSTM is local in space and time; its computational complexity per time step and weight is
O(1). Our experiments with arti cial data involve local, distributed, real-valued, and noisy pattern reprentations. In comparisons with RTRL, BPTT, R初三记叙文
ecurrent Cascade-Correlation, Elman nets, and Neural Sequence Chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, arti cial long time lag tasks that have never been solved by previous recurrent network algorithms.通过周期性的反向传播学习,在扩展的时间间隔内存储信息需要很长的时间,这主要是由于不确定的、衰减的错误导致的。我们简要回顾了Hochreiter在1991年对这个问题的分析,然后介绍了⼀种新颖的、独特的、基于梯度的⽅法,称为LSTM (LSTM)。在不造成伤害的情况下截断梯度,LSTM可以学习在超过1000个离散时间步长的最⼩时间滞后上桥接,⽅法是通过在特殊单元内的“恒定误差轮盘”强制执⾏恒定误差。乘性门单元学习打开和关闭访问的恒定误差低。LSTM在空间和时间上都是局部的;其每时间步长的计算复杂度和权值为O(1)。我们对⼈⼯数据的实验包括局部的、分布式的、实值的和有噪声的模式表⽰。在与RTRL、BPTT、周期性级联相关、Elman⽹和神经序列分块的⽐较中,LSTM带来了更多的成功运⾏,并且学习速度更快。LSTM还解决了以前的递归⽹络算法所不能解决的复杂、⼈⼯的长时间滞后问题。
1 INTRODUCTION
Recurrent networks can in principle u their feedback connections to store reprentations of recent input events in form of activations (short-term memory", as oppod to long-term memory" e
mbodied by slowly changing weights). This is potentially signicant for many applications, including s和是多音字吗
peech processing, non-Markovian control, and music composition (e.g., Mozer 1992). The most widely ud algorithms for learning what to put in short-term memory, however, take too
much time or do not work well at all, especially when minimal time lags between inputs and corresponding teacher signals are long. Although theoretically fascinating, existing methods do not provide clear practical advantages over, say, backprop in feedforward nets with limited time windows. This paper will review an analysis of the problem and suggest a remedy.
递归⽹络原则上可以使⽤它们的反馈连接以激活的形式存储最近输⼊事件的表⽰(“短期记忆”,⽽不是“长期记忆”,后者由缓慢变化的权重表⽰)。这对许多应⽤程序都有潜在的重要性,包括语⾳处理、⾮马尔可夫控制和⾳乐作曲(例如,Mozer 1992)。然⽽,最⼴泛使⽤的学习短期记忆的算法要么花费了太多时间,要么根本就不能很好地⼯作,尤其是在输⼊和相应教师信号之间的最⼩时滞很长时。虽然理论上很吸引⼈,但现有的⽅法并没有提供明显的实际优势,例如,在有限时间窗⼝的前馈⽹络中,backprop。本⽂将对这⼀问题进⾏分析,并提出解决办法。The problem. With conventional Back-Propagation Through Time" (BPTT, e.g., Williams and Zipr 1992, Werbos 1988) or Real-Time Recurrent Learning" (RTRL, e.g., Robinson and Fallside 1987),黄玫瑰代表什么意思
error signals owing backwards
in time" tend to either (1) blow up or (2) vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights (Hochreiter 1991). Ca (1) may lead to oscillating weights, while in ca (2) learning to bridge long time lags takes a prohibitive amount of time, or does not work at all (e ction 3). 这个问题。与传统反向传播通过时
间”(BPTT,例如,1992年威廉姆斯和拉链,Werbos 1988)或实时复发性学习”(RTRL,例如,罗宾逊和Fallside 1987),误差信号在时间上向后由于”倾向于(1)炸毁或(2):消失的时间演化backpropagated误差指数的⼤⼩取决于重量(Hochreiter 1991)。情形(1)可能会导致权值的振荡,⽽情形(2)学习如何桥接长时间滞后的情况会花费⼤量的时间,或者根本不起作⽤(参见第3节)。The remedy. This paper prents Long Short-Term Memory" (LSTM), a novel recurrent network architecture in conjunction with an appropriate gradient-bad learning algorithm. LSTM is designed to overcome the error back- ow problems. It can learn to bridge time intervals in excess of 1000 steps even in ca of noisy, incompressible input quences, without loss of short time lag capabilities. This is achieved by an ecient, gradient-bad algorithm for an architecture enforcing constant (thus neither exploding nor vanishing) error ow through internal states of special units (provided the gradient computation is truncated at
提出了⼀种新的递归⽹络结certain architecture-specic points | this does not aect long-term error ow t
hough). 补救措施。本⽂提出了⼀种新的递归⽹络结构——长短时记忆(LSTM),并结合适当的梯度学习算法
构——长短时记忆(LSTM),并结合适当的梯度学习算法。LSTM的设计就是为了克服这些错误的反向问题。它可以学习桥接超过1000步的时间间隔,即使在有噪声、不可压缩的输⼊序列的情况下,也不会损失短时间延迟能⼒。这是通过⼀种特殊的、基于梯度的算法来实现的,它针对的是⼀种通过特殊单元的内部状态来执⾏常量(因此既不会爆炸也不会消失)的错误(假设梯度计算在某些特定的体系结构点|被截断,但这并不影响长期的错误)。Outline of paper. Section 2 will brie y review previous work. Section 3 begins with an outline of the detailed analysis of vanishing errors due to Hochreiter (1991). It will then introduce a naive approach to constant error backprop for didactic purpos, and highlight its problems concerning information storage and retrieval. The problems will lead to the LSTM architecture as described in Section 4. Section 5 will prent numerous experiments and comparisons with competing methods. LSTM outperforms them, and also learns to solve complex, articial tasks no other recur如何申请资质
rent net algorithm has solved. Section 6 will discuss LSTM's limitations and advantages. The appendix contains a detailed description of the algorithm (A.1), and explicit error ow formulae (A.2).第⼆部分将简要回顾以前的⼯作。第3节以详细分析Hochreiter(1991)所造成的消失误差的⼤纲开始。然后,它将介绍⼀种⽤于教学⽬的的幼稚的不断错误⽀持⽅法,并突出其在
信息存储和检索⽅⾯的问题。这些问题将导致第4节中描述的LSTM体系结构。第5节将提供⼤量的实验和与竞争⽅法的⽐较。LSTM⽐它们做得更好,⽽且还学会了解决复杂的⼈⼯任务,这是其他递归⽹络算法所不能解决的。第6节将讨论LSTM的局限性和优点。附录中有算法的详细描述(a .1),以及公式的显式误差(a .2)。
2 PREVIOUS WORK
This ction will focus on recurrent nets with time-varying inputs (as oppod to nets with stationary inputs and xpoint-bad gradient calculations, e.g., Almeida 1987, Pineda 1987).
本节将集中讨论具有时变输⼊的递归⽹络(⽽不是具有固定输⼊和基于x点的梯度计算的⽹络,例如Almeida 1987和Pineda 1987)。Gradient-descent variants. The approaches of Elman (1988), Fahlman (1991), Williams (1989), Schmidhuber (1992a), Pearlmutter (1989), and many of the related algorithms in Pearlmutter's comprehensive overview (1995) suer from the same problems as BPTT and RTRL (e Sections 1 and 3).
梯度下降法变体。Elman(1988)、Fahlman(1991)、Williams(1989)、Schmidhuber (1992a)、Pearlmutter(1989)的⽅法,以及Pearlmutter的综合综述(1995)中的许多相关算法,都是从与BPTT和RTRL相同的问题中提出的(见第1节和第3节)
Time-delays. Other methods that em practical for short time lags only are Time-Delay Neural Networks (Lang et al. 1990) and Plate's method (Plate 1993), which updates unit activations bad on a weighted sum of old activations (e also de Vries and Principe 1991). Lin et al. (1995) propo variants of time-delay networks called NARX networks.
时间延迟。其他似乎只适⽤于短时间滞后的⽅法有时滞神经⽹络(Lang et al. 1990)和Plate法(Plate 19月计划表格模板
93),后者基于旧激活的加权和更新单位激活(参见de Vries和Principe 1991)。Lin等⼈(1995)提出了时延⽹络的变体NARX⽹络。
Time constants. To deal with long time lags, Mozer (1992) us time constants in uencing changes of unit activations (deVries and Principe's above-mentioned approach (1991) may in fact be viewed as a mixture of TDNN and time constants). For long time lags, however, the time constants need external ne tuning (Mozer 1992). Sun et al.'s alternative approach (1993) updates the activation of a recurrent unit by adding the old activation and the (scaled) current net input. The net input, however, tends to perturb the stored information, which makes long-term storage impractical.
时间常量。为了处理长时间滞后,Mozer(1992)使⽤时间常数来表⽰单位激活的变化(deVries and Principe’s上述⽅法(1991)实际上可以看作是TDNN和时间常数的混合物)。然⽽,对于长时间滞后,时
间常数需要外部ne调谐(Mozer 1992)。Sun等⼈的替代⽅法(1993)通过添加旧的激活和(缩放的)当前净输⼊来更新⼀个经常性单元的激活。然⽽,净输⼊往往会⼲扰所存储的信息,这使得长期存储变得不切实际。
Ring's approach. Ring (1993) also propod a method for bridging long time lags. Whenever a unit in his network receives con icting error signals, he adds a higher order unit in uencing appropriate connections. Although his approach can sometimes be extremely fast, to bridge a time lag involving 100 steps may require the addition of 100 units. Also, Ring's
net does not generalize to unen lag durations.
环的⽅法。Ring(1993)也提出了⼀种桥接长时间滞后的⽅法。当他的⽹络中的⼀个单元接收到通信错误信号时,他就增加⼀个更⾼阶的单元来建⽴适当的连接。虽然他的⽅法有时⾮常快,但要跨越100步的时间延迟可能需要增加100个单元。同样,环⽹也不能推⼴到看不见的滞后时间。
Bengio et al.'s approaches. Bengio et al. (1994) investigate methods such as simulated annealing, multi-grid random arch, time-weighted pudo-Newton optimization, and discrete error propagation. Their latch" and 2-quence" problems are very similar to problem 3a with minimal time lag 100 (e Experiment 3). Bengio and Frasconi (1994) also propo an EM approach for propagati
ng targets. With n so-called state networks", at a given time, their system can be in one of only n dierent states. See also beginning of Section 5. But to solve continuous problems such as the adding problem" (Section
5.4), their system would require an unacceptable number of states (i.e., state networks).
Bengio等⼈的⽅法。Bengio等⼈(1994)研究了模拟退⽕、多⽹格随机搜索、时间加权伪⽜顿优化和离散误差传播等⽅法。他们的“闩锁”和“2-序列”问题与3a问题⾮常相似,只有最⼩的滞后时间100(见实验3)。Bengio和Frasconi(1994)也提出了⼀种EM⽅法来传播⽬标。对于n个所谓的“状态⽹络”,在给定的时间内,它们的系统只能处于n种不同状态中的⼀种。参见第5节的开头。但是,为了解决诸如“添加问题”(第5.4节)之类的连续问题,它们的系统将需要不可接受的状态数(即状态的⽹络)。
Kalman lters. Puskorius and Feldkamp (1994) u Kalman lter techniques to improve recurrent net performance. Since they u a derivative discount factor impod to decay exponentially the eects of past dynamic derivatives," there is no reason to believe that their Kalman Filter Trained Recurrent Networks will be uful for very long minimal time lags. Second order nets. We will e that LSTM us multiplicative units (MUs) to protect error ow from unwanted perturbations. It is not the rst recurrent net method using MUs though. For instance, Watrous and Kuhn (1992) u MUs in cond order nets. Some dierences to LSTM are:
(1) Watrous and Kuhn's architecture does not enforce constant error ow and is not designed to solve long time lag
problems.
(2) It has fully connected cond-order sigma-pi units, while the LSTM architecture's MUs are ud only to gate access
to constant error ow.
(3) Watrous and Kuhn's algorithm costs O(W2 ) operations per time step, ours only O(W), where W is the number of
weights. See also Miller and Giles (1993) for additional work on MUs.
Kalman lters. Puskorius and Feldkamp (1994)使⽤Kalman lter技术来提⾼经常性净绩效。由于他们使⽤⼀个衍⽣品折扣因⼦来指数衰减过去动态衍⽣品的影响,“我们没有理由相信他们的卡尔曼滤波训练的递归⽹络在很长⼀段时间内都是有⽤的。”⼆阶⽹络。我们将看到LSTM使⽤乘法单位(MUs)来保护错误不受不必要的⼲扰。但它不是第⼀个使⽤MUs的递归⽹络⽅法。例如,Watrous和Kuhn(1992)在⼆阶⽹中使⽤MUs。LSTM的⼀些不同之处是:(1)Watrous和Kuhn的架构不强制恒定的错误,也不是为
了解决长时间滞后的问题⽽设计的。
(2)它具有完全连通的⼆阶sigma-pi单元,⽽LSTM体系结构的MUs仅⽤于对恒定误差低的门访问。
(3) Watrous和Kuhn的算法每时间步需要O(W2)个操作,我们的算法只需要O(W)个操作,其中W是权值的个数。有关MUs的其他⼯
作也见Miller和Giles(1993)。
Simple weight guessing. To avoid long time lag problems of gradient-bad approaches we may simply randomly initialize all network weights until the resulting net happens to classify all training quences correctly. In fact, recently we discovered (Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997) that simple weight guessing solves many of the problems in (Bengio 1994, Bengio and Frasconi 1994, Miller and Giles 1993, Lin et al. 1995) faster than the algorithms propod therein. This does not mean that weight guessing is a good algorithm. It just means that the problems are very simple. More realistic tasks require either many free parameters (e.g., input weights) or high weight precision (e.g., for continuous-valued parameters), such that guessing becomes completely infeasible.
简单的猜测。为了避免基于梯度的⽅法的长时间滞后问题,我们可以简单地随机初始化所有⽹络权值,直到最终得到的⽹络正确地对所有训练序列进⾏分类。事实上,最近我们发现(Schmidhuber and Hochreiter 1996, Hochreiter and Schmidhuber 1996, 1997)简单的重量猜测解决了(Bengio 1994, Bengio and Frasconi 1994, Miller and Giles 1993, Lin et al. 1995)中的许多问题,⽐其中提出的算法更快。这并不意味着猜测权重是⼀个好的算法。这意味着问题很简单。更实际的任务需要许多⾃由参数(例如,输⼊权值)或较可爱日语
⾼的权值精度(例如,连续值参数),这样猜测就变得完全不可⾏的。
Adaptive quence chunkers. Schmidhuber's hierarchical chunker systems (1992b, 1993) do have a capability to bridge arbitrary time lags, but only if there is local predictability across the subquences causing the time lags (e also Mozer 1992). For instance, in his postdoctoral thesis (1993), Schmidhuber us hierarchical recurrent nets to rapidly solve certain grammar learning tasks involving minimal time lags in excess of 1000 steps. The performance of chunker systems, however, deteriorates as the noi level increas and the input quences become less compressible. LSTM does not suer from this problem.⾃适应序列chunkers。Schmidhuber的分层chunker系统(1992新年动态图片
b, 1993)确实具有桥接任意时间滞后的能⼒,但前提是⼦序列具有局部可预测性,从⽽导致时间滞后(参见Mozer 1992)。例如,在他的博⼠后论⽂(1993)中,Schmidhuber使⽤层次递归⽹络来快速解决某些语法学习任务,这些任务涉及的时间延迟最⼩,超过了1000步。然⽽,随着噪声⽔平的提⾼和输⼊序列的可压缩性的降低,chunker系统的性能会下降。LSTM不能解决这个问题。
3 CONSTANT ERROR BACKPROP 固定误差⽀持
3.1 EXPONENTIALLY DECAYING ERROR 指数衰减误差
Conventional BPTT (e.g. Williams and Zipr 1992). Output unit k's target at time t is denoted by dk (t). Using mean squared error, k's error signal is
传统的BPTT(如Williams和Zipr 1992)。输出单元k在t时刻的⽬标⽤dk (t)表⽰,利⽤均⽅误差,k的误差信号为
The corresponding contribution to wjl 's total weight update is #j (t)yl (t 1), where is the learning ra关注的近义词
te, and l stands for an arbitrary unit connected to unit j. Outline of Hochreiter's analysis (1991, page 19-21). Suppo we have a fully connected net who non-input unit indices range from 1 to n. Let us focus on local error ow from unit u to unit v (later we will e
that the analysis immediately extends to global error ow). The error occurring at an arbitrary unit u at time step t is propagated back into time" for q time steps, to an arbitrary unit v. This will scale the error by the following fact
wjl的总权重更新的相应贡献是#j (t)yl (t 1),其中为学习率,l表⽰连接到j单元的任意单元。Hochreiter
分析概要(1991年,第19-21页)。假设我们有⼀个完全连通的⽹络,它的⾮输⼊单位指数范围从1到n。让我们关注从单位u到单位v的局部误差ow(稍后我们将看到分析⽴即扩展到全局误差ow)。发⽣在任意单位u上的时间步长t的误差被传播回时间中,对于q时间步长,传播回任意单位v