⾃然语⾔处理:序列标注(BiLSTM-CRF )
⽂章⽬录
Reference:
Tagging Scheme
IOBES: Inside, Outside, Beginning, End, Single
Bidirectional LSTM Networks
By utilize bidirectional LSTM, we can efficiently make u of past features (via forward states) and future features (via backward states) for a specific time frame.
We train the bidirectional LSTM networks using BPTT, and specially treat at the beginning and the end of the data points.Such as ret the hidden states to 0 at the begging of each ntence.At time for NER task:
the input layer reprents features could be one-hot-encoding for word feature, den vector features, or spar features.
the output layer reprents a probability distribution (distributed by softmax) over labels. It has the same dimensionality as size of labels. Using the label with the max probability as output of timestep .
Why u the CRF Networks?
Despite u the as features to make independent tagging decisions for each output is success in POS tagging, the independent classification decisions are limiting when there are strong dependencies across output labels.As the NER task, since the “grammar” that characterizes interpretable quences of tags impos veral hard constraints () that would be impossible to model with independence assumptions.
The first way is to predict a distribution of tags of each time step and then u beam-like decoding to find optimal tag quences, such as Maximum entropy classifier (Ratnaparkhi, 1996) and Maximum entropy Markov models (McCallum etal., 2000).
t t h t y t
The cond way is to focus on ntence level instead of individual positions, thus leading to Conditional Random Fields (CRF) models that the inputs and outputs are directly connected, as oppod to LSTM networks where memory cells/recurrent
components are employed.
如下最右侧图,若不考虑不同位置词标注之间的关系,会出现错误的标注。
对于序列标注任务,输⼊词序列为观测序列,带标注的序列为隐藏状态序列。基于状态独⽴假设的序列标注模型,⽆法处理不同状态间的硬性约束,MRMM、CRF擅长解决此类问题。
MEMM假设当前状态仅与上⼀状态有关(马尔可夫性假设),CRF没有马尔可夫性假设,预测每个状态时考虑全局状态信息。
HMM(马尔可夫性假设、观测独⽴性假设) -> MEMM(马尔可夫性假设)-> CRF
CRF Networks
BiLSTM-CRF networks
Given and to reprent an input quence and a quence of predicted tags respectively, where is the length of the input ntence.
Emission score
x =(x ,x ,⋯,x )12n y =(y ,y ,⋯,y )12n n
We consider to be the matrix of scores output by the BiLSTM network, where is the number of distinct tags, the element corresponds to the score of the -th tag of the -th word in a ntence.
In BiLSTM-CRF networks, emission scores come from the BiLSTM layer. For instance, according the above figure, the score of labeled as B-Person is 1.5.
对于长度为的句⼦,发射矩阵包含个维的隐藏状态.
Transition score
CRF introduces a transition matrix , that is position independent, which measure the score from -th tag to
-th tag by the element .
In order to make the transition score matrix more robust, we will add the START and the END tags of a ntence to the t of possible tags. is therefore a square matrix of size .
Here is an example of the transition matrix score including the extra added START and END labels.
Transition Matrix START B-Person I-Person B-Organization I-Organization O END START 00.80.0070.70.00080.90.08B-Person 00.60.90.20.00060.60.009I-Person -10.50.530.550.00030.850.008B-Organization 0.90.50.00030.250.80.770.006I-Organization -0.90.450.0070.70.650.760.2O 00.650.00070.70.00080.90.08END
As shown in the table above, we can find that the transition matrix has learned some uful constraints.
The label of first word in a ntence should start with “B-” or “O”, not “I-”.etc.
Where or how to get the transition matrix?
Transition matrix is a parameter of CRF layer. It’s initialization with random value, that will be more and more reasonable gradually with increasing training iterations.
Decoding
For a ntence has 5 words: , the real tags is:
"START B-Person I-Person O B-Organization O END"
Here, we add two more extra words which denote the start and the end of ntence: .A linear-chain CRF defines a global score consists of 2 parts, such that:
P ∈R n ×k k P [i ,j ]j i w 0n P n k A ∈R (k +2)×(k +2)i j A [i ,j ]A k +2x ,x ,x ,x ,x 12345y x ,x 06s (x ,y )s (x ,y )=s (x ,y )+e s (y )=
t P [x ,y ]+
i =1
∑
n
i i A [y ,y ]
民间习俗i =0
∑
n
i i +1
Emission Score
where and just t them zeros, are from the previous BiLSTM.
1829年
Transition Score
the score are actually the parameters of CRF layer.
Illustration of the scoring of a ntence with linear-chain CRF:
水仙花公主The path PER-O-LOC has a score of:
1 + 10 + 4 + 3 +
2 + 11 + 0 = 31
The path PER-PER-LOC has a score of:
1 + 10 +
2 + 4 - 2 + 11 + 0 = 26
Now that we understand the scoring function of the CRF, we need to do 2 things:
Find the quence of tags with the best score.
Compute a probability distribution over all the quence of tags (total score).
主力资金The simplest way to measure the total score is that: enumerating all the possible paths and sum their scores. However, it is very inefficient, with complexity. The recurrent nature of our formula makes it the perfect candidates to apply dynamic programming .
维特⽐算法求解最优路径
Let’s suppo that is the solution for time steps for quences that start with :
s (x ,y )=e P [x ,START]+0P [x ,B-Person]+1⋯+P [x ,END]
6P 0P 6P ⋯P 15s (y )=t A [START,B-Person]+A [B-Person,I-Person]+⋯+A [O,END]
n k c (y ,x )t t t ,⋯,n y t
The best score and path are:
盛世大阅兵
As we perform step, final cost is , much less than .类似于CRF解码(已知模型、观测求最可能的状态序列)的维特⽐算法
动态规划求解归⼀化因⼦
底妆你的名字高清壁纸模型的优化⽬标是最⼤化⽬标标注序列的概率,⼀般使⽤softmax将分数转化为概率,softmax分母项需求解所有可能标注序列的分数和,也称为归⼀化因⼦,下⾯介绍基于前向递推动态规划的优化求解算法。
假定已知时刻以为标注结尾的所有可能标注序列的总分数为,即
对于总时间步为的序列,所有可能标注序列的总分数为
动态规划求解对数归⼀化因⼦
c (y ,x )t t =arg max s (y ,⋯y ,x )
y ,⋯,y t n t n =arg max P [x ,y ]+A [y ,y ]+s (y ,⋯,y ,x )y ,⋯,y t n t t t t +1t +1n =arg max P [x ,y ]+A [y ,y ]+c (y ,x )
y t +1t t t t +1t +1t +1s (x ,y )=∗c (y =00START,x ),y =∗arg
s (x ,)
感冒适合吃什么
∈Y
y ~
max y ~
n O (nk )2O (k )n Z p (y ∣x )=exp(s (x ,y )),
Z =
Z 1
exp(s (x ,))
y
~∑
y ~t y t Z (y )t t Z (y )t t =exp(s (y ,⋯,y ,x ))
y ,⋯,y 1t −1∑
1t =exp(P [x ,y ]+A [y ,y ])exp(s (y ,⋯,y ))y t −1∑t t t −1t y ,⋯,y 1t −2∑
1t −1=exp(P [x ,y ]+A [y ,y ])⋅Z (y )
y t −1∑
t t t −1t t −1t −1n Z =
Z (y )
y n
∑
n n