BiLSTM和CRF算法的序列标注原理

CRF原理

条件随机场(conditional_random_field),是一类判别型算法,特别适合于预测任务,任务中的 上下文信息或临近状态会影响当前状态 。如序列标注任务: >判别式模型(discriminative model)计算条件概率,而生成式模型(generative model)计算联合概率分布

  • 词性标注(Part_Of_Speech Tagging,POS Tagging):确定句子中的单词的词性,如名称、形容词、副词等
  • 命名实体识别(Named Entity Recognize):确定句子中单词属于那种实体,如组织、机构、人名等

HMM生成模型

给定句子 \(S\),对应的输出词性序列 \(T\)HMM模型的联合概率: \[ \begin{align} P(T|S) &= \frac{P(S|T)\cdot P(T)}{P(S)}\\ P(S,T) &= P(S|T)\cdot P(T)\\ &= \prod_{i=1}^{n}P(w_i|T)\cdot P(T)\\ &= \prod_{i=1}^{n}P(w_i|t_i)\cdot P(T)\\ &= \prod_{i=1}^{n}P(w_i|t_i)\cdot P(t_i|t_{i-1})\\ \end{align} \] > 首先贝叶斯公式展开,然后利用 以下假设 简化:
- 由词之间相互独立假设,得到 \(\prod_{i=1}^{n}P(w_i|T)\) - 由单词概率仅依赖于其自身的标签,得到发射(emission)概率 \(\prod_{i=1}^{n}P(w_i|t_i)\) - 由马尔可夫假设,使用 bi-gram 得到转移(transition)概率 \(P(t_i|t_{i-1})\)


目标函数:

\[ (\hat{t_1},\hat{t_2}...\hat{t_n})=arg max\prod_{i=1}^{n}P(w_i|t_i)\cdot P(t_i|t_{i-1}) \]

综上,HMM假设了两类特征:当前词性与上一词性的关系,当前词与当前词性的关系
HMM的学习过程就是在训练集中学习这两个概率矩阵,大小分别为(t,t),(w,t)w为单词的个数,t为词性的个数

CRF判别模型

CRF并没有做出上述的假设,而是使用特征方程feature function来更抽象地表达特征,而不再局限于HMM的两类特征

特征方程

条件随机场中,特征方程 \(f_j\) 的输入为: - 句子 \(S\) - 一个单词在句子中的位置 \(i\) - 当前单词的标签 \(l_i\) - 前一个单词的标签 \(l_{i-1}\)

输出实值 0 或 1 > 上述示例为 线性链 条件随机场,特征方程只依赖于当前与 前一个 标签,而不是序列中的任意标签;
如给定之前的单词 “很” ,特征方程判断当前单词 “简单” 的词性

给每个特征方程 \(f_j\) 一个权重 \(\lambda_j\),可以计算一个句子 \(s\) 对应一组标签 \(l\) 的 "分数" \[score(l|s)=\sum_{j=1}^{m}\sum_{i=1}^{n}\lambda_jf_j(s,i,l_i,l_{i-1})\] > 其中 \(i\) 表示句子中的位置,\(j\) 表示特定的特征方程

“分数”然后转化成概率分布 \[p(s|l)=\frac{exp\big[\sum_{j=1}^{m}\sum_{i=1}^{n}\lambda_jf_j(s,i,l_i,l_{i-1})\big]}{\sum_{l^{’}}exp\big[\sum_{j=1}^{m}\sum_{i=1}^{n}\lambda_jf_j(s,i,l^{’}_i,l^{’}_{i-1})\big]}\] > \(l^{’}\) 表示所有可能的序列标签组合

特征方程示例: \(f_1(s,i,l_i,l_{i-1})\) 表示 \(l_i\) 是否为 副词;若第 \(i\) 个单词以 -ly 结尾,该值为 1,否则为 0。即若该特征对应的权重 \(\lambda_i\) 较大,说明偏向于将该特征的单词标注为 “副词”

\(f_2(s,i,l_i,l_{i-1})\) 表示 \(l_i\) 是否为 动词;若第 \(i=1\) 且句子以?结尾,该值为 1,否则为 0。

\(f_3(s,i,l_i,l_{i-1})\) 表示 \(l_i\) 是否为 形容词;若第 \(l_{i-1}\) 为名词,则该值为 1,否则为 0。

\(f_4(s,i,l_i,l_{i-1})\) 表示 \(l_i\) 是否为 介词;若第 \(l_{i-1}\) 为介词,则该值为 0。 介词不能跟着介词

因此:要创建条件随机场,需要定义一系列的特征,然后给每个特征分配权重,然后遍历整个序列,再将其转换成概率

损失函数

综上,给定训练样本 \(D=\big[(x^1,y^1),(x^2,y^2)...(x^m,y^m)\big]\),其中\(m\)表示\(m\)个句子
利用最大似然估计计算参数 \(\lambda\)\[ \begin{align} L(\lambda,D) &= log\Big(\prod_{k=1}^{n}P(y^k|x^k,\lambda)\Big)\\ &= \sum_{k=1}^{m}\Big[log\frac{1}{Z(x_k)}+\sum_{j=1}^{n}\sum_{i=1}^{l}\lambda_jf_j(x_i^k,i,y^k_i,y^k_{i-1})\Big] \end{align} \] > 其中\(k\)表示第\(k\)个句子,共\(m\)个句子;\(i\)表示句子的第\(i\)个单词,共\(l\)个单词,\(j\)表示第\(j\)个特征,共\(n\)个特征,\(\frac{1}{Z(x^k)}\)为正则项

然后利用梯度下降算法即可求解出 \(\lambda\) 参数

HMM的关系

将特征方程定义\(f_1(x,i,y_i,y_{i-1})=1\)定义为\(p(y_i|y_{i-1})\),将特征对应的权重\(\lambda\)定义为 \(\lambda=log p(x_i|y_i)\),即可从CRF中推导出HMMHMMCRF的特例

与逻辑回归类比:

逻辑回归用于分类的线性(log-linear)模型,CRFs则用于序列分类的线性(log-linear)模型

关键是:CRF模型中的特征方程?

  • 特征方程约束了输出标签序列,即确定标签与标签之间的关系,可以作为形状为(tag_size,tag_size)模型参数学习得到,(i,j)表示从标签i<-j的关系

BiLSTM+CRF实现命名实体识别

参考连接: CRF Layer on the Top of BiLSTM - 输入单词序列的词表征经过BiLSTM处理,生成每个单词所属实体类别的权重 - 再将权重分布组成的序列,输入到CRF层,获得最终的实体类别分布 - BiLSTM层已经可以获得了单词的实体类别了 - 但CRF层给上一层的输出添加了一些规则限制,即的CRF特征方程

转移矩阵和发射矩阵

  • BiLSTM层的输出为emission score \(E\),形状为(seq_len,tag_size)\(E_{i,j}\) 表示第 \(i\) 个单词属于第 \(j\) 个类别的权重,上图中 \(E_{w_0,B-person}=1.5\)

  • 使用 \(t_{i,j}\) 表示 transition score,例如 \(t_{B-Person,I-Person}=0.9\) ,表示B-Person --> I-Person的转移权重为 0.9 .所有标签之间都有权重分数;

    • 额外添加了表征开始和结束的两个标签START+END
      START B-Person I-Person B-Organization I-Organization O END
      START 0 0.8 0.007 0.7 0.0008 0.9 0.08
      B-Person 0 0.6 0.9 0.2 0.0006 0.6 0.009
      I-Person -1 0.5 0.53 0.55 0.0003 0.85 0.008
      B-Organization 0.9 0.5 0.0003 0.25 0.8 0.77 0.006
      I-Organization -0.9 0.45 0.007 0.7 0.65 0.76 0.2
      O 0 0.65 0.0007 0.7 0.0008 0.9 0.08
      END 0 0 0 0 0 0 0
    • 如上表所示,转移矩阵学习了一些有用信息:
      • 句子应该以 B-PersonB-Organization,而不应该以I-Person开始,等等
    • 转移矩阵为模型的参数,在训练之前随机初始化,随着训练进行逐渐进行更新;而不用手动设置

损失函数

损失函数

有5个单词组成的句子,可能的实体类别序列:

1
2
3
4
5
6
1)  START B-Person B-Person B-Person B-Person B-Person END
2) START B-Person I-Person B-Person B-Person B-Person END
......
10) START B-Person I-Person O B-Organization O END
......
N) O O O O O O O
假设每种可能的序列有一个分数 \(P_i\),总共 \(N\) 种路径,则所有路径分数和为 \(P_{total}=P_1+P_2+...+P_{N}=e^{S_1}+e^{S_2}+...+e^{S_N}\)
假设第10种路径为真实标签路径,则分数 \(P_{10}\)应该为最大的,则损失函数为\(Loss=\frac{P_{RealPath}}{P_1+P_2+...+P_{N}}\)
将其转化为 负log函数,便于梯度下降法计算最小值 \[ \begin{align} \text{Loss} &= -log \frac{P_{RealPath}}{P_1+P_2+...+P_{N}}\\ &= -log \frac{e^{S_{RealPath}}}{e^{S_1}+e^{S_2}+...+e^{S_N}}\\ &= -\big(s_{RealPath}-log(e^{S_1}+e^{S_2}+...+e^{S_N})\big)\\ \end{align} \] \(S_{RealPath}\)为真实路径的分数,\(log(e^{S_1}+e^{S_2}+...+e^{S_N})\)为所有路径分数

真实路径分数

真实路径START B-Person I-Person O B-Organization O END,如何计算真实路径的分数 \(P_i=e^{S_i}\),需要先计算 \(S_i\), - 对于上述 5 单词的句子 \(w_1,w_2,w_3,w_4,w_5\) - 加上开始和结束标签,START,END

\(S_i=\text{EmissionScore}+\text{TransitionScore}\) - \(EmissionScore=x_{0,START}+x_{1,B-Person}+x_{2,I-Person}+x_{3,O}+x_{4,B-Organization}+x_{5,O}+x_{6,END}\) - \(x_{i,label}\),表示第i个单词标签为label的分数,直接从BiLSTM层的输出为emission score \(E\)中获得 - \(x_{0,START}\)\(x_{6,END}\) 直接设置为 0

  • \(TransitionScore=t_{I-Person->O} + t_{0->B-Organization} + t_{B-Organization->O} + t_{O->END}\)
    • 这些分数来源于CRF
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def _score_sentence(self, feats, tags):
"""
feats: 发射矩阵,lstm 层的输出,(seq_len,num_tags)
tags: 真实的标签序列

示例代码不包含数据批的维度,一次只能处理一个序列
"""
score = torch.zeros(1)
tags = torch.cat([
torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags
])
for i, feat in enumerate(feats):
score = score + \
self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
return score
1

如何计算所有可能路径的分数?

  • 已知长 \(n\) 的序列{w0,w1,w2}\(m\) 个标签{l1,l2},发射矩阵\(x_{ij}\),转移矩阵\(t_{ij}\)

  • 连续两个标签 \((w_t,w_{t+1})\) 对应标签组合 \((l_a, l_b)\) 的分数表示为:\(x_{t,a}+x_{t+1,b}+t_{a,b}\) ,三项分别表示第 t 个单词属于标签 a 的分数、第 t+1 个单词属于标签 b 的分数、以及标签 a->b转移分数

  • 所有的路径即所有的标签排列组合(l1,l1,l1),(l1,l1,l2),(l1,l2,l1)...\(m^n\) 种,如上图中的8种。最终的分数即为所有路径的log_sum_exp之和

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import torch
def log_sum_exp(vec):
"""
[3,4,5] --> log( e^3+e^4+e^5 )
--> log( e^5*(e^(3-5)+e^(4-5)+e^(5-5)) )
--> 5 + log( e^(3-5)+e^(4-5)+e^(5-5) )
"""
max_score, idx = torch.max(vec, 1)
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
return max_score + torch.log(
torch.sum(torch.exp(vec - max_score_broadcast)))


vec = torch.tensor([[3., 4., 5.]])
log_sum_exp(vec)
tensor([5.4076])

动态规划求解过程:

  • 利用向量 \(D\) 表示当前单词选择各个标签时的分数;如上图中所示;当前单词选择某个标签的分数,可由上一步的 \(D\) 向量推导出:
    • 单词 w0,没有前继单词,所以没有转移分数:\(D=[log(e^{x_{01}}), log(e^{x_{02}})]\)
    • 单词 w1
      • 选择标签 l1 时的分数可以表示为:\(d_1 = log(e^{d_1+x_{11}+t_{11}}+e^{d_2 + x_{11}+t_{21}})\)
      • 选择标签 l2 时的分数可以表示为:\(d_2 = log(e^{d_1+x_{12}+t_{12}}+e^{d_2+x_{12}+t_{22}})\)
    • 同理单词w2
      • 选择标签 l1 时的分数可以表示为:\(d_1 = log(e^{d_1+x_{21}+t_{11}}+e^{d_2+x_{21}+t_{21}})\)
      • 选择标签 l2 时的分数可以表示为:\(d_2 = log(e^{d_1+x_{22}+t_{12}}+e^{d_2+x_{22}+t_{22}})\)
    • 从最后一个单词得到的 \(D\),得到所有路径的分数:\(log(e^{d_1} + e^{d_2})\)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def _forward_alg(self, feats):
"""
feats: (seq_len, tag_size)
"""
init_alphas = torch.full((1, self.tag_size), -10000.) # 前一个单词选择各个标签时的分数
init_alphas[0][self.tag2idx[START_TAG]] = 0. # 开始标签

forward_var = init_alphas

for feat in feats:
alphas_t = [] # 动态规划遍历到当前单词时,当前单词选择各个标签时的分数
for next_tag in range(self.tag_size):

# 单词选择当前标签的分数
emit_score = feat[next_tag].view(1, -1).expand(1, self.tag_size)

# 上一单词所有标签指向当前标签的转移分数
trans_score = self.transitions[next_tag].view(1, -1)

# 再加上一单词选择各个标签的分数,然后求 log-sum-exp,即为当前单词选择当前标签的分数
next_tag_var = forward_var + trans_score + emit_score


alphas_t.append(log_sum_exp(next_tag_var).view(1))

forward_var = torch.cat(alphas_t).view(1, -1) # 更新当前层的值,作为下一层的参数
terminalL_var = forward_var + self.transitions[self.tag2idx[STOP_TAG]]
alpha = log_sum_exp(terminal_var)
return alpha

模型损失即为: \[ \begin{align} \text{Loss} = log(e^{S_1}+e^{S_2}+...+e^{S_N}) -S_{RealPath} \end{align} \] \(log(e^{S_1}+e^{S_2}+...+e^{S_N})\)为所有路径分数,\(S_{RealPath}\)为真实路径的分数

1
2
3
4
5
def neg_log_likelihood(self, sentence, tags):
feats = self._get_lstm_features(sentence) # emission matrix
forward_score = self._forward_alg(feats) # all possibile pathes score
gold_score = self._score_sentence(feats, tags) # real path score
return forward_score - gold_score

如何进行预测

模型训练好后,如何进行预测? > 通常模型 forward 方法进行预测,然后预测结果与 target 求交叉熵或MSE就可以计算损失函数,此过程没有增加其它参数;
crf 模型预测结果与 target 计算损失函数时还引入了转移矩阵作为参数,所以需要额外定义损失函数

维特比算法求解:
输入经过 lstm 层获得 发射矩阵,及模型训练得到的特征方程 转移矩阵,然后从所有可能路径中选择最优的路径。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def _veterbi_decode(self, feats):
# [i,j] 记录第 i 个单词选择第 j 个标签时的最佳路径中,上一步选择的哪个标签
backpointers = []

# Initialize the viterbi variables in log space
init_vvars = torch.full((1, self.tagset_size), -10000.)
init_vvars[0][self.tag_to_ix[START_TAG]] = 0

# 保存上一步各个标签对应的最佳分数
forward_var = init_vvars
for feat in feats:
bptrs_t = [] # holds the backpointers for this step
viterbivars_t = [] # holds the viterbi variables for this step

for next_tag in range(self.tagset_size):
# 各个标签对应的分数
next_tag_var = forward_var + self.transitions[next_tag]

# 到当前标签的最佳路径中 上一个标签的索引
best_tag_id = argmax(next_tag_var)
bptrs_t.append(best_tag_id)

# 最佳路径的分数
viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))

# 分数还要加上发射分数
forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
backpointers.append(bptrs_t)

# Transition to STOP_TAG
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
best_tag_id = argmax(terminal_var)

path_score = terminal_var[0][best_tag_id] # 最佳路径分数

best_path = [best_tag_id] # 最佳路径
for bptrs_t in reversed(backpointers):
best_tag_id = bptrs_t[best_tag_id]
best_path.append(best_tag_id)
# Pop off the start tag (we dont want to return that to the caller)
start = best_path.pop()
assert start == self.tag_to_ix[START_TAG] # Sanity check
best_path.reverse()
return path_score, best_path

BiLSTM+CRF完整代码

1
2
3
4
5
6
7
8
9
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(1)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device
device(type='cuda')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def argmax(vec):
"""
vec: (1,n)
"""
_, idx = torch.max(vec, 1)
return idx.item()


def prepare_sequence(seq, to_ix):
"""
word seq --> idx seq
"""
idxs = [to_ix[w] for w in seq]
return torch.tensor(idxs, dtype=torch.long)
# dtype must be float <- long/int not implemented for torch.exp


def log_sum_exp(vec):
"""
[3,4,5] --> log( e^3+e^4+e^5 )
--> log( e^5*(e^(3-5)+e^(4-5)+e^(5-5)) )
--> 5 + log( e^(3-5)+e^(4-5)+e^(5-5) )
"""
max_score = vec[0, argmax(vec)]
max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
return max_score + torch.log(
torch.sum(torch.exp(vec - max_score_broadcast)))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
class BiLSTM_CRF(nn.Module):
def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):
super(BiLSTM_CRF, self).__init__()
self.embedding_dim = embedding_dim
self.hidden_dim = hidden_dim
self.vocab_size = vocab_size
self.tag_to_ix = tag_to_ix
self.tagset_size = len(tag_to_ix)

self.word_embeds = nn.Embedding(vocab_size, embedding_dim)
self.lstm = nn.LSTM(embedding_dim,
hidden_dim // 2,
num_layers=1,
bidirectional=True)

# Maps the output of the LSTM into tag space.
self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size)

# Matrix of transition parameters. Entry i,j is the score of
# transitioning *to* i *from* j.
self.transitions = nn.Parameter(
torch.randn(self.tagset_size, self.tagset_size))

# These two statements enforce the constraint that we never transfer
# to the start tag and we never transfer from the stop tag
self.transitions.data[tag_to_ix[START_TAG], :] = -10000
self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000

self.hidden = self.init_hidden()

def init_hidden(self):
return (torch.randn(2, 1, self.hidden_dim // 2),
torch.randn(2, 1, self.hidden_dim // 2))

def _forward_alg(self, feats):
# Do the forward algorithm to compute the partition function
init_alphas = torch.full((1, self.tagset_size), -10000.)
# START_TAG has all of the score.
init_alphas[0][self.tag_to_ix[START_TAG]] = 0.

# Wrap in a variable so that we will get automatic backprop
forward_var = init_alphas

# Iterate through the sentence
for feat in feats:
alphas_t = [] # The forward tensors at this timestep
for next_tag in range(self.tagset_size):
# broadcast the emission score: it is the same regardless of
# the previous tag
emit_score = feat[next_tag].view(1, -1).expand(
1, self.tagset_size)
# the ith entry of trans_score is the score of transitioning to
# next_tag from i
trans_score = self.transitions[next_tag].view(1, -1)
# The ith entry of next_tag_var is the value for the
# edge (i -> next_tag) before we do log-sum-exp
next_tag_var = forward_var + trans_score + emit_score
# The forward variable for this tag is log-sum-exp of all the
# scores.
alphas_t.append(log_sum_exp(next_tag_var).view(1))
forward_var = torch.cat(alphas_t).view(1, -1)
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
alpha = log_sum_exp(terminal_var)
return alpha

def _get_lstm_features(self, sentence):
self.hidden = self.init_hidden()
embeds = self.word_embeds(sentence).view(len(sentence), 1, -1)
lstm_out, self.hidden = self.lstm(embeds, self.hidden)
lstm_out = lstm_out.view(len(sentence), self.hidden_dim)
lstm_feats = self.hidden2tag(lstm_out)
return lstm_feats

def _score_sentence(self, feats, tags):
# Gives the score of a provided tag sequence
score = torch.zeros(1)
tags = torch.cat([
torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags
])
for i, feat in enumerate(feats):
score = score + \
self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
return score

def _viterbi_decode(self, feats):
backpointers = []

# Initialize the viterbi variables in log space
init_vvars = torch.full((1, self.tagset_size), -10000.)
init_vvars[0][self.tag_to_ix[START_TAG]] = 0

# forward_var at step i holds the viterbi variables for step i-1
forward_var = init_vvars
for feat in feats:
bptrs_t = [] # holds the backpointers for this step
viterbivars_t = [] # holds the viterbi variables for this step

for next_tag in range(self.tagset_size):
# next_tag_var[i] holds the viterbi variable for tag i at the
# previous step, plus the score of transitioning
# from tag i to next_tag.
# We don't include the emission scores here because the max
# does not depend on them (we add them in below)
next_tag_var = forward_var + self.transitions[next_tag]
best_tag_id = argmax(next_tag_var)
bptrs_t.append(best_tag_id)
viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
# Now add in the emission scores, and assign forward_var to the set
# of viterbi variables we just computed
forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
backpointers.append(bptrs_t)

# Transition to STOP_TAG
terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
best_tag_id = argmax(terminal_var)
path_score = terminal_var[0][best_tag_id]

# Follow the back pointers to decode the best path.
best_path = [best_tag_id]
for bptrs_t in reversed(backpointers):
best_tag_id = bptrs_t[best_tag_id]
best_path.append(best_tag_id)
# Pop off the start tag (we dont want to return that to the caller)
start = best_path.pop()
assert start == self.tag_to_ix[START_TAG] # Sanity check
best_path.reverse()
return path_score, best_path

def neg_log_likelihood(self, sentence, tags):
feats = self._get_lstm_features(sentence)
forward_score = self._forward_alg(feats)
gold_score = self._score_sentence(feats, tags)
return forward_score - gold_score

def forward(self, sentence): # dont confuse this with _forward_alg above.
# Get the emission scores from the BiLSTM
lstm_feats = self._get_lstm_features(sentence)

# Find the best path, given the features.
score, tag_seq = self._viterbi_decode(lstm_feats)
return score, tag_seq
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
START_TAG = "<START>"
STOP_TAG = "<STOP>"
EMBEDDING_DIM = 5
HIDDEN_DIM = 4

# Make up some training data
training_data = [
("the wall street journal reported today that apple corporation made money"
.split(), "B I I I O O O B I O O".split()),
("georgia tech is a university in georgia".split(),
"B I O O O O B".split())
]

word_to_ix = {}
for sentence, tags in training_data:
for word in sentence:
if word not in word_to_ix:
word_to_ix[word] = len(word_to_ix)

tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4}

model = BiLSTM_CRF(len(word_to_ix), tag_to_ix, EMBEDDING_DIM, HIDDEN_DIM)
optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)

# Check predictions before training
with torch.no_grad():
precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)
precheck_tags = torch.tensor([tag_to_ix[t] for t in training_data[0][1]],
dtype=torch.long)
print(model(precheck_sent))

for epoch in range(300):
for sentence, tags in training_data:
# Step 1. Remember that Pytorch accumulates gradients.
# We need to clear them out before each instance
model.zero_grad()

# Step 2. Get our inputs ready for the network, that is,
# turn them into Tensors of word indices.
sentence_in = prepare_sequence(sentence, word_to_ix)
targets = torch.tensor([tag_to_ix[t] for t in tags], dtype=torch.long)

# Step 3. Run our forward pass.
loss = model.neg_log_likelihood(sentence_in, targets)

# Step 4. Compute the loss, gradients, and update the parameters by
# calling optimizer.step()
loss.backward()
optimizer.step()

# Check predictions after training
with torch.no_grad():
precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)
print(model(precheck_sent))
(tensor(2.6907), [1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1])
(tensor(20.4906), [0, 1, 1, 1, 2, 2, 2, 0, 1, 2, 2])

BiLSTM和CRF算法的序列标注原理

https://hunlp.com/posts/51094.html

作者

ฅ´ω`ฅ

发布于

2021-06-02

更新于

2021-06-06

许可协议


评论