BiLSTM和CRF算法的序列标注原理

2021-06-022021-06-06自然语言处理32 分钟读完 (大约4816个字)

`CRF`原理

条件随机场(conditional_random_field)，是一类判别型算法，特别适合于预测任务，任务中的 上下文信息或临近状态会影响当前状态 。如序列标注任务： >判别式模型(discriminative model)计算条件概率，而生成式模型(generative model)计算联合概率分布

词性标注(Part_Of_Speech Tagging,POS Tagging)：确定句子中的单词的词性，如名称、形容词、副词等
命名实体识别(Named Entity Recognize)：确定句子中单词属于那种实体，如组织、机构、人名等

`HMM`生成模型

给定句子 \(S\)，对应的输出词性序列 \(T\)，HMM模型的联合概率： \[ \begin{align} P(T|S) &= \frac{P(S|T)\cdot P(T)}{P(S)}\\ P(S,T) &= P(S|T)\cdot P(T)\\ &= \prod_{i=1}^{n}P(w_i|T)\cdot P(T)\\ &= \prod_{i=1}^{n}P(w_i|t_i)\cdot P(T)\\ &= \prod_{i=1}^{n}P(w_i|t_i)\cdot P(t_i|t_{i-1})\\ \end{align} \] > 首先贝叶斯公式展开，然后利用 以下假设 简化：
- 由词之间相互独立假设，得到 \(\prod_{i=1}^{n}P(w_i|T)\) - 由单词概率仅依赖于其自身的标签，得到发射(emission)概率 \(\prod_{i=1}^{n}P(w_i|t_i)\) - 由马尔可夫假设，使用 bi-gram 得到转移(transition)概率 \(P(t_i|t_{i-1})\)

目标函数：

\[ (\hat{t_1},\hat{t_2}...\hat{t_n})=arg max\prod_{i=1}^{n}P(w_i|t_i)\cdot P(t_i|t_{i-1}) \]

综上，HMM假设了两类特征：当前词性与上一词性的关系，当前词与当前词性的关系
HMM的学习过程就是在训练集中学习这两个概率矩阵，大小分别为(t,t),(w,t)，w为单词的个数，t为词性的个数

`CRF`判别模型

CRF并没有做出上述的假设，而是使用特征方程feature function来更抽象地表达特征，而不再局限于HMM的两类特征

特征方程

条件随机场中，特征方程 \(f_j\) 的输入为： - 句子 \(S\) - 一个单词在句子中的位置 \(i\) - 当前单词的标签 \(l_i\) - 前一个单词的标签 \(l_{i-1}\)

输出实值 0 或 1 > 上述示例为 线性链 条件随机场，特征方程只依赖于当前与 前一个 标签，而不是序列中的任意标签；
如给定之前的单词 “很” ，特征方程判断当前单词 “简单” 的词性

给每个特征方程 \(f_j\) 一个权重 \(\lambda_j\)，可以计算一个句子 \(s\) 对应一组标签 \(l\) 的 "分数" \[score(l|s)=\sum_{j=1}^{m}\sum_{i=1}^{n}\lambda_jf_j(s,i,l_i,l_{i-1})\] > 其中 \(i\) 表示句子中的位置，\(j\) 表示特定的特征方程

“分数”然后转化成概率分布 \[p(s|l)=\frac{exp\big[\sum_{j=1}^{m}\sum_{i=1}^{n}\lambda_jf_j(s,i,l_i,l_{i-1})\big]}{\sum_{l^{’}}exp\big[\sum_{j=1}^{m}\sum_{i=1}^{n}\lambda_jf_j(s,i,l^{’}_i,l^{’}_{i-1})\big]}\] > \(l^{’}\) 表示所有可能的序列标签组合

特征方程示例： \(f_1(s,i,l_i,l_{i-1})\) 表示 \(l_i\) 是否为副词；若第 \(i\) 个单词以 -ly 结尾，该值为 1，否则为 0。即若该特征对应的权重 \(\lambda_i\) 较大，说明偏向于将该特征的单词标注为 “副词”

\(f_2(s,i,l_i,l_{i-1})\) 表示 \(l_i\) 是否为动词；若第 \(i=1\) 且句子以?结尾，该值为 1，否则为 0。

\(f_3(s,i,l_i,l_{i-1})\) 表示 \(l_i\) 是否为形容词；若第 \(l_{i-1}\) 为名词，则该值为 1，否则为 0。

\(f_4(s,i,l_i,l_{i-1})\) 表示 \(l_i\) 是否为介词；若第 \(l_{i-1}\) 为介词，则该值为 0。介词不能跟着介词

因此：要创建条件随机场，需要定义一系列的特征，然后给每个特征分配权重，然后遍历整个序列，再将其转换成概率

损失函数

综上，给定训练样本 \(D=\big[(x^1,y^1),(x^2,y^2)...(x^m,y^m)\big]\)，其中\(m\)表示\(m\)个句子
利用最大似然估计计算参数 \(\lambda\)， \[ \begin{align} L(\lambda,D) &= log\Big(\prod_{k=1}^{n}P(y^k|x^k,\lambda)\Big)\\ &= \sum_{k=1}^{m}\Big[log\frac{1}{Z(x_k)}+\sum_{j=1}^{n}\sum_{i=1}^{l}\lambda_jf_j(x_i^k,i,y^k_i,y^k_{i-1})\Big] \end{align} \] > 其中\(k\)表示第\(k\)个句子，共\(m\)个句子；\(i\)表示句子的第\(i\)个单词，共\(l\)个单词，\(j\)表示第\(j\)个特征，共\(n\)个特征，\(\frac{1}{Z(x^k)}\)为正则项

然后利用梯度下降算法即可求解出 \(\lambda\) 参数

与`HMM`的关系

将特征方程定义\(f_1(x,i,y_i,y_{i-1})=1\)定义为\(p(y_i|y_{i-1})\)，将特征对应的权重\(\lambda\)定义为 \(\lambda=log p(x_i|y_i)\)，即可从CRF中推导出HMM，HMM为CRF的特例

与逻辑回归类比：

逻辑回归用于分类的线性(log-linear)模型，CRFs则用于序列分类的线性(log-linear)模型

关键是：`CRF`模型中的特征方程？

特征方程约束了输出标签序列，即确定标签与标签之间的关系，可以作为形状为(tag_size,tag_size)模型参数学习得到，(i,j)表示从标签i<-j的关系

`BiLSTM+CRF`实现命名实体识别

参考连接: CRF Layer on the Top of BiLSTM - 输入单词序列的词表征经过BiLSTM处理，生成每个单词所属实体类别的权重 - 再将权重分布组成的序列，输入到CRF层，获得最终的实体类别分布 - BiLSTM层已经可以获得了单词的实体类别了 - 但CRF层给上一层的输出添加了一些规则限制，即的CRF特征方程

转移矩阵和发射矩阵

BiLSTM层的输出为emission score \(E\)，形状为(seq_len,tag_size)，\(E_{i,j}\) 表示第 \(i\) 个单词属于第 \(j\) 个类别的权重，上图中 \(E_{w_0,B-person}=1.5\)

使用 \(t_{i,j}\) 表示 transition score，例如 \(t_{B-Person,I-Person}=0.9\) ，表示B-Person --> I-Person的转移权重为 0.9 .所有标签之间都有权重分数；

额外添加了表征开始和结束的两个标签START+END，

	START	B-Person	I-Person	B-Organization	I-Organization	O	END
START	0	0.8	0.007	0.7	0.0008	0.9	0.08
B-Person	0	0.6	0.9	0.2	0.0006	0.6	0.009
I-Person	-1	0.5	0.53	0.55	0.0003	0.85	0.008
B-Organization	0.9	0.5	0.0003	0.25	0.8	0.77	0.006
I-Organization	-0.9	0.45	0.007	0.7	0.65	0.76	0.2
O	0	0.65	0.0007	0.7	0.0008	0.9	0.08
END	0	0	0	0	0	0	0

如上表所示，转移矩阵学习了一些有用信息：
- 句子应该以 B-Person 和 B-Organization，而不应该以I-Person开始，等等
转移矩阵为模型的参数，在训练之前随机初始化，随着训练进行逐渐进行更新；而不用手动设置

损失函数

有5个单词组成的句子，可能的实体类别序列：

1)  START B-Person B-Person B-Person B-Person B-Person END
2)  START B-Person I-Person B-Person B-Person B-Person END
    ......
10) START B-Person I-Person O B-Organization O END
    ......
N)  O O O O O O O

假设每种可能的序列有一个分数 \(P_i\)，总共 \(N\) 种路径，则所有路径分数和为 \(P_{total}=P_1+P_2+...+P_{N}=e^{S_1}+e^{S_2}+...+e^{S_N}\)；
假设第10种路径为真实标签路径，则分数 \(P_{10}\)应该为最大的，则损失函数为\(Loss=\frac{P_{RealPath}}{P_1+P_2+...+P_{N}}\)
将其转化为负log函数，便于梯度下降法计算最小值 \[ \begin{align} \text{Loss} &= -log \frac{P_{RealPath}}{P_1+P_2+...+P_{N}}\\ &= -log \frac{e^{S_{RealPath}}}{e^{S_1}+e^{S_2}+...+e^{S_N}}\\ &= -\big(s_{RealPath}-log(e^{S_1}+e^{S_2}+...+e^{S_N})\big)\\ \end{align} \] \(S_{RealPath}\)为真实路径的分数，\(log(e^{S_1}+e^{S_2}+...+e^{S_N})\)为所有路径分数

真实路径分数

真实路径START B-Person I-Person O B-Organization O END，如何计算真实路径的分数 \(P_i=e^{S_i}\)，需要先计算 \(S_i\)， - 对于上述 5 单词的句子 \(w_1,w_2,w_3,w_4,w_5\) - 加上开始和结束标签，START,END

\(S_i=\text{EmissionScore}+\text{TransitionScore}\) - \(EmissionScore=x_{0,START}+x_{1,B-Person}+x_{2,I-Person}+x_{3,O}+x_{4,B-Organization}+x_{5,O}+x_{6,END}\) - \(x_{i,label}\)，表示第i个单词标签为label的分数，直接从BiLSTM层的输出为emission score \(E\)中获得 - \(x_{0,START}\) 和 \(x_{6,END}\) 直接设置为 0

\(TransitionScore=t_{I-Person->O} + t_{0->B-Organization} + t_{B-Organization->O} + t_{O->END}\)
- 这些分数来源于CRF层

def _score_sentence(self, feats, tags):
    """
    feats: 发射矩阵，lstm 层的输出，(seq_len,num_tags)
    tags: 真实的标签序列
    
    示例代码不包含数据批的维度，一次只能处理一个序列
    """
    score = torch.zeros(1)
    tags = torch.cat([
        torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags
    ])
    for i, feat in enumerate(feats):
        score = score + \
            self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
    score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
    return score

如何计算所有可能路径的分数？

已知长 \(n\) 的序列{w0,w1,w2}，\(m\) 个标签{l1,l2}，发射矩阵\(x_{ij}\)，转移矩阵\(t_{ij}\)，
连续两个标签 \((w_t,w_{t+1})\) 对应标签组合 \((l_a, l_b)\) 的分数表示为：\(x_{t,a}+x_{t+1,b}+t_{a,b}\) ，三项分别表示第 t 个单词属于标签 a 的分数、第 t+1 个单词属于标签 b 的分数、以及标签 a->b 的转移分数
所有的路径即所有的标签排列组合：(l1,l1,l1),(l1,l1,l2),(l1,l2,l1)...等 \(m^n\) 种，如上图中的8种。最终的分数即为所有路径的log_sum_exp之和

import torch
def log_sum_exp(vec):
    """
    [3,4,5] --> log( e^3+e^4+e^5 ) 
            --> log( e^5*(e^(3-5)+e^(4-5)+e^(5-5)) ) 
            --> 5 + log( e^(3-5)+e^(4-5)+e^(5-5) )
    """
    max_score, idx = torch.max(vec, 1)
    max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
    return max_score + torch.log(
        torch.sum(torch.exp(vec - max_score_broadcast)))


vec = torch.tensor([[3., 4., 5.]])
log_sum_exp(vec)

tensor([5.4076])

动态规划求解过程：

利用向量 \(D\) 表示当前单词选择各个标签时的分数；如上图中所示；当前单词选择某个标签的分数，可由上一步的 \(D\) 向量推导出：
- 单词 w0，没有前继单词，所以没有转移分数：\(D=[log(e^{x_{01}}), log(e^{x_{02}})]\)
- 单词 w1
  - 选择标签 l1 时的分数可以表示为：\(d_1 = log(e^{d_1+x_{11}+t_{11}}+e^{d_2 + x_{11}+t_{21}})\)
  - 选择标签 l2 时的分数可以表示为：\(d_2 = log(e^{d_1+x_{12}+t_{12}}+e^{d_2+x_{12}+t_{22}})\)
- 同理单词w2
  - 选择标签 l1 时的分数可以表示为：\(d_1 = log(e^{d_1+x_{21}+t_{11}}+e^{d_2+x_{21}+t_{21}})\)
  - 选择标签 l2 时的分数可以表示为：\(d_2 = log(e^{d_1+x_{22}+t_{12}}+e^{d_2+x_{22}+t_{22}})\)
- 从最后一个单词得到的 \(D\)，得到所有路径的分数：\(log(e^{d_1} + e^{d_2})\)

def _forward_alg(self, feats):
    """
    feats: (seq_len, tag_size)
    """
    init_alphas = torch.full((1, self.tag_size), -10000.)  # 前一个单词选择各个标签时的分数
    init_alphas[0][self.tag2idx[START_TAG]] = 0.  # 开始标签

    forward_var = init_alphas

    for feat in feats:
        alphas_t = []  # 动态规划遍历到当前单词时，当前单词选择各个标签时的分数
        for next_tag in range(self.tag_size):

            # 单词选择当前标签的分数
            emit_score = feat[next_tag].view(1, -1).expand(1, self.tag_size) 

            # 上一单词所有标签指向当前标签的转移分数
            trans_score = self.transitions[next_tag].view(1, -1) 

            # 再加上一单词选择各个标签的分数，然后求 log-sum-exp，即为当前单词选择当前标签的分数
            next_tag_var = forward_var + trans_score + emit_score 

            
            alphas_t.append(log_sum_exp(next_tag_var).view(1))

        forward_var = torch.cat(alphas_t).view(1, -1)  # 更新当前层的值，作为下一层的参数
    terminalL_var = forward_var + self.transitions[self.tag2idx[STOP_TAG]]
    alpha = log_sum_exp(terminal_var)
    return alpha

模型损失即为： \[ \begin{align} \text{Loss} = log(e^{S_1}+e^{S_2}+...+e^{S_N}) -S_{RealPath} \end{align} \] \(log(e^{S_1}+e^{S_2}+...+e^{S_N})\)为所有路径分数，\(S_{RealPath}\)为真实路径的分数

def neg_log_likelihood(self, sentence, tags):
    feats = self._get_lstm_features(sentence)  # emission matrix
    forward_score = self._forward_alg(feats)  # all possibile pathes score
    gold_score = self._score_sentence(feats, tags)  # real path score
    return forward_score - gold_score

如何进行预测

模型训练好后，如何进行预测？ > 通常模型 forward 方法进行预测，然后预测结果与 target 求交叉熵或MSE就可以计算损失函数，此过程没有增加其它参数；
而 crf 模型预测结果与 target 计算损失函数时还引入了转移矩阵作为参数，所以需要额外定义损失函数

维特比算法求解：
输入经过 lstm 层获得 发射矩阵，及模型训练得到的特征方程 转移矩阵，然后从所有可能路径中选择最优的路径。

def _veterbi_decode(self, feats):
    # [i,j] 记录第 i 个单词选择第 j 个标签时的最佳路径中，上一步选择的哪个标签
    backpointers = [] 

    # Initialize the viterbi variables in log space
    init_vvars = torch.full((1, self.tagset_size), -10000.)
    init_vvars[0][self.tag_to_ix[START_TAG]] = 0

    # 保存上一步各个标签对应的最佳分数
    forward_var = init_vvars
    for feat in feats:
        bptrs_t = []  # holds the backpointers for this step
        viterbivars_t = []  # holds the viterbi variables for this step

        for next_tag in range(self.tagset_size):
            # 各个标签对应的分数
            next_tag_var = forward_var + self.transitions[next_tag]
            
            # 到当前标签的最佳路径中 上一个标签的索引
            best_tag_id = argmax(next_tag_var)  
            bptrs_t.append(best_tag_id)
            
            # 最佳路径的分数
            viterbivars_t.append(next_tag_var[0][best_tag_id].view(1)) 
            
        # 分数还要加上发射分数
        forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
        backpointers.append(bptrs_t)

    # Transition to STOP_TAG
    terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
    best_tag_id = argmax(terminal_var)
    
    path_score = terminal_var[0][best_tag_id]  # 最佳路径分数

    best_path = [best_tag_id]  # 最佳路径
    for bptrs_t in reversed(backpointers):
        best_tag_id = bptrs_t[best_tag_id]
        best_path.append(best_tag_id)
    # Pop off the start tag (we dont want to return that to the caller)
    start = best_path.pop()
    assert start == self.tag_to_ix[START_TAG]  # Sanity check
    best_path.reverse()
    return path_score, best_path

`BiLSTM+CRF`完整代码

import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.optim as optim

torch.manual_seed(1)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

def argmax(vec):
    """
    vec: (1,n)
    """
    _, idx = torch.max(vec, 1)
    return idx.item()


def prepare_sequence(seq, to_ix):
    """
    word seq --> idx seq
    """
    idxs = [to_ix[w] for w in seq]
    return torch.tensor(idxs, dtype=torch.long) 
    # dtype must be float <- long/int not implemented for torch.exp


def log_sum_exp(vec):
    """
    [3,4,5] --> log( e^3+e^4+e^5 ) 
            --> log( e^5*(e^(3-5)+e^(4-5)+e^(5-5)) ) 
            --> 5 + log( e^(3-5)+e^(4-5)+e^(5-5) )
    """
    max_score = vec[0, argmax(vec)]
    max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
    return max_score + torch.log(
        torch.sum(torch.exp(vec - max_score_broadcast)))

class BiLSTM_CRF(nn.Module):
    def __init__(self, vocab_size, tag_to_ix, embedding_dim, hidden_dim):
        super(BiLSTM_CRF, self).__init__()
        self.embedding_dim = embedding_dim
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size
        self.tag_to_ix = tag_to_ix
        self.tagset_size = len(tag_to_ix)

        self.word_embeds = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim,
                            hidden_dim // 2,
                            num_layers=1,
                            bidirectional=True)

        # Maps the output of the LSTM into tag space.
        self.hidden2tag = nn.Linear(hidden_dim, self.tagset_size)

        # Matrix of transition parameters.  Entry i,j is the score of
        # transitioning *to* i *from* j.
        self.transitions = nn.Parameter(
            torch.randn(self.tagset_size, self.tagset_size))

        # These two statements enforce the constraint that we never transfer
        # to the start tag and we never transfer from the stop tag
        self.transitions.data[tag_to_ix[START_TAG], :] = -10000
        self.transitions.data[:, tag_to_ix[STOP_TAG]] = -10000

        self.hidden = self.init_hidden()

    def init_hidden(self):
        return (torch.randn(2, 1, self.hidden_dim // 2),
                torch.randn(2, 1, self.hidden_dim // 2))

    def _forward_alg(self, feats):
        # Do the forward algorithm to compute the partition function
        init_alphas = torch.full((1, self.tagset_size), -10000.)
        # START_TAG has all of the score.
        init_alphas[0][self.tag_to_ix[START_TAG]] = 0.

        # Wrap in a variable so that we will get automatic backprop
        forward_var = init_alphas

        # Iterate through the sentence
        for feat in feats:
            alphas_t = []  # The forward tensors at this timestep
            for next_tag in range(self.tagset_size):
                # broadcast the emission score: it is the same regardless of
                # the previous tag
                emit_score = feat[next_tag].view(1, -1).expand(
                    1, self.tagset_size)
                # the ith entry of trans_score is the score of transitioning to
                # next_tag from i
                trans_score = self.transitions[next_tag].view(1, -1)
                # The ith entry of next_tag_var is the value for the
                # edge (i -> next_tag) before we do log-sum-exp
                next_tag_var = forward_var + trans_score + emit_score
                # The forward variable for this tag is log-sum-exp of all the
                # scores.
                alphas_t.append(log_sum_exp(next_tag_var).view(1))
            forward_var = torch.cat(alphas_t).view(1, -1)
        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
        alpha = log_sum_exp(terminal_var)
        return alpha

    def _get_lstm_features(self, sentence):
        self.hidden = self.init_hidden()
        embeds = self.word_embeds(sentence).view(len(sentence), 1, -1)
        lstm_out, self.hidden = self.lstm(embeds, self.hidden)
        lstm_out = lstm_out.view(len(sentence), self.hidden_dim)
        lstm_feats = self.hidden2tag(lstm_out)
        return lstm_feats

    def _score_sentence(self, feats, tags):
        # Gives the score of a provided tag sequence
        score = torch.zeros(1)
        tags = torch.cat([
            torch.tensor([self.tag_to_ix[START_TAG]], dtype=torch.long), tags
        ])
        for i, feat in enumerate(feats):
            score = score + \
                self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
        score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
        return score

    def _viterbi_decode(self, feats):
        backpointers = []

        # Initialize the viterbi variables in log space
        init_vvars = torch.full((1, self.tagset_size), -10000.)
        init_vvars[0][self.tag_to_ix[START_TAG]] = 0

        # forward_var at step i holds the viterbi variables for step i-1
        forward_var = init_vvars
        for feat in feats:
            bptrs_t = []  # holds the backpointers for this step
            viterbivars_t = []  # holds the viterbi variables for this step

            for next_tag in range(self.tagset_size):
                # next_tag_var[i] holds the viterbi variable for tag i at the
                # previous step, plus the score of transitioning
                # from tag i to next_tag.
                # We don't include the emission scores here because the max
                # does not depend on them (we add them in below)
                next_tag_var = forward_var + self.transitions[next_tag]
                best_tag_id = argmax(next_tag_var)
                bptrs_t.append(best_tag_id)
                viterbivars_t.append(next_tag_var[0][best_tag_id].view(1))
            # Now add in the emission scores, and assign forward_var to the set
            # of viterbi variables we just computed
            forward_var = (torch.cat(viterbivars_t) + feat).view(1, -1)
            backpointers.append(bptrs_t)

        # Transition to STOP_TAG
        terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
        best_tag_id = argmax(terminal_var)
        path_score = terminal_var[0][best_tag_id]

        # Follow the back pointers to decode the best path.
        best_path = [best_tag_id]
        for bptrs_t in reversed(backpointers):
            best_tag_id = bptrs_t[best_tag_id]
            best_path.append(best_tag_id)
        # Pop off the start tag (we dont want to return that to the caller)
        start = best_path.pop()
        assert start == self.tag_to_ix[START_TAG]  # Sanity check
        best_path.reverse()
        return path_score, best_path

    def neg_log_likelihood(self, sentence, tags):
        feats = self._get_lstm_features(sentence)
        forward_score = self._forward_alg(feats)
        gold_score = self._score_sentence(feats, tags)
        return forward_score - gold_score

    def forward(self, sentence):  # dont confuse this with _forward_alg above.
        # Get the emission scores from the BiLSTM
        lstm_feats = self._get_lstm_features(sentence)

        # Find the best path, given the features.
        score, tag_seq = self._viterbi_decode(lstm_feats)
        return score, tag_seq

START_TAG = "<START>"
STOP_TAG = "<STOP>"
EMBEDDING_DIM = 5
HIDDEN_DIM = 4

# Make up some training data
training_data = [
    ("the wall street journal reported today that apple corporation made money"
     .split(), "B I I I O O O B I O O".split()),
    ("georgia tech is a university in georgia".split(),
     "B I O O O O B".split())
]

word_to_ix = {}
for sentence, tags in training_data:
    for word in sentence:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)

tag_to_ix = {"B": 0, "I": 1, "O": 2, START_TAG: 3, STOP_TAG: 4}

model = BiLSTM_CRF(len(word_to_ix), tag_to_ix, EMBEDDING_DIM, HIDDEN_DIM)
optimizer = optim.SGD(model.parameters(), lr=0.01, weight_decay=1e-4)

# Check predictions before training
with torch.no_grad():
    precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)
    precheck_tags = torch.tensor([tag_to_ix[t] for t in training_data[0][1]],
                                 dtype=torch.long)
    print(model(precheck_sent))

for epoch in range(300):
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Get our inputs ready for the network, that is,
        # turn them into Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = torch.tensor([tag_to_ix[t] for t in tags], dtype=torch.long)

        # Step 3. Run our forward pass.
        loss = model.neg_log_likelihood(sentence_in, targets)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss.backward()
        optimizer.step()

# Check predictions after training
with torch.no_grad():
    precheck_sent = prepare_sequence(training_data[0][0], word_to_ix)
    print(model(precheck_sent))

(tensor(2.6907), [1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1])
(tensor(20.4906), [0, 1, 1, 1, 2, 2, 2, 0, 1, 2, 2])

BiLSTM和CRF算法的序列标注原理

https://hunlp.com/posts/51094.html

作者

ฅ´ω`ฅ

发布于

2021-06-02

更新于

2021-06-06

许可协议

BiLSTM, CRF

BiLSTM和CRF算法的序列标注原理

`CRF`原理

`HMM`生成模型

`CRF`判别模型

特征方程

损失函数

与`HMM`的关系

与逻辑回归类比：

关键是：`CRF`模型中的特征方程？

`BiLSTM+CRF`实现命名实体识别

转移矩阵和发射矩阵

损失函数

损失函数

真实路径分数

如何计算所有可能路径的分数？

如何进行预测

`BiLSTM+CRF`完整代码

作者

发布于

更新于

许可协议

喜欢这篇文章？打赏一下作者吧

评论

目录

BiLSTM和CRF算法的序列标注原理

CRF原理

HMM生成模型

CRF判别模型

特征方程

损失函数

与HMM的关系

与逻辑回归类比：

关键是：CRF模型中的特征方程？

BiLSTM+CRF实现命名实体识别

转移矩阵和发射矩阵

损失函数

损失函数

真实路径分数

如何计算所有可能路径的分数？

如何进行预测

BiLSTM+CRF完整代码

作者

发布于

更新于

许可协议

喜欢这篇文章？打赏一下作者吧

评论

目录

`CRF`原理

`HMM`生成模型

`CRF`判别模型

与`HMM`的关系

关键是：`CRF`模型中的特征方程？

`BiLSTM+CRF`实现命名实体识别

`BiLSTM+CRF`完整代码