基于HMM和Viterbi算法的序列标注

HMM生成模型

给定句子 \(S\),对应的输出词性序列 \(T\)HMM模型的联合概率: \[ \begin{align} P(T|S) &= \frac{P(S|T)\cdot P(T)}{P(S)}\\ P(S,T) &= P(S|T)\cdot P(T)\\ &= \prod_{i=1}^{n}P(w_i|T)\cdot P(T)\\ &= \prod_{i=1}^{n}P(w_i|t_i)\cdot P(T)\\ &= \prod_{i=1}^{n}P(w_i|t_i)\cdot P(t_i|t_{i-1})\\ \end{align} \]

首先贝叶斯公式展开,然后利用 以下假设 简化:
- 由词之间相互独立假设,得到 \(\prod_{i=1}^{n}P(w_i|T)\) - 由单词概率仅依赖于其自身的标签,得到发射(emission)概率 \(\prod_{i=1}^{n}P(w_i|t_i)\) - 由马尔可夫假设,使用 bi-gram 得到转移(transition)概率 \(P(t_i|t_{i-1})\)


目标函数:

\[ (\hat{t_1},\hat{t_2}...\hat{t_n})=arg max\prod_{i=1}^{n}P(w_i|t_i)\cdot P(t_i|t_{i-1}) \]


综上,HMM假设了两类特征:当前词性与上一词性的关系,当前词与当前词性的关系
HMM的学习过程就是在训练集中学习这两个概率矩阵,大小分别为(t,t),(w,t)w为单词的个数,t为词性的个数

1

实战:词性标注

  • 词性标注(Part-Of-Speech tagging, POS tagging),判断句子中单词的词性:谓词、虚词、代词、感叹词等
  • 本质上属于分类问题,将句子中的单词按词性分类
  • 因此需要词性标注好的语料库,其中给定句子\(s=w_1w_2...w_n\)及对应的词性 \(t=z_1z_2...z_n\) > 语料格式:每一行为词+词性,特殊符号 , 等表示句子结尾
    1
    2
    3
    4
    5
    6
    7
    'Newsweek/NNP\n',
    ',/,\n',
    'trying/VBG\n',
    'to/TO\n',
    'keep/VB\n',
    'pace/NN\n',
    'with/IN\n',
1
2
# 语料样本
open('../datasets/pos_tagging_data.txt','r').readlines(50)
['Newsweek/NNP\n',
 ',/,\n',
 'trying/VBG\n',
 'to/TO\n',
 'keep/VB\n',
 'pace/NN\n',
 'with/IN\n']

创建词汇表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# 创建字典,便于将文本数值化

tag2id, id2tag = {}, {}
word2id, id2word = {}, {}

for line in open('../datasets/pos_tagging_data.txt', 'r'):
items = line.split('/')
word, tag = items[0], items[1].rstrip()

if word not in word2id:
word2id[word] = len(word2id)
id2word[len(id2word)] = word
if tag not in tag2id:
tag2id[tag] = len(tag2id)
id2tag[len(tag2id)] = tag

M = len(word2id) # 词典的大小
N = len(tag2id) # 词性的种类

print(M,N)

发射矩阵和转移矩阵

  • 基于语料库,计算发射矩阵和转移矩阵
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import numpy as np

pi = np.zeros(N) # 每个 tag 出现在句首的概率
A = np.zeros((N, M)) # A[i][j],给定 tag i,出现单词 j 的概率
B = np.zeros((N, N)) # B[i][j],词性为 tag i 时,其后单词的词性为 tag j 的概率

prev_tag = ""
for line in open('../datasets/pos_tagging_data.txt', 'r'):
items = line.split('/')
wordId, tagId = word2id[items[0]], tag2id[items[1].rstrip()]

if prev_tag == "": # 判断句子的开始
pi[tagId] += 1
A[tagId][wordId] += 1
else:
A[tagId][wordId] += 1
B[tag2id[prev_tag]][tagId] += 1

if items[0] == ".":
prev_tag = ""
else:
prev_tag = items[1].rstrip()

# 转化成概率
pi = pi / sum(pi)
for i in range(N):
A[i] /= sum(A[i])
B[i] /= sum(B[i])
1
pi

维特比算法求解最优标注

1
2
3
4
def log_(v):
if v==0:
return np.log(v+0.000001)
return np.log(v)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from math import log


def viterbi(x, pi, A, B): # x 为输入句子,"I like playing soccer"
x = [word2id[word] for word in x.split(" ")]
T = len(x)

dp = np.zeros((T, N)) # 默认浮点数
ptr = np.array([[0 for x in range(N)] for y in range(T)]) # 整数

for j in range(N):
dp[0][j] = log_(pi[j]) + log_(A[j][x[0]]) # 需要添加平滑项

for i in range(1, T):
for j in range(N):
dp[i][j] = float('-inf')
for k in range(N):
score = dp[i - 1][k] + log_(B[k][j]) + log_(A[j][x[i]])
if score > dp[i][j]:
dp[i][j] = score
ptr[i][j] = k

# decoding:找出最好的 tag sequence
best_seq = [0] * T
# step1:找出最后一个单词的词性
best_seq[T - 1] = np.argmax(dp[T - 1])

# step2:从后向前循环依次找出每个单词的词性
for i in range(T - 2, -1, -1):
best_seq[i] = ptr[i + 1][best_seq[i + 1]]

return [id2tag[id] for id in best_seq]
1
2
x = "I like play soccer"
viterbi(x,pi,A,B)

实战:命名实体识别

基于人民日报语料,词汇表及标签已经处理完成

根据语料获取转移矩阵与发射矩阵

语料每一行为一个子句,每个子句中用空格隔开标注好的单字,单字与其词性用/连接

1
2
3
4
5
6
7
8
9
10
11
with open("../datasets/ner/renmin/vocab.pkl", 'rb') as inp:
token2idx = pickle.load(inp)
idx2token = pickle.load(inp)

with open("../datasets/ner/renmin/tags.pkl", "rb") as inp:
tag2idx = pickle.load(inp)
idx2tag = pickle.load(inp)

# 模型参数
N = len(tag2idx)
M = len(token2idx)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import codecs


def train_hmm(data_file):
input_data = codecs.open(data_file, 'r', 'utf-8')

pi = np.zeros(N) # 每个 tag 出现在句首的概率
A = np.zeros((N, M)) # A[i][j],给定 tag i,出现单词 j 的概率
B = np.zeros((N, N)) # B[i][j],词性为 tag i 时,其后单词的词性为 tag j 的概率

for line in input_data.readlines():
line = line.strip().split()
tokens = [token2idx[string.split('/')[0].strip()] for string in line]
tags = [tag2idx[string.split('/')[1].strip()] for string in line]

for idx in range(len(tokens)):
if idx == 0:
pi[tags[idx]] += 1
A[tags[idx]][tokens[idx]] += 1
else:
A[tags[idx]][tokens[idx]] += 1
B[tags[idx - 1]][tags[idx]] += 1

pi = pi / sum(pi)
A = A / A.sum(axis=-1).reshape(-1, 1)
B = B / B.sum(axis=-1).reshape(-1, 1)

return pi, A, B

# data_file = '../datasets/ner/renmin/renmin4.txt'
# pi, A, B = train_hmm(data_file)

# with open('../models/ner/hmm.pkl', 'wb') as output:
# pickle.dump(pi, output)
# pickle.dump(A, output)
# pickle.dump(B, output)
1
2
3
4
with open('../models/ner/hmm.pkl', 'rb') as inp:
pi = pickle.load(inp)
A = pickle.load(inp)
B = pickle.load(inp)
1
2
3
4
5
6
7
8
9
10
11
import pandas as pd

# 转移矩阵,给定前一个标记的标签,后一个标记的标签的概率分布

index = [
'B_nt', 'M_nt', 'E_nt', 'B_nr', 'M_nr', 'E_nr', 'B_ns', 'M_ns', 'E_ns', 'O'
]
transitions = pd.DataFrame(B, index=idx2tag.values(), columns=idx2tag.values())
transitions.reindex(index, axis=0).reindex(index, axis=1).round(
2).style.applymap(lambda v: 'background-color: %s' % '#B0C4DE'
if v > 0 else 'background-color: %s' % '#FFFFFF')

从上表中,通过语料训练得到的转移矩阵,可以看出,连续的两个标签之间的联合概率

1
2
3
4
5
6
# 句首标记的标签分布

start_status = pd.DataFrame(pi, index=idx2tag.values())
start_status.reindex(index, axis=0).round(2).style.applymap(
lambda v: 'background-color: %s' % '#B0C4DE'
if v > 0 else 'background-color: %s' % '#FFFFFF')

句首单字的标签的概率分布

1

评估性能

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import pickle

pickle_path = '../datasets/ner/renmin/renmindata.pkl'
with open(pickle_path, 'rb') as inp:
word2id = pickle.load(inp)
id2word = pickle.load(inp)
tag2id = pickle.load(inp)
id2tag = pickle.load(inp)
x_train = pickle.load(inp)
y_train = pickle.load(inp)

# 测试数据集
x_test = pickle.load(inp)
y_test = pickle.load(inp)

x_valid = pickle.load(inp)
y_valid = pickle.load(inp)
print("train len:", len(x_train))
print("test len:", len(x_test))
print("valid len:", len(x_valid))
train len: 24271
test len: 7585
valid len: 6068
1
x_test[0], y_test[0]
(array([ 3, 33,  5,  2,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0]),
 array([3, 3, 5, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def log_(v):
return np.log(v + 0.000001)


# 解码
def viterbi_decode(x, pi, A, B):

T = len(x)
N = len(tag2idx)

dp = np.full((T, N), float('-inf'))
ptr = np.zeros_like(dp, dtype=np.int32)


dp[0] = log_(pi) + log_(A[:, x[0]])

for i in range(1, T):
v = dp[i - 1].reshape(-1, 1) + log_(B)
dp[i] = np.max(v, axis=0) + log_(A[:, x[i]])
ptr[i] = np.argmax(v, axis=0)


best_seq = [0] * T
best_seq[-1] = np.argmax(dp[-1])
for i in range(T - 2, -1, -1):
best_seq[i] = ptr[i + 1][best_seq[i + 1]]

return best_seq
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report


def test(x_test, y_test):
preds, labels = [], []
for index, x in enumerate(x_test):
x = x[:sum(x > 0)]
y_pred = viterbi_decode(x, pi, A, B)
y_true = y_test[index][:sum(x > 0)]

preds.extend(y_pred)
labels.extend(y_true)

# 损失值与评测指标
precision = precision_score(labels, preds, average='macro')
recall = recall_score(labels, preds, average='macro')
f1 = f1_score(labels, preds, average='macro')
report = classification_report(labels, preds)
print(report)
1
test(x_test, y_test)
              precision    recall  f1-score   support

           0       0.73      0.69      0.71      2151
           1       0.74      0.75      0.74      8090
           2       0.89      0.78      0.83      3965
           3       0.94      0.96      0.95     77532
           4       0.92      0.83      0.87      3964
           5       0.83      0.80      0.81      4522
           6       0.66      0.70      0.68      2691
           7       0.80      0.76      0.78      4524
           8       0.88      0.84      0.86      3654
           9       0.75      0.74      0.74      2146

    accuracy                           0.90    113239
   macro avg       0.81      0.78      0.80    113239
weighted avg       0.90      0.90      0.90    113239

进行预测

1

文本向量化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import re, pickle

# 文本向量化
class Tokenizer:
def __init__(self, vocab_file):
with open(vocab_file, 'rb') as inp:
self.token2idx = pickle.load(inp)
self.idx2token = pickle.load(inp)

def encode(self, text, maxlen):
seqs = re.split('[,。!?、‘’“”:]', text.strip())

# 将文本转换成索引
seq_ids = []
for seq in seqs:
token_ids = []
if seq:
for char in seq:
if char not in self.token2idx:
token_ids.append(self.token2idx['[unknown]'])
else:
token_ids.append(self.token2idx[char])
seq_ids.append(token_ids)

# 等长化处理
num_samples = len(seq_ids)
x = np.full((num_samples, maxlen), 0., dtype=np.int64)
for idx, s in enumerate(seq_ids):
trunc = np.array(s[:maxlen], dtype=np.int64)
x[idx, :len(trunc)] = trunc
return x
1
2
vocab_file = "../datasets/ner/renmin/vocab.pkl"
tokenizer = Tokenizer(vocab_file)
1
2
3
4
5
6
text = "新冠肺炎疫情发生后,以习近平同志为核心的党中央将疫情防控作为头等大事来抓,习近平\
总书记亲自指挥、亲自部署,坚持把人民生命安全和身体健康放在第一位,领导全党全军全国各族人民打好疫情\
防控的人民战争、总体战、阻击战。经过艰苦卓绝的努力,武汉保卫战、湖北保卫战取得决定性成果,疫情防控\
阻击战取得重大战略成果,统筹推进疫情防控和经济社会发展工作取得积极成效。"

tokenizer.encode(text, maxlen=30);

进行预测

1
2
3
4
5
6
7
def predict(input_ids):
res = []
for idx, x in enumerate(input_ids):
x = x[x > 0]
y_pred = viterbi_decode(x, pi, A, B)
res.append(y_pred)
return res
1
2
input_ids = tokenizer.encode(text, maxlen=30)
predict(input_ids)
[[3, 3, 3, 3, 3, 3, 3, 3, 3],
 [3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  9,
  1,
  0,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3],
 [3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
 [3, 3, 3, 3],
 [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
 [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
 [3, 3, 3],
 [3, 3, 3],
 [3, 3, 3, 3, 3, 3, 3, 3, 3],
 [9, 1, 1, 1, 1],
 [5, 7, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
 [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
 [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]]

预测向量转标签序列

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
class Parser:
def __init__(self, tags_file):
with open(tags_file, "rb") as inp:
self.tag2idx = pickle.load(inp)
self.idx2tag = pickle.load(inp)

def decode(self, text, paths):

seqs = re.split('[,。!?、‘’“”:]', text)
labels = [[self.idx2tag[idx] for idx in seq] for seq in paths]

res = []
for sent, tags in zip(seqs, labels):
print(tags)
tags = self._correct_tags(tags)
print(tags)
print('-'*100)
res.append(list(zip(sent, tags)))
return res

def _correct_tags(self, tags):
stack = []
for idx, tag in enumerate(tags):
# 判断标签是否合理
if tag.startswith("B"):
stack.append(idx)
elif tag.startswith("M") and stack and tags[
stack[-1]] == 'B_' + tag[2:]:
continue
elif tag.startswith("E") and stack and tags[
stack[-1]] == 'B_' + tag[2:]:
stack.pop()
else:
stack.append(idx)

for idx in stack:
tags[idx] = 'O'
return tags
1
2
3
4
5
6
tags_file = "../datasets/ner/renmin/tags.pkl"
parser = Parser(tags_file)

paths = predict(input_ids)

parser.decode(text, paths)
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B_nt', 'M_nt', 'E_nt', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B_nt', 'M_nt', 'E_nt', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O']
['O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O']
['O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O']
['O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['B_nt', 'M_nt', 'M_nt', 'M_nt', 'M_nt']
['O', 'M_nt', 'M_nt', 'M_nt', 'M_nt']
----------------------------------------------------------------------------------------------------
['B_ns', 'E_ns', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['B_ns', 'E_ns', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------





[[('新', 'O'),
  ('冠', 'O'),
  ('肺', 'O'),
  ('炎', 'O'),
  ('疫', 'O'),
  ('情', 'O'),
  ('发', 'O'),
  ('生', 'O'),
  ('后', 'O')],
 [('以', 'O'),
  ('习', 'O'),
  ('近', 'O'),
  ('平', 'O'),
  ('同', 'O'),
  ('志', 'O'),
  ('为', 'O'),
  ('核', 'O'),
  ('心', 'O'),
  ('的', 'O'),
  ('党', 'B_nt'),
  ('中', 'M_nt'),
  ('央', 'E_nt'),
  ('将', 'O'),
  ('疫', 'O'),
  ('情', 'O'),
  ('防', 'O'),
  ('控', 'O'),
  ('作', 'O'),
  ('为', 'O'),
  ('头', 'O'),
  ('等', 'O'),
  ('大', 'O'),
  ('事', 'O'),
  ('来', 'O'),
  ('抓', 'O')],
 [('习', 'O'),
  ('近', 'O'),
  ('平', 'O'),
  ('总', 'O'),
  ('书', 'O'),
  ('记', 'O'),
  ('亲', 'O'),
  ('自', 'O'),
  ('指', 'O'),
  ('挥', 'O')],
 [('亲', 'O'), ('自', 'O'), ('部', 'O'), ('署', 'O')],
 [('坚', 'O'),
  ('持', 'O'),
  ('把', 'O'),
  ('人', 'O'),
  ('民', 'O'),
  ('生', 'O'),
  ('命', 'O'),
  ('安', 'O'),
  ('全', 'O'),
  ('和', 'O'),
  ('身', 'O'),
  ('体', 'O'),
  ('健', 'O'),
  ('康', 'O'),
  ('放', 'O'),
  ('在', 'O'),
  ('第', 'O'),
  ('一', 'O'),
  ('位', 'O')],
 [('领', 'O'),
  ('导', 'O'),
  ('全', 'O'),
  ('党', 'O'),
  ('全', 'O'),
  ('军', 'O'),
  ('全', 'O'),
  ('国', 'O'),
  ('各', 'O'),
  ('族', 'O'),
  ('人', 'O'),
  ('民', 'O'),
  ('打', 'O'),
  ('好', 'O'),
  ('疫', 'O'),
  ('情', 'O'),
  ('防', 'O'),
  ('控', 'O'),
  ('的', 'O'),
  ('人', 'O'),
  ('民', 'O'),
  ('战', 'O'),
  ('争', 'O')],
 [('总', 'O'), ('体', 'O'), ('战', 'O')],
 [('阻', 'O'), ('击', 'O'), ('战', 'O')],
 [('经', 'O'),
  ('过', 'O'),
  ('艰', 'O'),
  ('苦', 'O'),
  ('卓', 'O'),
  ('绝', 'O'),
  ('的', 'O'),
  ('努', 'O'),
  ('力', 'O')],
 [('武', 'O'), ('汉', 'M_nt'), ('保', 'M_nt'), ('卫', 'M_nt'), ('战', 'M_nt')],
 [('湖', 'B_ns'),
  ('北', 'E_ns'),
  ('保', 'O'),
  ('卫', 'O'),
  ('战', 'O'),
  ('取', 'O'),
  ('得', 'O'),
  ('决', 'O'),
  ('定', 'O'),
  ('性', 'O'),
  ('成', 'O'),
  ('果', 'O')],
 [('疫', 'O'),
  ('情', 'O'),
  ('防', 'O'),
  ('控', 'O'),
  ('阻', 'O'),
  ('击', 'O'),
  ('战', 'O'),
  ('取', 'O'),
  ('得', 'O'),
  ('重', 'O'),
  ('大', 'O'),
  ('战', 'O'),
  ('略', 'O'),
  ('成', 'O'),
  ('果', 'O')],
 [('统', 'O'),
  ('筹', 'O'),
  ('推', 'O'),
  ('进', 'O'),
  ('疫', 'O'),
  ('情', 'O'),
  ('防', 'O'),
  ('控', 'O'),
  ('和', 'O'),
  ('经', 'O'),
  ('济', 'O'),
  ('社', 'O'),
  ('会', 'O'),
  ('发', 'O'),
  ('展', 'O'),
  ('工', 'O'),
  ('作', 'O'),
  ('取', 'O'),
  ('得', 'O'),
  ('积', 'O'),
  ('极', 'O'),
  ('成', 'O'),
  ('效', 'O')]]
1
ttt = [('武', 'O'), ('汉', 'M_nt'), ('保', 'M_nt'), ('卫', 'M_nt'), ('战', 'M_nt')]
1
tttags = [tmp[1] for tmp in ttt]
1
tttags
['O', 'M_nt', 'M_nt', 'M_nt', 'M_nt']
1
parser._correct_tags(tttags)
['O', 'O', 'O', 'O', 'O']
1
2
seqs = re.findall('[,。!?、‘’“”:]', text.strip())

1
seqs
[',', ',', '、', ',', ',', '、', '、', '。', ',', '、', ',', ',', '。']
1
text[9]
','
1
text[10]
'以'
1
import string
1
string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
class Tokenizer:
def __init__(self, vocab_file):
with open(vocab_file, 'rb') as inp:
self.token2idx = pickle.load(inp)
self.idx2token = pickle.load(inp)

def encode(self, text):
sep = set(',。!?、‘’“”:')

res = {}

start = 0
end = 0
for idx, char in enumerate(text):
end = idx
if char in sep:
ids = [self.token2idx[char] for char in text[start:end]]
res[(start,end)] = ids
start



seqs = re.split('[,。!?、‘’“”:]', text.strip())

# 将文本转换成索引
seq_ids = []
for seq in seqs:
token_ids = []
if seq:
for char in seq:
if char not in self.token2idx:
token_ids.append(self.token2idx['[unknown]'])
else:
token_ids.append(self.token2idx[char])
seq_ids.append(token_ids)

# 等长化处理
num_samples = len(seq_ids)
x = np.full((num_samples, maxlen), 0., dtype=np.int64)
for idx, s in enumerate(seq_ids):
trunc = np.array(s[:maxlen], dtype=np.int64)
x[idx, :len(trunc)] = trunc
return x
1

1
2
3
4
text = "新冠肺炎疫情发生后,以习近平同志为核心的党中央将疫情防控作为头等大事来抓,习近平\
总书记亲自指挥、亲自部署,坚持把人民生命安全和身体健康放在第一位,领导全党全军全国各族人民打好疫情\
防控的人民战争、总体战、阻击战。经过艰苦卓绝的努力,武汉保卫战、湖北保卫战取得决定性成果,疫情防控\
阻击战取得重大战略成果,统筹推进疫情防控和经济社会发展工作取得积极成效。"
1
2
3
4
5
6
7
8
9
10
res = []
start = 0
for m in re.finditer('[,。!?、‘’“”:]', text):
end = m.span()[0]
seg = text[start:end]
seg_ids = [token2idx[token] for token in seg]
seg_labels = viterbi_decode(seg_ids, pi, A, B)
res.append(list(zip(list(range(start, end)), seg_labels)))
start = m.span()[1]
res
[[(0, 3), (1, 3), (2, 3), (3, 3), (4, 3), (5, 3), (6, 3), (7, 3), (8, 3)],
 [(10, 3),
  (11, 3),
  (12, 3),
  (13, 3),
  (14, 3),
  (15, 3),
  (16, 3),
  (17, 3),
  (18, 3),
  (19, 3),
  (20, 9),
  (21, 1),
  (22, 0),
  (23, 3),
  (24, 3),
  (25, 3),
  (26, 3),
  (27, 3),
  (28, 3),
  (29, 3),
  (30, 3),
  (31, 3),
  (32, 3),
  (33, 3),
  (34, 3),
  (35, 3)],
 [(37, 3),
  (38, 3),
  (39, 3),
  (40, 3),
  (41, 3),
  (42, 3),
  (43, 3),
  (44, 3),
  (45, 3),
  (46, 3)],
 [(48, 3), (49, 3), (50, 3), (51, 3)],
 [(53, 3),
  (54, 3),
  (55, 3),
  (56, 3),
  (57, 3),
  (58, 3),
  (59, 3),
  (60, 3),
  (61, 3),
  (62, 3),
  (63, 3),
  (64, 3),
  (65, 3),
  (66, 3),
  (67, 3),
  (68, 3),
  (69, 3),
  (70, 3),
  (71, 3)],
 [(73, 3),
  (74, 3),
  (75, 3),
  (76, 3),
  (77, 3),
  (78, 3),
  (79, 3),
  (80, 3),
  (81, 3),
  (82, 3),
  (83, 3),
  (84, 3),
  (85, 3),
  (86, 3),
  (87, 3),
  (88, 3),
  (89, 3),
  (90, 3),
  (91, 3),
  (92, 3),
  (93, 3),
  (94, 3),
  (95, 3)],
 [(97, 3), (98, 3), (99, 3)],
 [(101, 3), (102, 3), (103, 3)],
 [(105, 3),
  (106, 3),
  (107, 3),
  (108, 3),
  (109, 3),
  (110, 3),
  (111, 3),
  (112, 3),
  (113, 3)],
 [(115, 9), (116, 1), (117, 1), (118, 1), (119, 1)],
 [(121, 5),
  (122, 7),
  (123, 3),
  (124, 3),
  (125, 3),
  (126, 3),
  (127, 3),
  (128, 3),
  (129, 3),
  (130, 3),
  (131, 3),
  (132, 3)],
 [(134, 3),
  (135, 3),
  (136, 3),
  (137, 3),
  (138, 3),
  (139, 3),
  (140, 3),
  (141, 3),
  (142, 3),
  (143, 3),
  (144, 3),
  (145, 3),
  (146, 3),
  (147, 3),
  (148, 3)],
 [(150, 3),
  (151, 3),
  (152, 3),
  (153, 3),
  (154, 3),
  (155, 3),
  (156, 3),
  (157, 3),
  (158, 3),
  (159, 3),
  (160, 3),
  (161, 3),
  (162, 3),
  (163, 3),
  (164, 3),
  (165, 3),
  (166, 3),
  (167, 3),
  (168, 3),
  (169, 3),
  (170, 3),
  (171, 3),
  (172, 3)]]
1
2
3
for seg in res:
for idx, tag_id in seg:
print(text[idx], idx2tag[tag_id])
新 O
冠 O
肺 O
炎 O
疫 O
情 O
发 O
生 O
后 O
以 O
习 O
近 O
平 O
同 O
志 O
为 O
核 O
心 O
的 O
党 B_nt
中 M_nt
央 E_nt
将 O
疫 O
情 O
防 O
控 O
作 O
为 O
头 O
等 O
大 O
事 O
来 O
抓 O
习 O
近 O
平 O
总 O
书 O
记 O
亲 O
自 O
指 O
挥 O
亲 O
自 O
部 O
署 O
坚 O
持 O
把 O
人 O
民 O
生 O
命 O
安 O
全 O
和 O
身 O
体 O
健 O
康 O
放 O
在 O
第 O
一 O
位 O
领 O
导 O
全 O
党 O
全 O
军 O
全 O
国 O
各 O
族 O
人 O
民 O
打 O
好 O
疫 O
情 O
防 O
控 O
的 O
人 O
民 O
战 O
争 O
总 O
体 O
战 O
阻 O
击 O
战 O
经 O
过 O
艰 O
苦 O
卓 O
绝 O
的 O
努 O
力 O
武 B_nt
汉 M_nt
保 M_nt
卫 M_nt
战 M_nt
湖 B_ns
北 E_ns
保 O
卫 O
战 O
取 O
得 O
决 O
定 O
性 O
成 O
果 O
疫 O
情 O
防 O
控 O
阻 O
击 O
战 O
取 O
得 O
重 O
大 O
战 O
略 O
成 O
果 O
统 O
筹 O
推 O
进 O
疫 O
情 O
防 O
控 O
和 O
经 O
济 O
社 O
会 O
发 O
展 O
工 O
作 O
取 O
得 O
积 O
极 O
成 O
效 O

基于HMM和Viterbi算法的序列标注

https://hunlp.com/posts/553.html

作者

ฅ´ω`ฅ

发布于

2021-06-06

更新于

2021-06-06

许可协议


评论