HMM
生成模型
给定句子 \(S\) ,对应的输出词性序列 \(T\) ,HMM
模型的联合概率: \[
\begin{align}
P(T|S) &= \frac{P(S|T)\cdot P(T)}{P(S)}\\
P(S,T) &= P(S|T)\cdot P(T)\\
&= \prod_{i=1}^{n}P(w_i|T)\cdot P(T)\\
&= \prod_{i=1}^{n}P(w_i|t_i)\cdot P(T)\\
&= \prod_{i=1}^{n}P(w_i|t_i)\cdot P(t_i|t_{i-1})\\
\end{align}
\]
首先贝叶斯公式展开,然后利用 以下假设 简化: - 由词之间相互独立假设,得到 \(\prod_{i=1}^{n}P(w_i|T)\) - 由单词概率仅依赖于其自身的标签,得到发射(emission)
概率 \(\prod_{i=1}^{n}P(w_i|t_i)\) - 由马尔可夫假设,使用 bi-gram
得到转移(transition)
概率 \(P(t_i|t_{i-1})\)
目标函数:
\[
(\hat{t_1},\hat{t_2}...\hat{t_n})=arg max\prod_{i=1}^{n}P(w_i|t_i)\cdot P(t_i|t_{i-1})
\]
综上,HMM
假设了两类特征:当前词性与上一词性的关系,当前词与当前词性的关系 HMM的学习过程就是在训练集中学习这两个概率矩阵,大小分别为(t,t),(w,t)
,w
为单词的个数,t
为词性的个数
实战:词性标注
词性标注(Part-Of-Speech tagging, POS tagging)
,判断句子中单词的词性:谓词、虚词、代词、感叹词等
本质上属于分类问题,将句子中的单词按词性分类
因此需要词性标注好的语料库,其中给定句子\(s=w_1w_2...w_n\) 及对应的词性 \(t=z_1z_2...z_n\) > 语料格式:每一行为词+词性,特殊符号 ,
等表示句子结尾 1 2 3 4 5 6 7 'Newsweek/NNP\n', ',/,\n', 'trying/VBG\n', 'to/TO\n', 'keep/VB\n', 'pace/NN\n', 'with/IN\n',
1 2 open ('../datasets/pos_tagging_data.txt' ,'r' ).readlines(50 )
['Newsweek/NNP\n',
',/,\n',
'trying/VBG\n',
'to/TO\n',
'keep/VB\n',
'pace/NN\n',
'with/IN\n']
创建词汇表
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 tag2id, id2tag = {}, {} word2id, id2word = {}, {} for line in open ('../datasets/pos_tagging_data.txt' , 'r' ): items = line.split('/' ) word, tag = items[0 ], items[1 ].rstrip() if word not in word2id: word2id[word] = len (word2id) id2word[len (id2word)] = word if tag not in tag2id: tag2id[tag] = len (tag2id) id2tag[len (tag2id)] = tag M = len (word2id) N = len (tag2id) print (M,N)
发射矩阵和转移矩阵
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import numpy as nppi = np.zeros(N) A = np.zeros((N, M)) B = np.zeros((N, N)) prev_tag = "" for line in open ('../datasets/pos_tagging_data.txt' , 'r' ): items = line.split('/' ) wordId, tagId = word2id[items[0 ]], tag2id[items[1 ].rstrip()] if prev_tag == "" : pi[tagId] += 1 A[tagId][wordId] += 1 else : A[tagId][wordId] += 1 B[tag2id[prev_tag]][tagId] += 1 if items[0 ] == "." : prev_tag = "" else : prev_tag = items[1 ].rstrip() pi = pi / sum (pi) for i in range (N): A[i] /= sum (A[i]) B[i] /= sum (B[i])
维特比算法求解最优标注
1 2 3 4 def log_ (v ): if v==0 : return np.log(v+0.000001 ) return np.log(v)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 from math import logdef viterbi (x, pi, A, B ): x = [word2id[word] for word in x.split(" " )] T = len (x) dp = np.zeros((T, N)) ptr = np.array([[0 for x in range (N)] for y in range (T)]) for j in range (N): dp[0 ][j] = log_(pi[j]) + log_(A[j][x[0 ]]) for i in range (1 , T): for j in range (N): dp[i][j] = float ('-inf' ) for k in range (N): score = dp[i - 1 ][k] + log_(B[k][j]) + log_(A[j][x[i]]) if score > dp[i][j]: dp[i][j] = score ptr[i][j] = k best_seq = [0 ] * T best_seq[T - 1 ] = np.argmax(dp[T - 1 ]) for i in range (T - 2 , -1 , -1 ): best_seq[i] = ptr[i + 1 ][best_seq[i + 1 ]] return [id2tag[id ] for id in best_seq]
1 2 x = "I like play soccer" viterbi(x,pi,A,B)
实战:命名实体识别
基于人民日报语料,词汇表及标签已经处理完成
根据语料获取转移矩阵与发射矩阵
语料每一行为一个子句,每个子句中用空格隔开标注好的单字,单字与其词性用/
连接
1 2 3 4 5 6 7 8 9 10 11 with open ("../datasets/ner/renmin/vocab.pkl" , 'rb' ) as inp: token2idx = pickle.load(inp) idx2token = pickle.load(inp) with open ("../datasets/ner/renmin/tags.pkl" , "rb" ) as inp: tag2idx = pickle.load(inp) idx2tag = pickle.load(inp) N = len (tag2idx) M = len (token2idx)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 import codecsdef train_hmm (data_file ): input_data = codecs.open (data_file, 'r' , 'utf-8' ) pi = np.zeros(N) A = np.zeros((N, M)) B = np.zeros((N, N)) for line in input_data.readlines(): line = line.strip().split() tokens = [token2idx[string.split('/' )[0 ].strip()] for string in line] tags = [tag2idx[string.split('/' )[1 ].strip()] for string in line] for idx in range (len (tokens)): if idx == 0 : pi[tags[idx]] += 1 A[tags[idx]][tokens[idx]] += 1 else : A[tags[idx]][tokens[idx]] += 1 B[tags[idx - 1 ]][tags[idx]] += 1 pi = pi / sum (pi) A = A / A.sum (axis=-1 ).reshape(-1 , 1 ) B = B / B.sum (axis=-1 ).reshape(-1 , 1 ) return pi, A, B
1 2 3 4 with open ('../models/ner/hmm.pkl' , 'rb' ) as inp: pi = pickle.load(inp) A = pickle.load(inp) B = pickle.load(inp)
1 2 3 4 5 6 7 8 9 10 11 import pandas as pdindex = [ 'B_nt' , 'M_nt' , 'E_nt' , 'B_nr' , 'M_nr' , 'E_nr' , 'B_ns' , 'M_ns' , 'E_ns' , 'O' ] transitions = pd.DataFrame(B, index=idx2tag.values(), columns=idx2tag.values()) transitions.reindex(index, axis=0 ).reindex(index, axis=1 ).round ( 2 ).style.applymap(lambda v: 'background-color: %s' % '#B0C4DE' if v > 0 else 'background-color: %s' % '#FFFFFF' )
从上表中,通过语料训练得到的转移矩阵,可以看出,连续的两个标签之间的联合概率
1 2 3 4 5 6 start_status = pd.DataFrame(pi, index=idx2tag.values()) start_status.reindex(index, axis=0 ).round (2 ).style.applymap( lambda v: 'background-color: %s' % '#B0C4DE' if v > 0 else 'background-color: %s' % '#FFFFFF' )
句首单字的标签的概率分布
评估性能
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 import picklepickle_path = '../datasets/ner/renmin/renmindata.pkl' with open (pickle_path, 'rb' ) as inp: word2id = pickle.load(inp) id2word = pickle.load(inp) tag2id = pickle.load(inp) id2tag = pickle.load(inp) x_train = pickle.load(inp) y_train = pickle.load(inp) x_test = pickle.load(inp) y_test = pickle.load(inp) x_valid = pickle.load(inp) y_valid = pickle.load(inp) print ("train len:" , len (x_train))print ("test len:" , len (x_test))print ("valid len:" , len (x_valid))
train len: 24271
test len: 7585
valid len: 6068
(array([ 3, 33, 5, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0]),
array([3, 3, 5, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]))
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 def log_ (v ): return np.log(v + 0.000001 ) def viterbi_decode (x, pi, A, B ): T = len (x) N = len (tag2idx) dp = np.full((T, N), float ('-inf' )) ptr = np.zeros_like(dp, dtype=np.int32) dp[0 ] = log_(pi) + log_(A[:, x[0 ]]) for i in range (1 , T): v = dp[i - 1 ].reshape(-1 , 1 ) + log_(B) dp[i] = np.max (v, axis=0 ) + log_(A[:, x[i]]) ptr[i] = np.argmax(v, axis=0 ) best_seq = [0 ] * T best_seq[-1 ] = np.argmax(dp[-1 ]) for i in range (T - 2 , -1 , -1 ): best_seq[i] = ptr[i + 1 ][best_seq[i + 1 ]] return best_seq
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 from sklearn.metrics import precision_score, recall_score, f1_score, classification_reportdef test (x_test, y_test ): preds, labels = [], [] for index, x in enumerate (x_test): x = x[:sum (x > 0 )] y_pred = viterbi_decode(x, pi, A, B) y_true = y_test[index][:sum (x > 0 )] preds.extend(y_pred) labels.extend(y_true) precision = precision_score(labels, preds, average='macro' ) recall = recall_score(labels, preds, average='macro' ) f1 = f1_score(labels, preds, average='macro' ) report = classification_report(labels, preds) print (report)
precision recall f1-score support
0 0.73 0.69 0.71 2151
1 0.74 0.75 0.74 8090
2 0.89 0.78 0.83 3965
3 0.94 0.96 0.95 77532
4 0.92 0.83 0.87 3964
5 0.83 0.80 0.81 4522
6 0.66 0.70 0.68 2691
7 0.80 0.76 0.78 4524
8 0.88 0.84 0.86 3654
9 0.75 0.74 0.74 2146
accuracy 0.90 113239
macro avg 0.81 0.78 0.80 113239
weighted avg 0.90 0.90 0.90 113239
进行预测
文本向量化
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 import re, pickleclass Tokenizer : def __init__ (self, vocab_file ): with open (vocab_file, 'rb' ) as inp: self.token2idx = pickle.load(inp) self.idx2token = pickle.load(inp) def encode (self, text, maxlen ): seqs = re.split('[,。!?、‘’“”:]' , text.strip()) seq_ids = [] for seq in seqs: token_ids = [] if seq: for char in seq: if char not in self.token2idx: token_ids.append(self.token2idx['[unknown]' ]) else : token_ids.append(self.token2idx[char]) seq_ids.append(token_ids) num_samples = len (seq_ids) x = np.full((num_samples, maxlen), 0. , dtype=np.int64) for idx, s in enumerate (seq_ids): trunc = np.array(s[:maxlen], dtype=np.int64) x[idx, :len (trunc)] = trunc return x
1 2 vocab_file = "../datasets/ner/renmin/vocab.pkl" tokenizer = Tokenizer(vocab_file)
1 2 3 4 5 6 text = "新冠肺炎疫情发生后,以习近平同志为核心的党中央将疫情防控作为头等大事来抓,习近平\ 总书记亲自指挥、亲自部署,坚持把人民生命安全和身体健康放在第一位,领导全党全军全国各族人民打好疫情\ 防控的人民战争、总体战、阻击战。经过艰苦卓绝的努力,武汉保卫战、湖北保卫战取得决定性成果,疫情防控\ 阻击战取得重大战略成果,统筹推进疫情防控和经济社会发展工作取得积极成效。" tokenizer.encode(text, maxlen=30 );
进行预测
1 2 3 4 5 6 7 def predict (input_ids ): res = [] for idx, x in enumerate (input_ids): x = x[x > 0 ] y_pred = viterbi_decode(x, pi, A, B) res.append(y_pred) return res
1 2 input_ids = tokenizer.encode(text, maxlen=30 ) predict(input_ids)
[[3, 3, 3, 3, 3, 3, 3, 3, 3],
[3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
9,
1,
0,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3,
3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3],
[3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3],
[9, 1, 1, 1, 1],
[5, 7, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3],
[3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]]
预测向量转标签序列
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 class Parser : def __init__ (self, tags_file ): with open (tags_file, "rb" ) as inp: self.tag2idx = pickle.load(inp) self.idx2tag = pickle.load(inp) def decode (self, text, paths ): seqs = re.split('[,。!?、‘’“”:]' , text) labels = [[self.idx2tag[idx] for idx in seq] for seq in paths] res = [] for sent, tags in zip (seqs, labels): print (tags) tags = self._correct_tags(tags) print (tags) print ('-' *100 ) res.append(list (zip (sent, tags))) return res def _correct_tags (self, tags ): stack = [] for idx, tag in enumerate (tags): if tag.startswith("B" ): stack.append(idx) elif tag.startswith("M" ) and stack and tags[ stack[-1 ]] == 'B_' + tag[2 :]: continue elif tag.startswith("E" ) and stack and tags[ stack[-1 ]] == 'B_' + tag[2 :]: stack.pop() else : stack.append(idx) for idx in stack: tags[idx] = 'O' return tags
1 2 3 4 5 6 tags_file = "../datasets/ner/renmin/tags.pkl" parser = Parser(tags_file) paths = predict(input_ids) parser.decode(text, paths)
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B_nt', 'M_nt', 'E_nt', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B_nt', 'M_nt', 'E_nt', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O']
['O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O']
['O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O']
['O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['B_nt', 'M_nt', 'M_nt', 'M_nt', 'M_nt']
['O', 'M_nt', 'M_nt', 'M_nt', 'M_nt']
----------------------------------------------------------------------------------------------------
['B_ns', 'E_ns', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['B_ns', 'E_ns', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
----------------------------------------------------------------------------------------------------
[[('新', 'O'),
('冠', 'O'),
('肺', 'O'),
('炎', 'O'),
('疫', 'O'),
('情', 'O'),
('发', 'O'),
('生', 'O'),
('后', 'O')],
[('以', 'O'),
('习', 'O'),
('近', 'O'),
('平', 'O'),
('同', 'O'),
('志', 'O'),
('为', 'O'),
('核', 'O'),
('心', 'O'),
('的', 'O'),
('党', 'B_nt'),
('中', 'M_nt'),
('央', 'E_nt'),
('将', 'O'),
('疫', 'O'),
('情', 'O'),
('防', 'O'),
('控', 'O'),
('作', 'O'),
('为', 'O'),
('头', 'O'),
('等', 'O'),
('大', 'O'),
('事', 'O'),
('来', 'O'),
('抓', 'O')],
[('习', 'O'),
('近', 'O'),
('平', 'O'),
('总', 'O'),
('书', 'O'),
('记', 'O'),
('亲', 'O'),
('自', 'O'),
('指', 'O'),
('挥', 'O')],
[('亲', 'O'), ('自', 'O'), ('部', 'O'), ('署', 'O')],
[('坚', 'O'),
('持', 'O'),
('把', 'O'),
('人', 'O'),
('民', 'O'),
('生', 'O'),
('命', 'O'),
('安', 'O'),
('全', 'O'),
('和', 'O'),
('身', 'O'),
('体', 'O'),
('健', 'O'),
('康', 'O'),
('放', 'O'),
('在', 'O'),
('第', 'O'),
('一', 'O'),
('位', 'O')],
[('领', 'O'),
('导', 'O'),
('全', 'O'),
('党', 'O'),
('全', 'O'),
('军', 'O'),
('全', 'O'),
('国', 'O'),
('各', 'O'),
('族', 'O'),
('人', 'O'),
('民', 'O'),
('打', 'O'),
('好', 'O'),
('疫', 'O'),
('情', 'O'),
('防', 'O'),
('控', 'O'),
('的', 'O'),
('人', 'O'),
('民', 'O'),
('战', 'O'),
('争', 'O')],
[('总', 'O'), ('体', 'O'), ('战', 'O')],
[('阻', 'O'), ('击', 'O'), ('战', 'O')],
[('经', 'O'),
('过', 'O'),
('艰', 'O'),
('苦', 'O'),
('卓', 'O'),
('绝', 'O'),
('的', 'O'),
('努', 'O'),
('力', 'O')],
[('武', 'O'), ('汉', 'M_nt'), ('保', 'M_nt'), ('卫', 'M_nt'), ('战', 'M_nt')],
[('湖', 'B_ns'),
('北', 'E_ns'),
('保', 'O'),
('卫', 'O'),
('战', 'O'),
('取', 'O'),
('得', 'O'),
('决', 'O'),
('定', 'O'),
('性', 'O'),
('成', 'O'),
('果', 'O')],
[('疫', 'O'),
('情', 'O'),
('防', 'O'),
('控', 'O'),
('阻', 'O'),
('击', 'O'),
('战', 'O'),
('取', 'O'),
('得', 'O'),
('重', 'O'),
('大', 'O'),
('战', 'O'),
('略', 'O'),
('成', 'O'),
('果', 'O')],
[('统', 'O'),
('筹', 'O'),
('推', 'O'),
('进', 'O'),
('疫', 'O'),
('情', 'O'),
('防', 'O'),
('控', 'O'),
('和', 'O'),
('经', 'O'),
('济', 'O'),
('社', 'O'),
('会', 'O'),
('发', 'O'),
('展', 'O'),
('工', 'O'),
('作', 'O'),
('取', 'O'),
('得', 'O'),
('积', 'O'),
('极', 'O'),
('成', 'O'),
('效', 'O')]]
1 ttt = [('武' , 'O' ), ('汉' , 'M_nt' ), ('保' , 'M_nt' ), ('卫' , 'M_nt' ), ('战' , 'M_nt' )]
1 tttags = [tmp[1 ] for tmp in ttt]
['O', 'M_nt', 'M_nt', 'M_nt', 'M_nt']
1 parser._correct_tags(tttags)
['O', 'O', 'O', 'O', 'O']
1 2 seqs = re.findall('[,。!?、‘’“”:]' , text.strip())
[',', ',', '、', ',', ',', '、', '、', '。', ',', '、', ',', ',', '。']
','
'以'
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 class Tokenizer : def __init__ (self, vocab_file ): with open (vocab_file, 'rb' ) as inp: self.token2idx = pickle.load(inp) self.idx2token = pickle.load(inp) def encode (self, text ): sep = set (',。!?、‘’“”:' ) res = {} start = 0 end = 0 for idx, char in enumerate (text): end = idx if char in sep: ids = [self.token2idx[char] for char in text[start:end]] res[(start,end)] = ids start seqs = re.split('[,。!?、‘’“”:]' , text.strip()) seq_ids = [] for seq in seqs: token_ids = [] if seq: for char in seq: if char not in self.token2idx: token_ids.append(self.token2idx['[unknown]' ]) else : token_ids.append(self.token2idx[char]) seq_ids.append(token_ids) num_samples = len (seq_ids) x = np.full((num_samples, maxlen), 0. , dtype=np.int64) for idx, s in enumerate (seq_ids): trunc = np.array(s[:maxlen], dtype=np.int64) x[idx, :len (trunc)] = trunc return x
1 2 3 4 text = "新冠肺炎疫情发生后,以习近平同志为核心的党中央将疫情防控作为头等大事来抓,习近平\ 总书记亲自指挥、亲自部署,坚持把人民生命安全和身体健康放在第一位,领导全党全军全国各族人民打好疫情\ 防控的人民战争、总体战、阻击战。经过艰苦卓绝的努力,武汉保卫战、湖北保卫战取得决定性成果,疫情防控\ 阻击战取得重大战略成果,统筹推进疫情防控和经济社会发展工作取得积极成效。"
1 2 3 4 5 6 7 8 9 10 res = [] start = 0 for m in re.finditer('[,。!?、‘’“”:]' , text): end = m.span()[0 ] seg = text[start:end] seg_ids = [token2idx[token] for token in seg] seg_labels = viterbi_decode(seg_ids, pi, A, B) res.append(list (zip (list (range (start, end)), seg_labels))) start = m.span()[1 ] res
[[(0, 3), (1, 3), (2, 3), (3, 3), (4, 3), (5, 3), (6, 3), (7, 3), (8, 3)],
[(10, 3),
(11, 3),
(12, 3),
(13, 3),
(14, 3),
(15, 3),
(16, 3),
(17, 3),
(18, 3),
(19, 3),
(20, 9),
(21, 1),
(22, 0),
(23, 3),
(24, 3),
(25, 3),
(26, 3),
(27, 3),
(28, 3),
(29, 3),
(30, 3),
(31, 3),
(32, 3),
(33, 3),
(34, 3),
(35, 3)],
[(37, 3),
(38, 3),
(39, 3),
(40, 3),
(41, 3),
(42, 3),
(43, 3),
(44, 3),
(45, 3),
(46, 3)],
[(48, 3), (49, 3), (50, 3), (51, 3)],
[(53, 3),
(54, 3),
(55, 3),
(56, 3),
(57, 3),
(58, 3),
(59, 3),
(60, 3),
(61, 3),
(62, 3),
(63, 3),
(64, 3),
(65, 3),
(66, 3),
(67, 3),
(68, 3),
(69, 3),
(70, 3),
(71, 3)],
[(73, 3),
(74, 3),
(75, 3),
(76, 3),
(77, 3),
(78, 3),
(79, 3),
(80, 3),
(81, 3),
(82, 3),
(83, 3),
(84, 3),
(85, 3),
(86, 3),
(87, 3),
(88, 3),
(89, 3),
(90, 3),
(91, 3),
(92, 3),
(93, 3),
(94, 3),
(95, 3)],
[(97, 3), (98, 3), (99, 3)],
[(101, 3), (102, 3), (103, 3)],
[(105, 3),
(106, 3),
(107, 3),
(108, 3),
(109, 3),
(110, 3),
(111, 3),
(112, 3),
(113, 3)],
[(115, 9), (116, 1), (117, 1), (118, 1), (119, 1)],
[(121, 5),
(122, 7),
(123, 3),
(124, 3),
(125, 3),
(126, 3),
(127, 3),
(128, 3),
(129, 3),
(130, 3),
(131, 3),
(132, 3)],
[(134, 3),
(135, 3),
(136, 3),
(137, 3),
(138, 3),
(139, 3),
(140, 3),
(141, 3),
(142, 3),
(143, 3),
(144, 3),
(145, 3),
(146, 3),
(147, 3),
(148, 3)],
[(150, 3),
(151, 3),
(152, 3),
(153, 3),
(154, 3),
(155, 3),
(156, 3),
(157, 3),
(158, 3),
(159, 3),
(160, 3),
(161, 3),
(162, 3),
(163, 3),
(164, 3),
(165, 3),
(166, 3),
(167, 3),
(168, 3),
(169, 3),
(170, 3),
(171, 3),
(172, 3)]]
1 2 3 for seg in res: for idx, tag_id in seg: print (text[idx], idx2tag[tag_id])
新 O
冠 O
肺 O
炎 O
疫 O
情 O
发 O
生 O
后 O
以 O
习 O
近 O
平 O
同 O
志 O
为 O
核 O
心 O
的 O
党 B_nt
中 M_nt
央 E_nt
将 O
疫 O
情 O
防 O
控 O
作 O
为 O
头 O
等 O
大 O
事 O
来 O
抓 O
习 O
近 O
平 O
总 O
书 O
记 O
亲 O
自 O
指 O
挥 O
亲 O
自 O
部 O
署 O
坚 O
持 O
把 O
人 O
民 O
生 O
命 O
安 O
全 O
和 O
身 O
体 O
健 O
康 O
放 O
在 O
第 O
一 O
位 O
领 O
导 O
全 O
党 O
全 O
军 O
全 O
国 O
各 O
族 O
人 O
民 O
打 O
好 O
疫 O
情 O
防 O
控 O
的 O
人 O
民 O
战 O
争 O
总 O
体 O
战 O
阻 O
击 O
战 O
经 O
过 O
艰 O
苦 O
卓 O
绝 O
的 O
努 O
力 O
武 B_nt
汉 M_nt
保 M_nt
卫 M_nt
战 M_nt
湖 B_ns
北 E_ns
保 O
卫 O
战 O
取 O
得 O
决 O
定 O
性 O
成 O
果 O
疫 O
情 O
防 O
控 O
阻 O
击 O
战 O
取 O
得 O
重 O
大 O
战 O
略 O
成 O
果 O
统 O
筹 O
推 O
进 O
疫 O
情 O
防 O
控 O
和 O
经 O
济 O
社 O
会 O
发 O
展 O
工 O
作 O
取 O
得 O
积 O
极 O
成 O
效 O