第一章咱们简略了解了NER使命和基线模型Bert-Bilstm-CRF基线模型详解&代码完成,这一章按处理问题的方法来划分,咱们聊聊多使命学习,和对立搬迁学习是怎么优化实体辨认中鸿沟模糊,笔直范畴标示样本少等问题的。Github-DSXiangLi/ChineseNER中供给了bert_bilstm_crf_mtl多使命, 和bert_bilstm_crf_adv对立搬迁两个模型,支撑任意NER+NER,CWS+NER的Joint Training。Repo里上传了在MSRA+MSR上练习好的bert_bilstm_crf_mtl模型以及serving相关的代码, 能够开箱即用哟~
多使命学习
以下Reference中1,2,3都是有关多使命学习来提高NER作用的,简略说多使命的好处有两个:
- 引进额定信息:协助学习直接从主使命中难以提取出的特征
- 学到更通用的文本特征:多个使命都需求的信息才是通用信息,也能够了解成正则化,也能够了解为不同使命会带来噪声起到类似bagging的作用
MTL有很多种模型结构,之后咱们首要会用到的是前三种,hard,Asymemetry和Customized Sharing, 下面让咱们详细看下MTL在NER使命中的各种运用方式。
词鸿沟增强:ner+cws
paper: Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning,2016
分词使命和实体辨认使命进行联合练习首要体现了以上第一个长处’增加额定信息‘,让分词样本的词鸿沟标示信息来提高NER鸿沟辨认的准确率。以下是Ref1中的模型结构,基本便是上面的Asymmetry Sharing。NER和CWS同享character embedding,在NER的CRF层,除了运用character emebdding, NER相关特征以外,还会运用CWS包括分词信息的最终一层。这儿我对运用Asymmetry结构是存疑的,如果CWS和NER使命是对相同样本分别标示了分词和实体的话,用Asy的确更合理,但paper中一个是新闻样本一个是社交范畴的样本,感觉asy会比hard sharing引进更多的噪音,后面咱们会用MSRA和MSR数据来做测试。
练习时由于CWS和NER的样本量差异较大,作者提出在每个iteration,subsample大样本会明显加速模型收敛。我用的样本本身相差不大,所以也没有做相应的处理,感觉subsample,或者用不同的batch_size+task weight应该会有类似的作用。
跨范畴半监督学习:ner+ner
paper: A unified Model for Cross-Domain and Semi-Supervised Named entity Recognition in Chinese Social Media, 2017
不同范畴的NER使命进行联合学习首要体现了第二个长处‘通用文本特征提取’,用范畴外标示样本和范畴内未标示样本来协助该范畴标示样本,学习愈加通用的文本特征和实体特征。
范畴外到内的搬迁,首要需求处理样本差异性问题,究竟最终方针是期望协助范畴内文本学到合理的文本表达,所以需求penalize和方针范畴差异过大的范畴外样本。作者比照了3种方式来衡量样本x和方针范畴的类似度func(x,IN)func(x, IN),其间cosine距离作用最好
- cross-entropy: 用方针范畴n-gram模型核算x的熵
- Gaisssian: 用所有方针范畴文本embedding求均匀构建vINv_{IN}, 核算vxv_x和vINv_{IN}的欧式距离
- Polynomial Kernel:vxv_x和vINv_{IN}的cosine距离
范畴内未标示样本的半监督学习,由于是直接用模型猜测来做真实label,因而需求penalize猜测置信度低的样本,这儿作者用最优猜测,相对次优猜测提高的百分比做confid(x)confid(x),置信度是动态的需求在每个iteration先对未标示进行猜测再得到confid(x)confid(x)
整个模型结构是范畴内标示/未标示样本和范畴外标示样本的联合练习,以上类似度和置信度用于调整每个iteration练习时,不同样本的学习率lr=lr0∗weight(x,t)lr = lr_0 *weight(x,t)
这篇论文的立异一个在于对无标示样本的运用,不过个人认为在实践应用时直接运用的概率比较小,由于NER是token等级的分类使命,样本噪音对大局体现的搅扰是比较大,不过用confid(x)confid(x)作为自动学习的挑选策略来挑选样本,让标示同学进行标示倒是一个能够测验的思路。
其二是提出了要用范畴类似度来调整lr,尽管考虑到了范畴差异,不过处理方案仍是相对简略,只能下降并不能扫除范畴差异的影响。这儿只它当作引子,看之后的对立搬迁学习是怎么处理范畴差异问题的。
模型完成
repo里的model/bert_bilstm_crf_mtl完成了基于bert-bilstm-crf的多使命联合练习结构,根据传入数据集是ner+ner仍是ner+cws能够完成以上的词增强和跨范畴学习。MTL的相关参数首要是task_weight操控两个使命的loss权重,asymmetry操控模型结构是hard sharing(多使命只同享bert),仍是asymmetry(task2运用task1的hidden output)。这儿默认传入数据集次序对应task1&2。
def build_graph(features, labels, params, is_training):
input_ids = features['token_ids']
label_ids = features['label_ids']
input_mask = features['mask']
segment_ids = features['segment_ids']
seq_len = features['seq_len']
task_ids = features['task_ids']
embedding = pretrain_bert_embedding(input_ids, input_mask, segment_ids, params['pretrain_dir'],
params['embedding_dropout'], is_training)
load_bert_checkpoint(params['pretrain_dir']) # load pretrain bert weight from checkpoint
mask1 = tf.equal(task_ids, 0)
mask2 = tf.equal(task_ids, 1)
batch_size = tf.shape(task_ids)[0]
with tf.variable_scope(params['task_list'][0], reuse=tf.AUTO_REUSE):
task_params = params[params['task_list'][0]]
lstm_output1 = bilstm(embedding, params['cell_type'], params['rnn_activation'],
params['hidden_units_list'], params['keep_prob_list'],
params['cell_size'], params['dtype'], is_training)
logits = tf.layers.dense(lstm_output1, units=task_params['label_size'], activation=None,
use_bias=True, name='logits')
add_layer_summary(logits.name, logits)
trans1, loglikelihood1 = crf_layer(logits, label_ids, seq_len, task_params['label_size'], is_training)
pred_ids1 = crf_decode(logits, trans1, seq_len, task_params['idx2tag'], is_training, mask1)
loss1 = tf.reduce_sum(tf.boolean_mask(-loglikelihood1, mask1, axis=0)) * params['task_weight'][0]
tf.summary.scalar('loss', loss1)
with tf.variable_scope(params['task_list'][1], reuse=tf.AUTO_REUSE):
task_params = params[params['task_list'][1]]
lstm_output2 = bilstm(embedding, params['cell_type'], params['rnn_activation'],
params['hidden_units_list'], params['keep_prob_list'],
params['cell_size'], params['dtype'], is_training)
if params['asymmetry']:
# if asymmetry, task2 is the main task using task1 information
lstm_output2 = tf.concat([lstm_output1, lstm_output2], axis=-1)
logits = tf.layers.dense(lstm_output2, units=task_params['label_size'], activation=None,
use_bias=True, name='logits')
add_layer_summary(logits.name, logits)
trans2, loglikelihood2 = crf_layer(logits, label_ids, seq_len, task_params['label_size'], is_training)
pred_ids2 = crf_decode(logits, trans2, seq_len, task_params['idx2tag'], is_training, mask2)
loss2 = tf.reduce_sum(tf.boolean_mask(-loglikelihood2, mask2, axis=0)) * params['task_weight'][1]
tf.summary.scalar('loss', loss2)
loss = (loss1+loss2)/tf.cast(batch_size, dtype=params['dtype'])
pred_ids = tf.where(tf.equal(task_ids, 0), pred_ids1, pred_ids2) # for infernce all pred_ids will be for 1 task
return loss, pred_ids, task_ids
这儿我在NER(MSRA)+NER(people_daily), NER+CWS(MSR)上分别测验了hard和asymmetry sharing的多使命学习,和前一章的Bert-Bilstm+CRF的benchmark进行比照,整体上来看MTL对MSRA样本基本没啥提高,可是对样本更小的people daily使命有非常明显约3~4%F1的提高。不过以上paper中运用的asy的多使命结构明显被没有带来明显提高,反倒是hard sharing只引进辅佐使命来协助bert finetune的作用更好些。不过MTL和使命挑选联系很大,所以以上结论并不能直接搬迁到其他使命。
对立搬迁学习
以上多使命学习还有一个未处理的问题便是hard和asymmmetry对同享参数层没有任何约束,在多使命练习时使命间的差异会导致带来信息增益的一起也带来了额定的噪音,抽取通用特征的一起也抽取了使命相关的私有特征。当辅佐使命和主使命差异过大,或者辅佐使命噪声过多时,MTL反而会下降主使命作用
这儿使命差异可能是分词使命和实体辨认词粒度的差异,不同范畴NER使命文本的差异等等。前面提到的用范畴类似度来对lr加权的方法只能缓解并不能处理问题,下面咱们来看下对立学习是怎么把使命相关特征/噪音和通用特征区别开来的。
梯度回转 GRL
paper: Adversarial Transfer Learning for Chinese Named Entity Recognition with Self-Attention Mechanism, 2018
这儿就需求用到以上第三种MTL结构Customized Sharing。咱们以NER+CWS使命为例,保留之前的NER tower和CWS tower,参加一个额定的share tower。理想状况是所有通用特征例如粒度相同的词鸿沟信息都被share tower学到,而ner/cws使命相关的私有特征分别被ner/cws tower学到。作者通过对share tower参加对立学习机制,来约束share tower尽可能保留通用特征。模型结构如下【咱们用了Bert来抽取信息,self-attention层就能够先忽略了】
中间的share tower是一个task descriminator,先不看Gradient Reversal。其实是从输入文本中提取双向文本特征,过maxpooling层得到2∗dh2*d_h范畴特征,过softmax辨认样本是来自NER仍是CWS使命的二分类问题(或多分类问题如果有多个使命)
从propensity score的视点,如果softmax得到的概率都在0.5邻近,说明share tower学到的特征无法有用区别task,也便是咱们期望得到的通用特征。为了完成这一作用,作者引进minmax对立机制,softmax判别层尽可能去辨认task,share-bilstm特征抽取层尽可能抽取混杂task的通用特征。
其间K是使命,NkN_k是使命k的样本,EsE_s是用于通用信息提取的bilstm,xikx_i^k是使命k的第i个样本,以上公式是按多分类使命给出的。
这儿作者用了GRL梯度回转层来完成minmax。softmax学到的用于辨认task的特征梯度,反向传播过gradient reversal层会调转正负−1∗gradient-1 * gradient再对share-bilstm的参数进行更新,有点像生成器和判别器按相同步数进行同步练习的GAN的另一种工程完成。之前有评论说梯度回转有些奇怪,由于方针是让share-bilstm学到通用特征,而不是学到把CWS判别成NER,把NER判别是CWS这种颠倒黑白的特征,个人感觉其实不会由于有minmax对立机制在,在实践练习过程中task descriminator的确在一段时间后就会到达probability=0.5 cross-entropy=0.7上下的动态平衡。
模型完成
def build_graph(features, labels, params, is_training):
input_ids = features['token_ids']
label_ids = features['label_ids']
input_mask = features['mask']
segment_ids = features['segment_ids']
seq_len = features['seq_len']
task_ids = features['task_ids']
embedding = pretrain_bert_embedding(input_ids, input_mask, segment_ids, params['pretrain_dir'],
params['embedding_dropout'], is_training)
load_bert_checkpoint(params['pretrain_dir']) # load pretrain bert weight from checkpoint
mask1 = tf.equal(task_ids, 0)
mask2 = tf.equal(task_ids, 1)
batch_size = tf.shape(task_ids)[0]
with tf.variable_scope('task_discriminator', reuse=tf.AUTO_REUSE):
share_output = bilstm(embedding, params['cell_type'], params['rnn_activation'],
params['hidden_units_list'], params['keep_prob_list'],
params['cell_size'], params['dtype'], is_training) # batch * max_seq * (2*hidden)
share_max_pool = tf.reduce_max(share_output, axis=1, name='share_max_pool') # batch * (2* hidden) extract most significant feature
# reverse gradient of max_output to only update the unit use to distinguish task
share_max_pool = flip_gradient(share_max_pool, params['shrink_gradient_reverse'])
share_max_pool = tf.layers.dropout(share_max_pool, rate=params['share_dropout'],
seed=1234, training=is_training)
add_layer_summary(share_max_pool.name, share_max_pool)
logits = tf.layers.dense(share_max_pool, units=len(params['task_list']), activation=None,
use_bias=True, name='logits')# batch * num_task
add_layer_summary(logits.name, logits)
adv_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=features['task_ids'], logits=logits)
adv_loss = tf.reduce_mean(adv_loss, name='loss')
tf.summary.scalar('loss', adv_loss)
with tf.variable_scope('task1_{}'.format(params['task_list'][0]), reuse=tf.AUTO_REUSE):
task_params = params[params['task_list'][0]]
lstm_output = bilstm(embedding, params['cell_type'], params['rnn_activation'],
params['hidden_units_list'], params['keep_prob_list'],
params['cell_size'], params['dtype'], is_training)
lstm_output = tf.concat([share_output, lstm_output], axis=-1) # bath * (4* hidden)
logits = tf.layers.dense(lstm_output, units=task_params['label_size'], activation=None,
use_bias=True, name='logits')
add_layer_summary(logits.name, logits)
trans1, loglikelihood1 = crf_layer(logits, label_ids, seq_len, task_params['label_size'], is_training)
pred_ids1 = crf_decode(logits, trans1, seq_len, task_params['idx2tag'], is_training, mask1)
loss1 = tf.reduce_sum(tf.boolean_mask(-loglikelihood1, mask1, axis=0)) * params['task_weight'][0]
tf.summary.scalar('loss', loss1)
with tf.variable_scope('task2_{}'.format(params['task_list'][1]), reuse=tf.AUTO_REUSE):
task_params = params[params['task_list'][1]]
lstm_output = bilstm(embedding, params['cell_type'], params['rnn_activation'],
params['hidden_units_list'], params['keep_prob_list'],
params['cell_size'], params['dtype'], is_training)
lstm_output = tf.concat([share_output, lstm_output], axis=-1) # bath * (4* hidden)
logits = tf.layers.dense(lstm_output, units=task_params['label_size'], activation=None,
use_bias=True, name='logits')
add_layer_summary(logits.name, logits)
trans2, loglikelihood2 = crf_layer(logits, label_ids, seq_len, task_params['label_size'], is_training)
pred_ids2 = crf_decode(logits, trans2, seq_len, task_params['idx2tag'], is_training, mask2)
loss2 = tf.reduce_sum(tf.boolean_mask(-loglikelihood2, mask2, axis=0)) * params['task_weight'][1]
tf.summary.scalar('loss', loss2)
loss = (loss1+loss2)/tf.cast(batch_size, dtype=params['dtype']) + adv_loss * params['lambda']
pred_ids = tf.where(tf.equal(task_ids, 0), pred_ids1, pred_ids2)
return loss, pred_ids, task_ids
这儿咱们比照下adv和mtl的作用,。。。不扫除咱们用了强大的Bert做底层抽取,以及这儿的3个使命本身差异并不太大,究竟在people daily上MTL的作用提高已经非常明显,所以adv和mtl的差异感觉也便是个随机波动,之后要是有比较垂的样本再试试看吧~
Reference
- 【CWS+NER MTL】Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning,2016
- 【Cross-Domain LR Adjust】A unified Model for Cross-Domain and Semi-Supervised Named entity Recognition in Chinese Social Media, 2017
- 【MTL】Multi-Task Learning for Sequence Tagging: An Empirical Study, 2018
- 【CWS+NER Adv MTL】Adversarial Transfer Learning for Chinese Named Entity Recognition with Self-Attention Mechanism, 2018
- 【Adv MTL】Adversarial Multi-task Learning for Text Classification, 2017
- Dual Adversarial Neural Transfer for Low-Resource Named Entity Recognition, 2019
- 【GRL】Unsupervised Domain Adaptation by Backpropagation,2015
- 【GRL】Domain-Adversarial Training of Neural Networks, 2016
- www.zhihu.com/question/26…