赞
踩
文本相似度旨在识别两段文本在语义上是否相似。文本相似度在自然语言处理领域是一个重要研究方向,同时在信息检索、新闻推荐、智能客服等领域都发挥重要作用,具有很高的商业价值。
目前学术界的一些公开中文文本相似度数据集,在相关论文的支撑下对现有的公开文本相似度模型进行了较全面的评估,具有较高权威性。因此,本开源项目收集了这些权威的数据集,期望对模型效果进行综合的评价,旨在为研究人员和开发者提供学术和技术交流的平台,进一步提升文本相似度的研究水平,推动文本相似度在自然语言处理领域的应用和发展。
赛题提供3份不同的数据文本数据。官方需要我们分别对3份数据进行预测,然后打包至一个压缩包内上传。
数据集名字 | 数据集简介 | 训练集大小 | 开发集大小 | 测试集大小 |
---|---|---|---|---|
LCQMC | 百度知道中文问题对 | 238,766 | 8,802 | 12,500 |
BQ Corpus | 银行金融领域问题对 | 100,000 | 10,000 | 10,000 |
PAWS-X | 谷歌语言释义对 | 49,401 | 2,000 | 2,000 |
通过数据可知,这是一个典型的文本相似度问题,也可以是一个二分类问题。接解题方案可分为两部分:
以第一句话作为目标分别进行长度统计。发现不同数据集可能采用的长度不一样。我比较懒惰,直接使用97.5分为作为数据集的最大长度。最终得到的结果如下:
数据集 | 截取句子长度 |
---|---|
paws-x | 88 |
lcqmc | 22 |
bq_corpus | 30 |
# 以paws-x为例子统计长度
import pandas as pd
import numpy as np
train = pd.read_csv('data/paws-x/train.tsv', sep='\t',names=['text_a', 'text_b', 'label'])
train['len_a'] = train['text_a'].apply(lambda x:len(x))
p = np.percentile(train['len_a'].tolist(), [75,90,97.5]) # return 50th percentile, e.g median.
import numpy as np
import pandas as pd
import jieba
import distance
from tqdm import tqdm
from gensim import corpora,models,similarities
from gensim.test.utils import common_texts
from gensim.models import Word2Vec,TfidfModel
from gensim import corpora
这个思路相对简单,也是进入这个课题的baseline。具体步骤如下:
def cut(content): try: seg_list = jieba.lcut(content, cut_all=True) except AttributeError as ex: print(content) raise ex return seg_list def rate(words_1, words_2): int_list = list(set(words_1).intersection(set(words_2))) return len(int_list)/len(set(words_1)) def edit_distance(s1, s2): return distance.levenshtein(s1, s2) def data_anaysis(df): # 编辑距离 df['edit_dist'] = df.apply(lambda row: edit_distance(row['text_a'], row['text_b']), axis=1) # 分词 df['words_a'] = df['text_a'].apply(lambda x: cut(x)) df['words_b'] = df['text_b'].apply(lambda x: cut(x)) # 统计字符数 df['text_a_len'] = df['text_a'].apply(lambda x: len(x)) df['text_b_len'] = df['text_b'].apply(lambda x: len(x)) # 统计词个数 df['words_a_len'] = df['words_a'].apply(lambda x: len(x)) df['words_b_len'] = df['words_b'].apply(lambda x: len(x)) # 单词个数比 df['rate_a'] = df.apply(lambda row: rate(row['words_a'], row['words_b']), axis=1) df['rate_b'] = df.apply(lambda row: rate(row['words_b'], row['words_a']), axis=1) return df train = pd.read_csv('data/paws-x-zh/train.tsv', sep='\t',names=['text_a', 'text_b', 'label']) test = pd.read_csv('data/paws-x-zh/test.tsv', sep='\t',names=['text_a', 'text_b', 'label']) # train = train[train['label'].isin(['0','1'])] test['label'] = -1 train = train.dropna() test = test.dropna() train = data_anaysis(train) test = data_anaysis(test) test # 统计tfidf距离 def tfidf_word_match_share(row, stops): q1words = {} q2words = {} for word in row['words_a']: if word not in stops: q1words[word] = 1 for word in row['words_b']: if word not in stops: q2words[word] = 1 if len(q1words) == 0 or len(q2words) == 0: # The computer-generated chaff includes a few questions that are nothing but stopwords return 0 shared_weights = [weights.get(w, 0) for w in q1words.keys() if w in q2words] + [weights.get(w, 0) for w in q2words.keys() if w in q1words] total_weights = [weights.get(w, 0) for w in q1words] + [weights.get(w, 0) for w in q2words] R = np.sum(shared_weights) / np.sum(total_weights) return R train['tfidf_word_match'] = train.apply(lambda row: tfidf_word_match_share(row, stop_words), axis=1) test['tfidf_word_match'] = test.apply(lambda row: tfidf_word_match_share(row, stop_words), axis=1) # 最后数据处理 train['text_len_diff'] = abs(train['text_a_len'] - train['text_b_len']) train['word_len_diff'] = abs(train['words_a_len'] - train['words_b_len']) test['text_len_diff'] = abs(test['text_a_len'] - test['text_b_len']) test['word_len_diff'] = abs(test['words_a_len'] - test['words_b_len']) from sklearn.model_selection import StratifiedKFold import lightgbm as lgb # 建模 fretures = ['text_len_diff','word_len_diff','word_match','tfidf_word_match'] X = train[fretures] y = train['label'] test_features = test[fretures] model = lgb.LGBMClassifier(num_leaves=128, max_depth=10, learning_rate=0.01, n_estimators=2000, subsample=0.8, feature_fraction=0.8, reg_alpha=0.5, reg_lambda=0.5, random_state=2022, metric='auc', boosting_type='gbdt', subsample_freq=1, bagging_fraction=0.8) skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=2022) prob = [] mean_acc = 0 for k,(train_index, test_index) in enumerate(skf.split(X, y)): print(k) X_train, X_val = X.iloc[train_index], X.iloc[test_index] y_train, y_val = y.iloc[train_index], y.iloc[test_index] # 训练 print(y_val) model = model.fit(X_train, y_train, eval_set=[(X_val, y_val)], eval_metric='auc', verbose = True) # 正式预测 test_y_pred = model.predict_proba(test_features) prob.append(test_y_pred)
传统的神经网络分为两部分:
通常情况下,embedding可以通过网络训练得到,也可以通过Word2Vec得到。
Word2Vec就是大家最熟悉的词向量。我们可以先对句子进行分词,然后对每一个词训练一个对应的词向量,最后基于这个词向量融合成一个句向量。
而据向量有下面几种不同的方式:
import re import math from sklearn.decomposition import TruncatedSVD # 读取数据 def get_stopwords(): stop_words = [] with open('baidu_stopwords.txt', 'r', encoding='utf-8') as f: for line in f.readlines(): stop_words.append(line.replace('\n', '')) return stop_words # jieba分词 def cut(content, stop_words): # 去除符号 content = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——!,。?、~@#¥%……&*()]", "",content) result = [] try: seg_list = jieba.lcut(content, cut_all=True) for i in seg_list: if i not in stop_words: result.append(i) except AttributeError as ex: print(content) raise ex return result # 统计相同的词比例 def rate(words_1, words_2): int_list = list(set(words_1).intersection(set(words_2))) return len(int_list)/len(set(words_1)) # 统计距离 def edit_distance(s1, s2): return distance.levenshtein(s1, s2) def data_anaysis(df, stop_words): # 编辑距离 df['edit_dist'] = df.apply(lambda row: edit_distance(row['text_a'], row['text_b']), axis=1) # 分词 df['words_a'] = df['text_a'].apply(lambda x: cut(x, stop_words)) df['words_b'] = df['text_b'].apply(lambda x: cut(x, stop_words)) # 统计字符数 df['text_a_len'] = df['text_a'].apply(lambda x: len(x)) df['text_b_len'] = df['text_b'].apply(lambda x: len(x)) # 统计词个数 df['words_a_len'] = df['words_a'].apply(lambda x: len(x)) df['words_b_len'] = df['words_b'].apply(lambda x: len(x)) # 单词个数比 df['rate_a'] = df.apply(lambda row: rate(row['words_a'], row['words_b']), axis=1) df['rate_b'] = df.apply(lambda row: rate(row['words_b'], row['words_a']), axis=1) return df # 获取停用词 stop_words = get_stopwords() train = pd.read_csv('data/paws-x-zh/train.tsv', sep='\t',names=['text_a', 'text_b', 'label']) test = pd.read_csv('data/paws-x-zh/test.tsv', sep='\t',names=['text_a', 'text_b', 'label']) test['label'] = -1 train = train.dropna() test = test.dropna() train = data_anaysis(train, stop_words) test = data_anaysis(test, stop_words) # 训练词向量 context = [] for i in tqdm(range(len(train))): row = train.iloc[i] context.append(row['words_a']) context.append(row['words_b']) for i in tqdm(range(len(test))): row = test.iloc[i] context.append(row['words_a']) context.append(row['words_b']) wv_model = Word2Vec(sentences=context, vector_size=100, window=5, min_count=1, workers=4) wv_model.train(context, total_examples=1, epochs=1) # 统计全文的count count_list = [] words_num = 0 for i in tqdm(range(len(train))): count_list += list(set(train.iloc[i]['words_a'])) count_list += list(set(train.iloc[i]['words_b'])) words_num +=2 for i in tqdm(range(len(test))): count_list += list(set(test.iloc[i]['words_a'])) count_list += list(set(test.iloc[i]['words_b'])) words_num +=2 count = Counter(count_list) # 计算idf列表 idf = {} for k, v in tqdm(dict(count).items()): idf[k] = math.log(words_num/(v+1)) # 转换句向量 def text_to_wv(model, data, operation='max_pooling',key='wv'): full_wv_a = [] full_wv_b = [] # 每句话转词向量表达 for i in tqdm(range(len(data))): row = data.iloc[i] wv_a = [] words_a = row['words_a'] for i in words_a: wv_a.append(model.wv[i]) if operation == 'max_pooling': full_wv_a.append(np.amax(wv_a, axis=0)) elif operation == 'mean_pooling': full_wv_a.append(np.mean(wv_a, axis=0)) wv_b = [] words_b = row['words_b'] for i in words_b: wv_b.append(model.wv[i]) if operation == 'max_pooling': full_wv_b.append(np.amax(wv_b, axis=0)) elif operation == 'mean_pooling': full_wv_b.append(np.mean(wv_b, axis=0)) data[key + '_a'] = full_wv_a data[key + '_b'] = full_wv_b # idf加权的句向量 def idf_to_wv(model, data, idf): full_wv_a = [] full_wv_b = [] # 每句话转词向量表达 for i in tqdm(range(len(data))): row = data.iloc[i] wv_a = [] words_a = row['words_a'] for i in words_a: wv_a.append(model.wv[i] * idf[i]) full_wv_a.append(np.mean(wv_a, axis=0)) wv_b = [] words_b = row['words_b'] for i in words_b: wv_b.append(model.wv[i] * idf[i]) full_wv_b.append(np.mean(wv_b, axis=0)) data['idf_wv_a'] = full_wv_a data['idf_wv_b'] = full_wv_b # 最大池化句向量 text_to_wv(wv_model, train, 'max_pooling','max_wv') text_to_wv(wv_model, test, 'max_pooling','max_wv') # 平均池化句向量 text_to_wv(wv_model, train, 'mean_pooling','mean_wv') text_to_wv(wv_model, test, 'mean_pooling','mean_wv') # idf加权平均句向量 idf_to_wv(wv_model, train, idf) idf_to_wv(wv_model, test, idf) # sif词向量 # 计算主成分,npc为需要计算的主成分的个数 def compute_pc(X, npc): svd = TruncatedSVD(n_components=npc, n_iter=5, random_state=0) svd.fit(X) return svd.components_ # 去除主成分 def remove_pc(X, npc=1): pc = compute_pc(X, npc) if npc == 1: XX = X - X.dot(pc.transpose()) * pc else: XX = X - X.dot(pc.transpose()).dot(pc) return XX # 更新词权重 def sif_weight(count, a=3e-5): # 统计所有词频 word_num = 0 for k,v in dict(count).items(): word_num += v # 更新权重 sif = {} for k,v in dict(count).items(): sif[k] = a / (a + v/word_num) return sif # sif加权的句向量 def sif_to_wv(model, data, sif): full_wv_a = [] full_wv_b = [] # 每句话转词向量表达 for i in tqdm(range(len(data))): row = data.iloc[i] wv_a = [] words_a = row['words_a'] # 统计词向量 for i in words_a: wv_a.append(model.wv[i] * sif[i]) # 记录结果 full_wv_a.append(np.mean(wv_a, axis=0)) wv_b = [] words_b = row['words_b'] for i in words_b: wv_b.append(model.wv[i] * sif[i]) full_wv_b.append(np.mean(wv_b, axis=0)) # 扣除第一主成分 full_wv_a = remove_pc(np.array(full_wv_a)) full_wv_b = remove_pc(np.array(full_wv_b)) data['sif_wv_a'] = list(full_wv_a) data['sif_wv_b'] = list(full_wv_b) # 更新词权重 sif = sif_weight(count) sif_to_wv(wv_model, train, sif) sif_to_wv(wv_model, test, sif) # 打印 print(train[['max_wv_a', 'max_wv_b', 'mean_wv_a', 'mean_wv_b', 'idf_wv_a', 'idf_wv_b', 'sif_wv_a', 'sif_wv_b']][:5])
通常情况下,如果是机器学习(可作为上面的LGB分类模型的特征)的或者无监督相似度计算。我们只需要直接使用句向量即可。如果是神经网络的话,我们需要把训练好的词向量初始化到embedding层里面,下面以pytorch为例子
import torch
from torch import nn
# 初始化词向量矩阵
word_vectors = torch.randn([config.vocab_size, config.embed_dim])
# 将词向量模型复制给矩阵
for i in range(0, config.vocab_size):
word_vectors[i, :] = torch.from_numpy(wv_mode.wv[i])
# 创建embedding层,并把矩阵初始化到embnedding内
self.embedding = nn.Embedding.from_pretrained(word_vectors, freeze=config.update_embed)
有的embedding之后,剩余就是网络推理的部分。在网络结构设计,我写了3个方案不同的方案:
网络结构 | 注释 |
---|---|
Siamese Net Work | 1. 基于孪生网络的模型,其核心可基于CNN或RNN 2. 对CNN或RNN后的编码进行拼接推理 |
InferSent | 1. 类似SiamNet,但对RNN出来的编码进行拼接、乘、减操作 2. 对操作后的数据进行拼接最后推理 |
ESIM | 1. 基于RNN、注意力、组合、推理的复杂网络 2. 设计多个数学公式,代码在下面,具体公式可自行查看论文 |
由于整个网络的代码较长,下面代码只列出网络结构。整体代码请查看源码
class LinModel(nn.Module): def __init__(self, in_features, out_features): super(LinModel, self).__init__() self.fc_1 = nn.Sequential( nn.Linear(in_features, 256), nn.ReLU(), nn.Dropout(0.02) ) self.fc_2 = nn.Sequential( nn.Linear(256, 32), nn.ReLU(), nn.Dropout(0.02) ) self.fc_3 = nn.Sequential( nn.Linear(32, 4), nn.ReLU(), nn.Dropout(0.02) ) self.fc_4 = nn.Sequential( nn.Linear(4, out_features), ) self.softmax = nn.Softmax(1) def forward(self, X): X = self.fc_1(X) X = self.fc_2(X) X = self.fc_3(X) output = self.fc_4(X) return self.softmax(output) class SiamCNN(nn.Module): def __init__(self, wv_mode, config): super(SiamCNN, self).__init__() self.device = config.device word_vectors = torch.randn([config.vocab_size, config.embed_dim]) for i in range(0, config.vocab_size): word_vectors[i, :] = torch.from_numpy(wv_mode.wv[i]) # 创建embedding层 self.embedding = nn.Embedding.from_pretrained(word_vectors, freeze=config.update_embed) # (32, 27, 100) if config.update_embed is False: self.embedding.weight.requires_grad = False self.conv_1 = nn.Sequential( nn.Conv1d(in_channels=config.seq_len, out_channels=16, kernel_size=2, stride=1), nn.ReLU(), nn.MaxPool1d(3)) self.conv_2 = nn.Sequential( nn.Conv1d(in_channels=config.seq_len, out_channels=16, kernel_size=3, stride=1), nn.ReLU(), nn.MaxPool1d(3)) self.conv_3 = nn.Sequential( nn.Conv1d(in_channels=config.seq_len, out_channels=16, kernel_size=5, stride=1), nn.ReLU(), nn.MaxPool1d(3)) self.flattern = nn.Flatten() # 定义池化层 self.max_pool = nn.MaxPool1d(3) # 定义线性层 self.lin_model = LinModel(1552, 2) # 计算两个向量的相似度 def cos_sim(self, vector_a, vector_b): """ 计算两个向量之间的余弦相似度 :param vector_a: 向量 a :param vector_b: 向量 b :return: sim """ return torch.tensor([torch.cosine_similarity(vector_a, vector_b, 0, 1e-8)]) def forward_one(self, text): # 计算句子A x = self.embedding(text) conv_1 = self.conv_1(x) conv_2 = self.conv_2(x) conv_3 = self.conv_3(x) # 合并各卷积结果取最大值 x = torch.cat([conv_1, conv_2, conv_3], 2) x = x.view(x.size(0), -1) return self.lin_model(x) def forward(self, words_a, words_b): # words_a (batch_size, seq_len)(32, 27) # 计算句子A x_a = self.forward_one(words_a) # 计算句子B x_b = self.forward_one(words_b) return x_a, x_b
class LinModel(nn.Module): def __init__(self, in_features, out_features): super(LinModel, self).__init__() self.fc_1 = nn.Sequential( nn.Linear(in_features, 256), nn.ReLU(), nn.Dropout(0.02) ) self.fc_2 = nn.Sequential( nn.Linear(256, 32), nn.ReLU(), nn.Dropout(0.02) ) self.fc_3 = nn.Sequential( nn.Linear(32, 4), nn.ReLU(), nn.Dropout(0.02) ) self.fc_4 = nn.Sequential( nn.Linear(4, out_features), ) self.softmax = nn.Softmax(1) def forward(self, X): X = self.fc_1(X) X = self.fc_2(X) X = self.fc_3(X) output = self.fc_4(X) return self.softmax(output) class SiamLSTM(nn.Module): def __init__(self, wv_mode, config): super(SiamLSTM, self).__init__() self.device = config.device word_vectors = torch.randn([config.vocab_size, config.embed_dim]) for i in range(0, config.vocab_size): word_vectors[i, :] = torch.from_numpy(wv_mode.wv[i]) # 创建embedding层 self.embedding = nn.Embedding.from_pretrained(word_vectors, freeze=config.update_embed) # (32, 27, 100) if config.update_embed is False: self.embedding.weight.requires_grad = False # 创建rnn self.rnn = nn.LSTM(input_size=config.embed_dim, hidden_size=10, num_layers=1) # 创建线性层 self.lin_model = LinModel(270, 2) def forward_one(self, text): # 计算a x = self.embedding(text) # embedding转换 # rnn x = x.transpose(0, 1) # 交换维度,因为RNN的输入是 (L, D, H) x, _ = self.rnn(x) x = x.transpose(0, 1) # 还原维度,因为RNN的输出是 (L, D, H) x = x.contiguous().view(x.size(0), -1) return self.lin_model(x) def forward(self,words_a, words_b): # 计算a x_a = self.forward_one(words_a) # embedding转换 # 计算b x_b = self.forward_one(words_b) return x_a, x_b
class LinModel(nn.Module): def __init__(self, in_features, out_features): super(LinModel, self).__init__() self.fc_1 = nn.Sequential( nn.Linear(in_features, 256), nn.ReLU(), nn.Dropout(0.02) ) self.fc_2 = nn.Sequential( nn.Linear(256, 32), nn.ReLU(), nn.Dropout(0.02) ) self.fc_3 = nn.Sequential( nn.Linear(32, 4), nn.ReLU(), nn.Dropout(0.02) ) self.fc_4 = nn.Sequential( nn.Linear(4, out_features), ) self.softmax = nn.Softmax(1) def forward(self, X): X = self.fc_1(X) X = self.fc_2(X) X = self.fc_3(X) output = self.fc_4(X) return self.softmax(output) class InferSent(nn.Module): def __init__(self, wv_mode, config): super(InferSent, self).__init__() self.device = config.device word_vectors = torch.randn([config.vocab_size, config.embed_dim]) for i in range(0, config.vocab_size): word_vectors[i, :] = torch.from_numpy(wv_mode.wv[i]) # 创建embedding层 self.embedding = nn.Embedding.from_pretrained(word_vectors, freeze=config.update_embed) # (32, 27, 100) if config.update_embed is False: self.embedding.weight.requires_grad = False # 创建 双向 两层 RNN self.rnn = nn.LSTM(input_size=config.embed_dim, hidden_size=10, num_layers=2, bidirectional=True) # 创建线性层 self.lin_model = LinModel(2160, 2) def forward(self, words_a, words_b): # 计算a x_a = self.embedding(words_a) # embedding转换 # rnn x_a = x_a.transpose(0, 1) # 交换维度,因为RNN的输入是 (L, D, H) x_a, _ = self.rnn(x_a) x_a = x_a.transpose(0, 1) # 还原维度,因为RNN的输出是 (L, D, H) # 计算b x_b = self.embedding(words_b) x_b = x_b.transpose(0, 1) x_b, _ = self.rnn(x_b) x_b = x_b.transpose(0, 1) ''' 三种编码的交叉方式 shape: 句子1编码 x_a: (128, 27, 20) 句子2编码 x_b: (128, 27, 20) 拼接交叉 X_1: (128, 27, 40) 乘法交叉 X_2: (128, 27, 20) 减法交叉 X_3: (128, 27, 20) ''' # 方法一:拼接 X_1 = torch.cat([x_a, x_b], 2) # # 方法二:乘法 X_2 = torch.mul(x_a, x_b) # 方法三:减法 X_3 = torch.sub(x_a, x_b) # 拼接3种方式,展平张量 X = torch.cat([X_1, X_2, X_3], 2) # (128, 27, 80) X = X.view(X.size(0), -1) # (128, 27, 2160) # 线性推理 output = self.lin_model(X) return output
class RNNDropout(nn.Dropout): # 将词向量 某些维度 清0 def forward(self, sequences_batch): # (B, L, D) # 创建相同的全1张量 (B, D) ones = sequences_batch.data.new_ones(sequences_batch.shape[0], sequences_batch.shape[-1]) # 创建随机mask (B, D) dropout_mask = nn.functional.dropout(ones, self.p, self.training, inplace=False) # dropout原数据 (B, L, D), 这里需要给mask加一个维度 return dropout_mask.unsqueeze(1) * sequences_batch # 自定义RNN class StackedBRNN(nn.Module): def __init__(self, input_size, hidden_size, num_layers, dropout_rate=0, dropout_output=False, rnn_type=nn.LSTM, concat_layers=False): super().__init__() # 获取参数 self.dropout_output = dropout_output self.dropout_rate = dropout_rate self.num_layers = num_layers self.concat_layers = concat_layers # 使用最后一层结果或叠加结果 self.rnns = nn.ModuleList() # 遍历设计的RNN层数,使用Modulelist堆叠 for i in range(num_layers): # 如果不是第一层,把lstm的两个hidden_size作为输入 if i != 0: input_size = 2 * hidden_size self.rnns.append(rnn_type(input_size, hidden_size, num_layers=1, bidirectional=True)) def forward(self, x): # (B, L, D) # 转化成RNN能接收的维度 x = x.transpose(0, 1) # (L, B, D) # 用于记录不同层的RNN结果,初始是x outputs = [x] for i in range(self.num_layers): rnn_input = outputs[-1] # dropout if self.dropout_rate > 0: rnn_input = F.dropout(rnn_input, p=self.dropout_rate, training=self.training) # 取上一层的RNN结果传入当前层的RNN rnn_output = self.rnns[i](rnn_input)[0] # 只获取output,无需使用h_n,c_n # 添加结果 outputs.append(rnn_output) if self.concat_layers: # 如果使用拼接作为结果 # 这里0是X输入,所以只需要1开始取各层RNN的结果 output = torch.cat(outputs[1:], 2) # (L, B, D) else: # 如果使用最后一层RNN作为结果 output = outputs[-1] # (L, B, D) # 还原维度 output = output.transpose(0, 1) # (B, L, D) # dropout if self.dropout_output and self.dropout_rate > 0: output = F.dropout(output, p=self.dropout_rate, training=self.training) # (B, L, D) # 进行 transpose之后,tensor在内存中不连续, contiguous将output内存连续 return output.contiguous() class BidirectionalAttention(nn.Module): def __init__(self): super().__init__() def forward(self, v1, v1_mask, v2, v2_mask): ''' v1 (B, L, H) v1_mask (B, L) v2 (B, R, H) v2_mask (B, R) ''' # v2:a v1:b # 1.计算矩阵相似度 v1@v2 similarity_matrix = v1.bmm(v2.transpose(2, 1).contiguous()) # (B, L, R) # 2.计算attention时没有必要计算pad=0, 要进行mask操作 3.进行softmax # 将similarity_matrix v1中pad对应的权重给mask # v1_mask (B, L) 加一维到第三维度成 (B, L, unsqueeze) v2_v1_attn = F.softmax( similarity_matrix.masked_fill( v1_mask.unsqueeze(2), -1e7), dim=1) # (B, L, R) # 将similarity_matrix v2中pad对应的权重给mask # 21_mask (B, R) 加一维到第三维度成 (B, unsqueeze, R) v1_v2_attn = F.softmax( similarity_matrix.masked_fill( v2_mask.unsqueeze(1), -1e7), dim=2) # (B, L, R) # 4.计算attention # 句子b 对a的影响 # attented_v1 (B, L, R) @ (B, R, H) attented_v1 = v1_v2_attn.bmm(v2) # (B, L, H) # 句子b 对a的影响 # v2_v1_attn (B, L, R) -> (B, R, L) @(B, L, H) attented_v2 = v2_v1_attn.transpose(1, 2).bmm(v1) # (B, R, H) # attented_v1 将v1对应的pad填充为0 # attented_v2 将v2对应的pad填充为0 attented_v1.masked_fill(v1_mask.unsqueeze(2), 0) attented_v2.masked_fill(v2_mask.unsqueeze(2), 0) return attented_v1, attented_v2 class ESIM(nn.Module): def __init__(self, wv_mode, config: Config): super(ESIM, self).__init__() # ----------------------- encoding ---------------------# word_vectors = torch.randn([config.vocab_size, config.embed_dim]) for i in range(0, config.vocab_size): word_vectors[i, :] = torch.from_numpy(wv_mode.wv[i]) # 创建embedding层 self.embedding = nn.Embedding.from_pretrained(word_vectors, freeze=config.update_embed) # (32, 27, 100) if config.update_embed is False: self.embedding.weight.requires_grad = False # 创建rnn的dropout self.rnn_dropout = RNNDropout(config.dropout) rnn_size = config.hidden_size if config.concat_layers is True: rnn_size //= config.num_layers config.hidden_size = rnn_size // 2 *2 *2 # 第一个*2是双向 第二个*2是cat拼接 self.input_encoding = StackedBRNN(input_size=config.embed_dim, hidden_size=rnn_size // 2, num_layers=config.num_layers, rnn_type=nn.LSTM, concat_layers=config.concat_layers) # ----------------------- encoding ---------------------# # ----------------------- 注意力 ---------------------# self.attention = BidirectionalAttention() # ----------------------- 注意力 ---------------------# # ----------------------- 组合层 ---------------------# self.projection = nn.Sequential( nn.Linear(4 * config.hidden_size, config.hidden_size), nn.ReLU() ) self.composition = StackedBRNN(input_size=config.hidden_size, hidden_size=rnn_size // 2, num_layers=config.num_layers, rnn_type=nn.LSTM, concat_layers=config.concat_layers) # ----------------------- 组合层 ---------------------# # ----------------------- 推理层 ---------------------# self.classification = nn.Sequential( nn.Dropout(p=config.dropout), nn.Linear(4 * config.hidden_size, config.hidden_size), nn.Tanh(), nn.Dropout(p=config.dropout)) self.out = nn.Linear(config.hidden_size, config.num_labels) # ----------------------- 推理层 ---------------------# def forward(self, words_a, words_b): ''' 维度说明表 B: batch_size L: 句子a的长度 R: 句子b的长度 D: embedding长度 H: hidden长度 ''' # 读取数据 query = words_a # (B, L) doc = words_b # (B, R) # ----------------------- encoding ---------------------# # 获取mask,判断query,doc种每个数是不是0 # 是0则表示该位置是PAD # 是1则表示该位置不是PAD # query: [2,3,4,5,0,0,0] -> query_mask: [0,0,0,0,1,1,1] query_mask = (query == 0) # (B, L) doc_mask = (query == 0) # (B, R) # 转换词向量 query = self.embedding(query) # (B, L, D) doc = self.embedding(doc) # (B, R, D) # dropout,随机对输出清零 query = self.rnn_dropout(query) # (B, L, D) doc = self.rnn_dropout(doc) # (B, R, D) # 使用ESIM叠加的双向RNN 进行编码 query = self.input_encoding(query) # (B, L, H) doc = self.input_encoding(doc) # (B, R, H) # ----------------------- encoding ---------------------# # ----------------------- 注意力 ---------------------# ''' 1. 计算两个句子的矩阵相似度 2. 把PAD填充去掉,因为计算attention时先进行mask操作 3. 进行softmax 3. 计算attention ''' attended_query, attended_doc = self.attention(query, query_mask, doc, doc_mask) # ----------------------- 注意力 ---------------------# # ----------------------- 拼接层 ---------------------# # 得到拼接embedding和attention得到加强信息版query和doc, 对应论文中的m enhanced_query = torch.cat([query, attended_query, query - attended_query, query * attended_query], dim=-1) # (B, L, 4*H) enhanced_doc = torch.cat([doc, attended_doc, query - attended_doc, query * attended_doc], dim=-1) # (B, R, 4*H) # ----------------------- 拼接层 ---------------------# # ----------------------- 组合层 ---------------------# # 推理拼接后的张量, 对应论文中的F(m) projected_query = self.projection(enhanced_query) # (B, L, H) projected_doc = self.projection(enhanced_doc) # (B, R, H) # 使用双向RNN query = self.composition(projected_query) # (B, L, H) doc = self.composition(projected_doc) # (B, R, H) # ----------------------- 组合层 ---------------------# # ----------------------- 池化层 ---------------------# ''' 1. 平均池化 2. 最大池化 3. 拼接 4个结果张量 ''' # 由于部分句子被pad,使用平均池化会不准,所以需要反推mask,然后求句子长度 # 0的位置为pad reverse_query_mask = 1. - query_mask.float() # (B, L) reverse_doc_mask = 1. - doc_mask.float() # (B, R) # 平均池化 query_avg = torch.sum(query * reverse_query_mask.unsqueeze(2), dim=1) / ( torch.sum(reverse_query_mask, dim=1, keepdim=True) + 1e-8) # (B, L, H) doc_avg = torch.sum(doc * reverse_doc_mask.unsqueeze(2), dim=1) / ( torch.sum(reverse_doc_mask, dim=1, keepdim=True) + 1e-8) # (B, R, H) # 防止取出pad(也许部分值是负数,小于0) query = query.masked_fill(query_mask.unsqueeze(2), -1e7) doc = doc.masked_fill(doc_mask.unsqueeze(2), -1e7) # 最大池化 query_max, _ = query.max(dim=1) # (B, L, H) doc_max, _ = doc.max(dim=1) # (B, R, H) # 拼接 X = torch.cat([query_avg, query_max, doc_avg, doc_max], dim=-1) # ----------------------- 池化层 ---------------------# # ----------------------- 推理层 ---------------------# X = self.classification(X) output = self.out(X) # ----------------------- 推理层 ---------------------# return output
从综合分数上看,准确率由孪生网络到InferSent再到ESIM均有0.02的提升。所以得到在文本相似度任务上:tfidf < Siamese < InferSent < ESIM。当然实际超惨并未做太多的修改,所以这个准确率仅供参考使用。
分类 | 模型 | 详情 | 分数 |
---|---|---|---|
tfidf | tfidf.py | 1. 求字数差 2. 使用百度停用词 3. 去除停用词就词数差 4. tfidf | bq_corpus:0.6533 lcqmc:0.7343 paws-x:0.5585 score:0.6487 |
SiamCNN | SiamCNN_LSTM.py | 1. 基于gensim的wv作为深度学习embeeding层的初始化参数 2.使用孪生CNN卷积+线性层 | bq_corpus:0.6849 lcqmc:0.753 paws-x:0.5405 score:0.6595 |
SiamLSTM | SiamCNN_LSTM.py | 1. SiamCNN把模型改为孪生LSTM+线性层 | bq_corpus:0.6964 lcqmc:0.77 paws-x:0.5735 score:0.68 |
InferSent | InferSent.py | 1. SiamCNN把模型改为InferSent | bq_corpus:0.7264 lcqmc:0.778 paws-x:0.6055 score:0.7033 |
ESIM | ESIM.py | 1. 把PAD对应字典改为作为0 2.SiamCNN把模型改为ESIM | bq_corpus:0.7557 lcqmc:0.7744 paws-x:0.632 score:0.7207 |
现在NLP任务基本没有不使用BERT完成的,主要它能为我们提供更好的模型性能。BERT能完成很多NLP任务。而针对这一次任务,我们可以使用BERTForSequenceClassification。但得益于BERT具备着端到端的特性,所以过程中我会基于BERT进行一些列的魔改调试。
值得注意的是,BERT属于大型预训练模型。所以极度消耗GPU资源,所以运行时需考虑自身GPU环境。这里推荐使用百度AIStudio提供的免费GPU资源。
具体每个魔改的方案如下:
首先就是什么都不做,直接裸跑一个BERT。也就是完完全全相信BERT能帮我们完成。大概的流程如下:
最终得分0.8299,比ESIM高了0.1。可以发现仅仅裸跑BERT,结果也较传统的神经网络好很多。下面是训练部分的代码,剩余部分在gitee仓库
def train(config: Config, train_dataloader: DataLoader, dev_dataloader: DataLoader): # 创建模型 model = AutoModelForSequenceClassification.from_pretrained(config.model_path,num_classes=config.num_labels) # 定义优化器 opt = optimizer.AdamW(learning_rate=config.learning_rate, parameters=model.parameters()) # 定义损失函数 loss_fn = nn.loss.CrossEntropyLoss() metric = paddle.metric.Accuracy() # 遍历训练次数训练 for epoch in range(config.epochs): model.train() for iter_id, mini_batch in enumerate(train_dataloader): input_ids = mini_batch['input_ids'] token_type_ids = mini_batch['token_type_ids'] attention_mask = mini_batch['attention_mask'] labels = mini_batch['labels'] # ---------------------------- 报错部分, 百度paddle的bug, 已提交PR修复 ---------------------------- # # logits = model(input_ids=input_ids, token_type_ids=token_type_ids,attention_mask=attention_mask) # ---------------------------- 报错部分, 百度paddle的bug, 已提交PR修复 ---------------------------- # # ---------------------------- 正常部分部分 ---------------------------- # logits = model(input_ids=input_ids, token_type_ids=token_type_ids) # ---------------------------- 正常部分部分 ---------------------------- # # 计算损失值 loss = loss_fn(logits, labels) # 计算具体值并校验 probs = paddle.nn.functional.softmax(logits, axis=1) correct = metric.compute(probs, labels) metric.update(correct) acc = metric.accumulate() # 反向传播 loss.backward() opt.step() opt.clear_grad() # 打印模型性能 if iter_id%config.print_loss == 0: print('epoch:{}, iter_id:{}, loss:{}, acc:{}'.format(epoch, iter_id, loss, acc)) # 运行完一个epoch验证机校验 avg_val_loss, acc = evaluation(model, loss_fn, metric, dev_dataloader) print('-' * 50) print('epoch: {}, val_loss: {}, val_acc: {}'.format(epoch, avg_val_loss, acc)) print('-' * 50) return model
所谓的对抗训练就是在训练的过程中为模型的某部分参数添加白噪声再训练一次。所以在这种设计下,一个batch的数据需要训练两次(正常参数训练,加白噪声的参数训练)。这样的好处是可以让模型具备更好的泛化能力。具体流程如下:
最终得分0.8304,比裸跑BERT高了0.07。所以在这个算法命题下,对抗训练有助于模型更好的识别数据。下面是训练部分的代码,剩余部分在gitee仓库
# 训练 def train(config: Config, train_dataloader: DataLoader, dev_dataloader: DataLoader): # 创建模型 model = AutoModelForSequenceClassification.from_pretrained(config.model_path,num_classes=config.num_labels) # 定义优化器 opt = optimizer.AdamW(learning_rate=config.learning_rate, parameters=model.parameters()) # 定义损失函数 loss_fn = nn.loss.CrossEntropyLoss() metric = paddle.metric.Accuracy() # 检测是否添加对抗训练 if conf.adv == 'fgm': adver_method = extra_fgm.FGM(model) best_acc = 0 # 遍历训练次数训练 for epoch in range(config.epochs): model.train() for iter_id, mini_batch in enumerate(train_dataloader): input_ids = mini_batch['input_ids'] token_type_ids = mini_batch['token_type_ids'] attention_mask = mini_batch['attention_mask'] labels = mini_batch['labels'] logits = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask) # 计算损失值 loss = loss_fn(logits, labels) # 计算具体值并校验 probs = paddle.nn.functional.softmax(logits, axis=1) correct = metric.compute(probs, labels) metric.update(correct) acc = metric.accumulate() opt.clear_grad() loss.backward() # 检测是否使用对抗训练 if conf.adv == 'fgm': # 计算x+r的前向loss, 反向传播得到梯度,然后累加到(1)的梯度上; adver_method.attack(epsilon=conf.eps) # 计算x+r的前向loss logits_adv = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask) loss_adv = loss_fn(logits_adv, labels) # 反向传播得到梯度,然后累加到(1)的梯度上; loss_adv.backward() # 将embedding恢复为(1)时的embedding; adver_method.restore() # 反向传播 opt.step() # 打印模型性能 if iter_id%config.print_loss == 0: print('epoch:{}, iter_id:{}, loss:{}, acc:{}'.format(epoch, iter_id, loss, acc)) # 运行完一个epoch验证机校验 avg_val_loss, acc = evaluation(model, loss_fn, metric, dev_dataloader) print('-' * 50) print('epoch: {}, val_loss: {}, val_acc: {}'.format(epoch, avg_val_loss, acc)) print('-' * 50) # 保存最优模型 if best_acc < acc: best_acc = acc # 保存模型 model.save_pretrained('./checkpoint/'+conf.dataset+'/'+conf.model_path+conf.model_suffix) conf.tokenizer.save_pretrained('./checkpoint/'+conf.dataset+'/'+conf.model_path+conf.model_suffix) return model
文本类数据增强常用的就是EDA,也就是通过同义词替换、新增、删除等操作创造新的句子。我尝试过这样的数据增强手段,实际效果不佳。我认为核心问题是通过这种增强,让训练时的数据发生了畸形。最后导致模型也学习了畸形的文本。
最后我采用拓展式的数据增强手段。假设 A 与B是关联的、B与C是关联的,那么我认为A与C也是关联的。通过这种拓展,可以让学习的数据更多,且数据不会发生畸形。这里训练的内容其实没有变化,变的是增强的部分,所以只提供增强的样例:
# 数据增强 def aug_group_by_a(df): aug_data = defaultdict(list) # 以text_a中的句子为 a for g, data in df.groupby(by=['text_a']): if len(data) < 2: continue for i in range(len(data)): for j in range(i + 1, len(data)): # 取出b的值,a,b的label row_i_text = data.iloc[i, 1] row_i_label = data.iloc[i, 2] # 取出c的值,a,c的label row_j_text = data.iloc[j, 1] row_j_label = data.iloc[j, 2] if row_i_label == row_j_label == 0: continue aug_label = 1 if row_i_label == row_j_label == 1 else 0 aug_data['text_a'].append(row_i_text) aug_data['text_b'].append(row_j_text) aug_data['label'].append(aug_label) return pd.DataFrame(aug_data)
实验最终得分:0.832,对比仅使用对抗训练有了微弱的提升。最终可以确定采用对抗训练+数据增强能有效提高模型的性能。
一开始我的最大句子长度是以一句为目标的,然后截断与填充的方式是两句话拼接后再截断。经过实验发现这样容易出现一个问题。就是截断时会保留句子A的全部内容而截掉句子B的内容,使得模型无法有效衡量模型的重要性。
为了解决这个问题,我做了如下两部操作:
部分代码如下:
# 读取数据 def read_data(config: Config): if config.operation == 'train': train = pd.read_csv('data/data52714/' + config.dataset + '/train.tsv', sep='\t', names=['text_a', 'text_b', 'label']) dev = pd.read_csv('data/data52714/' + config.dataset + '/dev.tsv', sep='\t', names=['text_a', 'text_b', 'label']) test_size = len(dev) / (len(train)+len(dev)) if len(set(train['label'])) > 2: train = train[train['label'].isin(['0', '1'])] train['label'] = train['label'].astype('int') train = train.dropna() if len(set(train['label'])) > 2: dev = dev[dev['label'].isin(['0', '1'])] dev['label'] = dev['label'].astype('int') dev = dev.dropna() # 最终返回的数据 data = pd.concat([train, dev]) # 数据增强,加大训练集数据量 if config.need_data_aug is True: aug_train = aug_group_by_a(train) aug_dev = aug_group_by_a(dev) # 拼接数据 data = pd.concat([data, aug_train, aug_dev]) # 随机切分数据 X = data[['text_a', 'text_b']] y = data['label'] X_train, X_dev, y_train, y_dev = train_test_split( X, y, random_state=config.random_seed, test_size=test_size) X_train['label'] = y_train X_dev['label'] = y_dev # tokenizer tokenizer = config.tokenizer data_df = {'train': X_train, 'dev': X_dev} full_data_dict = {} for k, df in data_df.items(): inputs = defaultdict(list) for i, row in tqdm(df.iterrows(), desc='encode {} data'.format(k), total=len(df)): seq_a = row[0] seq_b = row[1] label = row[2] inputs_dict = tokenizer.encode(seq_a, seq_b, return_special_tokens_mask=True, return_token_type_ids=True, return_attention_mask=True, max_seq_len=config.max_seq_len, pad_to_max_seq_len=True) inputs['input_ids'].append(inputs_dict['input_ids']) inputs['token_type_ids'].append(inputs_dict['token_type_ids']) inputs['attention_mask'].append(inputs_dict['attention_mask']) inputs['labels'].append(label) full_data_dict[k] = inputs return full_data_dict['train'], full_data_dict['dev'] elif config.operation == 'predict': test = pd.read_csv('data/data52714/' + config.dataset + '/test.tsv', sep='\t', names=['text_a', 'text_b']) test['label'] = 0 # tokenizer tokenizer = config.tokenizer data_df = {'test': test} full_data_dict = {} for k, df in data_df.items(): inputs = defaultdict(list) for i, row in tqdm(df.iterrows(), desc='encode {} data'.format(k), total=len(df)): seq_a = row[0] seq_b = row[1] label = row[2] inputs_dict = tokenizer.encode(seq_a, seq_b, return_special_tokens_mask=True, return_token_type_ids=True, return_attention_mask=True, max_seq_len=config.max_seq_len, pad_to_max_seq_len=True) inputs['input_ids'].append(inputs_dict['input_ids']) inputs['token_type_ids'].append(inputs_dict['token_type_ids']) inputs['attention_mask'].append(inputs_dict['attention_mask']) inputs['labels'].append(label) full_data_dict[k] = inputs return full_data_dict['test'], len(test) else: raise Exception('错误的模型行为!')
最终得分:0.855,相对上一个模型提升巨大。说明猜测没错,需要尽可能的提供均匀的信息给模型。
最后就是简单的对模型进行五折模型融合。
# 训练 def train(config: Config): # 多折交叉训练 for k in range(config.k_flod): k += 4 # 读取数据 train_dataloader, dev_dataloader = create_dataloader(conf) # 创建模型 model = AutoModelForSequenceClassification.from_pretrained(config.model_path,num_classes=config.num_labels) # 定义优化器 num_training_steps = len(train_dataloader) * config.epochs lr_scheduler = LinearDecayWithWarmup(config.learning_rate, num_training_steps, 0.1) decay_params = [ p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"]) ] opt = optimizer.AdamW(learning_rate=lr_scheduler, parameters=model.parameters(), weight_decay=0.01, apply_decay_param_fun=lambda x: x in decay_params) # 定义损失函数 loss_fn = nn.loss.CrossEntropyLoss() metric = paddle.metric.Accuracy() # 检测是否添加对抗训练 if conf.adv == 'fgm': adver_method = extra_fgm.FGM(model) best_acc = 0 # 遍历训练次数训练 for epoch in range(config.epochs): model.train() for iter_id, mini_batch in enumerate(train_dataloader): input_ids = mini_batch['input_ids'] token_type_ids = mini_batch['token_type_ids'] attention_mask = mini_batch['attention_mask'] labels = mini_batch['labels'] logits = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask) # 计算损失值 loss = loss_fn(logits, labels) # 计算具体值并校验 probs = paddle.nn.functional.softmax(logits, axis=1) correct = metric.compute(probs, labels) metric.update(correct) acc = metric.accumulate() loss.backward() # 检测是否使用对抗训练 if conf.adv == 'fgm': # 计算x+r的前向loss, 反向传播得到梯度,然后累加到(1)的梯度上; adver_method.attack(epsilon=conf.eps) # 计算x+r的前向loss logits_adv = model(input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask) loss_adv = loss_fn(logits_adv, labels) # 反向传播得到梯度,然后累加到(1)的梯度上; loss_adv.backward() # 将embedding恢复为(1)时的embedding; adver_method.restore() # 反向传播 opt.step() lr_scheduler.step() opt.clear_grad() # 打印模型性能 if iter_id%config.print_loss == 0: print('k:{}, epoch:{}, iter_id:{}, loss:{}, acc:{}'.format(k, epoch, iter_id, loss, acc)) # 运行完一个epoch验证机校验 avg_val_loss, avg_val_acc = evaluation(model, loss_fn, metric, dev_dataloader) print('-' * 50) print('k:{}, epoch: {}, val_loss: {}, val_acc: {}'.format(k, epoch, avg_val_loss, avg_val_acc)) print('-' * 50) model.save_pretrained('./checkpoint/'+conf.dataset+'/k_flod/'+conf.model_path+'_'+str(k)) conf.tokenizer.save_pretrained('./checkpoint/'+conf.dataset+'/k_flod/'+conf.model_path+'_'+str(k)) return model
最终得分:0.864,排名38。由于继续提升需要花费大量的研究与训练时间,所以整个赛题的研究就到此未知。
综合整个研究过程,最终采用的方案是:
最终得分如下:
分类 | 模型 | 详情 | 分数 |
---|---|---|---|
裸跑Ernie-Gram | PaddleBERT.py | 1. 裸跑Ernie-Gram | bq_corpus:0.8412 lcqmc:0.8639 paws-x:0.7845 score:0.8299 |
裸跑数据增强的BERT | TorchBERT | 1. 裸跑chinese-bert-wwm-ext 2. 添加数据增强 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.7495 score:0.8112 |
对抗训练 | TorchBERTFGM | 1. 基于数据增强版添加对抗训练 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.76 score:0.8147 |
对抗训练 | Ernie-Gram-FGM.ipynb | 1. Ernie-Gram添加对抗训练 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.786 score:0.8304 |
对抗训练+数据增强 | Ernie-Gram-FGM.ipynb | 1. Ernie-Gram添加对抗训练 2. 增加数据增强 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.791 score:0.832 |
分开配置训练 | Ernie-Gram-FGM.ipynb | 1. Ernie-Gram-FGM针对3个数据集分开训练参数 2. paws-x调小batch_size与学习率 | bq_corpus:0.8363 lcqmc: paws-x:0.8605 score: |
BERT+HEADER | Ernie-Gram-FGM-Header.ipynb | 1. 基于Ernie-Gram-FGM加入三层线性层header | bq_corpus:0.8363 lcqmc:0.8596 paws-x:0.852 score:0.8493 |
分开填充 | Ernie-Gram-分开填充.ipynb | 1. 基于Ernie-Gram-FGM 2. encoder阶段对数据分开对称填充 | bq_corpus:0.8353 lcqmc:0.8696 paws-x:0.86 score:0.855 |
5折 | 多模型五折.ipynb | 1. 分开填充 2. 五折融合 | bq_corpus:0.848 lcqmc:0.875 paws-x:0.869 score:0.864 |
综合整个训练流程,使用的模型以及具体分数如下(其实是上文表格的合并):
分类 | 模型 | 详情 | 分数 |
---|---|---|---|
tfidf | tfidf.py | 1. 求字数差 2. 使用百度停用词 3. 去除停用词就词数差 4. tfidf | bq_corpus:0.6533 lcqmc:0.7343 paws-x:0.5585 score:0.6487 |
SiamCNN | SiamCNN_LSTM.py | 1. 基于gensim的wv作为深度学习embeeding层的初始化参数 2.使用孪生CNN卷积+线性层 | bq_corpus:0.6849 lcqmc:0.753 paws-x:0.5405 score:0.6595 |
SiamLSTM | SiamCNN_LSTM.py | 1. SiamCNN把模型改为孪生LSTM+线性层 | bq_corpus:0.6964 lcqmc:0.77 paws-x:0.5735 score:0.68 |
InferSent | InferSent.py | 1. SiamCNN把模型改为InferSent | bq_corpus:0.7264 lcqmc:0.778 paws-x:0.6055 score:0.7033 |
ESIM | ESIM.py | 1. 把PAD对应字典改为作为0 2.SiamCNN把模型改为ESIM | bq_corpus:0.7557 lcqmc:0.7744 paws-x:0.632 score:0.7207 |
裸跑Ernie-Gram | PaddleBERT.py | 1. 裸跑Ernie-Gram | bq_corpus:0.8412 lcqmc:0.8639 paws-x:0.7845 score:0.8299 |
裸跑数据增强的BERT | TorchBERT | 1. 裸跑chinese-bert-wwm-ext 2. 添加数据增强 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.7495 score:0.8112 |
对抗训练 | TorchBERTFGM | 1. 基于数据增强版添加对抗训练 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.76 score:0.8147 |
对抗训练 | Ernie-Gram-FGM.ipynb | 1. Ernie-Gram添加对抗训练 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.786 score:0.8304 |
对抗训练+数据增强 | Ernie-Gram-FGM.ipynb | 1. Ernie-Gram添加对抗训练 2. 增加数据增强 | bq_corpus:0.8227 lcqmc:0.8614 paws-x:0.791 score:0.832 |
分开配置训练 | Ernie-Gram-FGM.ipynb | 1. Ernie-Gram-FGM针对3个数据集分开训练参数 2. paws-x调小batch_size与学习率 | bq_corpus:0.8363 lcqmc: paws-x:0.8605 score: |
BERT+HEADER | Ernie-Gram-FGM-Header.ipynb | 1. 基于Ernie-Gram-FGM加入三层线性层header | bq_corpus:0.8363 lcqmc:0.8596 paws-x:0.852 score:0.8493 |
分开填充 | Ernie-Gram-分开填充.ipynb | 1. 基于Ernie-Gram-FGM 2. encoder阶段对数据分开对称填充 | bq_corpus:0.8353 lcqmc:0.8696 paws-x:0.86 score:0.855 |
5折 | 多模型五折.ipynb | 1. 分开填充 2. 五折融合 | bq_corpus:0.848 lcqmc:0.875 paws-x:0.869 score:0.864 |
由于篇幅问题,文中的仅提供核心代码。想要完整代码的可以移步到我的gitee仓库中。尽管如此,项目仍存在很大的优化空间:
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。