当前位置:   article > 正文

文本相似度和文本匹配模型归纳总结(附keras代码)_文本相似性分析

文本相似性分析

1.文本相似度,文本匹配模型归纳总结

DSSM详解

ESIM详解

ABCNN详解

BiMPM详解

DIIN详解

DRCN详解

    https://blog.csdn.net/u012526436/article/details/90179466

2. 短文本相似度计算方法

https://blog.csdn.net/baidu_26550817/article/details/80171532
最长公共子序列
编辑距离
相同单词个数/序列长度
word2vec+余弦相似度
Sentence2Vector 
https://blog.csdn.net/qjzcy/article/details/51882959?spm=0.0.0.0.zFx7Qk
DSSM(deep structured semantic models)(BOW/CNN/RNN) 
https://www.cnblogs.com/qniguoym/p/7772561.html

lstm+topic 
https://blog.csdn.net/qjzcy/article/details/52269382

百度AI的例子: 
http://ai.baidu.com/tech/nlp/simnet 
http://ai.baidu.com/docs#/NLP-API/c150c35a

3.文本分类


贝叶斯
支持向量
逻辑回归
http://sklearn.apachecn.org/cn/0.19.0/auto_examples/text/document_classification_20newsgroups.html#sphx-glr-auto-examples-text-document-classification-20newsgroups-py
fasttext
bilstm
cnn
rcnn
https://github.com/keras-team/keras/tree/master/examples


4.序列标注


HMM
CRF
LSTM+CRF
seq2seq
seq2seq+attention

5.部分模型keras实现

1. LSTM实现文本相似度:

  1. def get_model(nb_words, EMBEDDING_DIM, embedding_matrix, MAX_SEQUENCE_LENGTH,
  2. num_lstm, rate_drop_lstm, rate_drop_dense, num_dense, act):
  3. sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
  4. sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
  5. # embedding
  6. embedding_layer = Embedding(nb_words,
  7. EMBEDDING_DIM,
  8. weights=[embedding_matrix],
  9. input_length=MAX_SEQUENCE_LENGTH,
  10. trainable=False)
  11. embedded_sequences_1 = embedding_layer(sequence_1_input)
  12. embedded_sequences_2 = embedding_layer(sequence_2_input)
  13. # lstm
  14. lstm_layer = LSTM(num_lstm, dropout=rate_drop_lstm, recurrent_dropout=rate_drop_lstm)
  15. x1 = lstm_layer(embedded_sequences_1)
  16. y1 = lstm_layer(embedded_sequences_2)
  17. # classifier
  18. merged = concatenate([x1, y1])
  19. merged = Dropout(rate_drop_dense)(merged)
  20. merged = BatchNormalization()(merged)
  21. merged = Dense(num_dense, activation=act)(merged)
  22. merged = Dropout(rate_drop_dense)(merged)
  23. merged = BatchNormalization()(merged)
  24. preds = Dense(1, activation='sigmoid')(merged)
  25. model = Model(inputs=[sequence_1_input, sequence_2_input], \
  26. outputs=preds)
  27. model.compile(loss='binary_crossentropy',
  28. optimizer='nadam',
  29. metrics=['acc'])
  30. model.summary()
  31. return model

2. BiLSTM实现文本相似度

  1. def get_model(nb_words, EMBEDDING_DIM, embedding_matrix, MAX_SEQUENCE_LENGTH,
  2. num_lstm, rate_drop_lstm, rate_drop_dense, num_dense, act):
  3. embedding_layer = Embedding(nb_words,
  4. EMBEDDING_DIM,
  5. weights=[embedding_matrix],
  6. input_length=MAX_SEQUENCE_LENGTH,
  7. trainable=False)
  8. lstm_layer = Bidirectional(LSTM(num_lstm, dropout=rate_drop_lstm, recurrent_dropout=rate_drop_lstm))
  9. sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
  10. embedded_sequences_1 = embedding_layer(sequence_1_input)
  11. x1 = lstm_layer(embedded_sequences_1)
  12. sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
  13. embedded_sequences_2 = embedding_layer(sequence_2_input)
  14. y1 = lstm_layer(embedded_sequences_2)
  15. merged = concatenate([x1, y1])
  16. merged = Dropout(rate_drop_dense)(merged)
  17. merged = BatchNormalization()(merged)
  18. merged = Dense(num_dense, activation=act)(merged)
  19. merged = Dropout(rate_drop_dense)(merged)
  20. merged = BatchNormalization()(merged)
  21. preds = Dense(1, activation='sigmoid')(merged)
  22. model = Model(inputs=[sequence_1_input, sequence_2_input], \
  23. outputs=preds)
  24. model.compile(loss='binary_crossentropy',
  25. optimizer='adam',
  26. metrics=['acc'])
  27. model.summary()
  28. return model

3. ESIM实现文本相似度

  1. def get_model(embedding_matrix_file, MAX_SEQUENCE_LENGTH, num_lstm, rate_drop_dense, num_dense):
  2. sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
  3. sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
  4. # embedding
  5. embedding_layer = create_pretrained_embedding(embedding_matrix_file, mask_zero=False)
  6. bn = BatchNormalization(axis=2)
  7. embedded_sequences_1 = bn(embedding_layer(sequence_1_input))
  8. embedded_sequences_2 = bn(embedding_layer(sequence_2_input))
  9. # encode
  10. encode = Bidirectional(LSTM(num_lstm, return_sequences=True))
  11. encode_sequences_1 = encode(embedded_sequences_1)
  12. encode_sequences_2 = encode(embedded_sequences_2)
  13. # attention
  14. alignd_sequences_1, alignd_sequences_2 = soft_attention_alignment(encode_sequences_1, encode_sequences_2)
  15. # compose
  16. combined_sequences_1 = Concatenate()(
  17. [encode_sequences_1, alignd_sequences_2, submult(encode_sequences_1, alignd_sequences_2)])
  18. combined_sequences_2 = Concatenate()(
  19. [encode_sequences_2, alignd_sequences_1, submult(encode_sequences_2, alignd_sequences_1)])
  20. compose = Bidirectional(LSTM(num_lstm, return_sequences=True))
  21. compare_sequences_1 = compose(combined_sequences_1)
  22. compare_sequences_2 = compose(combined_sequences_2)
  23. # aggregate
  24. rep_sequences_1 = apply_multiple(compare_sequences_1, [GlobalAvgPool1D(), GlobalMaxPool1D()])
  25. rep_sequences_2 = apply_multiple(compare_sequences_2, [GlobalAvgPool1D(), GlobalMaxPool1D()])
  26. # classifier
  27. merged = Concatenate()([rep_sequences_1, rep_sequences_2])
  28. dense = BatchNormalization()(merged)
  29. dense = Dense(num_dense, activation='elu')(dense)
  30. dense = BatchNormalization()(dense)
  31. dense = Dropout(rate_drop_dense)(dense)
  32. dense = Dense(num_dense, activation='elu')(dense)
  33. dense = BatchNormalization()(dense)
  34. dense = Dropout(rate_drop_dense)(dense)
  35. out_ = Dense(1, activation='sigmoid')(dense)
  36. model = Model(inputs=[sequence_1_input, sequence_2_input], outputs=out_)
  37. model.compile(optimizer=Adam(lr=1e-3), loss='binary_crossentropy', metrics=['binary_crossentropy', 'accuracy'])
  38. return model
  1. 4. DSSM实现文本相似度
  2. def get_model(embedding_matrix, nb_words, EMBEDDING_DIM, MAX_SEQUENCE_LENGTH, num_lstm, rate_drop_dense):
  3. att1_layer = Attention.Attention(MAX_SEQUENCE_LENGTH)
  4. sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') # 编码后的问题1的词特征
  5. sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32') # 编码后的问题2的词特征
  6. # embedding
  7. embedding_layer = Embedding(nb_words,
  8. EMBEDDING_DIM,
  9. weights=[embedding_matrix],
  10. input_length=MAX_SEQUENCE_LENGTH,
  11. trainable=False)
  12. embedded_sequences_1 = embedding_layer(sequence_1_input)
  13. embedded_sequences_2 = embedding_layer(sequence_2_input)
  14. # encode
  15. lstm1_layer = Bidirectional(LSTM(num_lstm))
  16. encode_sequences_1 = lstm1_layer(embedded_sequences_1)
  17. encode_sequences_2 = lstm1_layer(embedded_sequences_2)
  18. # lstm
  19. lstm0_layer = LSTM(num_lstm, return_sequences=True)
  20. lstm2_layer = LSTM(num_lstm)
  21. v1ls = lstm2_layer(lstm0_layer(embedded_sequences_1))
  22. v2ls = lstm2_layer(lstm0_layer(embedded_sequences_2))
  23. v1 = Concatenate(axis=1)([att1_layer(embedded_sequences_1), encode_sequences_1])
  24. v2 = Concatenate(axis=1)([att1_layer(embedded_sequences_2), encode_sequences_2])
  25. # sequence_1c_input = Input(shape=(MAX_SEQUENCE_LENGTH_CHAR,), dtype='int32') # 编码后的问题1的字特征
  26. # sequence_2c_input = Input(shape=(MAX_SEQUENCE_LENGTH_CHAR,), dtype='int32') # 编码后的问题2的字特征
  27. # embedding_char_layer = Embedding(char_words,
  28. # EMBEDDING_DIM)
  29. # embedded_sequences_1c = embedding_char_layer(sequence_1c_input)
  30. # embedded_sequences_2c = embedding_char_layer(sequence_2c_input)
  31. # x1c = lstm1_layer(embedded_sequences_1c)
  32. # x2c = lstm1_layer(embedded_sequences_2c)
  33. # v1c = Concatenate(axis=1)([att1_layer(embedded_sequences_1c), x1c])
  34. # v2c = Concatenate(axis=1)([att1_layer(embedded_sequences_2c), x2c])
  35. # compose
  36. mul = Multiply()([v1, v2])
  37. sub = Lambda(lambda x: K.abs(x))(Subtract()([v1, v2]))
  38. maximum = Maximum()([Multiply()([v1, v1]), Multiply()([v2, v2])])
  39. # mulc = Multiply()([v1c, v2c])
  40. # subc = Lambda(lambda x: K.abs(x))(Subtract()([v1c, v2c]))
  41. # maximumc = Maximum()([Multiply()([v1c, v1c]), Multiply()([v2c, v2c])])
  42. sub2 = Lambda(lambda x: K.abs(x))(Subtract()([v1ls, v2ls]))
  43. # matchlist = Concatenate(axis=1)([mul, sub, mulc, subc, maximum, maximumc, sub2])
  44. matchlist = Concatenate(axis=1)([mul, sub, maximum, sub2])
  45. matchlist = Dropout(rate_drop_dense)(matchlist)
  46. matchlist = Concatenate(axis=1)(
  47. [Dense(32, activation='relu')(matchlist), Dense(48, activation='sigmoid')(matchlist)])
  48. res = Dense(1, activation='sigmoid')(matchlist)
  49. # model = Model(inputs=[sequence_1_input, sequence_2_input,
  50. # sequence_1c_input, sequence_2c_input], outputs=res)
  51. model = Model(inputs=[sequence_1_input, sequence_2_input], outputs=res)
  52. model.compile(optimizer=Adam(lr=0.001), loss="binary_crossentropy", metrics=['acc'])
  53. model.summary()
  54. return model

5. Decomption + Attention实现文本相似度

  1. def get_model(embedding_matrix_file, MAX_SEQUENCE_LENGTH,
  2. rate_drop_projction, num_projction, hidden_projction,
  3. rate_drop_compare, num_compare,
  4. rate_drop_dense, num_dense):
  5. sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
  6. sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
  7. # embedding
  8. embedding_layer = create_pretrained_embedding(embedding_matrix_file, mask_zero=False)
  9. embedded_sequences_1 = embedding_layer(sequence_1_input)
  10. embedded_sequences_2 = embedding_layer(sequence_2_input)
  11. # projection
  12. projection_layers = []
  13. if hidden_projction > 0:
  14. projection_layers.extend([
  15. Dense(hidden_projction, activation='elu'),
  16. Dropout(rate=rate_drop_projction),
  17. ])
  18. projection_layers.extend([
  19. Dense(num_projction, activation=None),
  20. Dropout(rate=rate_drop_projction),
  21. ])
  22. encode_sequences_1 = time_distributed(embedded_sequences_1, projection_layers)
  23. encode_sequences_2 = time_distributed(embedded_sequences_2, projection_layers)
  24. # attention
  25. alignd_sequences_1, alignd_sequences_2 = soft_attention_alignment(encode_sequences_1, encode_sequences_2)
  26. # compare
  27. combined_sequences_1 = Concatenate()(
  28. [encode_sequences_1, alignd_sequences_2, submult(encode_sequences_1, alignd_sequences_2)])
  29. combined_sequences_2 = Concatenate()(
  30. [encode_sequences_2, alignd_sequences_1, submult(encode_sequences_2, alignd_sequences_1)])
  31. compare_layers = [
  32. Dense(num_compare, activation='elu'),
  33. Dropout(rate_drop_compare),
  34. Dense(num_compare, activation='elu'),
  35. Dropout(rate_drop_compare),
  36. ]
  37. compare_sequences_1 = time_distributed(combined_sequences_1, compare_layers)
  38. compare_sequences_2 = time_distributed(combined_sequences_2, compare_layers)
  39. # aggregate
  40. rep_sequences_1 = apply_multiple(compare_sequences_1, [GlobalAvgPool1D(), GlobalMaxPool1D()])
  41. rep_sequences_2 = apply_multiple(compare_sequences_2, [GlobalAvgPool1D(), GlobalMaxPool1D()])
  42. # classifier
  43. merged = Concatenate()([rep_sequences_1, rep_sequences_2])
  44. dense = BatchNormalization()(merged)
  45. dense = Dense(num_dense, activation='elu')(dense)
  46. dense = Dropout(rate_drop_dense)(dense)
  47. dense = BatchNormalization()(dense)
  48. dense = Dense(num_dense, activation='elu')(dense)
  49. dense = Dropout(rate_drop_dense)(dense)
  50. out_ = Dense(1, activation='sigmoid')(dense)
  51. model = Model(inputs=[sequence_1_input, sequence_2_input], outputs=out_)
  52. model.compile(optimizer=Adam(lr=1e-3), loss='binary_crossentropy', metrics=['binary_crossentropy', 'accuracy'])
  53. return model
  1. 6. 使用多头自注意力机制的简单网络实现文本相似度
  2. def get_model(embedding_matrix_file, MAX_SEQUENCE_LENGTH, rate_drop_dense):
  3. sequence_1_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
  4. sequence_2_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
  5. # embedding
  6. embedding_layer = create_pretrained_embedding(embedding_matrix_file, mask_zero=False)
  7. embedded_sequences_1 = embedding_layer(sequence_1_input)
  8. embedded_sequences_2 = embedding_layer(sequence_2_input)
  9. # position embedding
  10. # embedded_sequences_1 = pos_embed.Position_Embedding()(embedded_sequences_1)
  11. # embedded_sequences_2 = pos_embed.Position_Embedding()(embedded_sequences_2)
  12. # attention
  13. O_seq_1 = Attention.Attention(8, 16)([embedded_sequences_1, embedded_sequences_1, embedded_sequences_1])
  14. O_seq_2 = Attention.Attention(8, 16)([embedded_sequences_2, embedded_sequences_2, embedded_sequences_2])
  15. # aggregate ESMI
  16. # rep_sequences_1 = apply_multiple(compare_sequences_1, [GlobalAvgPool1D(), GlobalMaxPool1D()])
  17. # rep_sequences_2 = apply_multiple(compare_sequences_2, [GlobalAvgPool1D(), GlobalMaxPool1D()])
  18. rep_sequences_1 = GlobalAveragePooling1D()(O_seq_1)
  19. rep_sequences_2 = GlobalAveragePooling1D()(O_seq_2)
  20. # classifier
  21. merged = Concatenate()([rep_sequences_1, rep_sequences_2])
  22. O_seq = Dropout(rate_drop_dense)(merged)
  23. outputs = Dense(1, activation='sigmoid')(O_seq)
  24. model = Model(inputs=[sequence_1_input, sequence_2_input], outputs=outputs)
  25. model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  26. return model

 

7TF-IDF

TF计算
第三种方案就是直接计算 TF 矩阵中两个向量的相似度了,实际上就是求解两个向量夹角的余弦值,就是点乘积除以二者的模长,公式如下:

cosθ=a·b/|a|*|b|
  1. from sklearn.feature_extraction.text import TfidfVectorizer
  2. import numpy as np
  3. from scipy.linalg import norm
  4. def tfidf_similarity(s1, s2):
  5. def add_space(s):
  6. return ' '.join(list(s))
  7. # 将字中间加入空格
  8. s1, s2 = add_space(s1), add_space(s2)
  9. # 转化为TF矩阵
  10. cv = TfidfVectorizer(tokenizer=lambda s: s.split())
  11. corpus = [s1, s2]
  12. vectors = cv.fit_transform(corpus).toarray()
  13. # 计算TF系数
  14. return np.dot(vectors[0], vectors[1]) / (norm(vectors[0]) * norm(vectors[1]))
  15. s1 = '你在干嘛呢'
  16. s2 = '你在干什么呢'
  17. print(tfidf_similarity(s1, s2))

这里的 vectors 变量实际上就对应着 TFIDF 值,内容如下:

  1. [[0. 0. 0.4090901 0.4090901 0.57496187 0.4090901 0.4090901 ]
  2. [0.49844628 0.49844628 0.35464863 0.35464863 0. 0.35464863 0.35464863]]

运行结果如下:

0.5803329846765686

.

相关参考:

https://blog.csdn.net/qq_24140919/article/details/89469318

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/article/detail/42285
推荐阅读
相关标签
  

闽ICP备14008679号