当前位置:   article > 正文

kaggle competition 实践学习 文本分类 keras实现 模型基于yoon kim 的 Convolutional Neural Networks for Sentence Class_keras与nltk的关系

keras与nltk的关系

比赛链接实例

《 Convolutional Neural Networks for Sentence Class》论文链接

yoon kim 的《Convolutional Neural Networks for Sentence Classification》。(2014 Emnlp会议)

这里写图片描述

上面是最经典的卷积神经网络模型,于是我就用keras实现了上面的模型,

还有一下其他利用卷积神经网络的例子讲解


那个比赛就是对文本进行分类,一共五类

预处理是利用了nltk,gensim和pandas进行数据处理

其中pandas读取csv文本文件,这样就可以通过下标访问内容


nltk是对自然语言处理的一个很有用的库,pip 安装之后需要执行nltk.download()然后安装其数据库

然后这里用到了stopwords用来分词,还加入了标点符号分词,然后利用SnowballStemmer提取词的主干

代码用到了一个keras的序列处理的方法,会自动进行切割文本,当大于一定长度

然后用到了新的层Embedding,就是用来将一个词转化为n维向量,处理后一个长l的句子会变成 l*n 的矩阵,每一行代表一个单词

然后用一维的序列卷积,注意其实是长度为n的二维卷积,图像的2d卷积其实是加上维度的3维卷积,这点需要注意

下面是代码,也不难,注释掉的部分是用lstm实现的效果也差不多,在那个比赛均能达到0.62的准确度

  1. import numpy as np
  2. import pandas as pd
  3. from gensim import corpora
  4. from nltk.corpus import stopwords
  5. from nltk.tokenize import word_tokenize
  6. from nltk.stem import SnowballStemmer
  7. import keras
  8. from keras.preprocessing import sequence
  9. from keras.utils import np_utils
  10. from keras.models import Sequential
  11. from keras.models import Model
  12. #from keras.layers import Dense, Activation, Convolution2D, MaxPooling2D, Flatten
  13. from keras.layers import *
  14. from keras.optimizers import Adam
  15. from keras import callbacks
  16. from keras import backend as K
  17. from keras import metrics
  18. from keras import regularizers
  19. np.random.seed(0)
  20. if __name__ == "__main__":
  21. #load data
  22. train_df = pd.read_csv('./data/train.tsv', sep='\t', header=0)
  23. test_df = pd.read_csv('./data/test.tsv', sep='\t', header=0)
  24. raw_docs_train = train_df['Phrase'].values
  25. raw_docs_test = test_df['Phrase'].values
  26. sentiment_train = train_df['Sentiment'].values
  27. num_labels = len(np.unique(sentiment_train))
  28. #text pre-processing
  29. stop_words = set(stopwords.words('english'))
  30. stop_words.update(['.', ',', '"', "'", ':', ';', '(', ')', '[', ']', '{', '}'])
  31. stemmer = SnowballStemmer('english')
  32. print stemmer
  33. print "pre-processing train docs..."
  34. processed_docs_train = []
  35. for doc in raw_docs_train:
  36. tokens = word_tokenize(doc)
  37. filtered = [word for word in tokens if word not in stop_words]
  38. stemmed = [stemmer.stem(word) for word in filtered]
  39. processed_docs_train.append(stemmed)
  40. print "pre-processing test docs..."
  41. processed_docs_test = []
  42. for doc in raw_docs_test:
  43. tokens = word_tokenize(doc)
  44. filtered = [word for word in tokens if word not in stop_words]
  45. stemmed = [stemmer.stem(word) for word in filtered]
  46. processed_docs_test.append(stemmed)
  47. print len(processed_docs_train),len(processed_docs_test)
  48. processed_docs_all = np.concatenate((processed_docs_train, processed_docs_test), axis=0)
  49. print len(processed_docs_all)
  50. dictionary = corpora.Dictionary(processed_docs_all)
  51. dictionary_size = len(dictionary.keys())
  52. print "dictionary size: ", dictionary_size
  53. #dictionary.save('dictionary.dict')
  54. #corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
  55. print "converting to token ids..."
  56. word_id_train, word_id_len = [], []
  57. for doc in processed_docs_train:
  58. word_ids = [dictionary.token2id[word] for word in doc]
  59. word_id_train.append(word_ids)
  60. word_id_len.append(len(word_ids))
  61. word_id_test, word_ids = [], []
  62. for doc in processed_docs_test:
  63. word_ids = [dictionary.token2id[word] for word in doc]
  64. word_id_test.append(word_ids)
  65. word_id_len.append(len(word_ids))
  66. seq_len = np.round((np.mean(word_id_len) + 2*np.std(word_id_len))).astype(int)
  67. print seq_len,np.mean(word_id_len),2*np.std(word_id_len)
  68. #pad sequences
  69. word_id_train = sequence.pad_sequences(np.array(word_id_train), maxlen=seq_len)
  70. word_id_test = sequence.pad_sequences(np.array(word_id_test), maxlen=seq_len)
  71. y_train_enc = np_utils.to_categorical(sentiment_train, num_labels)
  72. # #LSTM
  73. # print "fitting LSTM ..."
  74. # # model = Sequential()
  75. # # model.add(Embedding(dictionary_size, 128, dropout=0.2))
  76. # # model.add(LSTM(128, dropout_W=0.2, dropout_U=0.2))
  77. # # model.add(Dense(num_labels))
  78. # # model.add(Activation('softmax'))
  79. # seq_len=12
  80. # dictionary_size=10000
  81. # num_labels=10
  82. myInput=Input(shape=(seq_len,))
  83. print myInput.shape
  84. WORD_VECSIZE=128
  85. x = Embedding(output_dim=WORD_VECSIZE, input_dim=dictionary_size,dropout=0.2)(myInput)
  86. print x.shape
  87. filterNum=64
  88. b=Conv1D(filterNum/2, 2)(x)
  89. c=Conv1D(filterNum/4, 3)(x)
  90. d=Conv1D(filterNum/4, 4)(x)
  91. e=Conv1D(filterNum/4, 5)(x)
  92. f=Conv1D(filterNum/8, 6)(x)
  93. # b=Conv1D(filterNum/8, 2)(x)
  94. # c=Conv1D(filterNum/4, 3)(x)
  95. # d=Conv1D(filterNum/4, 4)(x)
  96. # e=Conv1D(filterNum/2, 5)(x)
  97. # f=Conv1D(filterNum, 6)(x)
  98. ba=Activation('relu')(b)
  99. ca=Activation('relu')(c)
  100. da=Activation('relu')(d)
  101. ea=Activation('relu')(e)
  102. fa=Activation('relu')(f)
  103. print ba.shape,fa.shape
  104. b2=MaxPooling1D(pool_size=(seq_len -1 ))(ba)
  105. c2=MaxPooling1D(pool_size=(seq_len -2 ))(ca)
  106. d2=MaxPooling1D(pool_size=(seq_len -3 ))(da)
  107. e2=MaxPooling1D(pool_size=(seq_len -4 ))(ea)
  108. f2=MaxPooling1D(pool_size=(seq_len -5 ))(fa)
  109. fb=Flatten()(b2)
  110. fc=Flatten()(c2)
  111. fd=Flatten()(d2)
  112. fe=Flatten()(e2)
  113. ff=Flatten()(f2)
  114. all_flatten=concatenate([fb,fc,fd,fe,ff])
  115. # flatten=Flatten()(all_pool)
  116. dp=Dropout(0.5)(all_flatten)
  117. # fc1=Dense(64,activation='relu')(dp)
  118. # dp2=Dropout(0.5)(fc1)
  119. out=Dense(num_labels,activation='softmax',kernel_regularizer=regularizers.l2(0.005))(dp)
  120. # out=Dense(NUM_CLASS,activation='softmax')(dp)
  121. model = Model(inputs=myInput,outputs=out)
  122. model.compile(optimizer='adam',
  123. loss='categorical_crossentropy',
  124. # metrics=['accuracy',metrics.categorical_accuracy])
  125. metrics=['accuracy'])
  126. model.fit(word_id_train, y_train_enc, nb_epoch=5, batch_size=256, verbose=1)
  127. test_pred = model.predict(word_id_test)
  128. test_pred=test_pred.tolist()
  129. test_label =[i.index(max(i)) for i in test_pred]
  130. #make a submission
  131. test_df['Sentiment'] = np.array(test_label).reshape(-1,1)
  132. header = ['PhraseId', 'Sentiment']
  133. test_df.to_csv('./lstm_sentiment.csv', columns=header, index=False, header=True)

后面我又参考《A C-LSTM Neural Network for Text Classification》(arXiv preprint arXiv)这篇文章改了一下,在cnn后面加上了lstm,发现效果和原来差不多。上涨了0.001.。

模型差不多,就加了一层lstm

  1. myInput=Input(shape=(seq_len,))
  2. print myInput.shape
  3. WORD_VECSIZE=128
  4. x = Embedding(output_dim=WORD_VECSIZE, input_dim=dictionary_size)(myInput)
  5. print x.shape
  6. filterNum=128
  7. b=Conv1D(filterNum/2, 2)(x)
  8. c=Conv1D(filterNum/2, 3)(x)
  9. d=Conv1D(filterNum, 4)(x)
  10. e=Conv1D(filterNum, 5)(x)
  11. f=Conv1D(filterNum, 6)(x)
  12. # b=Conv1D(filterNum/8, 2)(x)
  13. # c=Conv1D(filterNum/4, 3)(x)
  14. # d=Conv1D(filterNum/4, 4)(x)
  15. # e=Conv1D(filterNum/2, 5)(x)
  16. # f=Conv1D(filterNum, 6)(x)
  17. ba=Activation('relu')(b)
  18. ca=Activation('relu')(c)
  19. da=Activation('relu')(d)
  20. ea=Activation('relu')(e)
  21. fa=Activation('relu')(f)
  22. b2=MaxPooling1D(pool_size=(seq_len -1 ))(ba)
  23. c2=MaxPooling1D(pool_size=(seq_len -2 ))(ca)
  24. d2=MaxPooling1D(pool_size=(seq_len -3 ))(da)
  25. e2=MaxPooling1D(pool_size=(seq_len -4 ))(ea)
  26. f2=MaxPooling1D(pool_size=(seq_len -5 ))(fa)
  27. print b2.shape,f2.shape
  28. all_pool=concatenate([b2,c2,d2,e2,f2])
  29. # flatten=Flatten()(all_pool)
  30. # print all_pool.shape
  31. # res=Reshape(1)
  32. # print type(res),type(all_flatten)
  33. lstm=LSTM(128,return_sequences=True)(all_pool)
  34. print lstm.shape
  35. flatten=Flatten()(lstm)
  36. dp=Dropout(0.5)(flatten)
  37. # fc1=Dense(64,activation='relu')(dp)
  38. # dp2=Dropout(0.5)(fc1)
  39. out=Dense(num_labels,activation='softmax',kernel_regularizer=regularizers.l2(0.005))(dp)
  40. # out=Dense(NUM_CLASS,activation='softmax')(dp)
  41. model = Model(inputs=myInput,outputs=out)
  42. model.compile(optimizer='adam',
  43. loss='categorical_crossentropy',
  44. # metrics=['accuracy',metrics.categorical_accuracy])
  45. metrics=['accuracy'])


声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/菜鸟追梦旅行/article/detail/346375
推荐阅读
相关标签
  

闽ICP备14008679号