赞
踩
CNN文本分类实战
1:文本数据预处理,必须都是相同长度,相同向量维度
2:构建卷积模型,注意卷积核大小的设计
3:将卷积后的特征图池化为一个特征
4:将多种特征拼接在一起,准备完成分类任务
第一步,导入库
- import warnings
- warnings.filterwarnings('ignore')
- import numpy as np
- import matplotlib.pyplot as plt
- import tensorflow as tf
- from tensorflow import keras
- from tensorflow.keras import layers
- from tensorflow.keras.preprocessing.sequence import pad_sequences
第二步,设置参数并加载数据
- num_features = 3000 # 语料库单词数
- sequence_length = 300 # 评论最大长度
- embedding_dimension = 100 # 词嵌入向量维度
- (x_train, y_train), (x_test, y_test) = keras.datasets.imdb.load_data(num_words=num_features)
- print(x_train.shape)
- print(y_train.shape)
- print(x_test.shape)
- print(y_test.shape)
- print(x_train[0])
第三步,规定文本长度
- x_train = pad_sequences(x_train, maxlen=sequence_length)
- x_test = pad_sequences(x_test,maxlen=sequence_length)
- print(x_train.shape)
- print(x_test.shape)
- print(y_train.shape)
- print(y_test.shape)
第四步,搭建卷积过程函数
- #多种卷积核
- filter_size = [3,4,5]
- def convolution():
- inn = layers.Input(shape=(sequence_length,embedding_dimension,1)) # 3维的
- cnns = []
- for size in filter_size:
- conv = layers.Conv2D(filters=64,
- kernel_size=(size,embedding_dimension),
- strides=1,padding='valid',
- activation='relu')(inn)
- #需要将多种卷积后的特征图池化为一个特征
- pool = layers.MaxPool2D(pool_size=(sequence_length - size + 1, 1),padding='valid')(conv)
- cnns.append(pool)
- #将得到的特征拼接在一起
- outt = layers.concatenate(cnns)
-
- model = keras.Model(inputs=inn,outputs=outt)
- return model

第五步,构建网络模型
- def cnn_mulfilter():
- model = keras.Sequential([
- # 将 num_features 个不同的词,映射成词向量
- # 并且这是需要训练的词嵌入矩阵embedding
- layers.Embedding(input_dim=num_features,
- output_dim=embedding_dimension,
- input_length=sequence_length),
- # 因为卷积层的输入需要是三维的,所以要调整
- layers.Reshape((sequence_length, embedding_dimension, 1)),
- convolution(),
- layers.Flatten(),
- layers.Dense(10, activation='relu'),
- layers.Dropout(0.2),
- layers.Dense(1, activation=tf.nn.sigmoid)
- ])
- model.compile(optimizer=tf.optimizers.Adam(),
- loss=tf.losses.BinaryCrossentropy(),
- metrics=['accuracy'])
- return model
-
- model = cnn_mulfilter()
- model.summary()

部分运行结果:
- Model: "sequential"
- _________________________________________________________________
- Layer (type) Output Shape Param #
- =================================================================
- embedding_1 (Embedding) (None, 300, 100) 300000
- _________________________________________________________________
- reshape_1 (Reshape) (None, 300, 100, 1) 0
- _________________________________________________________________
- model (Functional) (None, 1, 1, 192) 76992
计算过程如下:
1。Embedding层,参数个数为需要更新的 E 词向量矩阵,即参数个数为
单词种类数 num_features * 词向量维度embedding_dimension = 3000 * 100 = 300000
其输出维度则为 batch,sequence_length,embedding_dimension = None,300,100
2。model (Functional) 层,经过 3,4,5 三个核卷积,卷积后的输出维度为
- 300 - 3 + 1 = 298,300 - 4 + 1 = 297 , 300 - 5 + 1 = 296
-
- 即 (298,1,64),(297,1,64),(296,1,64)
-
- 之后经过池化,变成
-
- (1,1,64),(1,1,64),(1,1,64)
-
- 再经过拼接,变成
-
- (1,1,64*3)= (1,1,192)
故其输出维度则为 batch,1,1,192= None,1,1,192,
其参数个数为 (3 * 100 + 1)64 + (4 * 100 + 1)64 + (5 * 100 + 1)*64 = 76992
第六步,训练模型与评估
- history = model.fit(x_train, y_train, batch_size=64, epochs=5, validation_split=0.1)
-
- model.evaluate(x_test,y_test)
第七步,显示迭代过程
- plt.plot(history.history['accuracy'])
- plt.plot(history.history['val_accuracy'])
- plt.legend(['training','valiation'],loc='upper left')
- plt.show()
赞
踩
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。