赞
踩
这里使用经典的IMDB影评数据集来完成情感分类任务。IMDB影评数据集包含了50000条用户评价,评价的标签分为消极和积极,其中IMDB评级
<
5
<5
<5的用户评价标注为0,即消极; IMDB评价
≥
7
≥7
≥7的用户评价标注为1,即积极。25000条影评用于训练集,25000条用于测试集。
通过Keras提供的数据集datasets工具即可加载IMDB数据集,代码如下:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers, losses, optimizers, Sequential
from tensorflow.python.keras.datasets import imdb
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
batchsz = 128 # 批量大小
total_words = 10000 # 词汇表大小N_vocab
max_review_len = 80 # 句子最大长度s,大于的句子部分将截断,小于的将填充
embedding_len = 100 # 词向量特征长度f
# 加载IMDB数据集,此处的数据采用数字编码,一个数字代表一个单词
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=total_words)
# 打印输入的形状,标签的形状
print(x_train.shape, len(x_train[0]), y_train.shape)
print(x_test.shape, len(x_test[0]), y_test.shape)
运行结果如下图所示:
可以看到,x_train和x_test是长度为25000的一维数组,数组的每个元素是不定长List,保存了数字编码的每个句子,例如训练集的第一个句子共有218个单词,测试集的第一个句子共有68个单词,每个句子都包含了句子起始标志ID。
那么每个单词是如何编码为数字的呢?我们可以通过查看它的编码表获得编码方案,例如:
# 数字编码表
word_index = imdb.get_word_index()
# 打印出编码表的单词和对应的数字
for k, v in word_index.items():
print(k, v)
运行结果如下图所示:
由于编码表的键为单词,值为ID,这翻转编码表,并添加标志位的编码ID,代码如下:
# 前面4个ID是特殊位
word_index = {k:(v+3) for k, v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2 # unknown
word_index["<UNUSED>"] = 3
# 翻转编码表
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
对于一个数字编码的句子,通过入选函数转换为字符串数据:
def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
# 将第1个句子转换为字符串数据
print(decode_review(x_train[0]))
运行结果如下所示:
<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all
对于长度参差不齐的句子,人为设置一个阈值,对大于此长度的句子,选择阶段部分单词,可以选择截去句首单词,也可以截去句末单词; 对于小于此长度的句子,可以选择在句首或句尾填充,句子截断功能可以通过keras.preprocessing.sequence.pad_sequences()
函数方便实现,例如:
# 截断和填充句子,使得等长,此处长句子保留句子后面的部分,短句子在前面填充
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_review_len)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_review_len)
截断或填充为相同长度后,通过Dataset类包裹成数据集对象,并添加常用的数据集处理流程,代码如下:
# 构建数据集,打散,批量,并丢掉最后一个不够batchsz的batch
db_train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
db_train = db_train.shuffle(1000).batch(batchsz, drop_remainder=True)
db_test = tf.data.Dataset.from_tensor_slices((x_test, y_test))
db_test = db_test.batch(batchsz, drop_remainder=True)
# 统计数据集属性
print('x_train shape:', x_train.shape, tf.reduce_max(y_train), tf.reduce_min(y_train))
print('x_test shape:', x_test.shape)
运行结果如下图所示:
可以看到截断填充后的句子长度统一为80,即设定的句子长度阈值。drop_remainder=True
参数设置丢掉最后一个Batch,因为其真实的Batch Size可能小于预设的Batch Size。
我们创建自定义的模型类MyRNN,继承自Model基类,需要新建Embedding层,两个RNN层,分类网络层,代码如下:
class MyRNN(keras.Model):
# Cell方式构建多层网络
def __init__(self, units):
super(MyRNN, self).__init__()
# [b, 64],构建Cell初始化状态向量,重复使用
self.state0 = [tf.zeros([batchsz, units])]
self.state1 = [tf.zeros([batchsz, units])]
# 词向量编码 [b, 80] => [b, 80, 100]
self.embedding = layers.Embedding(total_words, embedding_len,
input_length=max_review_len)
# 构建2个Cell
self.rnn_cell0 = layers.SimpleRNNCell(units, dropout=0.5)
self.rnn_cell1 = layers.SimpleRNNCell(units, dropout=0.5)
# 构建分类网络,用于将CELL的输出特征进行分类,2分类
# [b, 80, 100] => [b, 64] => [b, 1]
self.outlayer = layers.Dense(1)
其中词向量编码长度
n
=
100
n=100
n=100,RNN的状态向量长度
h
=
units
h=\text{units}
h=units参数,分类网络完成二分类任务,故输出节点设置为1。
前向传播逻辑如下: 输入序列通过Embedding层完成词向量编码,循环通过两个RNN层,提取语义特征,取最后一层的最后时间戳的状态向量输出送入分类网络,经过Sigmoid激活函数后得到输出概率。代码如下:
def call(self, inputs, training=None):
x = inputs # [b, 80]
# 获取词向量: embedding: [b, 80] => [b, 80, 100]
x = self.embedding(x)
# 通过2个RNN CELL,rnn cell compute,[b, 80, 100] => [b, 64]
state0 = self.state0
state1 = self.state1
for word in tf.unstack(x, axis=1): # word: [b, 100]
out0, state0 = self.rnn_cell0(word, state0, training)
out1, state1 = self.rnn_cell1(out0, state1, training)
# 末层最后一个输出作为分类网络的输入: [b, 64] => [b, 1]
x = self.outlayer(out1, training)
# 通过激活函数,p(y is pos|x)
prob = tf.sigmoid(x)
return prob
为了简便,这里使用Keras的Compile&Fit方式训练网络,设置优化器为Adam优化器,学习率为0.001,误差函数选用二分类的交叉熵损失函数BinaryCrossentropy,测试指标采用准确率即可。代码如下:
# 训练与测试
def main():
units = 64 # RNN状态向量长度f
epochs = 50 # 训练epochs
model = MyRNN(units)
# 装配
model.compile(optimizer=optimizers.RMSprop(0.001),
loss=losses.BinaryCrossentropy(),
metrics=['accuracy'])
# 训练和验证
model.fit(db_train, epochs=epochs, validation_data=db_test)
# 测试
model.evaluate(db_test)
网络固定训练20个Epoch后,在测试集上获得了80.1%的准确率。
import os
import ssl
import tensorflow as tf
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers, losses, optimizers, Sequential
from tensorflow.python.keras.datasets import imdb
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
ssl._create_default_https_context = ssl._create_unverified_context
batchsz = 128 # 批量大小
total_words = 10000 # 词汇表大小N_vocab
max_review_len = 80 # 句子最大长度s,大于的句子部分将截断,小于的将填充
embedding_len = 100 # 词向量特征长度f
# 加载IMDB数据集,此处的数据采用数字编码,一个数字代表一个单词
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=total_words)
# 打印输入的形状,标签的形状
print(x_train.shape, len(x_train[0]), y_train.shape)
print(x_test.shape, len(x_test[0]), y_test.shape)
# 数字编码表
word_index = imdb.get_word_index()
# 打印出编码表的单词和对应的数字
# for k, v in word_index.items():
# print(k, v)
# 前面4个ID是特殊位
word_index = {k:(v+3) for k, v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2 # unknown
word_index["<UNUSED>"] = 3
# 翻转编码表
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
def decode_review(text):
return ' '.join([reverse_word_index.get(i, '?') for i in text])
# # 将第1个句子转换为字符串数据
# print(decode_review(x_train[0]))
# 截断和填充句子,使得等长,此处长句子保留句子后面的部分,短句子在前面填充
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=max_review_len)
x_test = keras.preprocessing.sequence.pad_sequences(x_test, maxlen=max_review_len)
# 构建数据集,打散,批量,并丢掉最后一个不够batchsz的batch
db_train = tf.data.Dataset.from_tensor_slices((x_train, y_train))
db_train = db_train.shuffle(1000).batch(batchsz, drop_remainder=True)
db_test = tf.data.Dataset.from_tensor_slices((x_test, y_test))
db_test = db_test.batch(batchsz, drop_remainder=True)
# 统计数据集属性
print('x_train shape:', x_train.shape, tf.reduce_max(y_train), tf.reduce_min(y_train))
print('x_test shape:', x_test.shape)
class MyRNN(keras.Model):
# Cell方式构建多层网络
def __init__(self, units):
super(MyRNN, self).__init__()
# [b, 64],构建Cell初始化状态向量,重复使用
self.state0 = [tf.zeros([batchsz, units])]
self.state1 = [tf.zeros([batchsz, units])]
# 词向量编码 [b, 80] => [b, 80, 100]
self.embedding = layers.Embedding(total_words, embedding_len,
input_length=max_review_len)
# 构建2个Cell
self.rnn_cell0 = layers.SimpleRNNCell(units, dropout=0.5)
self.rnn_cell1 = layers.SimpleRNNCell(units, dropout=0.5)
# 构建分类网络,用于将CELL的输出特征进行分类,2分类
# [b, 80, 100] => [b, 64] => [b, 1]
self.outlayer = Sequential([
layers.Dense(units),
layers.Dropout(rate=0.5),
layers.ReLU(),
layers.Dense(1)])
def call(self, inputs, training=None):
x = inputs # [b, 80]
# 获取词向量: embedding: [b, 80] => [b, 80, 100]
x = self.embedding(x)
# 通过2个RNN CELL,rnn cell compute,[b, 80, 100] => [b, 64]
state0 = self.state0
state1 = self.state1
for word in tf.unstack(x, axis=1): # word: [b, 100]
out0, state0 = self.rnn_cell0(word, state0, training)
out1, state1 = self.rnn_cell1(out0, state1, training)
# 末层最后一个输出作为分类网络的输入: [b, 64] => [b, 1]
x = self.outlayer(out1, training)
# 通过激活函数,p(y is pos|x)
prob = tf.sigmoid(x)
return prob
# 训练与测试
def main():
units = 64 # RNN状态向量长度f
epochs = 50 # 训练epochs
model = MyRNN(units)
# 装配
model.compile(optimizer=optimizers.RMSprop(0.001),
loss=losses.BinaryCrossentropy(),
metrics=['accuracy'])
# 训练和验证
model.fit(db_train, epochs=epochs, validation_data=db_test)
# 测试
model.evaluate(db_test)
if __name__ == '__main__':
main()
可以看到,在训练45个Epoch后,正确率最高达到了80.08%。
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。