当前位置:   article > 正文

循环神经网络实现文本情感分类之使用LSTM完成文本情感分类_lstm情感分类

lstm情感分类

循环神经网络实现文本情感分类之使用LSTM完成文本情感分类

1. 使用LSTM完成文本情感分类

在前面,使用了word embedding去实现了toy级别的文本情感分类,那么现在在这个模型中添加上LSTM层,观察分类效果。

为了达到更好的效果,对之前的模型做如下修改

  1. MAX_LEN = 200

  2. 构建dataset的过程,把数据转化为2分类的问题,pos为1,neg为0,否则25000个样本完成10个类别的划分数据量是不够的

  3. 在实例化LSTM的时候,使用dropout=0.5,在model.eval()的过程中,dropout自动会为0

1.1 修改模型

  1. import torch
  2. import pickle
  3. import torch.nn as nn
  4. import torch.nn.functional as F
  5. ws = pickle.load(open('./model/ws.pkl', 'rb'))
  6. device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
  7. class IMDBLstmModel(nn.Module):
  8. def __init__(self):
  9. super().__init__()
  10. self.embedding_dim = 200
  11. self.hidden_size = 64
  12. self.num_layer = 2
  13. self.bidirectional = True
  14. self.bi_num = 2 if self.bidirectional else 1
  15. self.dropout = 0.5
  16. # 以上部分为超参数,可以自行修改
  17. self.embedding = nn.Embedding(len(ws), self.embedding_dim, padding_idx=ws.PAD) # [N, 300]
  18. self.lstm = nn.LSTM(self.embedding_dim, self.hidden_size, self.num_layer, bidirectional=self.bidirectional,
  19. dropout=self.dropout)
  20. # 使用两个全连接层,中间使用relu激活函数
  21. self.fc = nn.Linear(self.hidden_size * self.bi_num, 20)
  22. self.fc2 = nn.Linear(20, 2)
  23. def forward(self, x):
  24. x = self.embedding(x)
  25. x = x.permute(1, 0, 2) # 进行轴交换
  26. h_0, c_0 = self.init_hidden_state(x.size(1))
  27. _, (h_n, c_0) = self.lstm(x, (h_0, c_0))
  28. # 只要最后一个lstm单元处理的结果,这里去掉了hidden_state
  29. out = torch.cat([h_n[-2, :, :], h_n[-1, :, :]], dim=-1)
  30. out = self.fc(out)
  31. out = F.relu(out)
  32. out = self.fc2(out)
  33. return F.log_softmax(out, dim=-1)
  34. def init_hidden_state(self, batch_size):
  35. h_0 = torch.rand(self.num_layer * self.bi_num, batch_size, self.hidden_size).to(device)
  36. c_0 = torch.rand(self.num_layer * self.bi_num, batch_size, self.hidden_size).to(device)
  37. return h_0, c_0

2.2 完成训练和测试代码

为了提高程序的运行速度,可以考虑把模型放在gup上运行,那么此时需要处理一下几点:

  1. device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

  2. model.to(device)

  3. 除了上述修改外,涉及计算的所有tensor都需要转化为CUDA的tensor

    1. 初始化的h_0,c_0

    2. 训练集和测试集的input,traget

  4. 在最后可以通过tensor.cpu()转化为torch的普通tensor

  1. from torch import optim
  2. train_batch_size = 64
  3. test_batch_size = 5000
  4. # imdb_model = IMDBLstmModel(MAX_LEN) # 基础model
  5. imdb_model = IMDBLstmModel().to(device) # 在GPU上运行,提高运行速度
  6. # imdb_model.load_state_dict(torch.load("model/
  7. optimizer = optim.Adam(imdb_model.parameters())
  8. criterion = nn.CrossEntropyLoss()
  9. def train(epoch):
  10. mode = True
  11. imdb_model.train(mode)
  12. train_dataloader = get_dataloader(mode, train_batch_size)
  13. for idx, (target, input, input_length) in enumerate(train_dataloader):
  14. target = target.to(device)
  15. input = input.to(device)
  16. optimizer.zero_grad()
  17. output = imdb_model(input)
  18. loss = F.nll_loss(output, target) # target需要是[0,9],不能是[1-10]
  19. loss.backward()
  20. optimizer.step()
  21. if idx % 10 == 0:
  22. pred = torch.max(output, dim=-1, keepdim=False)[-1]
  23. acc = pred.eq(target.data).cpu().numpy().mean() * 100. # 使用eq判断是否一致
  24. print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}\t ACC: {:.6f}'.format(epoch, idx * len(input),
  25. len(train_dataloader.dataset),
  26. 100. * idx / len(
  27. train_dataloader),
  28. loss.item(), acc))
  29. torch.save(imdb_model.state_dict(), "model/mnist_net.pkl")
  30. torch.save(optimizer.state_dict(), 'model/mnist_optimizer.pkl')
  31. def test():
  32. mode = False
  33. imdb_model.eval()
  34. test_dataloader = get_dataloader(mode, test_batch_size)
  35. with torch.no_grad():
  36. for idx, (target, input, input_lenght) in enumerate(test_dataloader):
  37. target = target.to(device)
  38. input = input.to(device)
  39. output = imdb_model(input)
  40. test_loss = F.nll_loss(output, target, reduction="mean")
  41. pred = torch.max(output, dim=-1, keepdim=False)[-1]
  42. correct = pred.eq(target.data).sum()
  43. acc = 100. * pred.eq(target.data).cpu().numpy().mean()
  44. print('idx: {} Test set: Avg. loss: {:.4f}, Accuracy: {}/{} ({:.2f}%)\n'.format(idx, test_loss, correct,
  45. target.size(0), acc))
  46. if __name__ == "__main__":
  47. test()
  48. for i in range(10):
  49. train(i)
  50. test()

2.3 模型训练的最终输出

  1. ...
  2. Train Epoch: 9 [20480/25000 (82%)] Loss: 0.017165 ACC: 100.000000
  3. Train Epoch: 9 [21120/25000 (84%)] Loss: 0.021572 ACC: 98.437500
  4. Train Epoch: 9 [21760/25000 (87%)] Loss: 0.058546 ACC: 98.437500
  5. Train Epoch: 9 [22400/25000 (90%)] Loss: 0.045248 ACC: 98.437500
  6. Train Epoch: 9 [23040/25000 (92%)] Loss: 0.027622 ACC: 98.437500
  7. Train Epoch: 9 [23680/25000 (95%)] Loss: 0.097722 ACC: 95.312500
  8. Train Epoch: 9 [24320/25000 (97%)] Loss: 0.026713 ACC: 98.437500
  9. Train Epoch: 9 [15600/25000 (100%)] Loss: 0.006082 ACC: 100.000000
  10. idx: 0 Test set: Avg. loss: 0.8794, Accuracy: 4053/5000 (81.06%)
  11. idx: 1 Test set: Avg. loss: 0.8791, Accuracy: 4018/5000 (80.36%)
  12. idx: 2 Test set: Avg. loss: 0.8250, Accuracy: 4087/5000 (81.74%)
  13. idx: 3 Test set: Avg. loss: 0.8380, Accuracy: 4074/5000 (81.48%)
  14. idx: 4 Test set: Avg. loss: 0.8696, Accuracy: 4027/5000 (80.54%)

可以看到模型的测试准确率稳定在81%左右。

大家可以把上述代码改为GRU,或者多层LSTM继续尝试,观察效果

完整代码:

目录结构:

main.py

  1. # 由于pickle特殊性,需要在此导入Word2Sequence
  2. from word_squence import Word2Sequence
  3. import pickle
  4. import os
  5. from dataset import tokenlize
  6. from tqdm import tqdm # 显示当前迭代进度
  7. TRAIN_PATH = r"../data/aclImdb/train"
  8. if __name__ == '__main__':
  9. ws = Word2Sequence()
  10. temp_data_path = [os.path.join(TRAIN_PATH, 'pos'), os.path.join(TRAIN_PATH, 'neg')]
  11. for data_path in temp_data_path:
  12. # 获取每一个文件的路径
  13. file_paths = [os.path.join(data_path, file_name) for file_name in os.listdir(data_path)]
  14. for file_path in tqdm(file_paths):
  15. sentence = tokenlize(open(file_path, errors='ignore').read())
  16. ws.fit(sentence)
  17. ws.build_vocab(max=10, max_features=10000)
  18. pickle.dump(ws, open('./model/ws.pkl', 'wb'))
  19. print(len(ws.dict))

model.py

  1. """
  2. 定义模型
  3. 模型优化方法:
  4. # 为使得结果更好 添加一个新的全连接层,作为输出,激活函数处理
  5. # 把双向LSTM的output传给一个单向LSTM再进行处理
  6. lib.max_len = 200
  7. lib.embedding_dim = 100 # 用长度为100的向量表示一个词
  8. lib.hidden_size = 128 # 每个隐藏层中LSTM单元个数
  9. lib.num_layer = 2 # 隐藏层数量
  10. lib.bidirectional = True # 是否双向LSTM
  11. lib.dropout = 0.3 # 在训练时以一定的概率使神经元失活,实际上就是让对应神经元的输出为0
  12. lib.device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
  13. """
  14. import torch.nn as nn
  15. from lib import ws
  16. import torch.nn.functional as F
  17. from torch.optim import Adam
  18. from dataset import get_dataloader
  19. from tqdm import tqdm
  20. import torch
  21. import numpy as np
  22. import lib
  23. import os
  24. class Mymodel(nn.Module):
  25. def __init__(self):
  26. super().__init__()
  27. # nn.Embedding(num_embeddings - 词嵌入字典大小即一个字典里要有多少个词,embedding_dim - 每个词嵌入向量的大小。)
  28. self.embedding = nn.Embedding(len(ws), 100)
  29. # 加入LSTM
  30. self.lstm = nn.LSTM(input_size=lib.embedding_dim, hidden_size=lib.hidden_size, num_layers=lib.num_layer,
  31. batch_first=True, bidirectional=lib.bidirectional, dropout=lib.dropout)
  32. self.fc = nn.Linear(lib.hidden_size * 2, 2)
  33. def forward(self, input):
  34. """
  35. :param input: 形状[batch_size, max_len]
  36. :return:
  37. """
  38. x = self.embedding(input) # 进行embedding,形状[batch_size, max_len, 100]
  39. # x [batch_size, max_len, num_direction*hidden_size]
  40. # h_n[num_direction * num_layer, batch_size, hidden_size]
  41. x, (h_n, c_n) = self.lstm(x)
  42. # 获取两个方向最后一次的output(正向最后一个和反向第一个)进行concat
  43. # output = x[:,-1,:hidden_size] 前向,等同下方
  44. output_fw = h_n[-2, :, :] # 正向最后一次输出
  45. # output = x[:,0,hidden_size:] 反向,等同下方
  46. output_bw = h_n[-1, :, :] # 反向最后一次输出
  47. # 只要最后一个lstm单元处理的结果,这里去掉了hidden state
  48. output = torch.cat([output_fw, output_bw], dim=-1) # [batch_size, hidden_size*num_direction]
  49. out = self.fc(output)
  50. return F.log_softmax(out, dim=-1)
  51. model = Mymodel()
  52. optimizer = Adam(model.parameters(), lr=0.01)
  53. if os.path.exists('./model/model.pkl'):
  54. model.load_state_dict(torch.load('./model/model.pkl'))
  55. optimizer.load_state_dict(torch.load('./model/optimizer.pkl'))
  56. # 训练
  57. def train(epoch):
  58. for idx, (input, target) in enumerate(get_dataloader(train=True)):
  59. output = model(input)
  60. optimizer.zero_grad()
  61. loss = F.nll_loss(output, target)
  62. loss.backward()
  63. optimizer.step()
  64. print(loss.item())
  65. print('当前第%d轮,idx为%d 损失为:%lf, ' % (epoch, idx, loss.item()))
  66. # 保存模型
  67. if idx % 100 == 0:
  68. torch.save(model.state_dict(), './model/model.pkl')
  69. torch.save(optimizer.state_dict(), './model/optimizer.pkl')
  70. # 评估
  71. def test():
  72. acc_list = []
  73. loss_list = []
  74. # 开启模型评估模式
  75. model.eval()
  76. # 获取测试集数据
  77. test_dataloader = get_dataloader(train=False)
  78. # tqdm(total = 总数,ascii = #,desc=描述)
  79. for idx, (input, target) in tqdm(enumerate(test_dataloader), total=len(test_dataloader), ascii=True, desc='评估:'):
  80. with torch.no_grad():
  81. output = model(input)
  82. # 计算当前损失
  83. cur_loss = F.nll_loss(output, target)
  84. loss_list.append(cur_loss)
  85. pred = output.max(dim=-1)[-1]
  86. # 计算当前准确率
  87. cur_acc = pred.eq(target).float().mean()
  88. acc_list.append(cur_acc)
  89. print('准确率为:%lf, 损失为:%lf' % (np.mean(acc_list), np.mean(loss_list)))
  90. if __name__ == '__main__':
  91. for i in tqdm(range(10)):
  92. train(i)
  93. test()

dataset.py:

  1. import torch
  2. from torch.utils.data import Dataset, DataLoader
  3. import os
  4. import re
  5. """
  6. 完成数据集准备
  7. """
  8. TRAIN_PATH = r"..\data\aclImdb\train"
  9. TEST_PATH = r"..\data\aclImdb\test"
  10. # 分词
  11. def tokenlize(content):
  12. content = re.sub(r"<.*?>", " ", content)
  13. filters = ['!', '"', '#', '$', '%', '&', '\(', '\)', '\*', '\+', ',', '-', '\.', '/', ':', ';', '<', '=', '>', '\?',
  14. '@', '\[', '\\', '\]', '^', '_', '`', '\{', '\|', '\}', '~', '\t', '\n', '\x97', '\x96', '”', '“', ]
  15. content = re.sub("|".join(filters), " ", content)
  16. tokens = [i.strip().lower() for i in content.split()]
  17. return tokens
  18. class ImbdDateset(Dataset):
  19. def __init__(self, train=True):
  20. self.train_data_path = TRAIN_PATH
  21. self.test_data_path = TEST_PATH
  22. # 通过train和data_path控制读取train或者test数据集
  23. data_path = self.train_data_path if train else self.test_data_path
  24. # 把所有文件名放入列表
  25. # temp_data_path = [data_path + '/pos', data_path + '/neg']
  26. temp_data_path = [os.path.join(data_path, 'pos'), os.path.join(data_path, 'neg')]
  27. self.total_file_path = [] # 所有pos,neg评论文件的path
  28. # 获取每个文件名字,并拼接路径
  29. for path in temp_data_path:
  30. file_name_list = os.listdir(path)
  31. file_path_list = [os.path.join(path, i) for i in file_name_list if i.endswith('.txt')]
  32. self.total_file_path.extend(file_path_list)
  33. def __getitem__(self, index):
  34. # 获取index的path
  35. file_path = self.total_file_path[index]
  36. # 获取label
  37. label_str = file_path.split('\\')[-2]
  38. label = 0 if label_str == 'neg' else 1
  39. # 获取content
  40. tokens = tokenlize(open(file_path, errors='ignore').read())
  41. return tokens, label
  42. def __len__(self):
  43. return len(self.total_file_path)
  44. def get_dataloader(train=True):
  45. imdb_dataset = ImbdDateset(train)
  46. data_loader = DataLoader(imdb_dataset, shuffle=True, batch_size=128, collate_fn=collate_fn)
  47. return data_loader
  48. # 重新定义collate_fn
  49. def collate_fn(batch):
  50. """
  51. :param batch: (一个__getitem__[tokens, label], 一个__getitem__[tokens, label],..., batch_size个)
  52. :return:
  53. """
  54. content, label = list(zip(*batch))
  55. from lib import ws, max_len
  56. content = [ws.transform(i, max_len=max_len) for i in content]
  57. content = torch.LongTensor(content)
  58. label = torch.LongTensor(label)
  59. return content, label
  60. if __name__ == '__main__':
  61. for idx, (input, target) in enumerate(get_dataloader()):
  62. print(idx)
  63. print(input)
  64. print(target)
  65. break

word_squence.py

  1. import numpy as np
  2. """
  3. 构建词典,实现方法把句子转换为序列,和其翻转
  4. """
  5. class Word2Sequence(object):
  6. # 2个特殊类属性,标记特殊字符和填充标记
  7. UNK_TAG = 'UNK'
  8. PAD_TAG = 'PAD'
  9. UNK = 0
  10. PAD = 1
  11. def __init__(self):
  12. self.dict = {
  13. # 保存词语和对应的数字
  14. self.UNK_TAG: self.UNK,
  15. self.PAD_TAG: self.PAD
  16. }
  17. self.count = {} # 统计词频
  18. def fit(self, sentence):
  19. """
  20. 把单个句子保存到dict中
  21. :param sentence: [word1, word2 , ... , ]
  22. :return:
  23. """
  24. for word in sentence:
  25. # 对word出现的频率进行统计,当word不在sentence时,返回值是0,当word在sentence中时,返回+1,以此进行累计计数
  26. self.count[word] = self.count.get(word, 0) + 1
  27. def build_vocab(self, min=5, max=None, max_features=None):
  28. """
  29. 生成词典
  30. :param min:最小词频数
  31. :param max:最大词频数
  32. :param max_feature:一共保留多少词语
  33. :return:
  34. """
  35. # 删除count < min 的词语,即保留count > min 的词语
  36. if min is not None:
  37. self.count = {word: value for word, value in self.count.items() if value > min}
  38. # 删除count > min 的词语,即保留count < max 的词语
  39. if max is not None:
  40. self.count = {word: value for word, value in self.count.items() if value < max}
  41. # 限制保留的词语数
  42. if max_features is not None:
  43. # sorted 返回一个列表[(key1, value1), (key2, value2),...],True为升序
  44. temp = sorted(self.count.items(), key=lambda x: x[-1], reverse=True)[:max_features]
  45. self.count = dict(temp)
  46. for word in self.count:
  47. self.dict[word] = len(self.dict)
  48. # 得到一个翻转的dict字典
  49. # zip方法要比{value: word for word, value in self.dict.items()}快
  50. self.inverse_dict = dict(zip(self.dict.values(), self.dict.keys()))
  51. def transform(self, sentence, max_len=None):
  52. """
  53. 把句子转换为序列
  54. :param sentence: [word1, word2...]
  55. :param max_len: 对句子进行填充或者裁剪
  56. :return:
  57. """
  58. if max_len is not None:
  59. # 句子长度小于最大长度,进行填充
  60. if max_len > len(sentence):
  61. sentence = sentence + [self.PAD_TAG] * (max_len - len(sentence))
  62. # 句子长度大于最大长度,进行裁剪
  63. if max_len < len(sentence):
  64. sentence = sentence[:max_len]
  65. # for word in sentence:
  66. # self.dict.get(word, self.UNK)
  67. # 字典的get(key, default=None) 如果指定键不存在,则返回默认值None。
  68. return [self.dict.get(word, self.UNK) for word in sentence]
  69. def inverse_transform(self, indices):
  70. """
  71. 把序列转换为句子
  72. :param indices: [1, 2, 3, ...]
  73. :return:
  74. """
  75. return [self.inverse_dict.get(idx) for idx in indices]
  76. def __len__(self):
  77. return len(self.dict)
  78. if __name__ == '__main__':
  79. ws = Word2Sequence()
  80. ws.fit(["我", "是", "我"])
  81. ws.fit(["我", "是", "谁"])
  82. ws.build_vocab(min=1, max_features=5)
  83. print(ws.dict)
  84. ret = ws.transform(['我', '爱', '北京'], max_len=10)
  85. print(ret)
  86. print(ws.inverse_transform(ret))

lib.py

  1. import pickle
  2. import torch
  3. ws = pickle.load(open('./model/ws.pkl', 'rb'))
  4. max_len = 200
  5. embedding_dim = 100 # 用长度为100的向量表示一个词
  6. hidden_size = 128 # 每个隐藏层中LSTM单元个数
  7. num_layer = 2 # 隐藏层数量
  8. bidirectional = True # 是否双向LSTM
  9. dropout = 0.3 # 在训练时以一定的概率使神经元失活,实际上就是让对应神经元的输出为0
  10. device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

 

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/weixin_40725706/article/detail/354724?site
推荐阅读
相关标签
  

闽ICP备14008679号