当前位置:   article > 正文

第六课.NLP文本分类任务

文本分类任务

NLP文本分类简介

文本分类是最基本且实用的自然语言处理应用,应用在垃圾邮件识别,情感分类,主题分类等任务上;
对于情感分类,有经典的IMDB影评数据:
fig1
可以人为对影评进行标注,分成积极"pos"和消极"neg"两类,常用于文本分类的小型深度学习模型有WordAveraging,RNN,CNN;三者均能取得不错的效果

IMDB数据集准备

在本次情感分类任务中,数据有文本和两类情感"pos"和"neg",依然借助torchtext构建数据集,使用TEXT field来定义如何处理电影评论,使用LABEL field来处理两个情感类别;
TEXT field带有tokenize=‘spacy’,这表示我会用spacy tokenizer对英文句子进行分词。如果不特别声明tokenize这个参数,那么默认的分词方法是使用空格。
首先需要安装spacy:

pip install -U spacy  #-U是upgrade:如果已安装就升级到最新版
python -m spacy download en
  • 1
  • 2

spacy是世界上最快的工业级自然语言处理工具。 支持多种自然语言处理基本功能。官网地址:spacy.io
spacy主要功能包括分词、词性标注、词干化、命名实体识别、名词短语提取等等。


设置随机种子

导入torchtext,设置随机种子确保实验可以被复现:

import torch
from torchtext import data

SEED=1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
#固定卷积算法,确保卷积实验可重复
torch.backends.cudnn.deterministic = True


TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype=torch.float)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

下载IMDB并划分数据集

torchtext支持很多常见的自然语言处理数据集。下面的语句会下载IMDB数据集,然后分成train/test两个torchtext.datasets类别。同时数据被前面设置过的Fields进行处理:

from torchtext import datasets
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
  • 1
  • 2

IMDB数据集一共有50000条电影评论,每个评论都被标注为正面的或负面;
查看训练和测试数据由多少条数据组成:

print('Number of training examples: ',len(train_data))
print('Number of testing examples: ',len(test_data))
  • 1
  • 2

结果为:
Number of training examples: 25000
Number of testing examples: 25000
我也可以查看其中一条数据,使用函数vars(),vars回顾python记事本,返回对象object的属性和属性值的字典对象,如果没有参数,就打印当前调用位置的属性和属性值

#查看一个example 

#vars回顾python记事本,返回对象object的属性和属性值的字典对象,
#如果没有参数,就打印当前调用位置的属性和属性值 类似 locals()

vars(train_data.examples[0])
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

fig2由于现在只有train和test两个分类,所以还需要创建一个新的子数据集:validation set。可以使用.split()创建新的分类:

#由于现在只有train/test这两个分类,所以我们需要创建一个新的validation set。我们可以使用.split()创建新的分类
#默认的数据分割是split_ratio=0.7,如果声明split_ratio,可以改变split之间的比例,split_ratio=0.8表示80%的数据是训练集,20%是验证集。
#声明random_state这个参数,确保每次分割的数据集都是一样的。

import random
train_data, valid_data = train_data.split(random_state=random.seed(SEED))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

查看重新划分后的数据集情况:

print('Number of training examples: ',len(train_data))
print('Number of testing examples: ',len(test_data))
print('Number of validation examples: ',len(valid_data))
  • 1
  • 2
  • 3

结果为:
Number of training examples: 17500
Number of testing examples: 25000
Number of validation examples: 7500

构建词汇表

构建两个词汇表,一个是文本词汇表,另一个是情感类别词汇表;在创建词汇表的同时,读入了词向量glove.6B.100d,glove.6B是斯坦福大学训练的词向量包(862MB),glove.6B.100d是100维词向量,TEXT.build_vocab可以根据我的词汇表内的词匹配到glove内的词,组建成为我需要的词向量;后面如果想使用这个新词向量,可以通过TEXT.vocab.vectors返回;


glove.6B.100d下载和导入
快速便捷的下载可以前往kaggle:glove.6B.100d,这是以txt方式保存的词向量文件:
fig3导入词向量glove.6B.100d:

from torchtext.vocab import Vectors
# name为文件名,cache是存放name文件的目录
vectors = Vectors(name='glove.6B.100d.txt', cache='.vector_cache')
  • 1
  • 2
  • 3

在torchtext下构建词汇很方便,这里使用训练数据train_data对象,构建词汇表:

from torchtext.vocab import Vectors
glove_vector=Vectors(name='glove.6B.100d.txt',cache='./DataSet/glove.6B')

MAX_VOCAB_SIZE=25000

TEXT.build_vocab(train_data, 
                 max_size=MAX_VOCAB_SIZE,
                 vectors=glove_vector)

LABEL.build_vocab(train_data)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

借助列表itos(index_to_word),分别查看词汇表内的词:
fig4打印词汇表的词数:

print("Unique tokens in TEXT vocabulary: ",len(TEXT.vocab))
print("Unique tokens in LABEL vocabulary: ",len(LABEL.vocab))
  • 1
  • 2

得到:
Unique tokens in TEXT vocabulary: 25002
Unique tokens in LABEL vocabulary: 2
TEXT的词汇表数比MAX_VOCAB_SIZE的25000多2个,回顾第五课容易理解,这是因为自动添加了<unk>和<pad>;
通过TEXT.vocab.freqs.most_common可以查看训练数据中最常见的单词:

#训练数据集中最常见的单词
print(TEXT.vocab.freqs.most_common(20))
  • 1
  • 2

fig5

使用torchtext生成batch

最后一步数据的准备是创建iterators,每个iteration都会返回一个batch的数据,语言模型中使用BPTTIterator,文本分类习惯使用BucketIterator,BucketIterator还会把长度差不多的句子放到同一个batch中,确保每个batch中不出现太多的padding;
注意:不同batch的seq_len是不一样的;

"""
最后一步数据的准备是创建iterators。每个itartion都会返回一个batch的examples
使用BucketIterator。BucketIterator会把长度差不多的句子放到同一个batch中,确保每个batch中不出现太多的padding
不同batch的seq_len是不一样的
"""
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size=BATCH_SIZE,
    device=device)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

返回iterator对象后,我可以用iterator生成batch:

it=iter(train_iterator)
batch=next(it)
  • 1
  • 2

此时,batch内有两个对象,即text和label,且它们的形状为:

  • batch.text:seq_len * batch_size
  • batch.label:batch_size

现在的batch是:

"""
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 975x64]
	[.label]:[torch.FloatTensor of size 64]
"""
  • 1
  • 2
  • 3
  • 4
  • 5

如果再next一次,即batch=next(it),则会得到:

"""
[torchtext.data.batch.Batch of size 64]
	[.text]:[torch.LongTensor of size 1110x64]
	[.label]:[torch.FloatTensor of size 64]
"""
  • 1
  • 2
  • 3
  • 4
  • 5

这也证实了不同batch的seq_len是不同的;
查看batch的text和label对象:

batch.text,batch.label
"""
(tensor([[ 374,  873,  935,  ...,   11, 1677,  170],
         [ 173,   53, 9854,  ...,  533,   98,   31],
         [  44,  288,   15,  ...,    2, 2327,  200],
         ...,
         [   1,    1,    1,  ...,    1,    1,    1],
         [   1,    1,    1,  ...,    1,    1,    1],
         [   1,    1,    1,  ...,    1,    1,    1]]),
 tensor([0., 0., 1., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 1., 1., 1., 1.,
         1., 1., 0., 0., 1., 1., 1., 0., 1., 1., 0., 0., 0., 1., 1., 0., 1., 1.,
         1., 0., 1., 0., 1., 0., 1., 1., 0., 1., 1., 0., 1., 1., 0., 0., 0., 1.,
         0., 1., 1., 0., 1., 1., 1., 0., 1., 0.]))
"""
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14

全都是已经自动"one-hot编码"过的tensor;
查看batch的第一句话:

print(" ".join(TEXT.vocab.itos[i] for i in batch.text[:,0]))
"""
For those who are too young to know this or for those who have forgotten , the Disney company went almost down the tubes by the end of the 1980s . People were NOT seeing their movies anymore and the company was not producing the usual wholesome material .... at least no what people expected . A major problem : <unk> /><br />Yes , the idiots running the Disney movies during that decade would produce films with swear words - including the Lord 's name in vain , if you can believe that - interspersed in these " family films . " In fact that happens twice here in the first 20 minutes ! < br /><br />This movie , in addition to the language problems , has a nasty tone to it , too , which made it unlikeable almost right from the beginning . Thankfully , Disney woke up and has produced a lot of great material since these decadent ' 80s movies . ( " <unk> " is Disney , just under another name . ) <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>
"""
  • 1
  • 2
  • 3
  • 4

至此,对数据集的分析就完成了,下面就是模型实现部分

WordAveraging

WordAveraging看似简单,但事实证明wordaveraging非常适合文本分类,大部分情况都优于LSTM等复杂模型。
模型结构如下:
fig6其实就是对句子中每个词的词向量对应计算了平均值,相当于把句子变成了固定长度的向量(向量内有embed_size个元素),然后将这个向量输入全连接网络进行分类;
关于计算平均值的操作可以借助AveragePooling2D来实现;其过程可以描述为以下步骤;
比如句子"I hate this film",先将每个词转为固定长度(embed_size)的词向量,再拼接形成一个二维张量:
fig7
红色的区域代表AveragePooling2D的kernel,可以看出kernel的尺寸应该为:(seq_len,1)
fig8

fig9
当kernel在词向量组成的二维张量上完成操作后,会得到一个向量(embed_size个元素),此后这个向量输入全连接的NN进一步捕获特征从而得到分类结果

模型定义

模型定义如下:

#定义模型
import torch
import torch.nn as nn
import torch.nn.functional as F

class WordAverage(nn.Module):
    def __init__(self,vocab_size,embed_size,output_size):
        super().__init__()
        self.embed=nn.Embedding(vocab_size,embed_size)
        self.linear=nn.Linear(embed_size,output_size)
    
    def forward(self,text):
        # [seq_len,batch_szie]
        embeded=self.embed(text) #[seq_len,batch_szie,embed_size]
        
        #维度重新排序
        embeded=embeded.permute(1,0,2) #[batch_szie,seq_len,embed_size]
        
        #求平均,kernel_size为(seq_len,1)
        pooled=F.avg_pool2d(embeded,(embeded.shape[1],1)) #[batch_size,1,embed_size]
        #去除冗余维度
        pooled=pooled.squeeze()
        
        return self.linear(pooled) #[batch_size,output_size:1]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24

基于超参数实例化模型,由于损失函数会使用BCE,所以输出preds与标签target的形状应该是[batch_size]:

VOCAB_SIZE=len(TEXT.vocab)
EMBEDDING_SIZE=100
#分两类,使用BinaryCrossEntropy
OUTPUT_SIZE=1

USE_CUDA=torch.cuda.is_available()

model=WordAverage(VOCAB_SIZE,
                 EMBEDDING_SIZE,
                 OUTPUT_SIZE)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

tensor.numel()可以返回权重的参数数量,我可以查看embed层的权重数量:

model.embed.weight.numel()
  • 1

同样的,我可以统计模型参数的数量:

#统计模型参数的数量
def count_parameters(model):
    # numel()可以返回某一层权重的参数数量
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

count_parameters(model)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

加载由glove.6B生成的词向量到模型

之前提到TEXT.vocab.vectors保存了基于glove.6B.100d生成的优质词向量:

# 从TEXT对象加载之前提到的glove.6B.100d
TEXT.vocab.vectors.size()
#torch.Size([25002, 100])

TEXT.vocab.vectors
  • 1
  • 2
  • 3
  • 4
  • 5

现在将这个张量赋值到模型中,有利于加速收敛:

#将glove.6B.100d作为预训练的embed
pretrained_embedding=TEXT.vocab.vectors
model.embed.weight.data.copy_(pretrained_embedding)
  • 1
  • 2
  • 3

训练

损失函数选择二分类交叉熵BinaryCrossEntropyLoss(细节回顾pytorch记事本),并确定优化方法Adam:

#训练模型
LEARNING_RATE=0.001

optimizer=torch.optim.Adam(model.parameters(),lr=LEARNING_RATE)

#logits+BinaryCrossEntropy
loss_fn=nn.BCEWithLogitsLoss()

if USE_CUDA:
    model=model.cuda()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

定义函数用于计算准确率Accuracy:

#计算预测准确率
def binary_accuracy(preds,y):
    #preds和y都是[batch_size]
    #对结果四舍五入
    rounded_preds=torch.round(torch.sigmoid(preds))
    correct=(rounded_preds==y).float()
    return correct.sum()/len(correct)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

定义函数用于训练,一次调用的训练量是一个epoch,和以往一样,训练有五个步骤(前向传播,计算损失,反向传播计算梯度,更新参数,梯度清零):

#1个epoch的训练
def train(model,iterator,optimizer,loss_fn):
    epoch_loss,epoch_acc=0.,0.
    total_len=0.
    model.train()
    
    for batch in iterator:
        # batch.text [seq_len,batch_size]
        preds=model.forward(batch.text).squeeze()
        
        # preds和atch.label [batch_size] 
        loss=loss_fn(preds,batch.label)
        
        acc=binary_accuracy(preds,batch.label)
        
        loss.backward()
        optimizer.step()
        model.zero_grad()
        
        epoch_loss+=loss.item()*len(batch.label)
        epoch_acc+=acc.item()*len(batch.label)
        
        total_len+=len(batch.label)
        
    return epoch_loss/total_len,epoch_acc/total_len
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

定义函数用于在验证集上衡量模型效果,本质就是训练过程的数据集换成验证集,不进行梯度计算和参数更新,注意训练模式和验证模式的开关:

#在一个epoch上eval
def evaluate(model,iterator,loss_fn):
    epoch_loss,epoch_acc=0.,0.
    total_len=0.
    model.eval()
    
    with torch.no_grad():
        for batch in iterator:
            # batch.text [seq_len,batch_size]
            preds=model.forward(batch.text).squeeze()
        
            # preds和atch.label [batch_size] 
            loss=loss_fn(preds,batch.label)
        
            acc=binary_accuracy(preds,batch.label)

        
            epoch_loss+=loss.item()*len(batch.label)
            epoch_acc+=acc.item()*len(batch.label)
        
            total_len+=len(batch.label)
        
    model.train()
        
    return epoch_loss/total_len,epoch_acc/total_len
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25

训练如下:

NUM_EPOCHS=10
best_valid_acc=0.

for epoch in range(NUM_EPOCHS):
    train_loss,train_acc=train(model,train_iterator,optimizer,loss_fn)
    valid_loss,valid_acc=evaluate(model,valid_iterator,loss_fn)
    
    if valid_acc>best_valid_acc:
        best_valid_acc=valid_acc
        torch.save(model.state_dict(),"wordaveraging.pth")
        
    print("Epoch",epoch,"TrainLoss",train_loss,"TrainAcc",train_acc)
    print("Epoch",epoch,"ValidLoss",valid_loss,"ValidAcc",valid_acc)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

模型预测情感类别

先利用spacy库对句子分词,然后加载模型参数;此外,需要定义一个函数将预测过程进行封装:

# 预测

#先利用spacy库对句子分词
import spacy
nlp=spacy.load("en")

model.load_state_dict(torch.load("wordaveraging.pth"))

def predict_sentiment(model,sentence):
    tokenized=[tok.text for tok in nlp.tokenizer(sentence)]
    indexed=[TEXT.vocab.stoi[t] for t in tokenized]
    tensor=torch.tensor(indexed,dtype=torch.long) # [seq_len]
    tensor=tensor.unsqueeze(1) #加一个维度,变成[seq_len,batch_size],此时batch_size为1
    
    pred=torch.round(torch.sigmoid(model.forward(tensor)))
    
    #越靠近0越neg,越靠近1越pos
    return pred.item()
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18

调用函数对句子的情感分类,输出数值越靠近0越"neg",越靠近1越"pos":

predict_sentiment(model,"this film is very horrible!")
predict_sentiment(model,"this film is very nice!")
  • 1
  • 2

fig10

RNN

语言模型稍作修改也可以用于文本分类,在语言模型学习中,循环神经网络有:
h t = R N N ( x t − 1 , h t − 1 ) h_{t}=RNN(x_{t-1},h_{t-1}) ht=RNN(xt1,ht1)
因此可以考虑取最后一个hidden state向量,输入全连接网络后得到分类结果

模型定义

实际效果表明,仅用一层RNN,文本分类结果很差,为了提升分类的准确率Accuracy,模型改用双层双向LSTM,取LSTM第2层的两个边缘输出hidden state,进行拼接,成为一个向量再输入全连接网络;
张量的连接上,使用了dropout方法,dropout方法可以随机使连接失活,相当于减少了模型参数,可以缓解过拟合;
模型定义如下:

# RNN除了用于训练语言模型,也可以用于文本分类
# 模型定义
import torch
import torch.nn as nn
import torch.nn.functional as F

class RNNModel(nn.Module):
    def __init__(self,vocab_size,embed_size,hidden_size,dropout,output_size):
        super().__init__()
        self.embed=nn.Embedding(vocab_size,embed_size)
        
        #两层双向LSTM
        self.lstm=nn.LSTM(embed_size,hidden_size,num_layers=2,bidirectional=True)
        
        self.linear=nn.Linear(2*hidden_size,output_size)
        
        #训练时随机放弃连接,从而减少参数缓解过拟合
        self.dropout=nn.Dropout(dropout)
    
    def forward(self,text):
        # [seq_len,batch_szie]
        embeded=self.embed(text) #[seq_len,batch_szie,embed_size]
        
        output,(hidden,cell)=self.lstm(embeded) #hidden [4,batch_size,hidden_size]
        """
        output的形状  (seq_len, batch, num_directions * hidden_size)
        h_n 的形状 (num_layers * num_directions, batch, hidden_size)
        c_n 的形状  (num_layers * num_directions, batch, hidden_size)
        """
        #将第2层LSTM的两边的hidden state拼接
        # hidden[-1]:[batch_size,hidden_size]
        # hidden[-2]:[batch_size,hidden_size]
        hidden=torch.cat((hidden[-1],hidden[-2]),dim=1) # [batch_size,2*hidden_size]
        
        #dropout:随机使hidden与下层Linear的部分连接失活
        hidden=self.dropout(hidden) #[batch_size,hidden_size]
        
        return self.linear(hidden) #[batch_size,output_size:1]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38

传递超参数实例化模型:

VOCAB_SIZE=len(TEXT.vocab)
EMBEDDING_SIZE=100
HIDDEN_SIZE=100
OUTPUT_SIZE=1
DROPOUT=0.5

model=RNNModel(VOCAB_SIZE,
              EMBEDDING_SIZE,
              HIDDEN_SIZE,
              DROPOUT,
              OUTPUT_SIZE)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11

模型训练与预测

由于模型与WordAveraging的输入输出shape一样,所以,其余部分完全可以复用:

#由于与WordAveraging的输入输出shape一样,其余部分完全可以复用

#将glove.6B.100d作为预训练的embed
pretrained_embedding=TEXT.vocab.vectors
model.embed.weight.data.copy_(pretrained_embedding)

#训练模型
LEARNING_RATE=0.001

optimizer=torch.optim.Adam(model.parameters(),lr=LEARNING_RATE)

#logits+BinaryCrossEntropy
loss_fn=nn.BCEWithLogitsLoss()

if USE_CUDA:
    model=model.cuda()
    
#计算预测准确率
def binary_accuracy(preds,y):
    #preds和y都是[batch_size]
    #对结果四舍五入
    rounded_preds=torch.round(torch.sigmoid(preds))
    correct=(rounded_preds==y).float()
    return correct.sum()/len(correct)

#1个epoch的训练
def train(model,iterator,optimizer,loss_fn):
    epoch_loss,epoch_acc=0.,0.
    total_len=0.
    model.train()  
    for i,batch in enumerate(iterator):
        # batch.text [seq_len,batch_size]
        preds=model.forward(batch.text).squeeze()        
        # preds和atch.label [batch_size] 
        loss=loss_fn(preds,batch.label)     
        acc=binary_accuracy(preds,batch.label)        
        loss.backward()
        
        print("train:",i,"loss:",loss.item(),"acc:",acc.item())
        
        optimizer.step()
        model.zero_grad()        
        epoch_loss+=loss.item()*len(batch.label)
        epoch_acc+=acc.item()*len(batch.label)        
        total_len+=len(batch.label) 
    return epoch_loss/total_len,epoch_acc/total_len

#在一个epoch上eval
def evaluate(model,iterator,loss_fn):
    epoch_loss,epoch_acc=0.,0.
    total_len=0.
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            # batch.text [seq_len,batch_size]
            preds=model.forward(batch.text).squeeze()
            # preds和atch.label [batch_size] 
            loss=loss_fn(preds,batch.label)
            acc=binary_accuracy(preds,batch.label)
            
            print("valid:",i,"loss:",loss.item(),"acc:",acc.item())
            
            epoch_loss+=loss.item()*len(batch.label)
            epoch_acc+=acc.item()*len(batch.label)     
            total_len+=len(batch.label)   
    model.train()
    return epoch_loss/total_len,epoch_acc/total_len

NUM_EPOCHS=10
best_valid_acc=0.

for epoch in range(NUM_EPOCHS):
    train_loss,train_acc=train(model,train_iterator,optimizer,loss_fn)
    valid_loss,valid_acc=evaluate(model,valid_iterator,loss_fn)
    
    if valid_acc>best_valid_acc:
        best_valid_acc=valid_acc
        torch.save(model.state_dict(),"rnnmodel.pth")
        
    print("Epoch",epoch,"TrainLoss",train_loss,"TrainAcc",train_acc)
    print("Epoch",epoch,"ValidLoss",valid_loss,"ValidAcc",valid_acc)
    
# 预测
#先利用spacy库对句子分词
import spacy
nlp=spacy.load("en")

model.load_state_dict(torch.load("rnnmodel.pth"))

def predict_sentiment(model,sentence):
    tokenized=[tok.text for tok in nlp.tokenizer(sentence)]
    indexed=[TEXT.vocab.stoi[t] for t in tokenized]
    tensor=torch.tensor(indexed,dtype=torch.long) # [seq_len]
    tensor=tensor.unsqueeze(1) #加一个维度,变成[seq_len,batch_size],此时batch_size为1
    pred=torch.round(torch.sigmoid(model.forward(tensor)))
    #越靠近0越neg,越靠近1越pos
    return pred.item()

print(predict_sentiment(model,"this film is very nice!"))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75
  • 76
  • 77
  • 78
  • 79
  • 80
  • 81
  • 82
  • 83
  • 84
  • 85
  • 86
  • 87
  • 88
  • 89
  • 90
  • 91
  • 92
  • 93
  • 94
  • 95
  • 96
  • 97
  • 98
  • 99

CNN

用卷积网络做文本分类,在一定程度上,类似于n-gram模型,卷积网络的filter具有很强的局部特征提取能力,文本卷积的主要流程为:
fig11
实验表明,CNN与WordAveraging速度都比LSTM快;
文本的词向量卷积需要使用Conv2D,最后一步张量转为向量需要使用Maxpool1D;
所以需要提前声明这两个类的相关信息:


Conv2d

parameters:
in_channels (int) – 输入张量的通道数
out_channels (int) – 输出张量的通道数
kernel_size (int or tuple) – kernel的尺寸
stride (int or tuple, optional) – 卷积的步长,默认为1
padding (int or tuple, optional) – Zero-padding added to both sides of the input.Default:0

shape:
Input:[N,C_in,H_in,W_in]
Output:[N,C_out,H_out,W_out]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

其中, H H H W W W的计算满足:
H o u t = H i n + 2 p a d d i n g [ 0 ] − k e r n e l [ 0 ] s t r i d e [ 0 ] + 1 H_{out}=\frac{H_{in}+2padding[0]-kernel[0]}{stride[0]}+1 Hout=stride[0]Hin+2padding[0]kernel[0]+1
W o u t = W i n + 2 p a d d i n g [ 1 ] − k e r n e l [ 1 ] s t r i d e [ 1 ] + 1 W_{out}=\frac{W_{in}+2padding[1]-kernel[1]}{stride[1]}+1 Wout=stride[1]Win+2padding[1]kernel[1]+1
一个卷积层包含多个filter,每个filter的kernel数量等于输入张量的通道数,输出张量的通道数量就是filter的个数,一个filter中的各个kernel是不同的;
可以理解为不同的filter提取不同的局部特征,filter内的kernel不同是为了获取输入张量上不同通道的局部信息,最后加和得到该filter对应的局部特征相似度;
卷积的过程如下:
fig12
输入张量(红色)中kernel_size大小的区域被卷积后成为输出张量(蓝色)的一个向量,容易看出,卷积层有5个filter,每个filter有3个kernel
MaxPool1d

kernel_size int
stride int
padding int padding会加在最后一维向量的两边缘处

Input [N,C,L_in]
Output [N,C,L_out]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6

pooling的对象是通道维度后的张量,所以不会改变通道数,pool1d是在形状为 [ N , C , L i n ] [N,C,L_{in}] [N,C,Lin]的张量上操作的,最后一维的长度计算为:
L o u t = L i n + 2 p a d d i n g − k e r n e l s t r i d e + 1 L_{out}=\frac{L_{in}+2padding-kernel}{stride}+1 Lout=strideLin+2paddingkernel+1
额外补充MaxPool2d
MaxPool2d和Conv2d很相似,只是不改变通道数,因为pooling操作的对象是通道维度后的张量;
H H H W W W的计算同样满足:
H o u t = H i n + 2 p a d d i n g [ 0 ] − k e r n e l [ 0 ] s t r i d e [ 0 ] + 1 H_{out}=\frac{H_{in}+2padding[0]-kernel[0]}{stride[0]}+1 Hout=stride[0]Hin+2padding[0]kernel[0]+1
W o u t = W i n + 2 p a d d i n g [ 1 ] − k e r n e l [ 1 ] s t r i d e [ 1 ] + 1 W_{out}=\frac{W_{in}+2padding[1]-kernel[1]}{stride[1]}+1 Wout=stride[1]Win+2padding[1]kernel[1]+1


模型定义

模型定义如下:

import torch
import torch.nn as nn
import torch.nn.functional as F

class CNNModel(nn.Module):
    def __init__(self,vocab_size,embed_size,num_filters,filter_size,output_size):
        super().__init__()
        self.embed=nn.Embedding(vocab_size,embed_size)
        
        self.conv=nn.Conv2d(in_channels=1,out_channels=num_filters,
                           kernel_size=(filter_size,embed_size))
        
        self.linear=nn.Linear(num_filters,output_size)
        
    def forward(self,text):
        # text [seq_len,batch_size]
        text=text.permute(1,0) # [batch_size,seq_len]
        
        embeded=self.embed(text) # [batch_size,seq_len,embedding_size]
        #为了可卷积,增加一个维度作为channel
        embeded=embeded.unsqueeze(1) # [batch_size,C_in:1,seq_len,embedding_size]
        
        conved=F.relu(self.conv(embeded)) #[batch_size,num_filters,seq_len-filter_size+1,1]
        
        conved=conved.squeeze(dim=3) #[batch_size,num_filters,seq_len-filter_size+1]
        
        # Maxpooling1D  
        pooled=F.max_pool1d(conved,conved.shape[2]) #[batch_size,num_filters,1]
        
        pooled=pooled.squeeze(dim=2) #[batch_size,num_filters]
        
        return self.linear(pooled) #[batch_size,output_size:1]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32

传递超参数实例化模型:

VOCAB_SIZE=len(TEXT.vocab)
EMBEDDING_SIZE=100
NUM_FILTERS=100
FILTER_SIZE=3
OUTPUT_SIZE=1

USE_CUDA=torch.cuda.is_available()

model=CNNModel(VOCAB_SIZE,
              EMBEDDING_SIZE,
              NUM_FILTERS,
              FILTER_SIZE,
              OUTPUT_SIZE)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

模型训练与预测

输入输出shape与之前的模型一致,因此前面的代码完全可以复用:

#输入输出shape一致,前面代码完全可以复用

#将glove.6B.100d作为预训练的embed
pretrained_embedding=TEXT.vocab.vectors
model.embed.weight.data.copy_(pretrained_embedding)

#训练模型
LEARNING_RATE=0.001

optimizer=torch.optim.Adam(model.parameters(),lr=LEARNING_RATE)

#logits+BinaryCrossEntropy
loss_fn=nn.BCEWithLogitsLoss()

if USE_CUDA:
    model=model.cuda()
    
#计算预测准确率
def binary_accuracy(preds,y):
    #preds和y都是[batch_size]
    #对结果四舍五入
    rounded_preds=torch.round(torch.sigmoid(preds))
    correct=(rounded_preds==y).float()
    return correct.sum()/len(correct)

#1个epoch的训练
def train(model,iterator,optimizer,loss_fn):
    epoch_loss,epoch_acc=0.,0.
    total_len=0.
    model.train()  
    for i,batch in enumerate(iterator):
        # batch.text [seq_len,batch_size]
        preds=model.forward(batch.text).squeeze()        
        # preds和atch.label [batch_size] 
        loss=loss_fn(preds,batch.label)     
        acc=binary_accuracy(preds,batch.label)        
        loss.backward()
        optimizer.step()
        model.zero_grad()        
        epoch_loss+=loss.item()*len(batch.label)
        epoch_acc+=acc.item()*len(batch.label)        
        total_len+=len(batch.label) 
    return epoch_loss/total_len,epoch_acc/total_len

#在一个epoch上eval
def evaluate(model,iterator,loss_fn):
    epoch_loss,epoch_acc=0.,0.
    total_len=0.
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            # batch.text [seq_len,batch_size]
            preds=model.forward(batch.text).squeeze()
            # preds和atch.label [batch_size] 
            loss=loss_fn(preds,batch.label)
            acc=binary_accuracy(preds,batch.label)
            epoch_loss+=loss.item()*len(batch.label)
            epoch_acc+=acc.item()*len(batch.label)     
            total_len+=len(batch.label)   
    model.train()
    return epoch_loss/total_len,epoch_acc/total_len

NUM_EPOCHS=10
best_valid_acc=0.

for epoch in range(NUM_EPOCHS):
    train_loss,train_acc=train(model,train_iterator,optimizer,loss_fn)
    valid_loss,valid_acc=evaluate(model,valid_iterator,loss_fn)
    
    if valid_acc>best_valid_acc:
        best_valid_acc=valid_acc
        torch.save(model.state_dict(),"cnnmodel.pth")
        
    print("Epoch",epoch,"TrainLoss",train_loss,"TrainAcc",train_acc)
    print("Epoch",epoch,"ValidLoss",valid_loss,"ValidAcc",valid_acc)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75

训练过程如下:

Epoch 0 TrainLoss 0.4693069986002786 TrainAcc 0.7857142857551574
Epoch 0 ValidLoss 0.33368088204860685 ValidAcc 0.8596000000317892
Epoch 1 TrainLoss 0.2804485825266157 TrainAcc 0.8878285713876997
Epoch 1 ValidLoss 0.29516944955984753 ValidAcc 0.8752000000317891
Epoch 2 TrainLoss 0.19896689712660653 TrainAcc 0.9249142857006618
Epoch 2 ValidLoss 0.27966635626157127 ValidAcc 0.8842666666984558
Epoch 3 TrainLoss 0.12819375698396138 TrainAcc 0.959142857170105
Epoch 3 ValidLoss 0.27108019375801085 ValidAcc 0.8925333333651225
Epoch 4 TrainLoss 0.07056697996514184 TrainAcc 0.9833142857415336
Epoch 4 ValidLoss 0.2870686835686366 ValidAcc 0.8922666666984558
Epoch 5 TrainLoss 0.03335566938860076 TrainAcc 0.9956
Epoch 5 ValidLoss 0.30490205529530845 ValidAcc 0.8917333333651225
Epoch 6 TrainLoss 0.015325350893821034 TrainAcc 0.9990857142857142
Epoch 6 ValidLoss 0.3244790541966756 ValidAcc 0.8921333333651225
Epoch 7 TrainLoss 0.0075449678119804174 TrainAcc 0.9998857142857143
Epoch 7 ValidLoss 0.3434333708842595 ValidAcc 0.8928000000317892
Epoch 8 TrainLoss 0.004382576360766377 TrainAcc 1.0
Epoch 8 ValidLoss 0.36203826684951784 ValidAcc 0.8912000000317891
Epoch 9 TrainLoss 0.0028507052502461843 TrainAcc 1.0
Epoch 9 ValidLoss 0.3737186617692312 ValidAcc 0.8922666666984558
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20

同样的,加载模型参数对一个句子分类:

# 预测
#先利用spacy库对句子分词
import spacy
nlp=spacy.load("en")

VOCAB_SIZE=len(TEXT.vocab)
EMBEDDING_SIZE=100
NUM_FILTERS=100
FILTER_SIZE=3
OUTPUT_SIZE=1

USE_CUDA=torch.cuda.is_available()

model=CNNModel(VOCAB_SIZE,
              EMBEDDING_SIZE,
              NUM_FILTERS,
              FILTER_SIZE,
              OUTPUT_SIZE)

model.load_state_dict(torch.load("cnnmodel.pth"))

def predict_sentiment(model,sentence):
    tokenized=[tok.text for tok in nlp.tokenizer(sentence)]
    indexed=[TEXT.vocab.stoi[t] for t in tokenized]
    tensor=torch.tensor(indexed,dtype=torch.long) # [seq_len]
    tensor=tensor.unsqueeze(1) #加一个维度,变成[seq_len,batch_size],此时batch_size为1
    pred=torch.round(torch.sigmoid(model.forward(tensor)))
    #越靠近0越neg,越靠近1越pos
    return pred.item()

print(predict_sentiment(model,"this film is very nice!"))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31

fig13

多尺度CNN

上一个模型中,对句子的二维张量(通道数为1)进行卷积时,使用了100个滤波器,但每个滤波器的kernel_size都是相同的,这样的尺度略显单一,不能够兼顾不同尺度的信息,所以改进为多尺度的卷积同时对二维张量进行计算,使用nn.ModuleList实现;
设置3个并行的卷积层,每个卷积层的filter数量都是100,但kernel_size分别为[3,4,5]:
fig14
基于nn.ModuleList,获得可迭代对象,此后可以用for循环调用内部的3个卷积层:
fig15

模型定义

先梳理一下过程

  • 1.输入的batch(一批句子)形状为:
    [seq_len,batch_size]
  • 2.使用tensor.permute()调整维度后为:
    [batch_size,seq_len]
  • 3.经过embed层索引到词向量后,张量变成:
    [batch_size,seq_len,embedding_size]
  • 4.为了符合pytorch中Conv2d的规范,增加一个维度作为channel,此时张量为:
    [batch_size,C_in:1,seq_len,embedding_size]
  • 5.对这个张量并行经过3个不同的卷积层,卷积层设置为:
        nn.Conv2d(
            in_channels=1,
            out_channels=num_filters,
            kernel_size=(fs,embed_size)) for fs in filter_sizes
        )
  • 1
  • 2
  • 3
  • 4
  • 5

num_filters为100,filter_sizes为[3,4,5],这样,每个卷积层的输出张量为:
[batch_size,num_filters,seq_len-filter_size+1,1]
filter_size分别为3,4,5;

  • 6.对这3个张量去除冗余维度得到每个张量为:
    [batch_size,num_filters,seq_len-filter_size+1]
  • 7.对每个张量进行maxpool1d,pooling的kernel_size为每个张量的seq_len-filter_size+1,所以得到3个新张量,每个形状为:
    [batch_size,num_filters,1]
  • 8.去除冗余维度后为:
    [batch_size,num_filters]
  • 9.现在3个[batch_size,num_filters]张量组成列表,通过cat在num_filters轴方向拼接得到张量:
    [batch_size,3*num_filters]
  • 10.此时就可以通过全连接网络进行分类;
    模型定义如下:
import torch
import torch.nn as nn
import torch.nn.functional as F

class CNNsModel(nn.Module):
    def __init__(self,vocab_size,embed_size,num_filters,filter_sizes,output_size):
        super().__init__()
        self.embed=nn.Embedding(vocab_size,embed_size)
        
        """
        定义不同filter_size的CNN,并将它们并行连接
        设置3种不同的filter_size
        """
        self.convs=nn.ModuleList(
        nn.Conv2d(
            in_channels=1,
            out_channels=num_filters,
            kernel_size=(fs,embed_size)) for fs in filter_sizes
        )
            
        self.linear=nn.Linear(num_filters*len(filter_sizes),output_size)
        
    def forward(self,text):
        # text [seq_len,batch_size]
        text=text.permute(1,0) # [batch_size,seq_len]
        
        embeded=self.embed(text) # [batch_size,seq_len,embedding_size]
        #为了可卷积,增加一个维度作为channel
        embeded=embeded.unsqueeze(1) # [batch_size,C_in:1,seq_len,embedding_size]
        
        """
        列表中每个元素都需要经过以下操作
        conved=F.relu(self.conv(embeded)) #[batch_size,num_filters,seq_len-filter_size+1,1]
        conved=conved.squeeze(dim=3) #[batch_size,num_filters,seq_len-filter_size+1]
        """
        
        conved=[F.relu(conv(embeded)).squeeze(dim=3) for conv in self.convs]
        #得到3个[batch_size,num_filters,seq_len-filter_size+1]的张量并组成一个列表
        
        """
        列表中每个元素都需要经过以下操作
        pooled=F.max_pool1d(conved,conved.shape[2]) #[batch_size,num_filters,1]
        pooled=pooled.squeeze(dim=2) #[batch_size,num_filters]
        """
        # Maxpooling1D相应做3遍  
        pooled=[F.max_pool1d(conv,conv.shape[2]).squeeze(dim=2) for conv in conved]
        #得到3个[batch_size,num_filters]的张量组成一个列表
        
        #拼接
        pooled=torch.cat(pooled,dim=1) #[batch_size,3*num_filters]
        
        return self.linear(pooled) #[batch_size,output_size:1]
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52

模型实例化:

VOCAB_SIZE=len(TEXT.vocab)
EMBEDDING_SIZE=100
NUM_FILTERS=100
FILTER_SIZES=[3,4,5]
OUTPUT_SIZE=1

USE_CUDA=torch.cuda.is_available()

model=CNNsModel(VOCAB_SIZE,
              EMBEDDING_SIZE,
              NUM_FILTERS,
              FILTER_SIZES,
              OUTPUT_SIZE)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13

模型训练与预测

输入输出shape一致,前面代码完全可以复用:

#输入输出shape一致,前面代码完全可以复用

#将glove.6B.100d作为预训练的embed
pretrained_embedding=TEXT.vocab.vectors
model.embed.weight.data.copy_(pretrained_embedding)

#训练模型
LEARNING_RATE=0.001

optimizer=torch.optim.Adam(model.parameters(),lr=LEARNING_RATE)

#logits+BinaryCrossEntropy
loss_fn=nn.BCEWithLogitsLoss()

if USE_CUDA:
    model=model.cuda()
    
#计算预测准确率
def binary_accuracy(preds,y):
    #preds和y都是[batch_size]
    #对结果四舍五入
    rounded_preds=torch.round(torch.sigmoid(preds))
    correct=(rounded_preds==y).float()
    return correct.sum()/len(correct)

#1个epoch的训练
def train(model,iterator,optimizer,loss_fn):
    epoch_loss,epoch_acc=0.,0.
    total_len=0.
    model.train()  
    for i,batch in enumerate(iterator):
        # batch.text [seq_len,batch_size]
        preds=model.forward(batch.text).squeeze()        
        # preds和atch.label [batch_size] 
        loss=loss_fn(preds,batch.label)     
        acc=binary_accuracy(preds,batch.label)        
        loss.backward()
        optimizer.step()
        model.zero_grad()        
        epoch_loss+=loss.item()*len(batch.label)
        epoch_acc+=acc.item()*len(batch.label)        
        total_len+=len(batch.label) 
    return epoch_loss/total_len,epoch_acc/total_len

#在一个epoch上eval
def evaluate(model,iterator,loss_fn):
    epoch_loss,epoch_acc=0.,0.
    total_len=0.
    model.eval()
    with torch.no_grad():
        for batch in iterator:
            # batch.text [seq_len,batch_size]
            preds=model.forward(batch.text).squeeze()
            # preds和atch.label [batch_size] 
            loss=loss_fn(preds,batch.label)
            acc=binary_accuracy(preds,batch.label)
            epoch_loss+=loss.item()*len(batch.label)
            epoch_acc+=acc.item()*len(batch.label)     
            total_len+=len(batch.label)   
    model.train()
    return epoch_loss/total_len,epoch_acc/total_len

NUM_EPOCHS=10
best_valid_acc=0.

for epoch in range(NUM_EPOCHS):
    train_loss,train_acc=train(model,train_iterator,optimizer,loss_fn)
    valid_loss,valid_acc=evaluate(model,valid_iterator,loss_fn)
    
    if valid_acc>best_valid_acc:
        best_valid_acc=valid_acc
        torch.save(model.state_dict(),"multiscalecnnmodel.pth")
        
    print("Epoch",epoch,"TrainLoss",train_loss,"TrainAcc",train_acc)
    print("Epoch",epoch,"ValidLoss",valid_loss,"ValidAcc",valid_acc)
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52
  • 53
  • 54
  • 55
  • 56
  • 57
  • 58
  • 59
  • 60
  • 61
  • 62
  • 63
  • 64
  • 65
  • 66
  • 67
  • 68
  • 69
  • 70
  • 71
  • 72
  • 73
  • 74
  • 75

训练过程如下:

Epoch 0 TrainLoss 0.4474612478664943 TrainAcc 0.7874857142720904
Epoch 0 ValidLoss 0.30799763492743176 ValidAcc 0.8752
Epoch 1 TrainLoss 0.2471248833332743 TrainAcc 0.8999428571428572
Epoch 1 ValidLoss 0.2623157759666443 ValidAcc 0.8953333333651224
Epoch 2 TrainLoss 0.15377467893872943 TrainAcc 0.9449714286123003
Epoch 2 ValidLoss 0.24951276066303252 ValidAcc 0.9005333333651224
Epoch 3 TrainLoss 0.07325759186914989 TrainAcc 0.9808571428162711
Epoch 3 ValidLoss 0.2766515511910121 ValidAcc 0.8937333333651225
Epoch 4 TrainLoss 0.028904751653330667 TrainAcc 0.9963428571428572
Epoch 4 ValidLoss 0.2817218486944834 ValidAcc 0.9004000000317891
Epoch 5 TrainLoss 0.010430116872170141 TrainAcc 0.9997142857142857
Epoch 5 ValidLoss 0.3035568665663401 ValidAcc 0.9002666666984558
Epoch 6 TrainLoss 0.004805163683529411 TrainAcc 0.9999428571428571
Epoch 6 ValidLoss 0.3214325322945913 ValidAcc 0.8994666666984558
Epoch 7 TrainLoss 0.002724978632958872 TrainAcc 1.0
Epoch 7 ValidLoss 0.3378067107359568 ValidAcc 0.8984000000317891
Epoch 8 TrainLoss 0.0018055682155702795 TrainAcc 1.0
Epoch 8 ValidLoss 0.3515970369974772 ValidAcc 0.8981333333651225
Epoch 9 TrainLoss 0.0012723242744271245 TrainAcc 1.0
Epoch 9 ValidLoss 0.36390119013786315 ValidAcc 0.8982666666984558
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20

加载模型并分类句子:

# 预测
#先利用spacy库对句子分词
import spacy
nlp=spacy.load("en")

VOCAB_SIZE=len(TEXT.vocab)
EMBEDDING_SIZE=100
NUM_FILTERS=100
FILTER_SIZES=[3,4,5]
OUTPUT_SIZE=1

USE_CUDA=torch.cuda.is_available()

model=CNNsModel(VOCAB_SIZE,
              EMBEDDING_SIZE,
              NUM_FILTERS,
              FILTER_SIZES,
              OUTPUT_SIZE)

model.load_state_dict(torch.load("multiscalecnnmodel.pth"))

def predict_sentiment(model,sentence):
    tokenized=[tok.text for tok in nlp.tokenizer(sentence)]
    indexed=[TEXT.vocab.stoi[t] for t in tokenized]
    tensor=torch.tensor(indexed,dtype=torch.long) # [seq_len]
    tensor=tensor.unsqueeze(1) #加一个维度,变成[seq_len,batch_size],此时batch_size为1
    pred=torch.round(torch.sigmoid(model.forward(tensor)))
    #越靠近0越neg,越靠近1越pos
    return pred.item()

print(predict_sentiment(model,"this film is very nice!"))
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31

fig16

声明:本文内容由网友自发贡献,不代表【wpsshop博客】立场,版权归原作者所有,本站不承担相应法律责任。如您发现有侵权的内容,请联系我们。转载请注明出处:https://www.wpsshop.cn/w/知新_RL/article/detail/242716
推荐阅读
相关标签
  

闽ICP备14008679号