煮酒与君饮

这个屌丝很懒，什么也没留下！

热门标签

【NLP实验】姓氏分类任务—前馈神经网络

作者：煮酒与君饮 | 2024-07-12 07:15:02

踩

NLP实验系列文章目录

第一章姓氏分类任务——前馈神经网络
 第二章基于循环神经网络的机器翻译
 第三章基于Transformer的机器翻译

（本文为新手小白进行的NLP实验，内容简单，仅适合新手进行借鉴，也请大佬多多指教）

一、实现目标

通过“示例:带有多层感知器的姓氏分类”，掌握多层感知器在多层分类中的应用
掌握每种类型的神经网络层对它所计算的数据张量的大小和形状的影响
尝试带有dropout的SurnameClassifier模型，看看它如何更改结果

二、技术背景

2.1 多层感知机（MLP）

2.1.1 MLP的基本结构

多层感知机(MLP)被认为是最基本的神经网络构建模块之一，它将多个层与每个层之间的非线性结合在一起，实现了向量化的输入输出和非线性运算。下面我们来看它的模型组成：
在这里插入图片描述
1.输入层（input）
输入为一定长度的向量，这个长度根据输入数据的具体维度来决定，一般是数值型向量，文字或者图像可通过数据预处理调整为数值向量。
2.中间层（hidden）
中间层的主要任务是进行非线性处理，实现复杂特征模式的学习，它将最初的输入向量通过线性计算得到的隐藏向量作为该层的输入，然后对其进行非线性处理（激活函数）得到变化后的输出。中间层可以有多层结构，上一层的输出通过线性变化就得到了该层的输入，每层的输入维度根据模型的需求设计，一般小型的感知机任务可用32,64,128等神经元数目进行尝试。（数目经常采用2的幂数的形式，有利于计算机的并行处理）
3.输出层（output）
输出为一定长度的向量或单个输出值，根据模型处理任务来定义，如果是多分类任务，则应设置为类别数，如果是回归任务，则可以设置为一个标量输出。

2.1.2 MLP的前向传播

要实现MLP的前向传播，我们还需要介绍几个模块：
1.激活函数：
激活函数位于中间层，它的任务就是将输入数据进行非线性处理，常用的激活函数有sigmoid，tanh，ReLU，Leaky_ReLU等。使用激活函数是必要的，如果不加入激活函数做非线性运算，那么两个线性运算层在数学逻辑上等价于一个线性运算，这样不能学习复杂的特征模式。

2.softmax：
softmax是个非常常用而且比较重要的函数，尤其在多分类的场景中使用广泛。他把一些输入映射为0-1之间的实数，并且归一化保证和为1，因为多分类的概率之和刚好为1，所以我们可以采用softmax对最后一层隐藏层的输出向量进行处理，从而得到各个类别的概率值作为输出层的输出向量。但在自然语言处理任务中，因为词汇表非常大，导致softmax计算效率低下，对于这种情况，会采用计算复杂度更低的层次softmax等方法进行计算。

softmax计算图如下：

3.MLP前向传播：
在此我们给出pytorch上的前向传播实现示例，该多层感知机采用了一个隐藏层，ReLU激活函数以及softmax计算（可选）。

代码如下（示例）：

import torch.nn as nn
import torch.nn.functional as F

class MultilayerPerceptron(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        """
        Args:
            input_dim (int): the size of the input vectors
            hidden_dim (int): the output size of the first Linear layer
            output_dim (int): the output size of the second Linear layer
        """
        super(MultilayerPerceptron, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)  # 两个线性感知器

    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the MLP
        Args:
            x_in (torch.Tensor): an input data tensor.
                x_in.shape should be (batch, input_dim)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, output_dim)
        """
        intermediate = F.relu(self.fc1(x_in)) # 加入激活函数层
        output = self.fc2(intermediate)

        if apply_softmax:
            output = F.softmax(output, dim=1) # 如果需要分类，对结果采用softmax分类
        return output
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

综上所述，mlp是通过线性运算将向量在不同线性层间进行映射，在每一对线性层之间使用非线性（激活函数）来打破线性关系，使得模型能够“扭曲”向量空间，实现非线性问题求解。

2.1.3 torch的重点应用

1.PyTorch读取模型数据：
torch可以根据模型的定义来读取模型相关数据，在实际任务应用中，我们可以通过torch的这种交互性来实现对模型结构的了解。

代码如下（示例）：

batch_size = 2 # number of samples input at once
input_dim = 3
hidden_dim = 100
output_dim = 4

# Initialize model
mlp = MultilayerPerceptron(input_dim, hidden_dim, output_dim)
print(mlp)  
# 通过torch我们可以看到层的各类参数，从而检验模型的有效性-pytorch的可交互性
1
2
3
4
5
6
7
8
9

在这里插入图片描述
2.PyTorch读取数据特征：
torch可以相关方法实现对数据特征的展示，从而我们可以确认模型中相关数据的维度大小，数值等相关特征，实时实现数据检查和模型检查。

代码如下（示例）：

import torch
def describe(x):
    print("Type: {}".format(x.type()))
    print("Shape/size: {}".format(x.shape))
    print("Values: \n{}".format(x))

x_input = torch.rand(batch_size, input_dim)
describe(x_input)  # 可查看torch张量的相关信息性
1
2
3
4
5
6
7
8

模型输入张量的相关数据：

在这里插入图片描述

代码如下（示例）：

y_output = mlp(x_input, apply_softmax=False) # 加入完整的数据，用mlp进行任务
describe(y_output)
1
2

模型输出张量的相关数据：

在这里插入图片描述

2.2 卷积神经网络（CNN）

CNN中的一些概念最早是来源于1980年的福岛邦彦，例如卷积和池化这两个概念是从猫的视觉系统实验中得到的启发，之后由1998年Lecun提出的Lenet-5将卷积神经网络推上了小高潮，我们接下来讲述一下CNN大致的思想及其作用。

2.2.1 CNN的实现思想：

CNN的基本功能实现源于数学的卷积计算，一般对图像进行滤波或者对噪声进行处理时，会采用固定的卷积核进行相关计算处理，但是对于这种大量数据的处理中，使用数据去训练学习卷积核的参数是一种更优的解决问题的方式。

CNN的计算思想：

在这里插入图片描述
将输入矩阵与卷积核进行卷积运算，我们可以得到特征映射后的输出矩阵，其中卷积核的大小，与输入矩阵相乘的位置（上图是每个位置都进行一次相乘），输出矩阵的大小都可以经过一些超参数来进行调控，后面我们会进行更详细的介绍。

由此可以看出，卷积神经网络是一种非常适合检测空间子结构(并因此创建有意义的空间子结构)的神经网络。CNN通过使用少量的权重（卷积核）来扫描输入数据张量来实现这一点，即通过不断优化的卷积核提取任务所需的相关特征。通过这种扫描，它们可以减少输出的尺寸，提取数据中最关键的特征，用更少的参数量来实现更多运算和模型训练以及更精确的模型学习。

2.2.2 CNN的基本结构：

CNN基本结构
1. 卷积层：
卷积层是CNN的核心，其主要作用是从输入数据中提取局部特征。卷积层包含一组可学习的滤波器（filter）或称为卷积核，每个滤波器会在输入数据上滑动（卷积操作），并加上一个偏置项，产生一个输出特征图（feature map）。这个过程能够自动学习到图像中的边缘、纹理等低级至高级特征。卷积层通过参数共享和空间亚采样减少了参数数量，增强了模型的泛化能力。
2. 采样层：
采样层，最常见的形式是最大池化层和平均池化层，其目的是降低特征图的空间维度（高度和宽度），减少计算量，同时保持对变换、缩放的鲁棒性。如：最大池化操作在特征图的相邻区域中选取最大值作为输出，这样可以有效地提取关键特征，减少过拟合的风险。
3. 全连接层：
全连接层一般位于CNN的末端，在一系列卷积层和池化层之后。这一层将前面提取的特征进行整合，用于做出最终的分类或回归预测。在全连接层中，每一个神经元都与前一层的所有神经元相连，可以理解为将前面的特征图展平成向量后，通过权重矩阵进行线性变换，再加上偏置项，然后通过激活函数（如ReLU）引入非线性，最终输出网络的预测结果。（可以看做一个多层感知机放到最后进行预测）

2.2.3 CNN的前向传播：

在实现CNN的前向传播之前，我们需要关注卷积核的相关超参数：
1.kernel_size: 指定了卷积核在空间维度上的大小，通常表示为(宽度, 高度)，例如(3, 3)代表一个3x3的卷积核。这个尺寸影响了卷积核“观察”输入数据的局部区域大小。较大的kernel_size可以捕捉更大范围的特征，但计算成本更高；较小的kernel_size则更侧重细节特征，且计算效率较高。
2.stride: 决定了卷积核在输入数据上每次移动的距离。增大stride可以减少输出特征图的尺寸，降低计算复杂度，但也可能丢失部分空间信息；减小stride则保留更多细节，但增加了计算量。
3.channels: 表示输入数据的深度，即每个像素点有多少个不同的特征图。在自然语言处理任务中通常指sequence_length（单个序列的长度）。影响模型对输入数据的处理方式，即卷积核需要与输入的每个通道对应进行操作。
4.padding: 在输入数据边缘添加额外的像素（通常是0或者是边缘的镜像），以保持或改变输出特征图的尺寸。常见的有Same padding(计算只与步长有关) 和Valid padding（计算与步长和尺寸都有关）策略。
填充卷积演示

5.dilation: 控制卷积核中元素之间的空间间隔，即卷积核中的每个元素在输入数据上的跳跃步长。增加dilation可以扩大卷积核的感受野，而不需要增加kernel_size，有利于捕获更全局的特征，可用于较长序列的预测。
膨胀卷积演示

6.filters： 输出的特征图数量，也就是卷积层产生的新通道数。确定了模型能够学习到的特征种类的数量，直接影响模型的复杂度和学习能力。增加filter数量可以学习更丰富的特征，但也增加了模型的参数量和计算成本。

卷积核的输入通道数由输入矩阵的通道数所决定。 如：输入矩阵是三通道，那么卷积核的输入通道数设置为三通道，也就是同时对三个通道进行卷积处理；

输出矩阵的通道数由卷积核的输出通道数所决定。 如：想要输出六个通道，设置六个不同的卷积核对数据分别进行卷积处理，每个卷积核都会独立地应用于输入数据上，产生一个输出通道，则输出矩阵的通道数是六个通道。
$height_{out} =(height_{in}-height_{kernel}+2*padding)/stride +1$ $width_{out} =(width_{in}-width_{kernel}+2*padding)/stride +1$

卷积运算中张量尺寸变化

batch_size = 2
one_hot_size = 10
sequence_width = 7
data = torch.randn(batch_size, one_hot_size, sequence_width)
conv1 = nn.Conv1d(in_channels=one_hot_size, out_channels=16,
               kernel_size=3)  # 卷积后尺寸为7-3+1
intermediate1 = conv1(data)
print(data.size())
print(intermediate1.size())
1
2
3
4
5
6
7
8
9

在这里插入图片描述

conv2 = nn.Conv1d(in_channels=16, out_channels=32, kernel_size=3)
 # 卷积后通道数为5-3+1=3
conv3 = nn.Conv1d(in_channels=32, out_channels=64, kernel_size=3) 
# 卷积后通道数为3-3+1=1

intermediate2 = conv2(intermediate1)
intermediate3 = conv3(intermediate2)

print(intermediate2.size())
print(intermediate3.size())
1
2
3
4
5
6
7
8
9
10

在这里插入图片描述

前向传播

class SurnameClassifier(nn.Module):
    def __init__(self, initial_num_channels, num_classes, num_channels):
        """
        Args:
            initial_num_channels (int): size of the incoming feature vector
            num_classes (int): size of the output prediction vector
            num_channels (int): constant channel size to use throughout network
        """
        super(SurnameClassifier, self).__init__()
        # 卷积网络的区别在于训练时采用卷积核而非线性分类器
        self.convnet = nn.Sequential(
            nn.Conv1d(in_channels=initial_num_channels, 
                      out_channels=num_channels, kernel_size=3),
            nn.ELU(),  # 激活函数
            '''
            解决ReLU在负值区域梯度为0的问题，避免“神经元死亡”，提升学习效率
            '''
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, 
                      kernel_size=3, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, 
                      kernel_size=3, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, 
                      kernel_size=3),
            nn.ELU()
        )
        self.fc = nn.Linear(num_channels, num_classes)

    def forward(self, x_surname, apply_softmax=False):
        """The forward pass of the classifier
        
        Args:
            x_surname (torch.Tensor): an input data tensor. 
                x_surname.shape should be (batch, initial_num_channels, max_surname_length)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, num_classes)
        """
        features = self.convnet(x_surname).squeeze(dim=2)
       
        prediction_vector = self.fc(features)

        if apply_softmax:
            prediction_vector = F.softmax(prediction_vector, dim=1)

        return prediction_vector
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

三、技术实现

3.1 姓氏数据集处理

处理数据集前，我们要先理解数据集的特性：
第一个性质是它是相当不平衡的。排名前三的国籍占数据的60%以上:27%是英语，21%是俄语，14%是阿拉伯语。剩下的15个民族的频率也在下降——这也是语言特有的特性。
第二个特点是，在国籍和姓氏正字法(拼写)之间有一种有效和直观的关系。有些拼写变体与原籍国联系非常紧密(比如“O ‘Neill”、“Antonopoulos”、“Nagasawa”或“Zhu”)。
这不仅有利于我们设计模型和数据集的转换形式，同时也有助于我们对模型训练和预测的结果进行更准确的分析。
1. 数据集划分

import collections
import numpy as np
import pandas as pd
import re

from argparse import Namespace

args = Namespace(
    raw_dataset_csv="data/surnames/surnames.csv",
    train_proportion=0.7,
    val_proportion=0.15,
    test_proportion=0.15,
    output_munged_csv="data/surnames/surnames_with_splits.csv",
    seed=1337
)
# Read raw data
surnames = pd.read_csv(args.raw_dataset_csv, header=0) 
# Unique classes
set(surnames.nationality) # 国籍是不能重复的
# Splitting train by nationality
# Create dict
by_nationality = collections.defaultdict(list) 
for _, row in surnames.iterrows(): 
    by_nationality[row.nationality].append(row.to_dict()) # 对每一行的国籍进行字典映射
# Create split data
final_list = []
np.random.seed(args.seed)
for _, item_list in sorted(by_nationality.items()):
    np.random.shuffle(item_list)  # 打乱数据集
    n = len(item_list)  # 得到数据集长度
    n_train = int(args.train_proportion*n)
    n_val = int(args.val_proportion*n)
    n_test = int(args.test_proportion*n)
    
    # Give data point a split attribute
    for item in item_list[:n_train]:
        item['split'] = 'train'
    for item in item_list[n_train:n_train+n_val]:
        item['split'] = 'val'
    for item in item_list[n_train+n_val:]:
        item['split'] = 'test'  
    
    # Add to final list
    final_list.extend(item_list)
 
final_surnames = pd.DataFrame(final_list) 
# Write munged data to CSV
final_surnames.to_csv(args.output_munged_csv, index=False)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48

处理前的数据

在这里插入图片描述

处理后的数据

在这里插入图片描述

# Unique classes
set(surnames.nationality) # 国籍是不能重复的
1
2

查看有哪些国籍（分类的类别数）

在这里插入图片描述

final_surnames.split.value_counts()
1

查看训练、验证、测试数据集的大小

3.2 建立词汇表Vocabulary

Vocabulary为文本预处理和神经网络输入提供了关键的词汇管理功能。核心功能包括构建、存储和查找文本词汇与其唯一索引间的映射关系，支持序列化与反序列化以便保存和加载。
类初始化可自定义词汇映射、是否添加未知标记（UNK）及UNK标记的符号。类方法允许添加单个或多个词汇至映射中，并能根据词汇查找其索引或根据索引找回词汇。若遇到未知词汇且配置允许，自动返回UNK的索引。
此外，类的长度表示词汇表的大小，方便查询词汇表规模。
从功能上看，Vocabulary是为了下面的分词做准备，它提供了一个根据数据集建立的词汇映射表，保证了文本到索引的映射关系，从而可以在分词过程中逐一找到文本对应的索引。

class Vocabulary(object):
    """Class to process text and extract vocabulary for mapping"""

    def __init__(self, token_to_idx=None, add_unk=True, unk_token="<UNK>"):
        """
        Args:
            token_to_idx (dict): a pre-existing map of tokens to indices
            add_unk (bool): a flag that indicates whether to add the UNK token
            unk_token (str): the UNK token to add into the Vocabulary
        """

        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx

        self._idx_to_token = {idx: token 
                              for token, idx in self._token_to_idx.items()} # 将索引与字符对应
        
        self._add_unk = add_unk
        self._unk_token = unk_token
        
        self.unk_index = -1
        if add_unk:
            self.unk_index = self.add_token(unk_token) # 将未知的字符放在字典最后
        
        
    def to_serializable(self):
        """ returns a dictionary that can be serialized """
        return {'token_to_idx': self._token_to_idx, 
                'add_unk': self._add_unk, 
                'unk_token': self._unk_token} # 返回一个序列化的字典

    @classmethod
    def from_serializable(cls, contents):
        """ instantiates the Vocabulary from a serialized dictionary """
        return cls(**contents)

    def add_token(self, token): # 对字符进行映射
        """Update mapping dicts based on the token.

        Args:
            token (str): the item to add into the Vocabulary
        Returns:
            index (int): the integer corresponding to the token
        """
        try:
            index = self._token_to_idx[token]
        except KeyError:   # 不存在token，则在映射中添加一对token_to_idx
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index
    
    def add_many(self, tokens):  # 对一个字符串进行映射
        """Add a list of tokens into the Vocabulary
        
        Args:
            tokens (list): a list of string tokens
        Returns:
            indices (list): a list of indices corresponding to the tokens
        """
        return [self.add_token(token) for token in tokens]

    def lookup_token(self, token):
        """Retrieve the index associated with the token 
          or the UNK index if token isn't present.
        
        Args:
            token (str): the token to look up 
        Returns:
            index (int): the index corresponding to the token
        Notes:
            `unk_index` needs to be >=0 (having been added into the Vocabulary) 
              for the UNK functionality 
        """
        if self.unk_index >= 0:  # unk已在字典中存在
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]

    def lookup_index(self, index):
        """Return the token associated with the index
        
        Args: 
            index (int): the index to look up
        Returns:
            token (str): the token corresponding to the index
        Raises:
            KeyError: if the index is not in the Vocabulary
        """
        if index not in self._idx_to_token:
            raise KeyError("the index (%d) is not in the Vocabulary" % index)
        return self._idx_to_token[index]

    def __str__(self):
        return "<Vocabulary(size=%d)>" % len(self)

    def __len__(self):
        return len(self._token_to_idx)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99

3.3 建立矢量器Vectorizer

SurnameVectorizer类负责协调Vocabulary实例，实现文本数据向量化,j将原始文本转换为模型可用数值向量。它使用两个Vocabulary对象分别对姓氏字符和国籍进行编码。通过vectorize方法，将输入姓氏转换成基于预定义词汇表的独热编码向量。
该类还提供了从数据DataFrame和序列化字典中创建实例的方法，便于数据预处理和模型输入准备，同时也支持序列化以保存和恢复词汇表配置。
简要的说，Vectorizer是负责将文本转换为索引的工具（索引便于模型用于训练）并且将转换后的数据进行序列化处理，方便数据的引用。

class SurnameVectorizer(object):
    """ The Vectorizer which coordinates the Vocabularies and puts them to use"""
    def __init__(self, surname_vocab, nationality_vocab):
        """
        Args:
            surname_vocab (Vocabulary): maps characters to integers
            nationality_vocab (Vocabulary): maps nationalities to integers
        """
        self.surname_vocab = surname_vocab
        self.nationality_vocab = nationality_vocab

    def vectorize(self, surname):
        """
        Args:
            surname (str): the surname

        Returns:
            one_hot (np.ndarray): a collapsed one-hot encoding 
        """
        vocab = self.surname_vocab
        one_hot = np.zeros(len(vocab), dtype=np.float32)  # 初始化onehot编码表
        '''长度为len(vocab)是为了确保能够为词汇表中的每一个字符生成一个有效的独热编码向量'''
        for token in surname:
            one_hot[vocab.lookup_token(token)] = 1 # 1的位置对应于词汇表中特定字符的索引

        return one_hot

    @classmethod
    def from_dataframe(cls, surname_df):
        """Instantiate the vectorizer from the dataset dataframe
        
        Args:
            surname_df (pandas.DataFrame): the surnames dataset
        Returns:
            an instance of the SurnameVectorizer
        """
        # 初始化姓名和国籍两个实例
        surname_vocab = Vocabulary(unk_token="@")
        nationality_vocab = Vocabulary(add_unk=False)
        # 找到数据集中每一行姓名和国籍对应索引然后返回包含两个向量表的类实例
        for index, row in surname_df.iterrows():
            for letter in row.surname:
                surname_vocab.add_token(letter) # 从姓氏中取出每个字符
            nationality_vocab.add_token(row.nationality)

        return cls(surname_vocab, nationality_vocab) 

    @classmethod
    def from_serializable(cls, contents):
         # 从序列化的字典中重建两个Vocabulary类实例
        surname_vocab = Vocabulary.from_serializable(contents['surname_vocab'])
        nationality_vocab =  Vocabulary.from_serializable(contents['nationality_vocab'])
        return cls(surname_vocab=surname_vocab, nationality_vocab=nationality_vocab)

    def to_serializable(self):
        """Create the serializable dictionary for caching
        
        Returns:
            contents (dict): the serializable dictionary
        """
        return {'surname_vocab': self.surname_vocab.to_serializable(),
                'nationality_vocab': self.nationality_vocab.to_serializable()}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62

3.4 建立数据集Dataset

SurnameDataset的目的是为姓氏分类任务制定PyTorch数据集，整合数据预处理与数据装载功能。
它接受一个姓氏数据集和一个SurnameVectorizer实例，负责数据划分（训练/验证/测试）、向量化操作、以及计算类别权重以应对类别不平衡问题。
类方法支持从CSV文件加载数据集并创建或加载矢量器，实现数据集与矢量器的持久化存储与复用。该类通过继承Dataset，实现了PyTorch所需的数据装载接口，如__len__、__getitem__等，以便于迭代访问数据，并通过generate_batches函数进一步封装了数据装载过程，能够批量生成适合模型训练的数据集，并且支持设备迁移（CPU/GPU），让模型准备数据的过程更加便捷。

class SurnameDataset(Dataset):
    def __init__(self, surname_df, vectorizer):
        """
        Args:
            surname_df (pandas.DataFrame): the dataset
            vectorizer (SurnameVectorizer): vectorizer instatiated from dataset
        """
        # 初始化数据集和矢量器
        self.surname_df = surname_df
        self._vectorizer = vectorizer
        # 对每类数据集进行数据处理，表示为字典形式
        self.train_df = self.surname_df[self.surname_df.split=='train']
        self.train_size = len(self.train_df)

        self.val_df = self.surname_df[self.surname_df.split=='val']
        self.validation_size = len(self.val_df)

        self.test_df = self.surname_df[self.surname_df.split=='test']
        self.test_size = len(self.test_df)

        self._lookup_dict = {'train': (self.train_df, self.train_size),
                             'val': (self.val_df, self.validation_size),
                             'test': (self.test_df, self.test_size)}

        self.set_split('train')
        
        # Class weights
        class_counts = surname_df.nationality.value_counts().to_dict() # 计算出各国籍的频次数目
        '''
        to_dict()主要目的是为了将国籍的频次统计结果转换为一个便于处理和使用的字典格式
        '''
        def sort_key(item):
            return self._vectorizer.nationality_vocab.lookup_token(item[0])  # 排序基于字典键对应的索引来进行
        sorted_counts = sorted(class_counts.items(), key=sort_key)
        '''
        根据class_counts字典中每个国籍名称在nationality_vocab词汇表中的索引顺序，对字典的项进行排序，
        生成一个新的按特定顺序排列的项列表
        '''
        frequencies = [count for _, count in sorted_counts]  # 获取每个国籍的频次
        self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32) 
        # 用张量形式是便于为了能够直接在PyTorch的计算图中使用,可以在模型训练过程中自动求梯度和GPU加速。

    @classmethod
    def load_dataset_and_make_vectorizer(cls, surname_csv):
        """Load dataset and make a new vectorizer from scratch
        
        Args:
            surname_csv (str): location of the dataset
        Returns:
            an instance of SurnameDataset
        """
        surname_df = pd.read_csv(surname_csv)
        train_surname_df = surname_df[surname_df.split=='train']
        return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))

    @classmethod
    def load_dataset_and_load_vectorizer(cls, surname_csv, vectorizer_filepath):
        """Load dataset and the corresponding vectorizer. 
        Used in the case in the vectorizer has been cached for re-use
        
        Args:
            surname_csv (str): location of the dataset
            vectorizer_filepath (str): location of the saved vectorizer
        Returns:
            an instance of SurnameDataset
        """
        surname_df = pd.read_csv(surname_csv)
        vectorizer = cls.load_vectorizer_only(vectorizer_filepath)
        return cls(surname_df, vectorizer)

    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        """a static method for loading the vectorizer from file
        
        Args:
            vectorizer_filepath (str): the location of the serialized vectorizer
        Returns:
            an instance of SurnameVectorizer
        """
        with open(vectorizer_filepath) as fp:  # 只加载路径中的矢量器
            return SurnameVectorizer.from_serializable(json.load(fp))

    def save_vectorizer(self, vectorizer_filepath):
        """saves the vectorizer to disk using json
        
        Args:
            vectorizer_filepath (str): the location to save the vectorizer
        """
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp) # 先将文件转为json格式，再用dump将其存储在文件中

    def get_vectorizer(self):
        """ returns the vectorizer """
        return self._vectorizer

    def set_split(self, split="train"):
        """ selects the splits in the dataset using a column in the dataframe """
        self._target_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        return self._target_size

    def __getitem__(self, index):
        """the primary entry point method for PyTorch datasets
        
        Args:
            index (int): the index to the data point 
        Returns:
            a dictionary holding the data point's:
                features (x_surname)
                label (y_nationality)
        """
        row = self._target_df.iloc[index]# 根据索引找到对应数据点

        surname_vector = \
            self._vectorizer.vectorize(row.surname)  

        nationality_index = \
            self._vectorizer.nationality_vocab.lookup_token(row.nationality)

        return {'x_surname': surname_vector,
                'y_nationality': nationality_index}

    def get_num_batches(self, batch_size):
        """Given a batch size, return the number of batches in the dataset
        
        Args:
            batch_size (int)
        Returns:
            number of batches in the dataset
        """
        return len(self) // batch_size   # 返回批次数

    
def generate_batches(dataset, batch_size, shuffle=True,
                     drop_last=True, device="cpu"): 
    """
    A generator function which wraps the PyTorch DataLoader. It will 
      ensure each tensor is on the write device location.
    """
    dataloader = DataLoader(dataset=dataset, batch_size=batch_size,
                            shuffle=shuffle, drop_last=drop_last)
    # 将划分好的数据字典放入device中，得到输出字典
    for data_dict in dataloader:
        out_data_dict = {}
        for name, tensor in data_dict.items():
            out_data_dict[name] = data_dict[name].to(device) 
        yield out_data_dict
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149

3.5 多层感知机实现

实验将多层感知机定义为类的形式，初始化部分定义了模型的结构由双层的分类器构成，
forward部分定义了模型的数据传播方式，是如何进行线性变换和非线性激活，以及是否需要进行softmax进行分类。

class SurnameClassifier(nn.Module):
    """ A 2-layer Multilayer Perceptron for classifying surnames """
    def __init__(self, input_dim, hidden_dim, output_dim):
        """
        Args:
            input_dim (int): the size of the input vectors
            hidden_dim (int): the output size of the first Linear layer
            output_dim (int): the output size of the second Linear layer
        """
        super(SurnameClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, output_dim)

    def forward(self, x_in, apply_softmax=False):
        """The forward pass of the classifier
        
        Args:
            x_in (torch.Tensor): an input data tensor. 
                x_in.shape should be (batch, input_dim)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, output_dim)
        """
        intermediate_vector = F.relu(self.fc1(x_in))
        prediction_vector = self.fc2(intermediate_vector)

        if apply_softmax:
            prediction_vector = F.softmax(prediction_vector, dim=1)

        return prediction_vector
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

3.6 卷积神经网络实现

卷积神经网络也是将其定义成类，然后初始化部分定义模型的架构为卷积层、激活函数层和全连接层，forward部分定义模型的数据流动方式，最终得到预测的结果。

class SurnameClassifier(nn.Module):
    def __init__(self, initial_num_channels, num_classes, num_channels):
        """
        Args:
            initial_num_channels (int): size of the incoming feature vector
            num_classes (int): size of the output prediction vector
            num_channels (int): constant channel size to use throughout network
        """
        super(SurnameClassifier, self).__init__()
        # 卷积网络的区别在于训练时采用卷积核而非线性分类器
        self.convnet = nn.Sequential(
            nn.Conv1d(in_channels=initial_num_channels, 
                      out_channels=num_channels, kernel_size=3),
            nn.ELU(),  # 一种激活函数
            '''
            解决ReLU在负值区域梯度为0的问题，避免“神经元死亡”，提升学习效率
            '''
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, 
                      kernel_size=3, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, 
                      kernel_size=3, stride=2),
            nn.ELU(),
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, 
                      kernel_size=3),
            nn.ELU()
        )
        self.fc = nn.Linear(num_channels, num_classes)

    def forward(self, x_surname, apply_softmax=False):
        """The forward pass of the classifier
        
        Args:
            x_surname (torch.Tensor): an input data tensor. 
                x_surname.shape should be (batch, initial_num_channels, max_surname_length)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, num_classes)
        """
        features = self.convnet(x_surname).squeeze(dim=2)
       
        prediction_vector = self.fc(features)

        if apply_softmax:
            prediction_vector = F.softmax(prediction_vector, dim=1)

        return prediction_vector

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49

3.7 训练模型

训练过程中超参数设置

def make_train_state(args):
    return {'stop_early': False,
            'early_stopping_step': 0,
            'early_stopping_best_val': 1e8,
            'learning_rate': args.learning_rate,
            'epoch_index': 0,
            'train_loss': [],
            'train_acc': [],
            'val_loss': [],
            'val_acc': [],
            'test_loss': -1,
            'test_acc': -1,
            'model_filename': args.model_state_file}
        
 def update_train_state(args, model, train_state):  # 与exp4_2一致
    """Handle the training state updates.

    Components:
     - Early Stopping: Prevent overfitting.
     - Model Checkpoint: Model is saved if the model is better

    :param args: main arguments
    :param model: model to train
    :param train_state: a dictionary representing the training state values
    :returns:
        a new train_state
    """

    # Save one model at least
    if train_state['epoch_index'] == 0:
        torch.save(model.state_dict(), train_state['model_filename'])
        train_state['stop_early'] = False

    # Save model if performance improved
    elif train_state['epoch_index'] >= 1:
        loss_tm1, loss_t = train_state['val_loss'][-2:]   # 只取倒数的两个元素，一般用于比较或计算连续时间步

        # If loss worsened
        if loss_t >= train_state['early_stopping_best_val']:
            # Update step
            train_state['early_stopping_step'] += 1
        # Loss decreased
        else:
            # Save the best model
            if loss_t < train_state['early_stopping_best_val']:
                torch.save(model.state_dict(), train_state['model_filename'])

            # Reset early stopping step
            train_state['early_stopping_step'] = 0

        # Stop early ?
        train_state['stop_early'] = \
            train_state['early_stopping_step'] >= args.early_stopping_criteria

    return train_state
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55

在模型训练前，我们可以调用Namespace对相关参数、路径进行储存，包括数据文件路径、模型超参数、训练超参数以及运行时选项。例如，指定了数据CSV文件的位置、向量器和模型状态文件的保存路径、学习率、批次大小、训练轮数、早停条件等。
这个方法可以帮助我们更好的将参数模块化，使其更容易调节和整理。

模型超参数设置——多层感知机

args = Namespace(
    # Data and path information
    surname_csv="data/surnames/surnames_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch4/surname_mlp",
    # Model hyper parameters
    hidden_dim=300,
    # Training  hyper parameters
    seed=1337,
    num_epochs=100,
    early_stopping_criteria=5, # 早停步数5，等于或超过这个阈值早停
    learning_rate=0.001,
    batch_size=64,
    # Runtime options
    cuda=False,
    reload_from_files=False,
    expand_filepaths_to_save_dir=True,
)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

模型超参数设置——CNN

args = Namespace(
    # Data and Path information
    surname_csv="data/surnames/surnames_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="model_storage/ch4/cnn",
    # Model hyper parameters
    hidden_dim=100,
    num_channels=256,
    # Training hyper parameters
    seed=1337,
    learning_rate=0.001,
    batch_size=128,
    num_epochs=100,
    early_stopping_criteria=5,
    dropout_p=0.1,
    # Runtime options
    cuda=False,
    reload_from_files=False,
    expand_filepaths_to_save_dir=True,
    catch_keyboard_interrupt=True
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

在运行前确定路径扩展、设备使用、目录创建以及模型结果的可复现性等，这是机器学习项目中常见的初始化设置部分，有效保证了模型的正常训练。

环境设置

if args.expand_filepaths_to_save_dir:
    args.vectorizer_file = os.path.join(args.save_dir,
                                        args.vectorizer_file)

    args.model_state_file = os.path.join(args.save_dir,
                                         args.model_state_file)
    
    print("Expanded filepaths: ")
    print("\t{}".format(args.vectorizer_file))
    print("\t{}".format(args.model_state_file))
    
# Check CUDA
if not torch.cuda.is_available():
    args.cuda = False

args.device = torch.device("cuda" if args.cuda else "cpu")
print("Using CUDA: {}".format(args.cuda))

def set_seed_everywhere(seed, cuda):
    np.random.seed(seed)
    torch.manual_seed(seed)
    if cuda:
        torch.cuda.manual_seed_all(seed)
        
def handle_dirs(dirpath):
    if not os.path.exists(dirpath):
        os.makedirs(dirpath)
        
# Set seed for reproducibility
set_seed_everywhere(args.seed, args.cuda)

# handle dirs
handle_dirs(args.save_dir)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

模型的保存路径、使用设备

在这里插入图片描述
在开始训练前，我们还需要把数据文本转换为模型可以调用的数据集形式（借助Dataset和Vectorizer对数据集进行处理），另外通过模型初始化的参数将模型初始化。（相当于搭建好了模型的骨架，下面只需要数据在里面流动就开始了训练）

数据集、分类器初始化

if args.reload_from_files:
    # training from a checkpoint
    print("Reloading!")
    dataset = SurnameDataset.load_dataset_and_load_vectorizer(args.surname_csv,
                                                              args.vectorizer_file)
else:
    # create dataset and vectorizer
    print("Creating fresh!")
    dataset = SurnameDataset.load_dataset_and_make_vectorizer(args.surname_csv)
    dataset.save_vectorizer(args.vectorizer_file)
    
vectorizer = dataset.get_vectorizer()
classifier = SurnameClassifier(input_dim=len(vectorizer.surname_vocab), 
                               hidden_dim=args.hidden_dim, 
                               output_dim=len(vectorizer.nationality_vocab))  # 输入维度是词汇表的长度，输出维度是国籍类别数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16

在这里插入图片描述

计算模型预测准确率（训练过程中的预测）

def compute_accuracy(y_pred, y_target):
    y_pred_indices = y_pred.max(dim=1)[1]
    n_correct = torch.eq(y_pred_indices, y_target).sum().item()
    return n_correct / len(y_pred_indices) * 100
1
2
3
4

训练过程

整体上实现了一个用于训练和验证分类器模型的训练循环，
其中训练分类器主要有几个过程：
零梯度: 在反向传播计算梯度之前，将模型的梯度归零。
前向传播: 通过模型计算预测值。
计算损失: 使用损失函数衡量预测值与真实标签之间的差距。
反向传播: 计算梯度。
优化更新: 根据梯度更新模型参数。
统计指标: 累加并平均每个批次的损失和准确率，更新训练进度条的显示信息。
验证分类器过程执行与训练分类器类似的过程，但不进行梯度计算和参数更新。
之后还有一些优化训练过程的方法：
早停机制和学习率策略 ：
根据早停策略(stop early)判断是否提前终止训练，以避免过拟合。
使用scheduler根据验证损失动态调整学习率，以期在训练过程中更好地收敛。
可视化策略：在每个epoch结束后，根据训练和验证结果更新train_state字典，记录训练和验证的损失及准确率，并且调用tqdm函数显示模型训练过程中的进度条。
从这个训练过程我们可以看出典型的机器学习训练流程，包括训练阶段、验证阶段、性能监控、早停机制、学习率调整等关键环节。

epoch_bar = tqdm_notebook(desc='training routine', 
                          total=args.num_epochs,
                          position=0)

dataset.set_split('train')
train_bar = tqdm_notebook(desc='split=train',
                          total=dataset.get_num_batches(args.batch_size), 
                          position=1, 
                          leave=True)
dataset.set_split('val')
val_bar = tqdm_notebook(desc='split=val',
                        total=dataset.get_num_batches(args.batch_size), 
                        position=1, 
                        leave=True)

try:
    for epoch_index in range(args.num_epochs):
        train_state['epoch_index'] = epoch_index

        # Iterate over training dataset

        # setup: batch generator, set loss and acc to 0, set train mode on

        dataset.set_split('train')
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        running_loss = 0.0
        running_acc = 0.0
        classifier.train()

        for batch_index, batch_dict in enumerate(batch_generator):
            # the training routine is these 5 steps:

            # --------------------------------------
            # step 1. zero the gradients
            optimizer.zero_grad()

            # step 2. compute the output
            y_pred = classifier(batch_dict['x_surname'])

            # step 3. compute the loss
            loss = loss_func(y_pred, batch_dict['y_nationality'])
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)

            # step 4. use loss to produce gradients
            loss.backward()

            # step 5. use optimizer to take gradient step
            optimizer.step()
            # -----------------------------------------
            # compute the accuracy
            acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)

            # update bar
            train_bar.set_postfix(loss=running_loss, acc=running_acc, 
                            epoch=epoch_index)
            train_bar.update()

        train_state['train_loss'].append(running_loss)
        train_state['train_acc'].append(running_acc)

        # Iterate over val dataset

        # setup: batch generator, set loss and acc to 0; set eval mode on
        dataset.set_split('val')
        batch_generator = generate_batches(dataset, 
                                           batch_size=args.batch_size, 
                                           device=args.device)
        running_loss = 0.
        running_acc = 0.
        classifier.eval()

        for batch_index, batch_dict in enumerate(batch_generator):

            # compute the output
            y_pred =  classifier(batch_dict['x_surname'])

            # step 3. compute the loss
            loss = loss_func(y_pred, batch_dict['y_nationality'])
            loss_t = loss.item()
            running_loss += (loss_t - running_loss) / (batch_index + 1)

            # compute the accuracy
            acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
            running_acc += (acc_t - running_acc) / (batch_index + 1)
            val_bar.set_postfix(loss=running_loss, acc=running_acc, 
                            epoch=epoch_index)
            val_bar.update()

        train_state['val_loss'].append(running_loss)
        train_state['val_acc'].append(running_acc)

        train_state = update_train_state(args=args, model=classifier,
                                         train_state=train_state)

        scheduler.step(train_state['val_loss'][-1])

        if train_state['stop_early']:
            break

        train_bar.n = 0
        val_bar.n = 0
        epoch_bar.update()
except KeyboardInterrupt:
    print("Exiting loop")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108

卷积神经网络模型的训练过程：

在这里插入图片描述

我们能从这个训练过程看出模型最后有一点的过拟合，在后面，我们会采取措施延缓过拟合的到来。

多层感知机模型的训练过程：

3.8 多层感知机—预测

加载训练好的多层感知机模型参数，对数据集中测试集部分进行预测，通过计算loss和acc来评估模型的性能。（其过程大致与验证过程类似，去除了可视化过程的步骤）


classifier.load_state_dict(torch.load(train_state['model_filename']))

classifier = classifier.to(args.device)
dataset.class_weights = dataset.class_weights.to(args.device)
loss_func = nn.CrossEntropyLoss(dataset.class_weights)

dataset.set_split('test')
batch_generator = generate_batches(dataset, 
                                   batch_size=args.batch_size, 
                                   device=args.device)
running_loss = 0.
running_acc = 0.
classifier.eval()

for batch_index, batch_dict in enumerate(batch_generator):
    # compute the output
    y_pred =  classifier(batch_dict['x_surname'])
    
    # compute the loss
    loss = loss_func(y_pred, batch_dict['y_nationality'])
    loss_t = loss.item()
    running_loss += (loss_t - running_loss) / (batch_index + 1)

    # compute the accuracy
    acc_t = compute_accuracy(y_pred, batch_dict['y_nationality'])
    running_acc += (acc_t - running_acc) / (batch_index + 1)

train_state['test_loss'] = running_loss
train_state['test_acc'] = running_acc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

print("Test loss: {};".format(train_state['test_loss']))
print("Test Accuracy: {}".format(train_state['test_acc']))
1
2

测试集loss和精确度

在这里插入图片描述

计算完测试集的精确率和损失后，我们通过实际预测来判断一下模型的正确率，下面我们实现了从一个新的姓氏预测国籍的函数，可以通过输入姓氏来预测对应的国籍，并显示相应的概率。
代码首先将使用vectorizer将输入的姓氏转换成数值向量表示，并转换为Pytorch张量用于模型预测，然后调用分类器classifier对向量化的姓氏进行预测,从模型的预测输出中找到概率最高的国籍索引,并将其通过vectorizer转换为文本的形式，最终以字典的形式返回姓氏，国籍，概率的信息。

def predict_nationality(surname, classifier, vectorizer):  # 以下与多层感知机的测试一致
    """Predict the nationality from a new surname
    
    Args:
        surname (str): the surname to classifier
        classifier (SurnameClassifer): an instance of the classifier
        vectorizer (SurnameVectorizer): the corresponding vectorizer
    Returns:
        a dictionary with the most likely nationality and its probability
    """
    vectorized_surname = vectorizer.vectorize(surname)
    vectorized_surname = torch.tensor(vectorized_surname).unsqueeze(0)
    result = classifier(vectorized_surname, apply_softmax=True)

    probability_values, indices = result.max(dim=1)
    index = indices.item()

    predicted_nationality = vectorizer.nationality_vocab.lookup_index(index)
    probability_value = probability_values.item()

    return {'nationality': predicted_nationality, 'probability': probability_value}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

new_surname = input("Enter a surname to classify: ")
classifier = classifier.cpu()
prediction = predict_nationality(new_surname, classifier, vectorizer)
print("{} -> {} (p={:0.2f})".format(new_surname,
                                    prediction['nationality'],
                                    prediction['probability']))
1
2
3
4
5
6

概率最高的单个国籍预测

在这里插入图片描述
模型对中文字体识别不是很好（缺乏数据集），导致预测效果没有达到最佳（比如中文姓氏预测为韩国），下面采用拼音的形式尝试;

切换到英文表示的文本后，模型的预测准确率明显提高了(因为数据集中只存在英文表示）
下面我们测试一下复数姓氏：
在这里插入图片描述
很明显模型对复数姓氏预测效果不好（也有可能这个姓氏的特性确实和日语有相似性），之后我们来看看一个姓氏的多个预测结果。

为了查看模型输出的更多可能的预测结果，我们可以输出姓氏预测概率排名前k个的国籍，以方便我们的查看。
代码大致的实现过程与单个国籍预测类似，区别是调用了torch的topk方法返回了多个预测值和索引，从而展示出多个预测结果（数量可以自定义，不能超过类别数）

def predict_topk_nationality(surname, classifier, vectorizer, k=5):
    """Predict the top K nationalities from a new surname
    
    Args:
        surname (str): the surname to classifier
        classifier (SurnameClassifer): an instance of the classifier
        vectorizer (SurnameVectorizer): the corresponding vectorizer
        k (int): the number of top nationalities to return
    Returns:
        list of dictionaries, each dictionary is a nationality and a probability
    """
    
    vectorized_surname = vectorizer.vectorize(surname)
    vectorized_surname = torch.tensor(vectorized_surname).unsqueeze(dim=0)
    prediction_vector = classifier(vectorized_surname, apply_softmax=True)
    probability_values, indices = torch.topk(prediction_vector, k=k)
    
    # returned size is 1,k
    probability_values = probability_values[0].detach().numpy()
    indices = indices[0].detach().numpy()
    
    results = []
    for kth_index in range(k):
        nationality = vectorizer.nationality_vocab.lookup_index(indices[kth_index])
        probability_value = probability_values[kth_index]
        results.append({'nationality': nationality, 
                        'probability': probability_value})
    return results

new_surname = input("Enter a surname to classify: ")

k = int(input("How many of the top predictions to see? "))
if k > len(vectorizer.nationality_vocab):
    print("Sorry! That's more than the # of nationalities we have.. defaulting you to max size :)")
    k = len(vectorizer.nationality_vocab)
    
predictions = predict_topk_nationality(new_surname, classifier, vectorizer, k=k)

print("Top {} predictions:".format(k))
print("===================")
for prediction in predictions:
    print("{} -> {} (p={:0.2f})".format(new_surname,
                                        prediction['nationality'],
                                        prediction['probability']))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

概率Top5的多个国籍预测

在这里插入图片描述
从拼音的形式（字母）来看，模型预测的较为精准，能够捕捉到不同国籍的姓氏特征。

发现一个新问题，如果一个姓氏可能在多个国家中都有出现，或者说具有一定的普遍性，那么模型预测的结果也会差强人意（这个我是想打Alan Walker的Alan，应该对应English，此处Englishi排在第五位，且概率只有0.05。另外，还有Alice这个非常普遍的姓氏），这个可能需要针对具体的国家姓氏特征进行处理。

在这里插入图片描述
这种姓氏特征较为明显的，就能够准确预测出对应的国籍（Taylor Swift——>English）。

3.9 卷积神经网络—预测

下面，我们来看看CNN模型在姓氏—国籍预测任务重的表现。
卷积神经网络在预测过程中的流程与多层感知机类似，此处就不再重复代码。

测试集loss和准确率

在这里插入图片描述
可以看出，CNN模型的准确率比多层感知机高了不少，说明CNN模型在这种预测任务（在不显式编码的情况下捕获姓氏中与国籍有关的特征）上的优越性。

概率最高的单个国籍预测

在这里插入图片描述
经过对数据集的研究，因为模型中缺乏该类数据集，所以预测结果差强人意，如果想要获得较好的结果，可以尝试加入中文姓氏数据集进行训练。

后面两个预测结果则说明了模型有一定的错误率，尤其是对于普遍性较高（缺少明显国籍特征）和特殊性较高（复数姓氏）的两种类型的形式，模型给不出正确的答案。

Top5概率的多个国籍预测

在这里插入图片描述
对于上述这个Russian特征明显的姓氏，模型也给出了准确的预测（100%），这说明模型充分学习到了俄语的姓氏特征，而数据集中俄语占据了21%，我们来看看占比不高的语言会有什么变化。

虽然模型没能对这个中文姓氏给出十分准确的预测，但是它依旧学习到了姓氏的部分特征，相比于多层感知机模型，CNN模型对中文姓氏的预测概率更高更准确。

在这里插入图片描述
而对于一些普遍性较高的形式（姓氏中国籍特征较少，难以学习到），模型则不能明确的给出确切的答案（需要对模型进一步细节调整），对于Alice这个姓氏，它最开始出现在Italian和English文中，但是文中预测Italian的概率达到了0.59，而English只有0.07，这可能是因为数据集中各国姓氏占比不均衡的原因导致。
在这里插入图片描述

我又找了两个Arabic姓氏进行了进一步尝试，第一个Adam预测的概率较为均衡，而第二个Amjad预测为Arabic的概率较高，一方面是因为数据集中Arabic的占比较多，另一方面，可以体现出姓氏中带有国籍特征较多的名字更容易预测（Amjad）。

3.10 基于Dropout的CNN模型

最后，我们尝试在CNN中添加Dropout层，以预防模型过拟合，提高模型的预测性能，具体改动如下：

class SurnameClassifier(nn.Module):
    def __init__(self, initial_num_channels, num_classes, num_channels, dropout_p = 0.1):
        """
        Args:
            initial_num_channels (int): size of the incoming feature vector
            num_classes (int): size of the output prediction vector
            num_channels (int): constant channel size to use throughout network
        """
        super(SurnameClassifier, self).__init__()
        self.dropout_p = dropout_p
        # 卷积网络的区别在于训练时采用卷积核而非线性分类器
        self.convnet = nn.Sequential(
            nn.Conv1d(in_channels=initial_num_channels, 
                      out_channels=num_channels, kernel_size=3),
            nn.ELU(),
            nn.Dropout(self.dropout_p),  # 添加Dropout层
            
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, 
                      kernel_size=3, stride=2),
            nn.ELU(),
            nn.Dropout(self.dropout_p),  # 在每个卷积层后添加Dropout
            
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, 
                      kernel_size=3, stride=2),
            nn.ELU(),
            nn.Dropout(self.dropout_p),
            
            nn.Conv1d(in_channels=num_channels, out_channels=num_channels, 
                      kernel_size=3),
            nn.ELU(),
            nn.Dropout(self.dropout_p)
        )

        self.fc = nn.Linear(num_channels, num_classes)

    def forward(self, x_surname, apply_softmax=False):
        """The forward pass of the classifier
        
        Args:
            x_surname (torch.Tensor): an input data tensor. 
                x_surname.shape should be (batch, initial_num_channels, max_surname_length)
            apply_softmax (bool): a flag for the softmax activation
                should be false if used with the Cross Entropy losses
        Returns:
            the resulting tensor. tensor.shape should be (batch, num_classes)
        """
        features = self.convnet(x_surname).squeeze(dim=2)
        
         # 添加一个Dropout层在全连接层之前
        features = F.dropout(features, p=self.dropout_p, training=self.training)
       
        prediction_vector = self.fc(features)

        if apply_softmax:
            prediction_vector = F.softmax(prediction_vector, dim=1)

        return prediction_vector

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58

训练过程

在这里插入图片描述

与前面不加Dropout层的CNN模型对比，该模型过拟合程度明显降低，验证集上的loss更接近训练集。

模型在测试集上的loss和acc

在这里插入图片描述
模型loss下降，但是acc也下降了，这可能是因为Dropout层使得模型复杂度下降，导致模型学习能力下降，无法充分捕捉数据中的复杂特征。
另外，Dropout引入了随机性，这意味着每次迭代模型看到的是一个不同的子网络，这可能会导致训练过程中的波动加大，尤其是在早期训练阶段，模型可能需要更多时间来稳定并找到一个良好的解决方案。
加上数据集较小，Dropout可能会导致模型学习不足，因为每次迭代时模型看到的数据表示都在变化，减少了模型对数据集特征的精确学习。

模型预测单个国籍

在这里插入图片描述

在这里插入图片描述
可以看出模型对于数据集占比较小的姓氏也有了较高的预测正确率，对于数据集占比较大的姓氏预测依旧有着较高正确率。

但是，模型对于复杂姓氏和普遍姓氏，仍不能很好处理，进行精准预测。（如果想要更精确的预测，应该需要更大更全面的数据集）

模型预测多个国籍

在这里插入图片描述

从姓氏的多个预测概率来看，部分在CNN中预测不好的形式在这里有了更准确的国籍预测概率。

对于特征不明显的姓氏，模型则没有解决这个问题，但是top5的概率更加平均了，这也是泛化性能的一种体现。
在这里插入图片描述
其中出现了一个问题，在CNN中预测较好的姓氏在这里反而有了不太好的预测效果，这也说明了数据集较小的情况下，加入Dropout层可能影响模型学习性能。

总结

本篇博客全面解析了使用多层感知机和卷积神经网络进行自然语言处理任务的方法，以姓氏分类任务为引，展示了从数据预处理到模型实现，再到训练与预测的全过程。
在技术背景部分，我们理解了模型的思想，掌握了模型的结构，大致了解数据如何在神经网络中进行前向传播的。
在技术实现部分，我们将整个流程拆解为类实例的实现，包括数据集、词汇表、分词器、网络模型等，然后借助这些类实现对数据集的处理，对模型的训练以及预测过程。重点部分在于，本次任务流程详细充分，基本概括了一个自然语言处理任务的基本流程和必要的代码实现，作为一个接触NLP的初学者，这个任务及代码搭建了一个NLP实践的根基，之后的相关任务，我们都可以借鉴这篇博客，进行进一步的学习。

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/煮酒与君饮/article/detail/813009