动手学深度学习（二十）——VGG网络（2014年ILSVRC竞赛第二名模型）_vggi

作者：AllinToyou | 2024-06-11 20:58:20

踩

vggi

文章目录

Visual Gerometry Group(VGG)

Visual Gerometry Group(VGG)

《Very Deep Convolutional Networks for Large-Scale Image Recognition》

arXiv：[1409.1556] Very Deep Convolutional Networks for Large-Scale Image Recognition

intro：ICLR 2015

homepage：Visual Geometry Group Home Page

1. 使用块的网络——VGG

虽然 AlexNet 证明深层神经网络卓有成效，但它没有提供一个通用的模板来指导后续的研究人员设计新的网络。与芯片设计中工程师从放置晶体管到逻辑元件再到逻辑块的过程类似，神经网络结构的设计也逐渐变得更加抽象。研究人员开始从单个神经元的角度思考问题，发展到整个层次，现在又转向模块，重复各层的模式。

使用块的想法首先出现在牛津大学的视觉几何组（visualgeometry Group） (VGG)的 VGG网络 中。通过使用循环和子程序，可以很容易地在任何现代深度学习框架的代码中实现这些重复的结构。

1.1 VGG块

经典卷积神经网络的基本组成部分是下面的这个序列：

带填充以保持分辨率的卷积层；
非线性激活函数，如ReLU；
池化层，如最大池化层。

而一个 VGG 块与之类似，由一系列卷积层组成，后面再加上用于空间下采样的最大池化层。在最初的 VGG 论文Simonyan.Zisserman.2014 中，作者使用了带有 $3\times3$ 卷积核、填充为 1（保持高度和宽度）的卷积层，和带有 $\times 2$ 池化窗口、步幅为 2（每个块后的分辨率减半）的最大池化层。在下面的代码中，我们定义了一个名为 vgg_block 的函数来实现一个 VGG 块。

[为什么使用2个3x3的卷积核可以代替5x5的卷积核？]

参考：一文读懂VGG网络

5x5卷积看做一个小的全连接网络在5x5区域滑动，我们可以先用一个3x3的卷积滤波器卷积，然后再用一个全连接层连接这个3x3卷积输出，这个全连接层我们也可以看做一个3x3卷积层。这样我们就可以用两个3x3卷积级联（叠加）起来代替一个 5x5卷积。

1.2 VGG网络

与 AlexNet、LeNet 一样，VGG 网络可以分为两部分：第一部分主要由卷积层和池化层组成，第二部分由全连接层组成。如下图中所示。

VGG神经网络连续连接 :几个 VGG 块（在代码中用 vgg_block 函数定义）。其中有超参数变量 conv_arch 。该变量指定了每个VGG块里卷积层个数和输出通道数。全连接模块则与AlexNet中的相同。

2. pytorch实现VGG11

2.1 定义VGG块

# 该函数有三个参数，分别对应于卷积层的数量 num_convs、输入通道的数量 in_channels 和输出通道的数量 out_channels.
import torch
from torch import nn
from d2l import torch as d2l

def vgg_block(num_convs, in_channels, out_channels):
    layers = []
    for _ in range(num_convs):
        layers.append(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
        in_channels = out_channels
    layers.append(nn.MaxPool2d(kernel_size=2, stride=2))
    return nn.Sequential(*layers)
1
2
3
4
5
6
7
8
9
10
11
12
13
14

2.2 实现VGG网络结构

'''
原始 VGG 网络有 5 个卷积块，其中前两个块各有一个卷积层，后三个块各包含两个卷积层。
第一个模块有 64 个输出通道，每个后续模块将输出通道数量翻倍，直到该数字达到 512。
由于该网络使用 8 个卷积层和 3 个全连接层，因此它通常被称为 VGG-11。'''
conv_arch = ((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))
1
2
3
4
5

# vgg11
# 每个块的高度和块度减半，输出通道数量加倍，最终高度和宽度都是7，通道数为512。最后展平使用全连接层处理
def vgg(conv_arch):
    conv_blks = []
    in_channels = 1
    # 卷积层部分
    for (num_convs, out_channels) in conv_arch:
        conv_blks.append(vgg_block(num_convs, in_channels, out_channels))
        in_channels = out_channels

    return nn.Sequential(*conv_blks, nn.Flatten(),
                         # 全连接层部分
                         nn.Linear(out_channels * 7 * 7, 4096), nn.ReLU(),
                         nn.Dropout(0.5), nn.Linear(4096, 4096), nn.ReLU(),
                         nn.Dropout(0.5), nn.Linear(4096, 10))

net = vgg(conv_arch)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

# 构建一个高度和宽度为 224 的单通道数据样本，观察每个层输出的形状
X = torch.randn(size=(1, 1, 224, 224))
for blk in net:
    X = blk(X)
    print(blk.__class__.__name__, 'output shape:\t', X.shape)
1
2
3
4
5

Sequential output shape:	 torch.Size([1, 64, 112, 112])
Sequential output shape:	 torch.Size([1, 128, 56, 56])
Sequential output shape:	 torch.Size([1, 256, 28, 28])
Sequential output shape:	 torch.Size([1, 512, 14, 14])
Sequential output shape:	 torch.Size([1, 512, 7, 7])
Flatten output shape:	 torch.Size([1, 25088])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 4096])
ReLU output shape:	 torch.Size([1, 4096])
Dropout output shape:	 torch.Size([1, 4096])
Linear output shape:	 torch.Size([1, 10])
1
2
3
4
5
6
7
8
9
10
11
12
13

2.3 训练模型

由于VGG-11比AlexNet计算量更大，因此我们构建了一个通道数较少的网络，足够用于训练Fashion-MNIST数据集。

ratio = 16
small_conv_arch = [(pair[0], pair[1] // ratio) for pair in conv_arch]
net = vgg(small_conv_arch)
1
2
3

# 读取数据
from torchvision import transforms
import torchvision
from torch.utils import data

batch_size = 256

def get_dataloader_workers():
    """使用四个进程读取数据"""
    return 4

def load_data_fashion_mnist(batch_size,resize=None):
    """下载Fashion-MNIST数据集，并将其保存至内存中"""
    trans = [transforms.ToTensor()]
    if resize:
        trans.insert(0,transforms.Resize(resize)) # transforms.Resize将图片最小的一条边缩放到指定大小，另一边缩放对应比例
    trans = transforms.Compose(trans) # compose用于串联多个操作
    mnist_train = torchvision.datasets.FashionMNIST(root="./data",
                                                    train=True,
                                                    transform=trans,
                                                    download=True)
    mnist_test = torchvision.datasets.FashionMNIST(root="./data",
                                                   train=False,
                                                   transform=trans,
                                                   download=True)
    return (data.DataLoader(mnist_train,batch_size,shuffle=True,
                           num_workers=get_dataloader_workers()),
           data.DataLoader(mnist_test,batch_size,shuffle=True,
                          num_workers = get_dataloader_workers()))

batch_size = 256
train_iter, test_iter = load_data_fashion_mnist(batch_size=batch_size)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

def evaluate_accuracy_gpu(net, data_iter, device=None):  #@save
    """使用GPU计算模型在数据集上的精度。"""
    if isinstance(net, torch.nn.Module):
        net.eval()  # 设置为评估模式
        if not device:
            device = next(iter(net.parameters())).device
    # 正确预测的数量，总预测的数量
    metric = d2l.Accumulator(2)
    for X, y in data_iter:
        if isinstance(X, list):
            # BERT微调所需的（之后将介绍）
            X = [x.to(device) for x in X]
        else:
            X = X.to(device)
        y = y.to(device)
        metric.add(d2l.accuracy(net(X), y), y.numel())
    return metric[0] / metric[1]


#@save
def train(net, train_iter, test_iter, num_epochs, lr, device):
    """用GPU训练模型"""
    def init_weights(m):
        if type(m) == nn.Linear or type(m) == nn.Conv2d:
            nn.init.xavier_uniform_(m.weight)

    net.apply(init_weights)
    print('training on', device)
    net.to(device) # 将网络挪到gpu上
    optimizer = torch.optim.SGD(net.parameters(), lr=lr)
    loss = nn.CrossEntropyLoss()
    animator = d2l.Animator(xlabel='epoch', xlim=[1, num_epochs],
                            legend=['train loss', 'train acc', 'test acc'])
    timer, num_batches = d2l.Timer(), len(train_iter)
    for epoch in range(num_epochs):
        # 训练损失之和，训练准确率之和，范例数
        metric = d2l.Accumulator(3)
        net.train()
        for i, (X, y) in enumerate(train_iter):
            timer.start()
            optimizer.zero_grad()
            X, y = X.to(device), y.to(device)
            y_hat = net(X)
            l = loss(y_hat, y)
            l.backward()
            optimizer.step()
            with torch.no_grad():
                metric.add(l * X.shape[0], d2l.accuracy(y_hat, y), X.shape[0])
            timer.stop()
            train_l = metric[0] / metric[2]
            train_acc = metric[1] / metric[2]
            if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
                animator.add(epoch + (i + 1) / num_batches,
                             (train_l, train_acc, None))
        test_acc = evaluate_accuracy_gpu(net, test_iter)
        animator.add(epoch + 1, (None, None, test_acc))
    print(f'loss {train_l:.3f}, train acc {train_acc:.3f}, '
          f'test acc {test_acc:.3f}')
    print(f'{metric[2] * num_epochs / timer.sum():.1f} examples/sec '
          f'on {str(device)}')
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

lr, num_epochs, batch_size = 0.05, 10, 128
train_iter, test_iter = load_data_fashion_mnist(batch_size, resize=224)
train(net, train_iter, test_iter, num_epochs, lr, d2l.try_gpu())
1
2
3

loss 0.207, train acc 0.922, test acc 0.912
1425.0 examples/sec on cuda:0
1
2

import matplotlib.pyplot as plt
# 验证数据
def predict(net, test_iter, n=6):  #@save
    """预测标签"""
    for X, y in test_iter:
        break

    trues = d2l.get_fashion_mnist_labels(y)
    device = torch.device('cuda:0')
    X_gpu = X.to(device)
    preds = d2l.get_fashion_mnist_labels(net(X_gpu).argmax(axis=1))
    titles = [true + '\n' + pred for true, pred in zip(trues, preds)]
    
    # 绘图
    fig, ax = plt.subplots(
        nrows=4,
        ncols=4,
        sharex=True,
        sharey=True, 
    )

    ax = ax.flatten()
    for i in range(16):
        # 只查看了前面12张图片
        img = X[i].reshape(224,224)
        ax[i].imshow(img)
        ax[i].set(title=titles[i])
    ax[0].set_xticks([])
    ax[0].set_yticks([])
    plt.tight_layout()
    plt.show()
    

predict(net, test_iter)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

2.4 总结

在Fashion-MNIST数据集上VGG网络的训练测试精度高达91%,比两个博客中提及的LuNet和AlexNet的精度都要高。
为了提高训练效率，在GPU上训练，其效率为1425 exampes/s，其速度比Alexnet快，比LuNet慢。但是其原因在于，我GPU的内存达到上限了，所以将网络的输出通道除以了16。
VGG网络给我们的一个启示应该是：将网络模型当作一个个积木，那么我们的网络建模就等价于使用积木搭建小建筑了。

3. 发展进度

LeNet（1995）

2卷积+池化层
2全连接层

AlexNet

更大、更深
ReLU,Dropout,数据增强

VGG

更大更深的AlexNet（重复的VGG块）

分类模型对比

在这里插入图片描述

纵轴表示精度，横轴表示单位时间处理数据的个数（处理效率），对比可见VGG的处理效率虽然更低了，但是其精度高了