赞
踩
CVPR-2018
Building deeper and larger CNN is a primary trend for solving major visual recognition tasks. CNN 太大需要 computation at billions of FLOPs. 作者在有限的 computation 下追求 acc,轻量化 CNN,专注于将其运用到移动平台,如无人机、机器人和智能手机。(pursuing the best accuracy in very limited computational budgets at tens or hundreds of MFLOPs, focusing on common mobile platforms such as drones, robots, and smartphones)
受 mobilenet(depth separable convolution) 和 resnext(group convolution) 的启发,针对 1 ∗ 1 1*1 1∗1卷积 (point-wise convolution)计算量大,group convolution 组之间信息无交流的缺点,提出 point-wise group convolution 和 channels shuffle
Note:depth separable convolution = depth-wise convolution + point-wise convolution
Efficient Model Designs
GoogleNet、Squeezenet、ResNet、SENet、NASNet
Group Convolution
AlexNet、ResNeXt、Xception、MobileNet
Channel Shuffle Operation
两篇论文 cuda-convnet [20](先打乱,再 group convolution)、[41],作者说 [41] did not specially investigate the effectiveness of channel shuffle itself and its usage in tiny model design(哈哈哈会玩)
GConv 就是 Group convolution,(a)是传统的 GConv,(b)(c)等价是在(a)的基础上,对 feature map 进行通道维度的 shuffle。why?
Group Convolution 有个 side-effect:outputs from a certain channel are only derived from a small fraction of input channels.(也即上图标注的 No cross talk)
This property blocks information flow between channel groups and weakens representation.
所以作者做了(b)(c)的改进!妙哉
1)bottleneck unit 改进
同 ResNet 还是
1
×
1
1×1
1×1 →
3
×
3
3×3
3×3 →
1
×
1
1×1
1×1 的组合,只是
1
×
1
1×1
1×1 换成了 GConv,
3
×
3
3×3
3×3 换成了 Depth-wise Convolution,后面的
3
×
3
3×3
3×3 →
1
×
1
1×1
1×1 构成 Depth separable convolution
2)计算量对比
对比下计算量,假设 input size
c
×
h
×
w
c×h×w
c×h×w,
b
o
t
t
l
e
n
e
c
k
c
h
a
n
n
e
l
s
=
m
bottleneck \ channels = m
bottleneck channels=m,
n
u
m
b
e
r
o
f
g
r
o
u
p
s
number \ of \ groups
number of groups 为
g
g
g
parameters | computational complexity | |
---|---|---|
Resnet | 2 c m + 9 m 2 2cm + 9m^2 2cm+9m2 | h w ( 2 c m + 9 m 2 ) hw(2cm + 9m^2) hw(2cm+9m2) |
ResneXt | 2 c m + 9 m 2 / g 2cm + 9m^2/g 2cm+9m2/g | h w ( 2 c m + 9 m 2 / g ) hw(2cm + 9m^2/g) hw(2cm+9m2/g) |
Shufflenet | 2 c m / g + 9 m 2cm/g + 9m 2cm/g+9m | h w ( 2 c m / g + 9 m ) hw(2cm/g + 9m) hw(2cm/g+9m) |
3)resolution 降低时候的结构
不同的地方如下:
第二点还是很巧妙的,把 add 替换成 concatenate 来实现 double channels,不过见过 inception family,这种形式也不会大惊小怪了!
4)关于 depth-wise separable 的使用
我们知道,depth-wise separable 用于 3×3卷积,值得注意的是:虽然 depth-wise separable 能大大降低 parameters,但是呢!
we find it difficult to efficiently implement on low power mobile devices, which may result from a worse computation/memory access ratio compared with other dense operations.(Xception 中也提到了这一点)
所以,作者仅在 bottleneck unit 中把 regular 3×3 替换成了 depth-wise separable
bottleneck ratio
= 1:4Shufflenet 1×
,学习 MobileNet,可以来 Shufflenet s×
(缩减 number of filers s time),这样参数量或者 overall complexity 会降低
s
2
s^2
s2要注意一点,同一 complexity 下,g 不同,也表示着,每个 bottleneck unit 中的 feature map 的厚度不同,因为 g越多降的参数量也越多!
1)Pointwise Group Convolutions
Smaller model tend to benefit more from groups(同 complexity 下,有 wider feature maps),随着组别的增加,ShuffleNet 1X
提升 1.2%,ShuffleNet 0.5X
提升 3.5%,ShuffleNet 0.25X
提升 4.4%(和精度最高的对比)
从 ShuffleNet 1X
- ShuffleNet 0.25X
可以看出,it benefits more from enlarged feature map.
2)Channel Shuffle vs. No Shuffle
cross-group information interchange,在三种 complexity 下对比
组越多,shuffle 的效果越明显
同 complexity 下比较 performance
同一 complexity 下,参数量更少的结构,可以有 wider feature map,所以,精度会高
1)MobileNet 和 ShuffleNet solo
相仿 complexity 下比较 performance
可以看出,ShuffleNet 的参数利用效率更高,结构设计更优化
虽然,ShuffleNet network is specially designed for small models(<150 MFLOPs),但是 > 150 的时候还是可以看到 ShuffleNet 的淫威。
最后一行 shallow 表示 stage 2-4 bottleneck unit half
注意 with SE
哟,Squeeze-and-Excitation(SE),加了 SE,在手机上速度会慢 25-40%
2)和其它常见结构对比
similar accuracy 下比较 complexity
在 COCO 上跑跑
We conjecture that this significant gain is partly due to ShuffleNet’s simple design of architecture without bells and whistles.
在移动端上试试(a mobile device with an ARM platform),理论上的加速和实际中的加速还有些出入
Empirically g = 3 usually has a proper trade-off between accuracy and actual inference time.
achieves ~ 13x actual speedup(~18x theoretical)
注意几点
def channel_shuffle(x, groups): """ Parameters ---------- x: Input tensor of with `channels_last` data format groups: int number of groups per channel Returns ------- channel shuffled output tensor Examples -------- Example for a 1D Array with 3 groups >>> d = np.array([0,1,2,3,4,5,6,7,8]) >>> x = np.reshape(d, (3,3)) >>> x = np.transpose(x, [1,0]) >>> x = np.reshape(x, (9,)) '[0 1 2 3 4 5 6 7 8] --> [0 3 6 1 4 7 2 5 8]' """ height, width, in_channels = x.shape.as_list()[1:] channels_per_group = in_channels // groups x = K.reshape(x, [-1, height, width, groups, channels_per_group]) x = K.permute_dimensions(x, (0, 1, 2, 4, 3)) # transpose x = K.reshape(x, [-1, height, width, in_channels]) return x
keras 调用自己定义的层的方式为
x = Lambda(channel_shuffle, arguments={'groups': groups}, name='%s/channel_shuffle' % prefix)(x)
通过 Lambda
,配合 arguments
传递形参!!!
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。