Deep Learning : 深度前馈神经网络（三）_element-wise nonlinear function

作者：羊村懒王 | 2024-03-29 04:37:39

踩

element-wise nonlinear function

Hidden Units

So far we have focused our discussion on design choices for neural networks that are common to most parametric machine learning models trained with gradientbased optimization. Now we turn to an issue that is unique to feedforward neural networks: how to choose the type of hidden unit to use in the hidden layers of the model.
The design of hidden units is an extremely active area of research and does not yet have many definitive guiding theoretical principles.

Rectified linear units are an excellent default choice of hidden unit. Many other types of hidden units are available. It can be difficult to determine when to use which kind (though rectified linear units are usually an acceptable choice).
We describe here some of the basic intuitions motivating each type of hidden units. These intuitions can be used to suggest when to try out each of these units. It is usually impossible to predict in advance which will work best. The design process consists of trial and error, intuiting that a kind of hidden unit may work well, and then training a network with that kind of hidden unit and evaluating its performance on a validation set.
Some of the hidden units included in this list are not actually differentiable at all input points. For example, the rectified linear function $g(z) = max\{0 , z\}$ is not differentiable at z = 0. In practice, gradient descent still performs well enough for these models to be used for machine learning tasks. This is in part because neural network training algorithms do not usually arrive at a local minimum of the cost function, but instead merely reduce its value significantly.
Hidden units that are not differentiable are usually non-differentiable at only a small number of points. In general, a function g(z) has a left derivative defined by the slope of the function immediately to the left of z and a right derivative defined by the slope of the function immediately to the right of z.
Software implementations of neural network training usually return one of the one-sided derivatives rather than reporting that the derivative is undefined or raising an error.
The important point is that in practice one can safely disregard the non-differentiability of the hidden unit activation functions described below.

Unless indicated otherwise, most hidden units can be described as accepting a vector of inputs x, computing an affine transformation $z=W^Tx+b$ , and then applying an element-wise nonlinear function g(z). Most hidden units are distinguished from each other only by the choice of the form of the activation function g(z).

Rectified Linear Units and Their Generalizations

Rectified linear units use the activation function $g(z) = max\{0, z\}$ .

Rectified linear units are easy to optimize because they are so similar to linear units. The only difference between a linear unit and a rectified linear unit is that a rectified linear unit outputs zero across half its domain. This makes the derivatives through a rectified linear unit remain large whenever the unit is active.
The second derivative of the rectifying operation is 0 almost everywhere, and the derivative of the rectifying operation is 1 everywhere that the unit is active. This means that the gradient direction is far more useful for learning than it would be with activation functions that introduce second-order effects .
Rectified linear units are typically used on top of an affine transformation:

$h = g (W^{T} x + b)$ $h=g(W^Tx+b)$
When initializing the parameters of the affine transformation, it can be a good practice to set all elements of b to a small, positive value, such as 0.1. This makes it very likely that the rectified linear units will be initially active for most inputs in the training set and allow the derivatives to pass through.
One drawback to rectified linear units is that they cannot learn via gradientbased methods on examples for which their activation is zero.
A variety of generalizations of rectified linear units guarantee that they receive gradient everywhere.
(1) Three generalizations of rectified linear units are based on using a non-zero slope $α_i$ when $z_i$ < 0: $h_i = g(z, α)_i = max(0, z_i) + α_i min(0, z_i)$ . Absolute value rectification fixes $α_i = −1$ to obtain $g(z) = |z|$ .It is used for object recognition from images (Jarrett et al., 2009), where it makes sense to seek features that are invariant under a polarity reversal of the input illumination.
(2) Other generalizations of rectified linear units are more broadly applicable. A leaky ReLU (Maas et al., 2013) fixes $α_i$ to a small value like 0.01.
(3) While a parametric ReLU or PReLU treats $α_i$ as a learnable parameter (He et al., 2015).
(4) Maxout units (Goodfellow et al., 2013a) generalize rectified linear units further. Instead of applying an element-wise function g(z), maxout units divide z into groups of k values. Each maxout unit then outputs the maximum element of one of these groups:
$g (z)_{i} = max_{j \in G^{(i)}} z_{j}$ $g(z)_i=\max _{j\in \mathbb{G}^{(i)}}z_j$
This provides a way of learning a piecewise linear function that responds to multiple directions in the input x space.
A maxout unit can learn a piecewise linear, convex function with up to k pieces. Maxout units can thus be seen as learning the activation function itself rather than just the relationship between units.
With large enough k, a maxout unit can learn to approximate any convex function with arbitrary fidelity. In particular, a maxout layer with two pieces can learn to implement the same function of the input x as a traditional layer using the rectified linear activation function, absolute value rectification function, or the leaky or parametric ReLU, or can learn to implement a totally different function altogether.
Each maxout unit is now parametrized by k weight vectors instead of just one, so maxout units typically need more regularization than rectified linear units. They can work well without regularization if the training set is large and the number of pieces per unit is kept low.
Rectified linear units and all of these generalizations of them are based on the principle that models are easier to optimize if their behavior is closer to linear.

Logistic Sigmoid and Hyperbolic Tangent

Prior to the introduction of rectified linear units, most neural networks used the logistic sigmoid activation function

$g (z) = σ (z)$ $g(z)=\sigma(z)$
or the hyperbolic tangent activation function
$g (z) = \tanh (z)$ $g(z)=\tanh(z)$
Unlike piecewise linear units, sigmoidal units saturate across most of their domain—they saturate to a high value when z is very positive, saturate to a low value when z is very negative, and are only strongly sensitive to their input when z is near 0.
The widespread saturation of sigmoidal units can make gradient-based learning very difficult. For this reason, their use as hidden units in feedforward networks is now discouraged.
Their use as output units is compatible with the use of gradient-based learning when an appropriate cost function can undo the saturation of the sigmoid in the output layer.
When a sigmoidal activation function must be used, the hyperbolic tangent activation function typically performs better than the logistic sigmoid. It resembles the identity function more closely, in the sense that $\tanh(0) = 0$ while $σ(0) = 1/2$ . Because $\tanh$ is similar to identity near 0, training a deep neural network $\hat{y} = w^T \tanh(U^T \tanh(V^Tx))$ resembles training a linear model $\hat{y}= w^TU^TV^Tx$ so long as the activations of the network can be kept small. This makes training the $\tanh$ network easier.
Sigmoidal activation functions are more common in settings other than feedforward networks. Recurrent networks, many probabilistic models, and some autoencoders have additional requirements that rule out the use of piecewise linear activation functions and make sigmoidal units more appealing despite the drawbacks of saturation.

Other Hidden Units

Many other types of hidden units are possible, but are used less frequently.
In general, a wide variety of differentiable functions perform perfectly well. Many unpublished activation functions perform just as well as the popular ones.
During research and development of new techniques, it is common to test many different activation functions and find that several variations on standard practice perform comparably. This means that usually new hidden unit types are published only if they are clearly demonstrated to provide a significant improvement.

If every layer of the neural network consists of only linear transformations, then the network as a whole will be linear. However, it is acceptable for some layers of the neural network to be purely linear. Consider a neural network layer with n inputs and p outputs, $h = g(W^Tx + b)$ . We may replace this with two layers, with one layer using weight matrix U and the other using weight matrix V . If the first layer has no activation function, then we have essentially factored the weight matrix of the original layer based on W . The factored approach is to compute $h = g(V^TU^Tx + b). If U produces q outputs, then U and V together contain only (n + p)q parameters, while W contains np parameters. For small q, this can be a considerable saving in parameters. It comes at the cost of constraining the linear transformation to be low-rank, but these low-rank relationships are often sufficient. Linear hidden units thus offer an effective way of reducing the number of parameters in a network.
Softmax units are another kind of unit that is usually used as an output but may sometimes be used as a hidden unit. Softmax units naturally represent a probability distribution over a discrete variable with k possible values, so they may be used as a kind of switch. These kinds of hidden units are usually only used in more advanced architectures that explicitly learn to manipulate memory.

A few other reasonably common hidden unit types include:

Radial basis function or RBF unit:
$h_{i} = \exp (- \frac{1}{σ_{i}^{2}} | | W_{:, i} - x | |^{2})$ $h_i=\exp (-\frac{1}{\sigma_i^2}||W_{:,i}-x||^2)$
This function becomes more active as x approaches a template $W_{:,i}$ . Because it saturates to 0 for most x, it can be difficult to optimize.
Softplus:
$g (a) = ζ (a) = \log (1 + e^{a})$ $g(a)=\zeta(a)=\log(1+e^a)$
This is a smooth version of the rectifier for the conditional distributions of undirected probabilistic
models. The use of the softplus is generally discouraged.
The softplus demonstrates that the performance of hidden unit types can be very counterintuitive—one might expect it to have an advantage over the rectifier due to being differentiable everywhere or due to saturating less completely, but empirically it does not.
Hard tanh:
$g (a) = max (- 1, m i n (1, a))$ $g(a)=\max(-1,min(1,a))$
this is shaped similarly to the tanh and the rectifier but unlike the latter, it is bounded.

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/羊村懒王/article/detail/333807?site

Deep Learning : 深度前馈神经网络（三）_element-wise nonlinear function

Hidden Units

Rectified Linear Units and Their Generalizations

Logistic Sigmoid and Hyperbolic Tangent

Other Hidden Units