我的机器学习支线「损失函数」_ce损失函数

作者：爱喝兽奶帝天荒 | 2024-08-01 03:21:51

踩

ce损失函数

文章目录

语义分割

语义分割结合了图像分类、目标检测和图像分割，通过一定的方法将图像分割成具有一定语义含义的区域块，并识别出每个区域块的语义类别，实现从底层到高层的语义推理过程，最终得到一幅具有逐像素语义标注的分割图像。设计损失函数想要达到的目标是损失与梯度同步变化，求导自变量定义为神经网络的最后一层带权重层的输出。当学习率恒定时，希望当预测结果远离真实值时，损失大，梯度大；当预测结果靠近真实值时，损失小，梯度小

基于交叉熵的损失函数

1. Loss Function CE 用于多分类任务

最常用损失函数是像素级别的交叉熵损失 (cross entropy loss，ce)，逐个检查每个像素，将对每个像素类别的预测结果（概率分布向量）与热编码标签向量进行比较

假设需要对每个像素的预测类别有 $5$ 个，则预测的概率分布向量长度也为 $5$ 维
Alt
对应的每个像素损失函数
$\pmb{loss_{pixel}}=-\sum_{class}y_{true}^{class}log(y_{pred}^{class})$
令 $y_{pred}=softmax(x)$ 那么回传的梯度为 $\frac{d(loss_{ce})}{dx}=\sum_{class}y_{true}^{class}(y_{pred}^{class}-1)$ 正比于每个类别误差求和的均值，因此优化过程中损失小时梯度小

整个图像的损失就是全部像素损失的平均值
$\pmb{loss_{ce}}=\frac{1}{n}\sum_{pixel=1}^{n}loss_{pixel}$

F.cross_entropy(input, target, weight=self.weight, ignore_index=self.ignore_index, reduction=self.reduction, label_smoothing=self.label_smoothing)

PyTorch API
this case is equivalent to the combination of ~torch.nn.LogSoftmax and ~torch.nn.NLLLoss.
$\text{LogSoftmax}(x_{i}) = \log\left(\frac{\exp(x_i) }{ \sum_j \exp(x_j)} \right)$
$ignore_index } NLL(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - w_{y_n} x_{n,y_n}, \quad w_{c} = \text{weight}[c] \cdot \mathbb{1}\{c \not= \text{ignore\_index}\}$
It is useful when training a classification problem with C classes.
If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes, This is
particularly useful when you have an unbalanced training set.
The input is expected to contain raw, unnormalized scores for each class.
input has to be a Tensor of size (C) for unbatched input, (N, C) or (N, C, d_1, d_2, ..., d_K) with $K\geq 1$ for the K-dimensional case.
$\begin{aligned} C = & number of classes \\ N = & batch size \end{aligned}$
The target that this criterion expects should contain either
Class indices in the range [0, C) where C is the number of classes, not one-hot, dtype is long.
if ignore_index is specified, loss also accepts this class index (this index may not necessarily be in the class range).
If containing class probabilities, same shape as the input and each value should be between [0, 1], dtype is float.
The unreduced (i.e. with reduction set to 'none') loss for this case can be described as
$ignore_index } \ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - w_{y_n} \log \frac{\exp(x_{n,y_n})}{\sum_{c=1}^C \exp(x_{n,c})} \cdot \mathbb{1}\{y_n \not= \text{ignore\_index}\}$
x is the input, y is the target, w is the weight,
C is the number of classes, and N spans the minibatch dimension as well as d_1, ..., d_k for the K-dimensional case.
The performance of this criterion is generally better when target contains class
indices, as this allows for optimized computation. Consider providing target as
class probabilities only when a single class label per minibatch item is too restrictive.
The output If reduction is ‘none’, same shape as the target. Otherwise, scalar.

数学上 torch.nn.CrossEntropyLoss 等价 torch.nn.LogSoftmax 加 torch.nn.NLLLoss，但是 API 实现上，它们存在一些差异 torch.nn.NLLLoss 的标签无法使用概率值，而 torch.nn.CrossEntropyLoss 可以，因此可以认为前者是后者的超集

ce = nn.CrossEntropyLoss()
ls = nn.LogSoftmax(dim=1)
nll = nn.NLLLoss()

# 逆向实现API
def cross_entorpy(inputs, targets):
    inputs = inputs.numpy()
    targets = targets.numpy()
    outputs = 0.
    weight = 1.
    if targets.dtype == np.int64:
        assert len(inputs.shape) == 4 and len(targets.shape) == 3
        for k in range(targets.shape[0]):
            temp = 0.
            for i in range(targets.shape[-2]):
                for j in range(targets.shape[-1]):
                    temp += -1. * weight * (np.log(np.exp(inputs[k, :, i, j][..., int(targets[k, i, j].item())]) /
                            np.sum(np.exp(inputs[k, :, i, j]))))
            outputs += temp
    elif targets.dtype == np.float32:
        assert inputs.shape == targets.shape
        for k in range(targets.shape[0]):
            temp = 0.
            for i in range(targets.shape[-2]):
                for j in range(targets.shape[-1]):
                    temp += -1. * weight * np.sum(np.log(np.exp(inputs[k, :, i, j]) / np.sum(np.exp(inputs[k, :, i, j]))) * targets[k, :, i, j])
            outputs += temp
    else:
        print(f'标签的数据类型应该是 int64 或者 float32 而不是 {targets.dtype}')
        sys.exit()

    return (outputs / (targets.shape[0] * targets.shape[-2] * targets.shape[-1])).item()


# 交叉熵的计算模式一 - 标签中的元素是类的索引值, [0, C-1] -> int64
# 交叉熵的计算模式二 - 标签中的元素是类的概率值, [0, 1] -> float32
inputs = torch.rand(1, 5, 5, 5)
targets = torch.rand(1, 5, 5).random_(5).long()
# targets = torch.nn.Softmax(dim=1)(torch.rand(1, 5, 5, 5))

outputs = ce(inputs, targets)
print(f'ce {outputs:6f}')

if targets.dtype == torch.int64:
    outputs = nll(ls(inputs), targets)
    print(f'logsoftmax+nll {outputs:6f}')

outputs = cross_entorpy(inputs, targets)
print(f'cross_entorpy {outputs:6f}')

"""
ce 0.725609
logsoftmax+nll 0.725609
cross_entorpy 0.725609
"""

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56

2. Loss Function BCE 用于二分类任务

损失函数二值交叉熵 (binary entropy loss，bce) 适用于目标只有两个类别
$\pmb{loss_{bce}}=-y_{true}log(y_{pred})-(1-y_{true})log(1-y_{pred})$
如果 $y_{pred}=sigmoid(x)$ 那么回传的梯度为 $\frac{d(loss_{bce})}{dx}=y_{pred}-y_{true}$ 正比于误差，因此优化过程中损失小时梯度小

F.binary_cross_entropy(input, target, weight=self.weight, reduction=self.reduction)

PyTorch API
weight (Tensor, optional): a manual rescaling weight given to the loss of each batch element.
This is used for measuring the error of a reconstruction in for example an auto-encoder.
The unreduced (i.e. with reduction set to 'none') loss can be described as
$\ell(x, y) = L = \{l_1,\dots,l_N\}^\top, \quad l_n = - w_n \left[ y_n \cdot \log x_n + (1 - y_n) \cdot \log (1 - x_n) \right],$
N is the batch size.
targets y should be numbers between 0 and 1.
If reduction is not 'none' (default 'mean'), then
${\begin{cases} mean (L), & if reduction = `mean'; \\ sum (L), & if reduction = `sum'. \end{cases}$

用于图像重建等回归任务时，此时真实标签不是二元的，可取 $[0, 1]$ 之间任意值。例如标签有前景与背景等两类，两类和为 $1$ ，在这种情况下，交叉熵的最小值仍然是当预测值完全等于真实标签时交叉熵达到最小值，但这个最小值不再为 $0$

import torch
import numpy as np

bce = torch.nn.BCELoss()

# 逆向实现API
def binary_cross_entorpy(inputs, targets):
    inputs = inputs.numpy()
    inputs = inputs.reshape((inputs.shape[0]*inputs.shape[1], inputs.shape[-2]*inputs.shape[-1]))
    targets = targets.numpy()
    targets = targets.reshape((targets.shape[0]*targets.shape[1], targets.shape[-2]*targets.shape[-1]))
    outputs = 0.
    weight = 1.
    for i in range(targets.shape[0]):
        temp = 0
        for j in range(targets.shape[1]):
            temp += -1. * weight * (targets[i, j]*np.log(inputs[i, j]) + (1-targets[i, j])*np.log(1-inputs[i, j]))
        outputs += (temp / targets.shape[1])
        
    return outputs / targets.shape[0]


inputs = torch.rand((1, 2, 2, 2))
outputs = torch.tensor([[[[0, 1.], [1., 0]], [[0, 1.], [1., 0]]]])
# outputs = torch.nn.Softmax(dim=1)(torch.rand(1, 2, 2, 2))

print(f'bce {bce(inputs, outputs):6f}',
      f'binary_cross_entorpy {binary_cross_entorpy(inputs, outputs):6f}',
      sep="\n")

"""
bce 0.586063
binary_cross_entorpy 0.586063
"""
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

3. Weighted Loss 用于样本数量不均衡

交叉熵损失会分别评估每个像素的类别预测，然后对所有像素的损失进行平均，因此实质上是在对图像中的每个像素进行平等地学习。如果多个类在图像中的分布不均衡，那么这可能导致训练过程由像素数量多的类所主导，即模型会主要学习数量多的类别样本的特征，并且学习出来的模型会更偏向将像素预测为该类别

全卷积神经网络 FCN 与 U 型神经网路 U-Net 论文中对输出概率分布向量中的每个值进行加权，使得模型更加关注数量较少的样本，以缓解图像中存在的类别不均衡问题

例如，二分类中正负样本比例为 $1 : 99$ ，此时模型将所有样本都预测为负样本，那么准确率仍有 $99\%$ ，然而实际上没有意义

为了平衡这个差距，就对正样本和负样本的损失赋予不同的权重，带权重的二分类损失函数 weighted loss
$\pmb{loss_{wieghted}}=-pos_{wieghted}\times y_{true}log(y_{pred})-(1-y_{true})log(1-y_{pred})\\ \pmb{pos_{wieghted}}=\frac{neg_{num}}{pos_{num}}$
令 $y_{pred}=sigmoid(x)$ 那么回传的梯度为 $\frac{d(loss_{wieghted})}{dx}=(1-y_{true})y_{pred}-pos_{wieghted}\times y_{true}(1-y_{pred})$ 正比于误差，且正样本则为 $pos_{wieghted}(y_{pred}-1)$ 被抑制，负样本则为 $y_{pred}$ 相对增强，因此优化过程中损失小时梯度小，且放大了负样本的优化效果

4. Focal Loss 用于样本难易不均衡

有时不仅需要针对不同类别的像素数量的不均衡改进，还需要将像素分为难学习和容易学习这两种样本，对于容易学习的样本模型可以很轻松地预测正确，而模型只要将大量容易学习的样本预测正确，loss 就会减小很多，从而导致模型无法顾及难学习的样本，所以要让模型更加关注难学习的样本

对于难易程度不同的学习样本可赋予不同的权重调整
$-(1-y_{pred})^{\gamma}\times y_{true}log(y_{pred})-y_{pred}^{\gamma}(1-y_{true})\times log(1-y_{pred})\\$
例如，预测一个正样本，预测结果为 $0.95$ 是一个容易学习的样本，有 $1-0.95)^2=0.0025$ 损失直接减少为原来的 $1\over400$ ，预测结果为 $0.5$ 是一个难学习的样本，有 $1-0.5)^2=0.25$ ，损失减小为原来的 $1\over4$ ，相对减小的程度小很多，总体上更多的考虑到了难学习样本，因此模型更加专注学习难学习的样本

可得考虑正负样本不均衡与难易程度的 focal loss
$\pmb{loss_{focal}}=-\alpha(1-y_{pred})^{\gamma}\times y_{true}log(y_{pred})-(1-\alpha)y_{pred}^{\gamma}(1-y_{true})\times log(1-y_{pred})\\ \pmb{default\;\gamma=2}$
梯度性质于 Weighted Loss 类似

基于相似度的损失函数

1. Soft Dice Loss

常用的损失函数还有基于 $D i ce$ 系数的损失函数 (soft dice loss，sd) 其系数实质是两个样本之间重叠的度量，范围为 $0 ～ 1$ ，其中 $1$ 表示完全重叠
$Dice=\frac{2|A\cap B|}{|A|+|B|}=\frac{2TP}{2TP+FP+FN}$
$|A\cap B|$ 代表集合 $A$ 和 $B$ 之间的公共元素，并且 $∣ A ∣$ 与 $∣ B ∣$ 分别代表集合 $A$ 和 $B$ 的元素数量，分子乘 $2$ 保证取值范围在 $[0, 1]$ ， $|A\cap B|$ 为预测掩码和标签掩码之间的逐元素乘法，然后对结果矩阵求和

Alt
$D i ce$ 系数中 $TP$ 为真阳性样本 $FP$ 为假阳性样本 $FN$ 为假阴性样本，而 $precision=\frac{TP}{TP+FP}$ ， $recall=\frac{TP}{TP+FN}$ ，可知 $D i ce$ 包涵了两部分的意义
Alt
需要对每个类进行整体预测，使得预测结果的每个类都与真实标签尽可能重叠，即 $TP$ 充分的大， $FP$ 与 $FN$ 充分的小
Alt

对每个类别都计算 $1 - D i ce$ 后求和取平均得到最后的 soft dice loss
$\pmb{loss_{sd}}=\frac{1}{n}\sum_{class=1}^{n}\left\{1-\frac{2\sum_{piexl}(y_{true}y_{pred})}{\sum_{piexl}(y_{true}+y_{pred})}\right\}$
如果是二分类则令 $y_{pred}=sigmoid(x)$ 那么回传的梯度为
$\frac{d(loss_{sd}^{pixel})}{dy^{pixel}}=\frac{1}{2}\sum_{class=1}^{2}\left\{\frac{2[y_{true}^{pixel}(y_{true}^{pixel}+y_{pred}^{pixel})-y_{true}^{pixel}y_{pred}^{pixel}]}{(y_{true}^{pixel}+y_{pred}^{pixel})^2}\right\}=\frac{1}{2}\sum_{class=1}^{2}$

{\begin{cases} 0 & , y_{t r u e}^{p i x e l} = 0 \\ \frac{- 2}{(1 + y_{p r e d}^{p i x e l})^{2}} & , y_{t r u e}^{p i x e l} = 1 \end{cases}

$\begin{cases} 0&,y_{true}^{pixel}=0\\ \frac{-2}{(1+y_{pred}^{pixel})^2}&,y_{true}^{pixel}=1 \end{cases}$

\frac{d ( l os s _{s d}^{p i x e l} )}{d y ^{p i x e l}} = \frac{1}{2} c l a ss = 1 \sum 2 {\frac{2 [ y _{t r u e}^{p i x e l} ( y _{t r u e}^{p i x e l} + y _{p re d}^{p i x e l} ) - y _{t r u e}^{p i x e l} y _{p re d}^{p i x e l} ]}{( y _{t r u e}^{p i x e l} + y _{p re d}^{p i x e l} ) ^{2}}} = \frac{1}{2} c l a ss = 1 \sum 2 {0 \frac{- 2}{( 1 + y _{p re d}^{p i x e l} ) ^{2}}, y_{t r u e}^{p i x e l} = 0, y_{t r u e}^{p i x e l} = 1

$\frac{d(loss_{sd}^{pixel})}{dx^{pixel}}=\frac{d(loss_{sd}^{pixel})}{dy^{pixel}}\times\frac{e^{-x^{pixel}}}{(e^{-x^{pixel}}+1)^2}$

随着 $x^{pixel}$ 增大，损失（蓝色）趋向零梯度（红色）趋向零，随着 $x^{pixel}$ 减小，损失趋于一梯度趋向零（类似均方误差 (mse) 不论预测接近真实值或是接近错误值，梯度都很小）

def _take_channels(*xs, ignore_channels=None):
    if ignore_channels is None:
        return xs
    else:
        channels = [channel for channel in range(xs[0].shape[1]) if channel not in ignore_channels]
        xs = [torch.index_select(x, dim=1, index=torch.tensor(channels).to(x.device)) for x in xs]
        return xs


def _threshold(x, threshold=None):
    if threshold is not None:
        return (x > threshold).type(x.dtype)
    else:
        return x


class DiceLoss(nn.Module):
    def __init__(self, eps=1, threshold=0.5, ignore_channels=None):
        super(DiceLoss, self).__init__()
        self.eps = eps
        self.threshold = threshold
        self.ignore_channels = ignore_channels

    def forward(self, probs, targets):
        assert probs.shape[0] == targets.shape[0]

        probs = _threshold(probs, threshold=self.threshold)
        pr, gt = _take_channels(probs, targets, ignore_channels=self.ignore_channels)

        tp = torch.sum(gt * pr)
        fp = torch.sum(pr) - tp
        fn = torch.sum(gt) - tp
        score = (2 * tp + eps) / (2 * tp + fn + fp + eps)

        return score
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

2. Soft IoU Loss

$I o U$ 系数也叫 Jarcard 相似度，其计算公式与计算 $D i ce$ 系数的公式很像，区别是仅需计算一次 $TP$
$IoU=\frac{TP}{TP+FP+FN}=\frac{|A\cap B|}{|A|+|B|-|A\cap B|}=\frac{|A\cap B|}{|A\cup B|}$
对于每个类别的 mask 都计算 $1 - I o U$ 最后求和取平均得到基于 $I o U$ 系数的损失函数 (soft iou loss，si) 为
$\pmb{loss_{si}}=\frac{1}{n}\sum_{class=1}^{n}\left\{1-\frac{\sum_{piexl}(y_{true}y_{pred})}{\sum_{piexl}(y_{true}+y_{pred}-y_{true}y_{pred})}\right\}$
梯度性质于 soft dice loss 类似

def _take_channels(*xs, ignore_channels=None):
    ...(同上)...
def _threshold(x, threshold=None):
	...(同上)...


class IouLoss(nn.Module):
    def __init__(self, eps=1, threshold=0.5, ignore_channels=None):
        super().__init__()
        self.eps = eps
        self.threshold = threshold
        self.ignore_channels = ignore_channels

    def forward(self, probs, targets):
        probs = _threshold(probs, threshold=self.threshold)
        pr, gt = _take_channels(probs, targets, ignore_channels=self.ignore_channels)

        intersection = torch.sum(gt * pr)
        union = torch.sum(gt) + torch.sum(pr) - intersection + self.eps
        score = (intersection + self.eps) / union

        return score
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

小结

交叉熵损失把每个像素都当作一个独立样本进行预测，而相似度损失则以更整体的方式来看待最终的预测输出，两类损失是针对不同情况，各有优点和缺点，在实际应用中，可以同时使用这两类损失来进行互补

参考

1. 语义分割中的 loss function
2. An overview of semantic image segmentation
3. Loss Functions for Medical Image Segmentation
4. Losses for Image Segmentation

声明：本文内容由网友自发贡献，不代表【wpsshop博客】立场，版权归原作者所有，本站不承担相应法律责任。如您发现有侵权的内容，请联系我们。转载请注明出处：https://www.wpsshop.cn/w/爱喝兽奶帝天荒/article/detail/912096