662 字
3 分钟
The Deviation of Cross Entropy with Softmax

Softmax function#

Softmax function is used to regularize all number of a vector to [0, 1]. It is usual appeared in classification problems. By softmax, a vector with huge number can be projected to a small number range — from 0 to 1. That is useful to avoid gradient explosion & vanishing.

The formula of softmax as follows:

z=[z1,z2,,zn]a=softmax(z)=[ez1ezk,ez2ezk,,eznezk]z = [z_1, z_2, \cdots, z_n] \\\\ a = \text{softmax}(z) = [\frac{e^{z_1}}{\sum{e^{z_k}}}, \frac{e^{z_2}}{\sum{e^{z_k}}}, \cdots, \frac{e^{z_n}}{\sum{e^{z_k}}}]

By the way, softmax function is considered as a high-dimension generalization of sigmoid function and sigmoid function is also cansidered as a 2-dim version of softmax function. The formula of sigmoid as follows:

σ(x)=11+ex\sigma(x) = \frac{1}{1+e^{-x}}

sigmoid function from -5 to 5

Cross entropy#

The cross-entropy of two probability distribution pp and qq is defined as follows:

H(p,q)=pilogqiH(p, q) = -\sum p_i\log q_i

In classification problems, pp is usual a one-hot vector, which have only one position is 1, and others are 0. So, the defintion of cross-entropy is simplified as follows:

H(p,q)=pklogqk=logqkH(p, q)=-p_k\log q_k = -\log q_k

We also use cross-entropy as loss function. It is very intuitive because, with the increase or decrease of qkq_k, the loss will change conversely.

Calculate the derivative of loss function#

The simplest classification network as follows:

cross_entropy

Here zz is the output of the former neural network, aa is the result of softmax zz, y^\hat y is the correct answer and lossloss is the cross entropy of aa and y^\hat y.

The formulas as follows:

loss=cross entropy(a,y^)=cross entropy(softmax(z),y^)\begin{equation} \begin{split} loss & = \text{cross entropy}(a, \hat y)\\ & = \text{cross entropy}(\text{softmax}(z), \hat y) \end{split} \end{equation}

In order to make backpropagation, we need to calculate the deviation of lossloss:

lz=laaz\begin{equation} \begin{split} \frac{\partial l}{\partial \mathbf z} & = \frac{\partial l}{\partial \mathbf a}\frac{\partial \mathbf a}{\partial \mathbf z} \end{split} \end{equation}

Here ll is a scalar, aa and zz are vectors.

First, we need to know how to calculate the derivative of a scalar yy by a vector xx:

yx=[yx1,yx2,,yxn]\frac{\partial y}{\partial \mathbf x} = [\frac{\partial y}{\partial x_1},\frac{\partial y}{\partial x_2},\cdots, \frac{\partial y}{\partial x_n}]

Second, how to calculate the derivative of a vector yy by a scalar xx:

yx=[y1xy2xynx]\frac{\partial \mathbf y}{\partial x} =\begin{bmatrix}\frac{\partial y_1}{\partial x} \\\\ \frac{\partial y_2}{\partial x}\\\\ \vdots \\\\ \frac{\partial y_n}{\partial x}\end{bmatrix}

And, how to calculate the derivative of a vector yy by a vector xx:

yx=[y1x1y1x2y1xny2x1y2x2y2xnynx1ynx2ynxn]\frac{\partial \mathbf y}{\partial \mathbf x} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n}\\\\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\\\ \vdots & \vdots & \ddots & \vdots \\\\ \frac{\partial y_n}{\partial x_1} & \frac{\partial y_n}{\partial x_2} & \cdots & \frac{\partial y_n}{\partial x_n}\end{bmatrix}

We can infer the former two by the last one.

According to the above formula, we have this result as follows:

la=[la1,la2,,lak,,lan]=[0,0,,1ak,,0]\begin{split} \frac{\partial l}{\partial \mathbf a} & = [\frac{\partial l}{\partial a_1},\frac{\partial l}{\partial a_2},\cdots, \frac{\partial l}{\partial a_k}, \cdots, \frac{\partial l}{\partial a_n}] \\\\ & = [0, 0, \cdots, -\frac{1}{a_k}, \cdots, 0] \end{split}az=[a1z1a1z2a1zna2z1a2z2a2znanz1anz2anzn]\frac{\partial \mathbf a}{\partial \mathbf z} = \begin{bmatrix} \frac{\partial a_1}{\partial z_1} & \frac{\partial a_1}{\partial z_2} & \cdots & \frac{\partial a_1}{\partial z_n}\\\\ \frac{\partial a_2}{\partial z_1} & \frac{\partial a_2}{\partial z_2} & \cdots & \frac{\partial a_2}{\partial z_n} \\\\ \vdots & \vdots & \ddots & \vdots \\\\ \frac{\partial a_n}{\partial z_1} & \frac{\partial a_n}{\partial z_2} & \cdots & \frac{\partial a_n}{\partial z_n}\end{bmatrix}

Furthermore, when i=ji = j,

aizj=eziezkzj=eziezkeziezi(ezk)2=aiai2=ai(1ai)\begin{split} \frac{\partial a_i}{\partial z_j} & = \frac{\partial \frac{e^{z_i}}{\sum{e^{z_k}}}}{\partial z_j} \\\\ & = \frac{e^{z_i}\sum e^{z_k} - e^{z_i}e^{z_i}}{(\sum e^{z_k})^2} \\\\ & = a_i - a_i^2 \\\\ & = a_i(1-a_i) \end{split}

When iji \neq j,

aizj=eziezkzj=0eziezj(ezk)2=aiaj\begin{split} \frac{\partial a_i}{\partial z_j} & = \frac{\partial \frac{e^{z_i}}{\sum{e^{z_k}}}}{\partial z_j} \\\\ & = \frac{0-e^{z_i}e^{z_j}}{(\sum{e^{z_k}})^2} \\\\ & = -a_ia_j \end{split}

So,

az=[a1(1a1)a1a2a1ana2a1a2(1a2)a2anana1ana2an(1an)]\frac{\partial \mathbf a}{\partial \mathbf z} = \begin{bmatrix} a_1(1-a_1) & -a_1a_2 & \cdots & -a_1a_n\\\\ -a_2a_1 & a_2(1-a_2) & \cdots & -a_2a_n \\\\ \vdots & \vdots & \ddots & \vdots \\\\ -a_na_1 & a_na_2 & \cdots & a_n(1-a_n) \\\\ \end{bmatrix}

But actually, due to only lak=1ak\frac{\partial l}{\partial a_k} = -\frac{1}{a_k}, and others are 0, we just need to calculate akz\frac{\partial a_k}{\partial \mathbf z}.

Finally, the derivative of ll as follows:

lz=laaz=[a1,a2,,ak1,,an]=ay\begin{equation} \begin{split} \frac{\partial l}{\partial \mathbf z} & = \frac{\partial l}{\partial \mathbf a}\frac{\partial \mathbf a}{\partial \mathbf z} \\\\ & = [a_1, a_2, \cdots, a_k-1, \cdots, a_n] \\\\ & = \mathbf a - \mathbf y \end{split} \end{equation}

What a beautiful answer!

End.

The Deviation of Cross Entropy with Softmax
https://dicer-zz.github.io/posts/the-deviation-of-cross-entropy-with-softmax/
作者
Dicer
发布于
2021-06-28
许可协议
CC BY-NC 4.0