The deviation of cross entropy with softmax
Softmax function
Softmax function is used to regularize all number of a vector to [0, 1]. It is usual appeared in classification problems. By softmax, a vector with huge number can be projected to a small number range – from 0 to 1. That is useful to avoid gradient explosion & vanishing.
The formula of softmax as follows:
$$
z = [z_1, z_2, \cdots, z_n] \\
a = \text{softmax}(z) = [\frac{e^{z_1}}{\sum{e^{z_k}}}, \frac{e^{z_2}}{\sum{e^{z_k}}}, \cdots, \frac{e^{z_n}}{\sum{e^{z_k}}}]
$$
By the way, softmax function is considered as a high-dimension generalization of sigmoid function and sigmoid function is also cansidered as a 2-dim version of softmax function. The formula of sigmoid as follows:
$$
\sigma(x) = \frac{1}{1+e^{-x}}
$$
Cross entropy
The cross-entropy of two probability distribution $p$ and $q$ is defined as follows:
$$
H(p, q) = -\sum p_i\log q_i
$$
In classification problems, $p$ is usual a one-hot vector, which have only one position is 1, and others are 0. So, the defintion of cross-entropy is simplified as follows:
$$
H(p, q)=-p_k\log q_k = -\log q_k
$$
We also use cross-entropy as loss function. It is very intuitive because, with the increase or decrease of $q_k$, the loss will change conversely.
Calculate the derivative of loss function
The simplest classification network as follows:
Here $z$ is the output of the former neural network, $a$ is the result of softmax $z$, $\hat y$ is the correct answer and $loss$ is the cross entropy of $a$ and $\hat y$.
The formulas as follows:
$$
\begin{equation}
\begin{split}
loss
& = \text{cross entropy}(a, \hat y)\
& = \text{cross entropy}(\text{softmax}(z), \hat y)
\end{split}
\end{equation}
$$
In order to make backpropagation, we need to calculate the deviation of $loss$:
$$
\begin{equation}
\begin{split}
\frac{\partial l}{\partial \mathbf z}
& = \frac{\partial l}{\partial \mathbf a}\frac{\partial \mathbf a}{\partial \mathbf z}
\end{split}
\end{equation}
$$
Here $l$ is a scalar, $a$ and $z$ are vectors.
First, we need to know how to calculate the derivative of a scalar $y$ by a vector $x$:
$$
\frac{\partial y}{\partial \mathbf x} = [\frac{\partial y}{\partial x_1},\frac{\partial y}{\partial x_2},\cdots, \frac{\partial y}{\partial x_n}]
$$
Second, how to calculate the derivative of a vector $y$ by a scalar $x$:
$$
\frac{\partial \mathbf y}{\partial x} =\begin{bmatrix}\frac{\partial y_1}{\partial x} \\ \frac{\partial y_2}{\partial x}\\ \vdots \\ \frac{\partial y_n}{\partial x}\end{bmatrix}
$$
And, how to calculate the derivative of a vector $y$ by a vector $x$:
$$
\frac{\partial \mathbf y}{\partial \mathbf x} =
\begin{bmatrix}
\frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n}\\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial y_n}{\partial x_1} & \frac{\partial y_n}{\partial x_2} & \cdots & \frac{\partial y_n}{\partial x_n}\end{bmatrix}
$$
We can infer the former two by the last one.
According to the above formula, we have this result as follows:
$$
\begin{split}
\frac{\partial l}{\partial \mathbf a}
& = [\frac{\partial l}{\partial a_1},\frac{\partial l}{\partial a_2},\cdots, \frac{\partial l}{\partial a_k}, \cdots, \frac{\partial l}{\partial a_n}] \\
& = [0, 0, \cdots, -\frac{1}{a_k}, \cdots, 0]
\end{split}
$$
$$
\frac{\partial \mathbf a}{\partial \mathbf z} =
\begin{bmatrix}
\frac{\partial a_1}{\partial z_1} & \frac{\partial a_1}{\partial z_2} & \cdots & \frac{\partial a_1}{\partial z_n}\\
\frac{\partial a_2}{\partial z_1} & \frac{\partial a_2}{\partial z_2} & \cdots & \frac{\partial a_2}{\partial z_n} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial a_n}{\partial z_1} & \frac{\partial a_n}{\partial z_2} & \cdots & \frac{\partial a_n}{\partial z_n}\end{bmatrix}
$$
Furthermore, when $i = j$,
$$
\begin{split}
\frac{\partial a_i}{\partial z_j}
& = \frac{\partial \frac{e^{z_i}}{\sum{e^{z_k}}}}{\partial z_j} \\
& = \frac{e^{z_i}\sum e^{z_k} - e^{z_i}e^{z_i}}{(\sum e^{z_k})^2} \\
& = a_i - a_i^2 \\
& = a_i(1-a_i)
\end{split}
$$
When $i \neq j$,
$$
\begin{split}
\frac{\partial a_i}{\partial z_j}
& = \frac{\partial \frac{e^{z_i}}{\sum{e^{z_k}}}}{\partial z_j} \\
& = \frac{0-e^{z_i}e^{z_j}}{(\sum{e^{z_k}})^2} \\
& = -a_ia_j
\end{split}
$$
So,
$$
\frac{\partial \mathbf a}{\partial \mathbf z} =
\begin{bmatrix}
a_1(1-a_1) & -a_1a_2 & \cdots & -a_1a_n\\
-a_2a_1 & a_2(1-a_2) & \cdots & -a_2a_n \\
\vdots & \vdots & \ddots & \vdots \\
-a_na_1 & a_na_2 & \cdots & a_n(1-a_n) \\
\end{bmatrix}
$$
But actually, due to only $\frac{\partial l}{\partial a_k} = -\frac{1}{a_k}$, and others are 0, we just need to calculate $\frac{\partial a_k}{\partial \mathbf z}$.
Finally, the derivative of $l$ as follows:
$$
\begin{equation}
\begin{split}
\frac{\partial l}{\partial \mathbf z}
& = \frac{\partial l}{\partial \mathbf a}\frac{\partial \mathbf a}{\partial \mathbf z} \\
& = [a_1, a_2, \cdots, a_k-1, \cdots, a_n] \\
& = \mathbf a - \mathbf y
\end{split}
\end{equation}
$$
What a beautiful answer!
End.
The deviation of cross entropy with softmax
https://blog.dicer.fun/2021/06/28/The-deviation-of-cross-entropy-with-softmax/