'Matrix Derivation for Neural Network Formula

I am learning some insights of Neural network but I have some problem with the derivation of matrix for backpropagation. On an assumption that the formula for calculating for one node in a neural network, which as been vectorized, is $Z^{[i]} = W^{[i]}A^{[i - 1]} + B^{[i]}$.

For notation,

$Z^{[i]}$ is the score for layer $i$,

$W^{[i]}$ is the weight coefficient for layer $i$,

$A^{[i - 1]}$ is the score for layer $i - 1$

$B^{[i]}$ is the bias coefficient for layer $i$

For backpropagation, we need to calculate for $dZ$, $dW$, $dA$, $dB$, toward a cost function L = f(Z), a function of Z. In the other word, we need to calculate $dL/dZ$, $dL/dW$, $dL/dA$, $dL/dB$ respectively

For calculating $dZ$, $dB$, it has no problem. But for computing $dW$, $dA$, due to applying chain rule, it creates some confusions. For more details, the solution is stated as

$dW = \frac{dL}{dW} = \frac{dL}{dZ} \frac{dZ}{dW} = dZ A^{[i - 1]T}$

$dA = \frac{dL}{dA} = \frac{dL}{dZ} \frac{dZ}{dA} = W^{[i]T}dZ$

The confusion is that, I do not understand why in computing $dW$, $\frac{dZ}{dW}$ is $A^{[i - 1]T}$. Moreover, in matrix multiplication property, it states that the multiplication of two matrices are not commutative or AB $\neq$ BA. Hence, why, in computing $dA$, $dZ$, which is derived from $\frac{dL}{dZ}$, could be standing behind $W^{[i]T}$?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source