'Batchnormalize, Dropout and number of layers

I'm learning batchnormalisation and dropout. Saw this https://www.kaggle.com/ryanholbrook/dropout-and-batch-normalization.

The model

model = keras.Sequential([
    layers.Dense(1024, activation='relu', input_shape=[11]),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1024, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1024, activation='relu'),
    layers.Dropout(0.3),
    layers.BatchNormalization(),
    layers.Dense(1),
])

My question is do we put dropout before batchnormalization (BN) or after ? Same results ?

My understand is that dropout will "deactivate" the neuron in the next layer (pardon my terminology). So if I put before BN, would the BN be normalising incorrectly since it's not the full output from the previous layer.

So we should put dropout after BN ? Does it matter ?

Solution 1:^[1]

If you have a closer look at the tensorflow doc regarding the dropout layer, you see it says:

The Dropout layer randomly sets input units to 0 with a frequency of rate at each step during training time, which helps prevent overfitting. Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over all inputs is unchanged.

This means dropout does not deactivate any neurons in the next layer but instead "deactivates" the inputs for the neurons (= the neurons receive a 0 as input). Note, this is often the case for practical Dropout, other frameworks such as PyTorch use the same approach.

So BN after Dropout will not "normalize incorrectly" but instead do what it's programmed for, namely performing normalization, but now some inputs are having a 0 instead of their non-dropout value present.

Whether you put Dropout before or after BN depends on your data and can yield different results. Most of the time both techniques are used for normalization and in my personal experience one of them suffices / they do not perform better combined.
If you think of dropout as noise, putting Dropout before BN means you normalize a "noised" input, while putting it after means you add noise to the normalized data.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	mortom123

'Batchnormalize, Dropout and number of layers

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]