'why do pooler use tanh as a activation func in bert, rather than gelu?
class BERTPooler(nn.Module): def init(self, config): super(BERTPooler, self).init() self.dense = nn.Linear(config.hidden_size, config.hidden_size) self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
Solution 1:[1]
The author of the original BERT paper answered it (kind of) in a comment on GitHub.
The tanh() thing was done early to try to make it more interpretable but it probably doesn't matter either way.
I agree it doesn't fully answer "whether" tanh
is preferable, but from the looks of it, it'll probably work with any activation.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Daisuke Shimamoto |