'How does calculation in a GRU layer take place

So I want to understand exactly how the outputs and hidden state of a GRU cell are calculated.

I obtained the pre-trained model from here and the GRU layer has been defined as nn.GRU(96, 96, bias=True).

I looked at the the PyTorch Documentation and confirmed the dimensions of the weights and bias as:

  • weight_ih_l0: (288, 96)
  • weight_hh_l0: (288, 96)
  • bias_ih_l0: (288)
  • bias_hh_l0: (288)

My input size and output size are (1000, 8, 96). I understand that there are 1000 tensors, each of size (8, 96). The hidden state is (1, 8, 96), which is one tensor of size (8, 96).

I have also printed the variable batch_first and found it to be False. This means that:

  • Sequence length: L=1000
  • Batch size: B=8
  • Input size: Hin=96

Now going by the equations from the documentation, for the reset gate, I need to multiply the weight by the input x. But my weights are 2-dimensions and my input has three dimensions.

Here is what I've tried, I took the first (8, 96) matrix from my input and multiplied it with the transpose of my weight matrix:

Input (8, 96) x Weight (96, 288) = (8, 288)

Then I add the bias by replicating the (288) eight times to give (8, 288). This would give the size of r(t) as (8, 288). Similarly, z(t) would also be (8, 288).

This r(t) is used in n(t), since Hadamard product is used, both the matrices being multiplied have to be the same size that is (8, 288). This implies that n(t) is also (8, 288).

Finally, h(t) is the Hadamard produce and matrix addition, which would give the size of h(t) as (8, 288) which is wrong.

Where am I going wrong in this process?



Solution 1:[1]

TLDR; This confusion comes from the fact that the weights of the layer are the concatenation of input_hidden and hidden-hidden respectively.


- nn.GRU layer weight/bias layout

You can take a closer look at what's inside the GRU layer implementation torch.nn.GRU by peaking through the weights and biases.

>>> gru = nn.GRU(input_size=96, hidden_size=96, num_layers=1)

First the parameters of the GRU layer:

>>> gru._all_weights
[['weight_ih_l0', 'weight_hh_l0', 'bias_ih_l0', 'bias_hh_l0']]

You can look at gru.state_dict() to get the dictionary of weights of the layer.

We have two weights and two biases, _ih stands for 'input-hidden' and _hh stands for 'hidden-hidden'.

For more efficient computation the parameters have been concatenated together, as the documentation page clearly explains (| means concatenation). In this particular example num_layers=1 and k=0:

  • ~GRU.weight_ih_l[k] – the learnable input-hidden weights of the layer (W_ir | W_iz | W_in), of shape (3*hidden_size, input_size).

  • ~GRU.weight_hh_l[k] – the learnable hidden-hidden weights of the layer (W_hr | W_hz | W_hn), of shape (3*hidden_size, hidden_size).

  • ~GRU.bias_ih_l[k] – the learnable input-hidden bias of the layer (b_ir | b_iz | b_in), of shape (3*hidden_size).

  • ~GRU.bias_hh_l[k] – the learnable hidden-hidden bias of the (b_hr | b_hz | b_hn).

For further inspection we can get those split up with the following code:

>>> W_ih, W_hh, b_ih, b_hh = gru._flat_weights
>>> W_ir, W_iz, W_in = W_ih.split(H_in)
>>> W_hr, W_hz, W_hn = W_hh.split(H_in)
>>> b_ir, b_iz, b_in = b_ih.split(H_in)
>>> b_hr, b_hz, b_hn = b_hh.split(H_in)

Now we have the 12 tensor parameters sorted out.


- Expressions

The four expressions for a GRU layer: r_t, z_t, n_t, and h_t, are computed at each timestep.

The first operation is r_t = ?(W_ir@x_t + b_ir + W_hr@h + b_hr). I used the @ sign to designate the matrix multiplication operator (__matmul__). Remember W_ir is shaped (H_in=input_size, hidden_size) while x_t contains the element at step t from the x sequence. Tensor x_t = x[t] is shaped as (N=batch_size, H_in=input_size). At this point, it's simply a matrix multiplication between the input x[t] and the weight matrix. The resulting tensor r is shaped (N, hidden_size=H_in):

>>> (x[t]@W_ir.T).shape
(8, 96)

The same is true for all other weight multiplication operations performed. As a result, you end up with an output tensor shaped (N, H_out=hidden_size).

In the following expressions h is the tensor containing the hidden state of the previous step for each element in the batch, i.e. shaped (N, hidden_size=H_out), since num_layers=1, i.e. there's a single hidden layer.

>>> r_t = torch.sigmoid(x[t]@W_ir.T + b_ir + h@W_hr.T + b_hr)
>>> r_t.shape
(8, 96)

>>> z_t = torch.sigmoid(x[t]@W_iz.T + b_iz + h@W_hz.T + b_hz)
>>> z_t.shape
(8, 96)

The output of the layer is the concatenation of the computed h tensors at consecutive timesteps t (between 0 and L-1).


- Demonstration

Here is a minimal example of an nn.GRU inference manually computed:

Parameters Description Values
H_in feature size 3
H_out hidden size 2
L sequence length 3
N batch size 1
k number of layers 1

Setup:

gru = nn.GRU(input_size=H_in, hidden_size=H_out, num_layers=k)
W_ih, W_hh, b_ih, b_hh = gru._flat_weights
W_ir, W_iz, W_in = W_ih.split(H_out)
W_hr, W_hz, W_hn = W_hh.split(H_out)
b_ir, b_iz, b_in = b_ih.split(H_out)
b_hr, b_hz, b_hn = b_hh.split(H_out)

Random input:

x = torch.rand(L, N, H_in)

Inference loop:

output = []
h = torch.zeros(1, N, H_out)
for t in range(L):
   r = torch.sigmoid(x[t]@W_ir.T + b_ir + h@W_hr.T + b_hr)
   z = torch.sigmoid(x[t]@W_iz.T + b_iz + h@W_hz.T + b_hz)
   n = torch.tanh(x[t]@W_in.T + b_in + r*(h@W_hn.T + b_hn))
   h = (1-z)*n + z*h
   output.append(h)

The final output is given by the stacking the tensors h at consecutive timesteps:

>>> torch.vstack(output)
tensor([[[0.1086, 0.0362]],

        [[0.2150, 0.0108]],

        [[0.3020, 0.0352]]], grad_fn=<CatBackward>)

In this case the output shape is (L, N, H_out), i.e. (3, 1, 2).

Which you can compare with output, _ = gru(x).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1