While reading the code for Hugging Face’s T5, I felt uncomfortable with the processing around the paddings, so I examined them below.

HuggingFace’s attention layer implementation codes are here:

[round-off error utilization] How the padding columns in attention matrix is converted to 0?

In the calculation of getting new in the attention layer, paddings’ attention must be 0 like , but I cannot find explicitly substituting 0 in the codes.

This is implemented by utilizing round-off error.
As you know, .
The variable mask is using minimum value of the type. For example, pytorch’s float is shown below.

>>> import torch
>>> torch.finfo(torch.float).min

This is very small number, but it’s not . Then do some errors remind in a calculation result?

The answer is no. The softmax calculation rounds this minimum to zero. Thus, there is no need to compute exactly minus infinity.

>>> vec=torch.tensor([1,2,torch.finfo(torch.float).min])
>>> torch.softmax(vec,0)
tensor([0.2689, 0.7311, 0.0000])

The code is here:

[loss of trailing digits utilization] How is the padding mask applied to the original attention matrix?

As I mentioned above, the minimum value of the type is assigned in the dimension of the padding in the variable mask. Then, the mask is added to position bias, which is added to the attention matrix later.

NOTE: mask is a vec here. Adding a vec to the matrix adds a vec to all rows.

>>> torch.zeros(3,3)+torch.ones(3)
tensor([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])

But is it safe to add the minimum value (not ) instead of assigning it?
The answer is yes. This is a utilization of loss of trailing digits.
When calculating large and small absolute numbers in the floating point exhibition, python ignores the smaller one.

>>> torch.finfo(torch.float).min == torch.finfo(torch.float).min+1e3

position_bias is so small in absolute value compared to the minimum value of the type that adding them together does not change the fact that they are equivalent to -infinity.