Padding process in HuggingFace transformer models
While reading the code for Hugging Face’s T5, I felt uncomfortable with the processing around the paddings, so I examined them below.
HuggingFace’s attention layer implementation codes are here:
https://github.com/huggingface/transformers/blob/df56c843be370bd73d9ce4e597d4c7f55e4e41d9/src/transformers/models/t5/modeling_t5.py#L333
[round-off error utilization] How the padding columns in attention matrix is converted to 0?
In the calculation of getting new
This is implemented by utilizing round-off error.
As you know,
The variable mask is using minimum value of the type. For example, pytorch’s float is shown below.
1 | import torch |
This is very small number, but it’s not
The answer is no. The softmax calculation rounds this minimum to zero. Thus, there is no need to compute exactly minus infinity.
1 | vec=torch.tensor([1,2,torch.finfo(torch.float).min]) |
The code is here:
- type’s minimum value for mask:
https://github.com/huggingface/transformers/blob/df56c843be370bd73d9ce4e597d4c7f55e4e41d9/src/transformers/models/t5/modeling_t5.py#L529 - softmax round-off error utilization
https://github.com/huggingface/transformers/blob/df56c843be370bd73d9ce4e597d4c7f55e4e41d9/src/transformers/models/t5/modeling_t5.py#L539
[loss of trailing digits utilization] How is the padding mask applied to the original attention matrix?
As I mentioned above, the minimum value of the type is assigned in the dimension of the padding in the variable mask. Then, the mask is added to position bias, which is added to the attention matrix later.
NOTE: mask is a vec here. Adding a vec to the matrix adds a vec to all rows.
1
2
3
4torch.zeros(3,3)+torch.ones(3)
tensor([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
But is it safe to add the minimum value (not
The answer is yes. This is a utilization of loss of trailing digits.
When calculating large and small absolute numbers in the floating point exhibition, python ignores the smaller one.
1 | torch.finfo(torch.float).min == torch.finfo(torch.float).min+1e3 |
position_bias is so small in absolute value compared to the minimum value of the type that adding them together does not change the fact that they are equivalent to -infinity.
