Padding process in HuggingFace transformer models
While reading the code for Hugging Face’s T5, I felt uncomfortable with the processing around the paddings, so I examined them below.
HuggingFace’s attention layer implementation codes are here:
https://github.com/huggingface/transformers/blob/df56c843be370bd73d9ce4e597d4c7f55e4e41d9/src/transformers/models/t5/modeling_t5.py#L333
[round-off error utilization] How the padding columns in attention matrix is converted to 0?
In the calculation of getting new
This is implemented by utilizing round-off error.
As you know,
The variable mask
is using minimum value of the type. For example, pytorch’s float
is shown below.
1 | import torch |
This is very small number, but it’s not
The answer is no. The softmax calculation rounds this minimum to zero. Thus, there is no need to compute exactly minus infinity.
1 | 1,2,torch.finfo(torch.float).min]) vec=torch.tensor([ |
The code is here:
- type’s minimum value for mask:
https://github.com/huggingface/transformers/blob/df56c843be370bd73d9ce4e597d4c7f55e4e41d9/src/transformers/models/t5/modeling_t5.py#L529 - softmax round-off error utilization
https://github.com/huggingface/transformers/blob/df56c843be370bd73d9ce4e597d4c7f55e4e41d9/src/transformers/models/t5/modeling_t5.py#L539
[loss of trailing digits utilization] How is the padding mask applied to the original attention matrix?
As I mentioned above, the minimum value of the type is assigned in the dimension of the padding in the variable mask
. Then, the mask is added to position bias, which is added to the attention matrix later.
NOTE: mask
is a vec here. Adding a vec to the matrix adds a vec to all rows.
1
2
3
43,3)+torch.ones(3) torch.zeros(
tensor([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
But is it safe to add the minimum value (not
The answer is yes. This is a utilization of loss of trailing digits.
When calculating large and small absolute numbers in the floating point exhibition, python ignores the smaller one.
1 | float).min == torch.finfo(torch.float).min+1e3 torch.finfo(torch. |
position_bias
is so small in absolute value compared to the minimum value of the type that adding them together does not change the fact that they are equivalent to -infinity.