# Padding process in HuggingFace transformer models

While reading the code for Hugging Face’s T5, I felt uncomfortable with the processing around the paddings, so I examined them below.

HuggingFace’s attention layer implementation codes are here:

https://github.com/huggingface/transformers/blob/df56c843be370bd73d9ce4e597d4c7f55e4e41d9/src/transformers/models/t5/modeling_t5.py#L333

## [round-off error utilization] How the padding columns in attention matrix is converted to 0?

In the calculation of getting new

This is implemented by utilizing round-off error.

As you know,

The variable `mask`

is using minimum value of the type. For example, pytorch’s `float`

is shown below.

1 | import torch |

This is very small number, but it’s not

The answer is no. The softmax calculation rounds this minimum to zero. Thus, there is no need to compute exactly minus infinity.

1 | 1,2,torch.finfo(torch.float).min]) vec=torch.tensor([ |

The code is here:

- type’s minimum value for mask:

https://github.com/huggingface/transformers/blob/df56c843be370bd73d9ce4e597d4c7f55e4e41d9/src/transformers/models/t5/modeling_t5.py#L529 - softmax round-off error utilization

https://github.com/huggingface/transformers/blob/df56c843be370bd73d9ce4e597d4c7f55e4e41d9/src/transformers/models/t5/modeling_t5.py#L539

## [loss of trailing digits utilization] How is the padding mask applied to the original attention matrix?

As I mentioned above, the minimum value of the type is assigned in the dimension of the padding in the variable `mask`

. Then, the mask is added to position bias, which is added to the attention matrix later.

NOTE: `mask`

is a vec here. Adding a vec to the matrix adds a vec to all rows.

1 | 3,3)+torch.ones(3) torch.zeros( |

But is it safe to **add** the minimum value (not

The answer is yes. This is a utilization of loss of trailing digits.

When calculating large and small absolute numbers in the floating point exhibition, python ignores the smaller one.

1 | float).min == torch.finfo(torch.float).min+1e3 torch.finfo(torch. |

`position_bias`

is so small in absolute value compared to the minimum value of the type that adding them together does not change the fact that they are equivalent to -infinity.