This repository was archived by the owner on Oct 13, 2025. It is now read-only.

Description
In the paper, the authors write
Augmentations Token augmentation consists of randomly inserting up to 4 typos per token up to 25% of the token length. This is consistent with an observed maximum human error frequency of around 20% [11]. We use 22 distinct typo augmentations, which can be grouped into four categories: deletion, insertion, substitution, and transposition. For each token, we randomly select a target augmentation percentage between 0-25%, and for each augmentation step we randomly apply an augmentation from one of the four typo categories. The full list of augmentations used is reported in Appendix D.
Is the code to apply these augmentations available anywhere? I'd like to use & adapt it for my specific use-case.