Skip to content
This repository was archived by the owner on Oct 13, 2025. It is now read-only.
This repository was archived by the owner on Oct 13, 2025. It is now read-only.

Is the code to generate the augmented data available anywhere? #44

@delkind-dnsf

Description

@delkind-dnsf

In the paper, the authors write

Augmentations Token augmentation consists of randomly inserting up to 4 typos per token up to 25% of the token length. This is consistent with an observed maximum human error frequency of around 20% [11]. We use 22 distinct typo augmentations, which can be grouped into four categories: deletion, insertion, substitution, and transposition. For each token, we randomly select a target augmentation percentage between 0-25%, and for each augmentation step we randomly apply an augmentation from one of the four typo categories. The full list of augmentations used is reported in Appendix D.

Is the code to apply these augmentations available anywhere? I'd like to use & adapt it for my specific use-case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions