Skip to content

Collection of Tokenizer issues #17051

@patrickvonplaten

Description

@patrickvonplaten

System Info

Transformers + Tokenizers

Who can help?

This Issue is a summary of multiple problems that we are currently encountering with Tokenizers. To solve them we'll need a more profound discussion of:

  • To what extend fast and slow tokenizers should be aligned
  • Whether all slow tokenizers should be kept
  • How to treat special tokens
  • Whether all internal methods of tokenizer should be exposed

Relevant issues/PRs:
#15420
#16336
#16334
#16337
#15138
#16339
#15775

To community:
At the moment we sadly don't find the time to dive deeper here, but we're trying hard to allocate time to discuss the strategy here soon.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

See issues above

Expected behavior

Don't know yet

Metadata

Metadata

Assignees

No one assigned

    Labels

    DiscussionDiscussion on a topic (keep it focused or open a new issue though)WIPLabel your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progressbug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions