-
Notifications
You must be signed in to change notification settings - Fork 30.2k
Open
Labels
DiscussionDiscussion on a topic (keep it focused or open a new issue though)Discussion on a topic (keep it focused or open a new issue though)WIPLabel your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progressLabel your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progressbug
Description
System Info
Transformers + Tokenizers
Who can help?
This Issue is a summary of multiple problems that we are currently encountering with Tokenizers. To solve them we'll need a more profound discussion of:
- To what extend fast and slow tokenizers should be aligned
- Whether all slow tokenizers should be kept
- How to treat special tokens
- Whether all internal methods of tokenizer should be exposed
Relevant issues/PRs:
#15420
#16336
#16334
#16337
#15138
#16339
#15775
To community:
At the moment we sadly don't find the time to dive deeper here, but we're trying hard to allocate time to discuss the strategy here soon.
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
See issues above
Expected behavior
Don't know yet
SaulLu and LysandreJik
Metadata
Metadata
Assignees
Labels
DiscussionDiscussion on a topic (keep it focused or open a new issue though)Discussion on a topic (keep it focused or open a new issue though)WIPLabel your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progressLabel your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progressbug