Skip to content
This repository was archived by the owner on Sep 18, 2024. It is now read-only.

Conversation

@paw-lu
Copy link

@paw-lu paw-lu commented Jun 9, 2020

Summary

As outlined in #301, this PR makes keras.preprocessing.text.Tokenizer remove the characters in the filters argument if char_level=True.

Closes #301.

Behavior before

tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, filters="e")
❯ tokenizer.fit_on_texts("ae")
❯ tokenizer.word_index
{'a': 1, 'e': 2}  # "e" is tokenized

Behavior after

tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, filters="e")
❯ tokenizer.fit_on_texts("ae")
❯ tokenizer.word_index
{'a': 1}  # "e" is not tokenized

Closes #301

Related Issues

PR Overview

  • This PR requires new unit tests [y/n] (make sure tests are included)
  • This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
  • This PR is backwards compatible [y/n]
  • This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

@paw-lu paw-lu changed the title Ignore Tokenizer respects filters when char_level is True Jun 9, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tokenizer ignores filters if char_level is True

1 participant