Tokenizer respects `filters` when `char_level` is `True` #302

paw-lu · 2020-06-09T21:34:46Z

Summary

As outlined in #301, this PR makes keras.preprocessing.text.Tokenizer remove the characters in the filters argument if char_level=True.

Closes #301.

Behavior before

❯ tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, filters="e")
❯ tokenizer.fit_on_texts("ae")
❯ tokenizer.word_index
{'a': 1, 'e': 2}  # "e" is tokenized

Behavior after

❯ tokenizer = keras.preprocessing.text.Tokenizer(char_level=True, filters="e")
❯ tokenizer.fit_on_texts("ae")
❯ tokenizer.word_index
{'a': 1}  # "e" is not tokenized

Closes #301

Related Issues

PR Overview

This PR requires new unit tests [y/n] (make sure tests are included)
This PR requires to update the documentation [y/n] (make sure the docs are up-to-date)
This PR is backwards compatible [y/n]
This PR changes the current API [y/n] (all API changes need to be approved by fchollet)

paw-lu added 6 commits June 9, 2020 11:16

Test filters is respected at character level

1b037c2

Reword doc

01f1ab7

Use text import

8089e4f

Modify text to not expect . count

bff28ff

Test all possible inputs

bc1e30f

Remove filtered characters at char level

9b255a1

paw-lu changed the title ~~Ignore~~ Tokenizer respects filters when char_level is True Jun 9, 2020

paw-lu mentioned this pull request Jun 9, 2020

Tokenizer ignores filters if char_level is True #301

Open

2 tasks

paw-lu added 2 commits June 9, 2020 14:47

Don't remove filtered characters on lists elements

74afe1e

Remove test on lists of lists

489dcfd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenizer respects `filters` when `char_level` is `True` #302

Tokenizer respects `filters` when `char_level` is `True` #302

Uh oh!

paw-lu commented Jun 9, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tokenizer respects filters when char_level is True #302

Are you sure you want to change the base?

Tokenizer respects filters when char_level is True #302

Uh oh!

Conversation

paw-lu commented Jun 9, 2020

Summary

Behavior before

Behavior after

Related Issues

PR Overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tokenizer respects `filters` when `char_level` is `True` #302

Tokenizer respects `filters` when `char_level` is `True` #302