Skip to content

Conversation

RobinPicard
Copy link
Contributor

@RobinPicard RobinPicard commented Aug 6, 2025

Expose 2 new keywords for generation:

  • end_thinking_tag: a string indicating the tag used by the reasoning model to indicate that thinking is finished (and so that we should start constraining the generation)
  • thinking_max_tokens: an int giving the maximum number of tokens during which the model can think, after that number is reached, we force the generation of the end of thinking token

Not supported:

  • Models for which the end of the thinking does not correspond to a single token

If we want to capture the content of the thinking in the future when we will return an object with various attributes instead of just the text output, we could add an argument start_thinking_tag for the models that use one.

The tag the model uses to indicate the end of the thinking process.
Only used when running a thinking model.
thinking_max_tokens: int | None
The maximum number of tokens the model can think about. Only used when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The maximum number of tokens the model can think about. Only used when
The maximum number of tokens the model can think for. Only used when

Comment on lines +41 to +44
end_thinking_token_id: int | None
The id of the end thinking token
thinking_max_tokens: int | None
The maximum number of tokens the model can think about
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it possible to only build a specialized logits processor that the backends are unaware of? You should be able to not call the logits biasing function as long as </think> has not been generated, and limit the number of tokens from the logits processor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I wanted to wrap the logits processor into another one that would just not bias anything until we encounter the token and then it calls the other tokenizer it wraps, the problem is that it does not work for batching as the different sequences may not all stop thinking at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants