Remove device to host sync triggered in _flash_attention_forward

### Feature request

# Problem
In https://github.com/huggingface/transformers/blob/037755ed54208eefa77673b0af2a0b13e51f2fb1/src/transformers/modeling_flash_attention_utils.py#L521, the condition check `(torch.diff(position_ids, dim=-1) >= 0).all())` would cause the result from the device tensor `position_ids` to be synced to the host side.

During inference/training, it can cause serious performance degradation due to CPU blocking. see the following for an example:

<img width="2476" alt="Image" src="https://github.com/user-attachments/assets/e5c1adf1-940e-407a-b78d-9875d6ce9513" />

# Proposal
Precompute the result (torch.diff(position_ids, dim=-1) >= 0).all()) and store it in the `FlashAttentionKwargs` so that we don't have to perform this device to host sync in every attention call in every layer.

The only question is whether there exists a model that cannot precompute this, i.e., the position_ids seq changes during the forward process for the same batch? Based on the fact that we are caching cu_seqlens in FlashAttentionKwargs anyway (which is equivalent to positions_ids), we can assume that?

### Motivation

As stated above, this could severely degrade the out of box performance of transformers and it's usually hard for normal user to notice. And the fix would follow an existing mechanism, i.e., `FlashAttentionKwargs` approach to avoid recomputation for FA needed kwargs.

### Your contribution

Can prepare a PR if the team thinks the proposed approach is OK.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove device to host sync triggered in _flash_attention_forward #39213

Feature request

Problem

Proposal

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Remove device to host sync triggered in _flash_attention_forward #39213

Description

Feature request

Problem

Proposal

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions