Inconsistant `input_feature` length and `attention_mask` length in `WhisperFeatureExtractor`

### System Info

transformers `main`  branch

### Who can help?

@eustlb @zucchini-nlp 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

```python
from transformers import AutoProcessor
import numpy as np

audios=[np.random.randn(16000*5)]
processor=AutoProcessor.from_pretrained("openai/whisper-large-v3")
print(processor(
    [audios[0][: 160 * 5 - 1]], return_attention_mask=True, sampling_rate=16000, padding=False
)["attention_mask"].shape)
print(processor(
    [audios[0][: 160 * 5]], return_attention_mask=True, sampling_rate=16000, padding=False
)["attention_mask"].shape)
print(processor(
    [audios[0][: 160 * 5 + 1]], return_attention_mask=True, sampling_rate=16000, padding=False
)["attention_mask"].shape)
```


### Expected behavior

The `input_feature` and `attention_mask` length should be `audio_length // hop_length`, but the code here:
https://github.com/huggingface/transformers/blob/e8e0c76162263840661fc0ca0da3952861754759/src/transformers/models/whisper/feature_extraction_whisper.py#L327-L329

makes the `attention_mask` length equal to `(audio_length+hop_length-1) // hop_length`, it should be changed to:

```diff
diff --git a/src/transformers/models/whisper/feature_extraction_whisper.py b/src/transformers/models/whisper/feature_extraction_whisper.py
index 68c52c6eb3..b9f3b4cb35 100644
--- a/src/transformers/models/whisper/feature_extraction_whisper.py
+++ b/src/transformers/models/whisper/feature_extraction_whisper.py
@@ -326,7 +326,9 @@ class WhisperFeatureExtractor(SequenceFeatureExtractor):
 
         if return_attention_mask:
             # rescale from sample (48000) to feature (3000)
-            padded_inputs["attention_mask"] = padded_inputs["attention_mask"][:, :: self.hop_length]
+            padded_inputs["attention_mask"] = padded_inputs["attention_mask"][
+                :, self.hop_length - 1 :: self.hop_length
+             ]
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistant `input_feature` length and `attention_mask` length in `WhisperFeatureExtractor` #39214

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if return_attention_mask:
	# rescale from sample (48000) to feature (3000)
	padded_inputs["attention_mask"] = padded_inputs["attention_mask"][:, :: self.hop_length]

Inconsistant input_feature length and attention_mask length in WhisperFeatureExtractor #39214

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Inconsistant `input_feature` length and `attention_mask` length in `WhisperFeatureExtractor` #39214