-
Notifications
You must be signed in to change notification settings - Fork 517
fix: MTEB-NL prompts #3516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: MTEB-NL prompts #3516
Conversation
| year = {2025}, | ||
| } | ||
| """, | ||
| prompt={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure we want an English prompt for a Dutch dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all our prompts in taskmetada are in English
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I don't think there is any need to force that. For SEB v2 I will probably rework them to their respective languages)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. However, it is usually model-dependent. As I remember, e5-models are trained on English instructions. Also, I guess it should not be an issue for multilingual models because they are usually trained on a large portion of English instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nikolay-banar yeah the model providers can always replace the prompt if e.g. they only support English prompts. Ultimately I think it is decision for you to make as the developer of the benchmark. We can def. keep them English if you prefer
(last time I check e5 actually performed slightly better on SEB if you used a Danish prompt)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KennethEnevoldsen I will translate the prompts then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KennethEnevoldsen What about multilingual datasets? Would it make sense to create Dutch versions with the corresponding prompts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KennethEnevoldsen What about multilingual datasets? Would it make sense to create Dutch versions with the corresponding prompts?
For multilingual dataset I would probably keep the prompts English to avoid having two versions on the task (we could probably have different prompt on different subsets - feel free to create an issue on this)
| class ArguAnaNLv2(AbsTaskRetrieval): | ||
| ignore_identical_ids = True | ||
|
|
||
| metadata = TaskMetadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have to add to the description what has changed between the two versions
Co-authored-by: Roman Solomatin <[email protected]>
Following the discussion #3339 (comment)