-
Notifications
You must be signed in to change notification settings - Fork 2.2k
fixing #8703 GEPA does not need to assert that the score returned by feedback func matches the rollout score #8731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixing #8703 GEPA does not need to assert that the score returned by feedback func matches the rollout score #8731
Conversation
Dear @Lucas-Fernandes-Martins , thank you so much! Please have a look at this file from GEPA (https://github.com/gepa-ai/gepa/blob/main/src/gepa/proposer/reflective_mutation/reflective_mutation.py). In line 92 (https://github.com/gepa-ai/gepa/blob/main/src/gepa/proposer/reflective_mutation/reflective_mutation.py#L92), GEPA evaluates with "traces", .i.e., it expects the metric function to provide feedback (this is when your LLM-as-judge would run). Next, it proposes an improved prompt, and then in line 138 (https://github.com/gepa-ai/gepa/blob/main/src/gepa/proposer/reflective_mutation/reflective_mutation.py#L138), it evaluates the new prompt, on the same data instances, but this time, with traces=False. If the metric returns different scores, between Few thoughts:
|
32d7b80
to
0a44512
Compare
Hi @LakshyAAAgrawal thank you very much for the detailed breakdown! Now I do get why this inconsistency is an issue. For now, I've implemented the following:
Regarding 1), wouldn't adding I've added the Anyways, thank you very much and let me know if this is closer to what you believe would be the right implementation. |
True, it will increase the cost, but it could provide much better estimate of whether the new candidate is better or not. Since the minibatch size is typically 3, this just means 3 additional LLM-as-judge calls per round, which could amount to some 100 additional calls in an optimization round, which could be acceptable for a large gain. At the end, the API design should be such that:
In summary, I think the proper fix to this issue will involve:
I am not sure what is the cleanest way to achieve (1) here. |
Hi, thanks for the detailed breakdown! I gave it some thought and now I agree that the default behavior of L138
Should be Would it make sense to add a flag so the user can choose if the new prompt should be evaluated with To me it seems like a not too convoluted way to give the user this flexibility. The new parameter being optional, I guess it would not break any existing code. If I understood correctly, this aligns with your idea on 2. However, I also see your initial point of always leaving Regarding 1), you mean it would be interesting to let the adapter have some custom logic on potentially sometimes using LLM-as-judge, and sometimes not? (eg possibly doing a EvaluationBatch with module score logic, and if scores are too low, then calling LLM-as-judge to try and get more useful feedback - total hypothetical scenario I'm making up here). I'm a bit confused because the role of the assert in gepa_utils.py to me seemed to be to ensure the module level feedback matched the one that had been calculated during reflection for the same prompt. But even without the assert, if Excited to hear what you think is the right path! Thanks again for the help. |
I am slightly busy with some parallel work, but I will get to this soon. This is a very necessary improvement to GEPA. |
Got it @LakshyAAAgrawal, if there's something I can do to help in the meantime let me know :) |
@Lucas-Fernandes-Martins , here's a quick plan since this is tripping up many users:
Can you create a separate PR, so that we don't lose our detailed discussion here. We can merge that PR immediately, and come back to this later. Also ensure that the warning is printed only once (the metric will be getting called thousands of times), so we may need to introduce a counter for this in the class. |
b1f8a8a
to
a6ee755
Compare
Hi @LakshyAAAgrawal, thank you! I already got the quick fix on a separate PR: #8777 Unfortunately I made a mistake when rebasing my fork and Github closed this PR automatically. But if that's ok we can keep the discussion here and I'll raise a new one once we settle on a course of action. Thank you for your help! |
Hi,
Removing the assert in line 217 of teleprompt/gepa/gepa_utils.py:
As far as I tested, this should solve issue #8703
Let me know if any further changes are requested, I'll be happy to change the PR if needed.
Thank you :)