fixing #8703 GEPA does not need to assert that the score returned by feedback func matches the rollout score #8731

Lucas-Fernandes-Martins · 2025-08-28T16:34:17Z

Hi,

Removing the assert in line 217 of teleprompt/gepa/gepa_utils.py:

 assert fb["score"] == module_score, f"Currently, GEPA only supports feedback functions that return the same score as the module's score. However, the module-level score is {module_score} and the feedback score is {fb.score}."

As far as I tested, this should solve issue #8703

Let me know if any further changes are requested, I'll be happy to change the PR if needed.

Thank you :)

LakshyAAAgrawal · 2025-08-29T17:47:53Z

Dear @Lucas-Fernandes-Martins , thank you so much!

Please have a look at this file from GEPA (https://github.com/gepa-ai/gepa/blob/main/src/gepa/proposer/reflective_mutation/reflective_mutation.py).

In line 92 (https://github.com/gepa-ai/gepa/blob/main/src/gepa/proposer/reflective_mutation/reflective_mutation.py#L92), GEPA evaluates with "traces", .i.e., it expects the metric function to provide feedback (this is when your LLM-as-judge would run). Next, it proposes an improved prompt, and then in line 138 (https://github.com/gepa-ai/gepa/blob/main/src/gepa/proposer/reflective_mutation/reflective_mutation.py#L138), it evaluates the new prompt, on the same data instances, but this time, with traces=False.

If the metric returns different scores, between capture_traces=False and capture_traces=True, it will lead to a discrepancy in comparing the old program and new program.

Few thoughts:

Line 138 can be modified to also have capture_traces=True
The assert (https://github.com/stanfordnlp/dspy/pull/8731/files#diff-913da3b3974223651ea6ae9a72a1bd9a61df5c8e689128d7e5c09c01b3fbf3e4L217) can instead be converted to an assignment: fb["score"] = module_score.
Create a flag that lets the user decide: "What score should be considered to track prompt improvement? Module level metric, or special trace feedback function"?

Lucas-Fernandes-Martins · 2025-08-30T14:56:47Z

Hi @LakshyAAAgrawal thank you very much for the detailed breakdown! Now I do get why this inconsistency is an issue.

For now, I've implemented the following:

if self.keep_module_scores:
    d['score'] = module_score
else:
    d['score'] = fb['score']

Regarding 1), wouldn't adding capture_traces = True to line 138 potentially increase overhead (e.g. if I'm calling a LLM-as-Judge function, it would call it even when the feedback traces are not used)? I'm not sure why we'd like to set it to true other than to remove the inconsistency. Even still, if capture_traces = True but the metric function has indeterministic behavior I guess it wouldn't make much difference, as inconsistencies would still happen.

I've added the keep_module_scores boolean flag to GEPA's constructor, please let me know if you think of a better name.
Also, just wanted to confirm that d['score'] is the only variable that needs to be updated accordingly with either the module score of the feedback score, in case something slipped through in my implementation.

Anyways, thank you very much and let me know if this is closer to what you believe would be the right implementation.

LakshyAAAgrawal · 2025-09-01T07:42:44Z

Hi @Lucas-Fernandes-Martins

Regarding 1), wouldn't adding capture_traces = True to line 138 potentially increase overhead (e.g. if I'm calling a LLM-as-Judge function, it would call it even when the feedback traces are not used)?

True, it will increase the cost, but it could provide much better estimate of whether the new candidate is better or not. Since the minibatch size is typically 3, this just means 3 additional LLM-as-judge calls per round, which could amount to some 100 additional calls in an optimization round, which could be acceptable for a large gain.

At the end, the API design should be such that:

It doesn't let users do something obviously wrong: In this case, we want to prevent the user from ever sending scores calculated in 2 different manners in the 2 eval calls, as that will lead to comparison at different scales. Currently this is implemented as an assert statement, which is too restricting.
It still allows users the flexibility to choose what they want: For some tasks, it may make sense to run LLM-as-judge for both evals, for some it may make sense to just use the module level scores.

In summary, I think the proper fix to this issue will involve:

Some change in the gepa-ai/gepa repo at (https://github.com/gepa-ai/gepa/blob/2cf10c79125533af345ffbd48497005dd1f68ee9/src/gepa/proposer/reflective_mutation/reflective_mutation.py#L138 and https://github.com/gepa-ai/gepa/blob/2cf10c79125533af345ffbd48497005dd1f68ee9/src/gepa/proposer/reflective_mutation/reflective_mutation.py#L92) to let the adapter know that GEPA is seeking evaluation for candidate proposal. And then the adapter can decide whether it wants to use the expensive (LLM-as-judge) or cheap (module-level) score for this task.
A flag in dspy.GEPA similar to if self.keep_module_scores: that you have already implemented, that will let the end user choose.

I am not sure what is the cleanest way to achieve (1) here.

Lucas-Fernandes-Martins · 2025-09-02T04:44:41Z

Hi, thanks for the detailed breakdown!

I gave it some thought and now I agree that the default behavior of L138

eval_new = self.adapter.evaluate(minibatch, new_candidate, capture_traces=True)

Should be capture_traces = True

Would it make sense to add a flag so the user can choose if the new prompt should be evaluated with capture_traces=False or capture_traces=True? It would involve adding the flag to ReflectiveMutationProposer's constructor, and a parameter to the optimize method on api.py

To me it seems like a not too convoluted way to give the user this flexibility. The new parameter being optional, I guess it would not break any existing code. If I understood correctly, this aligns with your idea on 2.

However, I also see your initial point of always leaving capture_traces = True for the new_prompt, as most users would probably want the exact same scoring method for the old and new prompt candidates.

Regarding 1), you mean it would be interesting to let the adapter have some custom logic on potentially sometimes using LLM-as-judge, and sometimes not? (eg possibly doing a EvaluationBatch with module score logic, and if scores are too low, then calling LLM-as-judge to try and get more useful feedback - total hypothetical scenario I'm making up here).

I'm a bit confused because the role of the assert in gepa_utils.py to me seemed to be to ensure the module level feedback matched the one that had been calculated during reflection for the same prompt. But even without the assert, if capture_traces = True in L38 it seems to me it would be impossible for the user to call different scoring functions during minibatch eval. The worst that could happen is feedback score being very different from module level score, making GEPA inefficient by discarding potentially good candidates on the minibatch score improvement check.

Excited to hear what you think is the right path! Thanks again for the help.

LakshyAAAgrawal · 2025-09-04T23:27:56Z

Hi @Lucas-Fernandes-Martins ,

I am slightly busy with some parallel work, but I will get to this soon. This is a very necessary improvement to GEPA.

Lucas-Fernandes-Martins · 2025-09-04T23:43:34Z

Got it @LakshyAAAgrawal, if there's something I can do to help in the meantime let me know :)

LakshyAAAgrawal · 2025-09-07T18:39:50Z

@Lucas-Fernandes-Martins , here's a quick plan since this is tripping up many users:

We just introduce a warning:

if fb["score"] != module_score:
    logger.warning("The score returned by the metric with pred_name is different from the overall metric score. This can indicate 2 things: Either the metric is non-deterministic (e.g., LLM-as-judge, Semantic score, etc.) or the metric returned a score specific to pred_name that differs from the module level score. Currently, GEPA does not support predictor level scoring (support coming soon), and only requires a feedback text to be provided, which can be specific to the predictor or program level. GEPA will ignore the differing score returned, and instead use module level score.")
    fb["score"] = module_score

Can you create a separate PR, so that we don't lose our detailed discussion here. We can merge that PR immediately, and come back to this later.

Also ensure that the warning is printed only once (the metric will be getting called thousands of times), so we may need to introduce a counter for this in the class.

Lucas-Fernandes-Martins · 2025-09-07T20:54:10Z

Hi @LakshyAAAgrawal, thank you!

I already got the quick fix on a separate PR: #8777

Unfortunately I made a mistake when rebasing my fork and Github closed this PR automatically. But if that's ok we can keep the discussion here and I'll raise a new one once we settle on a course of action.

Thank you for your help!

Lucas-Fernandes-Martins mentioned this pull request Aug 28, 2025

[Feature] GEPA does not need to assert that the score returned by feedback func matches the rollout score #8703

Closed

2 tasks

Lucas-Fernandes-Martins force-pushed the main branch from 32d7b80 to 0a44512 Compare August 29, 2025 22:37

LakshyAAAgrawal mentioned this pull request Sep 7, 2025

[Feature] Assertion error with semantic scoring in gepa_utils.py #8676

Open

2 tasks

Lucas-Fernandes-Martins closed this Sep 7, 2025

Lucas-Fernandes-Martins force-pushed the main branch from b1f8a8a to a6ee755 Compare September 7, 2025 19:25

Lucas-Fernandes-Martins mentioned this pull request Sep 7, 2025

Fix #8703 - fixing module and feedback mismatch #8777

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fixing #8703 GEPA does not need to assert that the score returned by feedback func matches the rollout score #8731

fixing #8703 GEPA does not need to assert that the score returned by feedback func matches the rollout score #8731

Lucas-Fernandes-Martins commented Aug 28, 2025 •

edited

Loading

Uh oh!

LakshyAAAgrawal commented Aug 29, 2025

Uh oh!

Lucas-Fernandes-Martins commented Aug 30, 2025

Uh oh!

LakshyAAAgrawal commented Sep 1, 2025

Uh oh!

Lucas-Fernandes-Martins commented Sep 2, 2025 •

edited

Loading

Uh oh!

LakshyAAAgrawal commented Sep 4, 2025

Uh oh!

Lucas-Fernandes-Martins commented Sep 4, 2025

Uh oh!

LakshyAAAgrawal commented Sep 7, 2025

Uh oh!

Lucas-Fernandes-Martins commented Sep 7, 2025

Uh oh!

Uh oh!

fixing #8703 GEPA does not need to assert that the score returned by feedback func matches the rollout score #8731

fixing #8703 GEPA does not need to assert that the score returned by feedback func matches the rollout score #8731

Conversation

Lucas-Fernandes-Martins commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LakshyAAAgrawal commented Aug 29, 2025

Uh oh!

Lucas-Fernandes-Martins commented Aug 30, 2025

Uh oh!

LakshyAAAgrawal commented Sep 1, 2025

Uh oh!

Lucas-Fernandes-Martins commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LakshyAAAgrawal commented Sep 4, 2025

Uh oh!

Lucas-Fernandes-Martins commented Sep 4, 2025

Uh oh!

LakshyAAAgrawal commented Sep 7, 2025

Uh oh!

Lucas-Fernandes-Martins commented Sep 7, 2025

Uh oh!

Uh oh!

Lucas-Fernandes-Martins commented Aug 28, 2025 •

edited

Loading

Lucas-Fernandes-Martins commented Sep 2, 2025 •

edited

Loading