Skip to content

Conversation

Lucas-Fernandes-Martins
Copy link
Contributor

@Lucas-Fernandes-Martins Lucas-Fernandes-Martins commented Aug 28, 2025

Hi,

Removing the assert in line 217 of teleprompt/gepa/gepa_utils.py:

 assert fb["score"] == module_score, f"Currently, GEPA only supports feedback functions that return the same score as the module's score. However, the module-level score is {module_score} and the feedback score is {fb.score}."

As far as I tested, this should solve issue #8703

Let me know if any further changes are requested, I'll be happy to change the PR if needed.

Thank you :)

@LakshyAAAgrawal
Copy link
Collaborator

Dear @Lucas-Fernandes-Martins , thank you so much!

Please have a look at this file from GEPA (https://github.com/gepa-ai/gepa/blob/main/src/gepa/proposer/reflective_mutation/reflective_mutation.py).

In line 92 (https://github.com/gepa-ai/gepa/blob/main/src/gepa/proposer/reflective_mutation/reflective_mutation.py#L92), GEPA evaluates with "traces", .i.e., it expects the metric function to provide feedback (this is when your LLM-as-judge would run). Next, it proposes an improved prompt, and then in line 138 (https://github.com/gepa-ai/gepa/blob/main/src/gepa/proposer/reflective_mutation/reflective_mutation.py#L138), it evaluates the new prompt, on the same data instances, but this time, with traces=False.

If the metric returns different scores, between capture_traces=False and capture_traces=True, it will lead to a discrepancy in comparing the old program and new program.

Few thoughts:

  1. Line 138 can be modified to also have capture_traces=True
  2. The assert (https://github.com/stanfordnlp/dspy/pull/8731/files#diff-913da3b3974223651ea6ae9a72a1bd9a61df5c8e689128d7e5c09c01b3fbf3e4L217) can instead be converted to an assignment: fb["score"] = module_score.
  3. Create a flag that lets the user decide: "What score should be considered to track prompt improvement? Module level metric, or special trace feedback function"?

@Lucas-Fernandes-Martins
Copy link
Contributor Author

Hi @LakshyAAAgrawal thank you very much for the detailed breakdown! Now I do get why this inconsistency is an issue.

For now, I've implemented the following:

if self.keep_module_scores:
    d['score'] = module_score
else:
    d['score'] = fb['score']

Regarding 1), wouldn't adding capture_traces = True to line 138 potentially increase overhead (e.g. if I'm calling a LLM-as-Judge function, it would call it even when the feedback traces are not used)? I'm not sure why we'd like to set it to true other than to remove the inconsistency. Even still, if capture_traces = True but the metric function has indeterministic behavior I guess it wouldn't make much difference, as inconsistencies would still happen.

I've added the keep_module_scores boolean flag to GEPA's constructor, please let me know if you think of a better name.
Also, just wanted to confirm that d['score'] is the only variable that needs to be updated accordingly with either the module score of the feedback score, in case something slipped through in my implementation.

Anyways, thank you very much and let me know if this is closer to what you believe would be the right implementation.

@LakshyAAAgrawal
Copy link
Collaborator

Hi @Lucas-Fernandes-Martins

Regarding 1), wouldn't adding capture_traces = True to line 138 potentially increase overhead (e.g. if I'm calling a LLM-as-Judge function, it would call it even when the feedback traces are not used)?

True, it will increase the cost, but it could provide much better estimate of whether the new candidate is better or not. Since the minibatch size is typically 3, this just means 3 additional LLM-as-judge calls per round, which could amount to some 100 additional calls in an optimization round, which could be acceptable for a large gain.

At the end, the API design should be such that:

  1. It doesn't let users do something obviously wrong: In this case, we want to prevent the user from ever sending scores calculated in 2 different manners in the 2 eval calls, as that will lead to comparison at different scales. Currently this is implemented as an assert statement, which is too restricting.
  2. It still allows users the flexibility to choose what they want: For some tasks, it may make sense to run LLM-as-judge for both evals, for some it may make sense to just use the module level scores.

In summary, I think the proper fix to this issue will involve:

  1. Some change in the gepa-ai/gepa repo at (https://github.com/gepa-ai/gepa/blob/2cf10c79125533af345ffbd48497005dd1f68ee9/src/gepa/proposer/reflective_mutation/reflective_mutation.py#L138 and https://github.com/gepa-ai/gepa/blob/2cf10c79125533af345ffbd48497005dd1f68ee9/src/gepa/proposer/reflective_mutation/reflective_mutation.py#L92) to let the adapter know that GEPA is seeking evaluation for candidate proposal. And then the adapter can decide whether it wants to use the expensive (LLM-as-judge) or cheap (module-level) score for this task.
  2. A flag in dspy.GEPA similar to if self.keep_module_scores: that you have already implemented, that will let the end user choose.

I am not sure what is the cleanest way to achieve (1) here.

@Lucas-Fernandes-Martins
Copy link
Contributor Author

Lucas-Fernandes-Martins commented Sep 2, 2025

Hi, thanks for the detailed breakdown!

I gave it some thought and now I agree that the default behavior of L138

eval_new = self.adapter.evaluate(minibatch, new_candidate, capture_traces=True)

Should be capture_traces = True

Would it make sense to add a flag so the user can choose if the new prompt should be evaluated with capture_traces=False or capture_traces=True? It would involve adding the flag to ReflectiveMutationProposer's constructor, and a parameter to the optimize method on api.py

To me it seems like a not too convoluted way to give the user this flexibility. The new parameter being optional, I guess it would not break any existing code. If I understood correctly, this aligns with your idea on 2.

However, I also see your initial point of always leaving capture_traces = True for the new_prompt, as most users would probably want the exact same scoring method for the old and new prompt candidates.

Regarding 1), you mean it would be interesting to let the adapter have some custom logic on potentially sometimes using LLM-as-judge, and sometimes not? (eg possibly doing a EvaluationBatch with module score logic, and if scores are too low, then calling LLM-as-judge to try and get more useful feedback - total hypothetical scenario I'm making up here).

I'm a bit confused because the role of the assert in gepa_utils.py to me seemed to be to ensure the module level feedback matched the one that had been calculated during reflection for the same prompt. But even without the assert, if capture_traces = True in L38 it seems to me it would be impossible for the user to call different scoring functions during minibatch eval. The worst that could happen is feedback score being very different from module level score, making GEPA inefficient by discarding potentially good candidates on the minibatch score improvement check.

Excited to hear what you think is the right path! Thanks again for the help.

@LakshyAAAgrawal
Copy link
Collaborator

Hi @Lucas-Fernandes-Martins ,

I am slightly busy with some parallel work, but I will get to this soon. This is a very necessary improvement to GEPA.

@Lucas-Fernandes-Martins
Copy link
Contributor Author

Got it @LakshyAAAgrawal, if there's something I can do to help in the meantime let me know :)

@LakshyAAAgrawal
Copy link
Collaborator

@Lucas-Fernandes-Martins , here's a quick plan since this is tripping up many users:

  1. We just introduce a warning:
    if fb["score"] != module_score:
        logger.warning("The score returned by the metric with pred_name is different from the overall metric score. This can indicate 2 things: Either the metric is non-deterministic (e.g., LLM-as-judge, Semantic score, etc.) or the metric returned a score specific to pred_name that differs from the module level score. Currently, GEPA does not support predictor level scoring (support coming soon), and only requires a feedback text to be provided, which can be specific to the predictor or program level. GEPA will ignore the differing score returned, and instead use module level score.")
        fb["score"] = module_score
    

Can you create a separate PR, so that we don't lose our detailed discussion here. We can merge that PR immediately, and come back to this later.

Also ensure that the warning is printed only once (the metric will be getting called thousands of times), so we may need to introduce a counter for this in the class.

@Lucas-Fernandes-Martins
Copy link
Contributor Author

Hi @LakshyAAAgrawal, thank you!

I already got the quick fix on a separate PR: #8777

Unfortunately I made a mistake when rebasing my fork and Github closed this PR automatically. But if that's ok we can keep the discussion here and I'll raise a new one once we settle on a course of action.

Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants