Skip to content

Commit 8355611

Browse files
adding flag to switch between feedback score and module score
1 parent 4889c02 commit 8355611

File tree

1 file changed

+57
-33
lines changed

1 file changed

+57
-33
lines changed

dspy/teleprompt/gepa/gepa.py

Lines changed: 57 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -197,38 +197,60 @@ def metric(
197197
# pareto_frontier is a list of scores, one for each task in the batch.
198198
```
199199
200-
Parameters:
201-
- metric: The metric function to use for feedback and evaluation.
202-
203-
Budget configuration (exactly one of the following must be provided):
204-
- auto: The auto budget to use for the run.
205-
- max_full_evals: The maximum number of full evaluations to perform.
206-
- max_metric_calls: The maximum number of metric calls to perform.
207-
208-
Reflection based configuration:
209-
- reflection_minibatch_size: The number of examples to use for reflection in a single GEPA step.
210-
- candidate_selection_strategy: The strategy to use for candidate selection. Default is "pareto", which stochastically selects candidates from the Pareto frontier of all validation scores.
211-
- reflection_lm: [Required] The language model to use for reflection. GEPA benefits from a strong reflection model, and you can use `dspy.LM(model='gpt-5', temperature=1.0, max_tokens=32000)` to get a good reflection model.
212-
213-
Merge-based configuration:
214-
- use_merge: Whether to use merge-based optimization. Default is True.
215-
- max_merge_invocations: The maximum number of merge invocations to perform. Default is 5.
216-
217-
Evaluation configuration:
218-
- num_threads: The number of threads to use for evaluation with `Evaluate`
219-
- failure_score: The score to assign to failed examples. Default is 0.0.
220-
- perfect_score: The maximum score achievable by the metric. Default is 1.0. Used by GEPA to determine if all examples in a minibatch are perfect.
221-
222-
Logging configuration:
223-
- log_dir: The directory to save the logs. GEPA saves elaborate logs, along with all the candidate programs, in this directory. Running GEPA with the same `log_dir` will resume the run from the last checkpoint.
224-
- track_stats: Whether to return detailed results and all proposed programs in the `detailed_results` attribute of the optimized program. Default is False.
225-
- use_wandb: Whether to use wandb for logging. Default is False.
226-
- wandb_api_key: The API key to use for wandb. If not provided, wandb will use the API key from the environment variable `WANDB_API_KEY`.
227-
- wandb_init_kwargs: Additional keyword arguments to pass to `wandb.init`.
228-
- track_best_outputs: Whether to track the best outputs on the validation set. track_stats must be True if track_best_outputs is True. `optimized_program.detailed_results.best_outputs_valset` will contain the best outputs for each task in the validation set.
229-
230-
Reproducibility:
231-
- seed: The random seed to use for reproducibility. Default is 0.
200+
Args:
201+
metric: The metric function to use for feedback and evaluation.
202+
auto: The auto budget to use for the run. Options: "light", "medium", "heavy".
203+
max_full_evals: The maximum number of full evaluations to perform.
204+
max_metric_calls: The maximum number of metric calls to perform.
205+
reflection_minibatch_size: The number of examples to use for reflection in a single GEPA step. Default is 3.
206+
candidate_selection_strategy: The strategy to use for candidate selection. Default is "pareto",
207+
which stochastically selects candidates from the Pareto frontier of all validation scores.
208+
Options: "pareto", "current_best".
209+
reflection_lm: The language model to use for reflection. Required parameter. GEPA benefits from
210+
a strong reflection model. Consider using `dspy.LM(model='gpt-5', temperature=1.0, max_tokens=32000)`
211+
for optimal performance.
212+
skip_perfect_score: Whether to skip examples with perfect scores during reflection. Default is True.
213+
add_format_failure_as_feedback: Whether to add format failures as feedback. Default is False.
214+
use_merge: Whether to use merge-based optimization. Default is True.
215+
max_merge_invocations: The maximum number of merge invocations to perform. Default is 5.
216+
num_threads: The number of threads to use for evaluation with `Evaluate`. Optional.
217+
failure_score: The score to assign to failed examples. Default is 0.0.
218+
perfect_score: The maximum score achievable by the metric. Default is 1.0. Used by GEPA
219+
to determine if all examples in a minibatch are perfect.
220+
log_dir: The directory to save the logs. GEPA saves elaborate logs, along with all candidate
221+
programs, in this directory. Running GEPA with the same `log_dir` will resume the run
222+
from the last checkpoint.
223+
track_stats: Whether to return detailed results and all proposed programs in the `detailed_results`
224+
attribute of the optimized program. Default is False.
225+
use_wandb: Whether to use wandb for logging. Default is False.
226+
wandb_api_key: The API key to use for wandb. If not provided, wandb will use the API key
227+
from the environment variable `WANDB_API_KEY`.
228+
wandb_init_kwargs: Additional keyword arguments to pass to `wandb.init`.
229+
track_best_outputs: Whether to track the best outputs on the validation set. track_stats must
230+
be True if track_best_outputs is True. The optimized program's `detailed_results.best_outputs_valset`
231+
will contain the best outputs for each task in the validation set.
232+
seed: The random seed to use for reproducibility. Default is 0.
233+
234+
Note:
235+
Budget Configuration: Exactly one of `auto`, `max_full_evals`, or `max_metric_calls` must be provided.
236+
The `auto` parameter provides preset configurations: "light" for quick experimentation, "medium" for
237+
balanced optimization, and "heavy" for thorough optimization.
238+
239+
Reflection Configuration: The `reflection_lm` parameter is required and should be a strong language model.
240+
GEPA performs best with models like `dspy.LM(model='gpt-5', temperature=1.0, max_tokens=32000)`.
241+
The reflection process analyzes failed examples to generate feedback for program improvement.
242+
243+
Merge Configuration: GEPA can merge successful program variants using `use_merge=True`.
244+
The `max_merge_invocations` parameter controls how many merge attempts are made during optimization.
245+
246+
Evaluation Configuration: Use `num_threads` to parallelize evaluation. The `failure_score` and
247+
`perfect_score` parameters help GEPA understand your metric's range and optimize accordingly.
248+
249+
Logging Configuration: Set `log_dir` to save detailed logs and enable checkpoint resuming.
250+
Use `track_stats=True` to access detailed optimization results via the `detailed_results` attribute.
251+
Enable `use_wandb=True` for experiment tracking and visualization.
252+
253+
Reproducibility: Set `seed` to ensure consistent results across runs with the same configuration.
232254
"""
233255
def __init__(
234256
self,
@@ -260,7 +282,7 @@ def __init__(
260282
track_best_outputs: bool = False,
261283
# Reproducibility
262284
seed: int | None = 0,
263-
keep_module_scores: bool = False,
285+
keep_module_scores: bool = False
264286
):
265287
try:
266288
inspect.signature(metric).bind(None, None, None, None, None)
@@ -468,6 +490,8 @@ def feedback_fn(
468490
wandb_api_key=self.wandb_api_key,
469491
wandb_init_kwargs=self.wandb_init_kwargs,
470492
track_best_outputs=self.track_best_outputs,
493+
display_progress_bar=True,
494+
raise_on_exception=True,
471495

472496
# Reproducibility
473497
seed=self.seed,

0 commit comments

Comments
 (0)