SIMBA Improvements #8766

klopsahlong · 2025-09-04T15:14:08Z

This PR makes the following updates:

Support for teacher and prompt model. The teacher model will be used to generate 1/N trajectories for each example, so that we are still targeting rule generation improvements based on task model failure modes.
Don't append_demo if demo score < 10th percentile. This will help us avoid adding in poor demos.
Support for metric metadata. Allow for additional metadata to be passed back in a dspy.Prediction object, in addition to the score. Note that the one downside here is that users must know that the score should be called 'score'. To address this for now, we've added in an error message telling users to add in score to their dspy.Prediction object.
Fix for Optional fields. Fixes a bug in the parse_value function that parses strings like "152" as numbers when they are of annotation type Optional(None, str)

klopsahlong · 2025-09-04T15:16:04Z

dspy/adapters/utils.py


 def parse_value(value, annotation):
+    annotation = _strip_optional(annotation)


This fixes the following failure case:

Previously, this was failing for fields of Union(str, None) with values of type 'str' that could also be parsed as ints. Ex: "9812750".

The problem is in this sequence:
value = "9812750" (string)
annotation = typing.Optional[str]
candidate = json_repair.loads("9812750") → 9812750 (parses as integer, not str)
TypeAdapter(typing.Optional[str]).validate_python(9812750) → Fails with pydantic.ValidationError since the value is neither a str nor None
Exception handler is triggered: except pydantic.ValidationError as e:
Then we hit this line : issubclass(annotation, Type), which throws the error issubclass() arg 1 must be a class because typing.Optional[str] is not a class, since it's a type annotation/Union.

This fix involves first parsing the Optional field to get the expected non-Null type, and parse the value according to this. Now pydantic handles the type coercion correctly from str -> str instead of int-> int when the non-null annotation type is 'str'

I think we just need to change the condition from issubclass(annotation, Type) to inspect.isclass(annotation) and issubclass(annotation, Type)

The current approach doesn't handle str | None IIUC, let me file a PR to fix this issue so that we can keep this PR focuses on simba.

#8774, which handles the parse issue

klopsahlong · 2025-09-04T15:16:45Z

dspy/teleprompt/simba.py

@@ -31,6 +31,8 @@ def __init__(
        num_candidates: int = 6,
        max_steps: int = 8,
        max_demos: int = 4,
+        prompt_model: Optional[Any] = None,
+        teacher_settings: Optional[Dict] = None,


Adding support for prompt / teacher models

klopsahlong · 2025-09-04T15:17:17Z

dspy/teleprompt/simba_utils.py



+    start_rollout_idx, models = 0, []
+    # If we have a teacher model, use this as the first model
+    if teacher_settings:


This has been updated to add support for teacher model (used for 1 of the N trajectories)

klopsahlong · 2025-09-04T15:18:05Z

dspy/teleprompt/simba_utils.py

+            elif isinstance(output, dspy.Prediction):
+                if not hasattr(output, 'score'):
+                    raise ValueError("dspy.Prediction must contain a 'score' attribute")
+                score = output.score


Updated to handle additional metric metadata in addition to the score. To do this, we check if the output from the metric is a float or int (in which case we use it as the score) or a dspy.Prediction object, which contains a score + potentially additional meta-data

We can also use float(output), which might be more intuitive?

klopsahlong · 2025-09-04T15:18:20Z

dspy/teleprompt/simba_utils.py

        name2demo = {}

+        if good["score"] <= batch_10p_score:


Double checking that the demo we're appending is not below the 10th percentile of scores

klopsahlong · 2025-09-04T15:19:06Z

dspy/teleprompt/simba_utils.py

@@ -117,12 +147,17 @@ def append_a_rule(bucket, system, **kwargs):
        "worse_program_outputs": dict(bad["prediction"] or {}),
        "worse_reward_value": bad["score"],
        "better_reward_value": good["score"],
+        "worse_reward_info": bad["output_metadata"],
+        "better_reward_info": good["output_metadata"],
        "module_names": module_names,


adding in the metric meta-data (ex. feedback from a judge) to help come up with a better set of rules

Copilot

Pull Request Overview

This PR introduces several key improvements to the SIMBA (Sampling-based Iterative Multi-round Bootstrapping Algorithm) system for optimizing DSPy programs. The changes focus on enhancing model flexibility, improving demo quality control, and expanding metric capabilities.

Support for teacher and prompt models to generate more targeted trajectories
Quality filtering to prevent appending poor-performing demos below the 10th percentile
Enhanced metric system supporting metadata alongside scores

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
dspy/teleprompt/simba_utils.py	Core logic updates for teacher model support, demo filtering, and metric metadata handling
dspy/teleprompt/simba.py	Constructor updates to accept new prompt_model and teacher_settings parameters
dspy/adapters/utils.py	Bug fix for Optional field parsing to handle string values correctly

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-05T08:03:35Z

dspy/teleprompt/simba_utils.py

+        if good["score"] <= batch_10p_score:
+            logger.info(f"Skipping appending a demo as good score {good['score']} is at or below the 10th percentile.")
+            return False


[nitpick] The condition check and logic for skipping demo appending is duplicated between append_a_demo_ and append_a_rule functions. Consider extracting this into a shared helper function to reduce code duplication.

Copilot · 2025-09-05T08:03:35Z

dspy/teleprompt/simba_utils.py

+        teacher_lm.kwargs["rollout_id"] = rollout_ids[start_rollout_idx]
+        models.append(teacher_lm)


[nitpick] Direct mutation of teacher_lm.kwargs could cause side effects if the teacher model is reused elsewhere. Consider using teacher_lm.copy() and setting the rollout_id on the copy instead.

Suggested change

teacher_lm.kwargs["rollout_id"] = rollout_ids[start_rollout_idx]

models.append(teacher_lm)

models.append(teacher_lm.copy(rollout_id=rollout_ids[start_rollout_idx]))

Copilot · 2025-09-05T08:03:35Z

dspy/adapters/utils.py

+def _strip_optional(ann):
+    """If ann is Union[..., NoneType] return the non‑None part, else ann."""
+    if get_origin(ann) is Union and NoneType in get_args(ann):
+        # keep the first non‑None member (there will be only one in Optional[T])
+        return next(a for a in get_args(ann) if a is not NoneType)
+    return ann


The docstring uses an en-dash character (‑) instead of a regular hyphen (-) in 'non‑None'. This should be corrected for consistency and readability.

TomeHirata · 2025-09-05T08:11:08Z

dspy/teleprompt/simba.py

@@ -31,6 +31,8 @@ def __init__(
        num_candidates: int = 6,
        max_steps: int = 8,
        max_demos: int = 4,
+        prompt_model: Any | None = None,


Suggested change

prompt_model: Any | None = None,

prompt_model: dspy.LM | None = None,

btw, can we add type annotations for other arguments too?

TomeHirata · 2025-09-05T08:11:53Z

dspy/teleprompt/simba.py

@@ -62,6 +64,8 @@ def __init__(
        self.num_candidates = num_candidates
        self.max_steps = max_steps
        self.max_demos = max_demos
+        self.prompt_model = prompt_model if prompt_model else dspy.settings.lm


Suggested change

self.prompt_model = prompt_model if prompt_model else dspy.settings.lm

self.prompt_model = prompt_model or dspy.settings.lm

TomeHirata · 2025-09-05T08:17:11Z

dspy/teleprompt/simba_utils.py

+                score = output.score
+                # Just extract fields from _store, excluding 'score'
+                output_metadata = {
+                    k: v for k, v in output._store.items() if k != "score"


can't we use output.items()?

TomeHirata · 2025-09-05T08:19:10Z

dspy/teleprompt/simba_utils.py

@@ -77,14 +106,15 @@ def append_a_demo_(bucket, system, **kwargs):
 def append_a_rule(bucket, system, **kwargs):
    predictor2name = kwargs["predictor2name"]
    batch_10p_score, batch_90p_score = kwargs["batch_10p_score"], kwargs["batch_90p_score"]
+    prompt_model = kwargs["prompt_model"] or dspy.settings.lm


q: is it possible that prompt_model is not passed? Maybe kwargs.get("prompt_model") is safer

TomeHirata · 2025-09-05T08:21:25Z

dspy/teleprompt/simba_utils.py

-    if good["score"] < batch_10p_score or bad["score"] > batch_90p_score:
-        logger.info(f"Skipping rule generation as good score {good['score']} is below the 10th percentile "
-                    f"*or* bad score {bad['score']} is above the 90th percentile.")
+    if good["score"] <= batch_10p_score or bad["score"] >= batch_90p_score:


Suggested change

if good["score"] <= batch_10p_score or bad["score"] >= batch_90p_score:

if good <= batch_10p_score or bad >= batch_90p_score:

TomeHirata · 2025-09-05T08:32:35Z

dspy/adapters/utils.py

@@ -1,3 +1,4 @@
+


nit: blank line

TomeHirata · 2025-09-05T08:38:05Z

dspy/teleprompt/simba_utils.py

        name2demo = {}

+        if good["score"] <= batch_10p_score:


chenmoneygithub

Most of my comments overlap with Tomu's. Basically the function looks good, we can merge after addressing the comments.

chenmoneygithub · 2025-09-05T21:05:32Z

dspy/teleprompt/simba_utils.py

@@ -26,33 +38,51 @@ def wrapped_program(example):
            try:
                prediction = program(**example.inputs())
            except Exception as e:
-                print(e)
+                logger.info(e)


should this be logger.warning or logger.error?

simba updates + handling optional fields

ee655d4

klopsahlong changed the title ~~simba updates + handling optional fields~~ SIMBA Improvements Sep 4, 2025

klopsahlong commented Sep 4, 2025

View reviewed changes

pre-commit check

0947676

klopsahlong marked this pull request as ready for review September 4, 2025 15:27

TomeHirata requested a review from Copilot September 5, 2025 08:02

Copilot AI reviewed Sep 5, 2025

View reviewed changes

TomeHirata reviewed Sep 5, 2025

View reviewed changes

dspy/adapters/utils.py

@@ -1,3 +1,4 @@

Copy link

Collaborator

TomeHirata Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: blank line

TomeHirata reviewed Sep 5, 2025

View reviewed changes

dspy/teleprompt/simba_utils.py

name2demo = {}

if good["score"] <= batch_10p_score:

Copy link

Collaborator

TomeHirata Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

chenmoneygithub reviewed Sep 5, 2025

View reviewed changes


		def parse_value(value, annotation):
		annotation = _strip_optional(annotation)

		teacher_lm.kwargs["rollout_id"] = rollout_ids[start_rollout_idx]
		models.append(teacher_lm)

	teacher_lm.kwargs["rollout_id"] = rollout_ids[start_rollout_idx]
	models.append(teacher_lm)
	models.append(teacher_lm.copy(rollout_id=rollout_ids[start_rollout_idx]))

	prompt_model: Any \| None = None,
	prompt_model: dspy.LM \| None = None,

	self.prompt_model = prompt_model if prompt_model else dspy.settings.lm
	self.prompt_model = prompt_model or dspy.settings.lm

	if good["score"] <= batch_10p_score or bad["score"] >= batch_90p_score:
	if good <= batch_10p_score or bad >= batch_90p_score:

SIMBA Improvements #8766

Are you sure you want to change the base?

SIMBA Improvements #8766

Uh oh!

Conversation

klopsahlong commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomeHirata Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

klopsahlong commented Sep 4, 2025 •

edited

Loading

TomeHirata Sep 5, 2025 •

edited

Loading