[WIP] React use `dspy.ToolCalls` #8472

chenmoneygithub · 2025-06-28T02:50:13Z

Refactor dspy.ReAct to use the dspy.ToolCalls for consistency.

We are keeping the behavior that when JSONAdapter is used with ToolCalls, we direct it to the ChatAdapter for good quality, because we have been consistently noticing that the models are doing poorly when using structured output + dict[str, Any]. With our experiments, native tool calling can mitigate this issue, but native tool calling is not producing promising results now, and we are still doing experiments there.

Did a quick benchmark on Hover dataset for this PR, and we see a pretty clear quality regression:

Benchmark result with dspy.ToolCall:

|              |   chat_adapter |   json_adapter |   xml_adapter |
|--------------|----------------|----------------|---------------|
| 4o-mini      |             76 |          73.33 |         75.33 |
| 4o           |             76 |          71.33 |         76    |
| llama3-3-70b |             76 |          68    |         70    |


Benchmark result without dspy.ToolCal in dspy.ReAct:
|              |   chat_adapter |   json_adapter |   xml_adapter |
|--------------|----------------|----------------|---------------|
| 4o-mini      |          81.33 |          71.33 |         76    |
| 4o           |          77.33 |          75.33 |         78.67 |
| llama3-3-70b |          73.33 |          74.67 |         79.33 |

My theory is all these LMs are doing a worse job of understanding deep nested output type than flat types. In details, dspy.ToolCalls is a nest type that has a list of dspy.Toolcalls.ToolCall, which has two fields, one is a string representing the name, and the other representing the arg dict. As a comparison, the current react uses next_tool_name, which is a single string, and next_tool_args which is a dict. So this PR is introducing too much nesting to keep LM of a decent quality.

All benchmarks are done on 50 data from Hover dataset, with ChatAdapter. Benchmark script:

import dspy

lm_4o_mini = dspy.LM("openai/gpt-4o-mini", cache=False)
lm_4o = dspy.LM("openai/gpt-4o", cache=False)
llama = dspy.LM("databricks/databricks-meta-llama-3-3-70b-instruct", cache=False)
dspy.configure(
    lm=lm_4o_mini,
)


from pydantic import BaseModel

from dspy.datasets import DataLoader

kwargs = dict(fields=("claim", "supporting_facts", "hpqa_id", "num_hops"), input_keys=("claim",))
hover = DataLoader().from_huggingface(dataset_name="hover-nlp/hover", split="train", trust_remote_code=True, **kwargs)

hpqa_ids = set()
filtered_hover = []
for x in hover:
    if x["num_hops"] == 3 and x["hpqa_id"] not in hpqa_ids:
        hpqa_ids.add(x["hpqa_id"])
        filtered_hover.append(
            dspy.Example(claim=x.claim, titles=list(set([y["key"] for y in x.supporting_facts]))).with_inputs("claim")
        )
hover = filtered_hover

trainset, devset, testset = hover[:100], hover[100:150], hover[650:]

example = trainset[0]

print("Claim:", example.claim)
print("Pages that must be retrieved:", example.titles)

DOCS = {}


class SearchInput(BaseModel):
    search_query: str


def search(query: str, k: int) -> list[str]:
    results = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")(query, k=k)
    results = [x["text"] for x in results]

    for result in results:
        title, text = result.split(" | ", 1)
        DOCS[title] = text

    return results


def search_wikipedia(query: str) -> list[str]:
    """Returns top-5 results and then the titles of the top-5 to top-30 results."""

    topK = search(query, 30)
    titles, topK = [f"`{x.split(' | ')[0]}`" for x in topK[5:30]], topK[:5]
    return topK + [f"Other retrieved pages have titles: {', '.join(titles)}."]


def lookup_wikipedia(title: str) -> str:
    """Returns the text of the Wikipedia page, if it exists."""

    if title in DOCS:
        return DOCS[title]

    results = [x for x in search(title, 10) if x.startswith(title + " | ")]
    if not results:
        return f"No Wikipedia page found for title: {title}"
    return results[0]


instructions = "Find all Wikipedia titles relevant to verifying (or refuting) the claim."
signature = dspy.Signature("claim -> titles: list[str]", instructions)

tools = [dspy.Tool(search_wikipedia), dspy.Tool(lookup_wikipedia)]

react = dspy.ReAct(signature, tools=tools, max_iters=5)

output = react(claim="David Gregory was born in 1625.")
print(output)

dspy.inspect_history(n=3)


def top5_recall(example, pred, trace=None):
    gold_titles = example.titles
    recall = sum(x in pred.titles[:5] for x in gold_titles) / len(gold_titles)

    # If we're "bootstrapping" for optimization, return True if and only if the recall is perfect.
    if trace is not None:
        return recall >= 1.0

    # If we're just doing inference, just measure the recall.
    return recall


evaluate = dspy.Evaluate(devset=devset[:100], metric=top5_recall, num_threads=8, display_progress=True, display_table=5)


def safe_react(claim: str):
    try:
        return react(claim=claim)
    except Exception:
        return dspy.Prediction(titles=[])


evaluate(safe_react)

ryanh-ai · 2025-07-13T16:20:37Z

@chenmoneygithub not 100% sure, but the regression you see may be due to how the message history and tool call history/results are passed back to the model provider. It looks like the trajectory right now is still formatted as a string, but you may get better results if you build and pass a full message stack back to the model provider with the right content types for tool call and tool call results. Have you tried this?

chenmoneygithub · 2025-07-14T19:00:33Z

@ryanh-ai Thanks for the suggestion! I assume you mean formatting the trajectory as multiturn message instead of a big json? I have tried it, but doesn't produce meaningful improvements.

I have kinda of spotted the problem, it's the LM is doing a worse job of understanding nested type requirements, like dspy.ToolCalls essentially does something like output.toolcalls.tool_calls[0].args, while the original ReAct uses output.next_tool_call_args.

ryanh-ai · 2025-07-16T02:54:30Z

I mean passing tool results back as tool result content blocks as part of user the, same for assistant turns text content, etc - perhaps that is what your experiment was, but wanted to be clear. Here is the page in LiteLLM:

https://docs.litellm.ai/docs/completion/function_call

I know some model providers use the structured message schema and content types in the way they format what the LLM sees.

chenmoneygithub · 2025-07-16T03:39:30Z

@ryanh-ai Thanks! So you mean native function calling. We have noticed that native function calling yields a very poor quality, if you are interested in helping us improve DSPy, I would like to get your experiment result on how it works. Thank you!

ryanh-ai · 2025-07-18T03:40:09Z

Sounds good! I have not done it with DSPy but was trying to hence coming across this thread. I have done it outside DSPy on a couple of providers.

Let me see if I can implement in DSPy when I find some time for a test.

ryanh-ai · 2025-07-29T03:08:30Z

Started working on coding this up over the weekend.... will share back when I have some results!

chenmoneygithub added 5 commits June 24, 2025 21:20

init

f3b7266

increment

4c01078

increment

63cb5f9

give it a pause

4779b18

increment

f56afeb

chenmoneygithub marked this pull request as draft June 28, 2025 02:50

chenmoneygithub added 16 commits June 30, 2025 16:32

increment

d00e7c8

update test

fb0a4fa

fix

839d789

cleanup

a3a664e

Merge branch 'main' into react-native-tool-calling

a610898

add docstring

1b06c69

direct jsonadapter to chatadapter

973f613

fix test

a8e0e47

better test

4fb6239

fix

8e5b6ae

fix tests

ee7da54

lint fix

a7a5a4e

better test

dbbaea2

merge the change on adapter for native function calling

ab2fa65

increment

01b3bc1

some updates

c8752b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] React use `dspy.ToolCalls` #8472

[WIP] React use `dspy.ToolCalls` #8472

Uh oh!

chenmoneygithub commented Jun 28, 2025 •

edited

Loading

Uh oh!

ryanh-ai commented Jul 13, 2025

Uh oh!

chenmoneygithub commented Jul 14, 2025

Uh oh!

ryanh-ai commented Jul 16, 2025

Uh oh!

chenmoneygithub commented Jul 16, 2025

Uh oh!

ryanh-ai commented Jul 18, 2025

Uh oh!

ryanh-ai commented Jul 29, 2025

Uh oh!

Uh oh!

[WIP] React use dspy.ToolCalls #8472

Are you sure you want to change the base?

[WIP] React use dspy.ToolCalls #8472

Uh oh!

Conversation

chenmoneygithub commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ryanh-ai commented Jul 13, 2025

Uh oh!

chenmoneygithub commented Jul 14, 2025

Uh oh!

ryanh-ai commented Jul 16, 2025

Uh oh!

chenmoneygithub commented Jul 16, 2025

Uh oh!

ryanh-ai commented Jul 18, 2025

Uh oh!

ryanh-ai commented Jul 29, 2025

Uh oh!

Uh oh!

[WIP] React use `dspy.ToolCalls` #8472

[WIP] React use `dspy.ToolCalls` #8472

chenmoneygithub commented Jun 28, 2025 •

edited

Loading