Skip to content

Conversation

createthis
Copy link

@createthis createthis commented Aug 23, 2025

This PR enables DeepSeek V3.1 thinking mode as the default. Disable with --reasoning-budget 0.

Addresses #15496

openhands-agent and others added 14 commits August 22, 2025 13:31
- Added COMMON_CHAT_FORMAT_DEEPSEEK_V3_1 enum value
- Created common_chat_params_init_deepseek_v3_1() function (currently uses R1 implementation)
- Created common_chat_parse_deepseek_v3_1() function that handles V3.1 thinking format:
  - Extracts reasoning content before '</think>' tag into reasoning_content
  - Extracts regular content after '</think>' tag into content
  - No opening '<think>' tag in V3.1 format
- Added detection logic for V3.1 templates based on pattern: 'message['prefix'] is defined and message['prefix'] and thinking'
- Added V3.1 case to parsing switch statement

This addresses the issue where V3.1 outputs reasoning content followed by '</think>' and then regular content without the opening '<think>' tag.
@github-actions github-actions bot added the testing Everything test related label Aug 23, 2025
data.thinking_forced_open = true;
}
return data;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize you probably didn't want to tackle tool calls in this PR, but I'm pretty sure this will break when there are tool calls if you don't.

Copy link
Author

@createthis createthis Aug 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CISC I had ChatGPT 5 generate a python script for a simple tool call test: https://github.com/createthis/llama_cpp_deepseek_v3_1_think_tags/blob/main/tool_test.py

This works for me with gpt-oss-120b and master. We can use this as our control test.

I recorded the successful conversations with tool_test.py and gpt-oss-120b twice, once without --verbose set in llama.cpp and once with. The files for the recordings are in the same repo:

  • full_traffic_master_tool_test_gpt_oss_120b.mitm
  • full_traffic_master_tool_test_verbose_gpt_oss_120b.mitm

When the test works, the output looks like this:

(base) jesse@Jesses-MacBook-Pro llama.cpp % python ../tool_test.py

ASSISTANT:
 The current time in Tokyo is **2025‑08‑24 22:41 JST**.
(base) jesse@Jesses-MacBook-Pro llama.cpp %

@createthis createthis requested a review from ngxson as a code owner August 25, 2025 01:50
@createthis createthis marked this pull request as draft August 25, 2025 03:34
@createthis createthis marked this pull request as ready for review August 25, 2025 05:42
Tool calls work in thinking and non-thinking modes. However, I've introduced a regression in streaming mode where reasoning content initially comes through as regular content. I need to think about how to deal with this long term.
@createthis createthis requested a review from CISC August 29, 2025 18:22
@CISC
Copy link
Collaborator

CISC commented Aug 29, 2025

@createthis
Copy link
Author

Also, address this CI failure: https://github.com/ggml-org/llama.cpp/actions/runs/17274346584/job/49214116367

@CISC you got it 9056707

common/chat.cpp Outdated
* Takes a prefix regex that must have 1 group to capture the function name, a closing suffix, and expects json parameters in between.
* Aggregates the prefix, suffix and in-between text into the content.
*/
static void parse_json_tool_calls_deepseek_v3_1(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure why you made this function instead of changing parse_json_tool_calls?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to break anything existing by altering current behavior accidentally. I don't think there are a lot of unit tests for this function. I'm happy to just replace it though if you want.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I know if I replace it, it breaks Functionary v3.2 tests at a minimum.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I merged them using an optional update_cursor argument.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CISC would you please re-review? I just finished running the Aider Polyglot Benchmark with Q2_K_XL in thinking mode. This branch seems to be performing pretty well.

Screenshot 2025-08-31 at 8 54 45 AM

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@createthis When I tested this branch a couple of days ago with llama-server and the integrated web frontend, I wasn't able to get any response out of the model: it would apparently "think", but no response would be shown at all (only the busy indicator). Is this something that was to be expected and would now be fixed before the merge?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgoll I don't use the built-in webui. I use Open WebUI, which works fine. I just tested the built-in webui and I'm seeing the same behavior. Thanks for the report. I'll try to figure out why that's happening.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgoll It works in the builtin webui with --reasoning-budget 0. That's something at least. Still investigating.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgoll The builtin webui uses reasoning_format: none, but I wasn't supporting it. This should be fixed by 7795594.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2025-08-31 at 10 44 26 PM

Note that reasoning_format: none is pretty basic. Open WebUI formats it better because it uses reasoning_format: auto:

Screenshot 2025-08-31 at 10 45 32 PM

@createthis createthis requested a review from CISC August 30, 2025 04:57
@@ -678,7 +679,8 @@ static void parse_json_tool_calls(
const common_regex & close_regex,
const std::optional<common_regex> & block_close,
bool allow_raw_python = false,
const std::function<std::string(const common_chat_msg_parser::find_regex_result & fres)> & get_function_name = nullptr) {
const std::function<std::string(const common_chat_msg_parser::find_regex_result & fres)> & get_function_name = nullptr,
bool update_cursor = false) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to understand why the update_cursor changes are necessary, why exactly do you need to do this?

I see that the tests fail without it, but the tests are also altered compared to all the other formats...

Copy link
Author

@createthis createthis Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that the tests fail without it

@CISC Correct. The regexes simply do not work without this change. It's good to see my unit tests are doing their job.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why do they require this change?

Copy link
Collaborator

@CISC CISC Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm getting at is that this looks like shaping the result to fit the (possibly incorrect) test.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CISC What's incorrect about the test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I understand that update_cursor allows you to process multiple tool calls (which BTW, should be applicable to several models, including R1), however the first few tests are single tool calls, and they fail when update_cursor is false because the end tag is left unconsumed, thus failing to parse as a tool call.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CISC I'm not super smart. Dumb it down for me. Are you asking for a change? Is a specific line incorrect? Give me something to go on here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't mean to be obtuse, my worry is that something is not quite right when the tool call fails to be parsed if I force update_cursor to false, intuitively this should work for single tool calls, but doesn't.

createthis and others added 9 commits August 31, 2025 16:12
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
Comment on lines +1368 to +1371
tool_rules.push_back(builder.add_rule(name + "-call",
"( \"<|tool▁call▁begin|>\" )? \"function<|tool▁sep|>" + name + "\\n"
"```json\\n\" " + builder.add_schema(name + "-args", parameters) + " "
"\"```<|tool▁call▁end|>\""));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not match function_regex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was cargo culted from the R1 code. I see the tests fail with it removed. I'll investigate. I have to cook some swordfish for my daughter first.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I logged this out:

std::string rule_name = name + "-call";
std::string rule_pattern = "( \"<|tool▁call▁begin|>\" )? \"function<|tool▁sep|>" + name + "\\n"
                           "```json\\n\" " + builder.add_schema(name + "-args", parameters) + " "
                           "\"```<|tool▁call▁end|>\"";
LOG_DBG("%s: add_rule: \nrule_name: %s\nrule_pattern: %s\n", __func__, rule_name.c_str(), rule_pattern.c_str());

This logs

operator(): add_rule:
rule_name: special_function-call
rule_pattern: ( "<|tool▁call▁begin|>" )? "function<|tool▁sep|>special_function\n```json\n" special-function-args "```<|tool▁call▁end|>"

function_regex

function_regex is (?:<|tool▁call▁begin|>)?function<|tool▁sep|>([^\n]+)\n```json\n

testing function_regex

Here's a link to a regex tester that implements this pattern: https://regex101.com/r/WUCUga/1
And a screenshot of the tester for good measure:
Screenshot 2025-08-31 at 7 53 33 PM

???

Admittedly, I don't fully understand llama.cpp's grammar subsystem, but this looks like it works to me. Also, if I remove this code the tests fail. Help me out. What are we talking about here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CISC ^

Copy link
Collaborator

@CISC CISC Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails on this test:

llama.cpp/tests/test-chat.cpp

Lines 1796 to 1798 in 7795594

simple_assist_msg("", "", "get_time", "{\"city\":\"Tokyo\"}"),
common_chat_parse(
"<|tool▁calls▁begin|><|tool▁call▁begin|>get_time<|tool▁sep|>{\"city\": \"Tokyo\"}<|tool▁call▁end|><|tool▁calls▁end|>",

Edit: To be clear, the regex works, but the rule pattern does not match the regex.

Copy link
Author

@createthis createthis Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CISC No, it fails on the first test_templates call, which was cargo culted from the R1 tests:

test_templates(tmpls.get(), end_tokens, message_assist, tools, "Hello, world!\nWhat's up?", /* expect_grammar_triggered= */ false);

changes to repro

diff --git a/common/chat.cpp b/common/chat.cpp
index 1236e766..113eec90 100644
--- a/common/chat.cpp
+++ b/common/chat.cpp
@@ -1365,10 +1365,6 @@ static common_chat_params common_chat_params_init_deepseek_v3_1(const common_cha
                 std::string name = function.at("name");
                 auto parameters = function.at("parameters");
                 builder.resolve_refs(parameters);
-                tool_rules.push_back(builder.add_rule(name + "-call",
-                    "( \"<|tool▁call▁begin|>\" )? \"function<|tool▁sep|>" + name + "\\n"
-                    "```json\\n\" " + builder.add_schema(name + "-args", parameters) + " "
-                    "\"```<|tool▁call▁end|>\""));
             });
             // Distill Qwen 7B & 32B models seem confused re/ syntax of their tool call opening tag,
             // so we accept common variants (then it's all constrained)
diff --git a/tests/test-chat.cpp b/tests/test-chat.cpp
index 4bcbf97a..42d79508 100644
--- a/tests/test-chat.cpp
+++ b/tests/test-chat.cpp
@@ -1755,6 +1755,7 @@ static void test_template_output_parsers() {
                 /* is_partial= */ false,
                 {COMMON_CHAT_FORMAT_SEED_OSS}));
     }
+    common_log_set_verbosity_thold(LOG_DEFAULT_DEBUG);
     {
         auto tmpls = read_templates("models/templates/deepseek-ai-DeepSeek-V3.1.jinja");
         std::vector<std::string>   end_tokens{ "<|end▁of▁sentence|>" };
@@ -1765,8 +1766,11 @@ static void test_template_output_parsers() {
             assert_equals(true, params.thinking_forced_open);
         }
 
+        LOG_DBG("%s: here0\n", __func__);
         test_templates(tmpls.get(), end_tokens, message_assist, tools, "</think>Hello, world!\nWhat's up?", /* expect_grammar_triggered= */ false);
+        LOG_DBG("%s: here0.1\n", __func__);
         test_templates(tmpls.get(), end_tokens, message_assist_thoughts, tools, "</think>Hello, world!\nWhat's up?", /* expect_grammar_triggered= */ false);
+        LOG_DBG("%s: here0.2\n", __func__);
         assert_msg_equals(
             simple_assist_msg("Hello, world!\nWhat's up?", "I'm\nthinking"),
             common_chat_parse(

result

Note the here0 but no here0.1:

# Reading: models/templates/deepseek-ai-DeepSeek-V3.1.jinja
Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.
test_template_output_parsers: here0
Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.
Template supports tool calls but does not natively describe tools. The fallback behaviour used may produce bad results, inspect prompt w/ --verbose & consider overriding the template.
unsupported grammar, left recursion detected for nonterminal at index 3/home/jesse/llama.cpp/build-ci-debug/bin/libggml-base.so(+0x51511)[0x7a96c41c2511]

analysis

I don't know why tools is using python:

std::vector<common_chat_tool> tools { special_function_tool, python_tool };

I'm just trying to get V3.1 working, not fix the broken R1 code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples server testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants