Skip to content

Commit 2e934bb

Browse files
aarnphmheheda12345
andauthored
[Docs] gpt-oss with function calling (#40)
Co-authored-by: Chen Zhang <[email protected]>
1 parent 15eeb14 commit 2e934bb

File tree

1 file changed

+16
-8
lines changed

1 file changed

+16
-8
lines changed

OpenAI/GPT-OSS.md

Lines changed: 16 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@ vllm serve openai/gpt-oss-120b --compilation-config '{"compile_sizes": [1, 2, 4,
142142
Once the `vllm serve` runs and `INFO: Application startup complete` has been displayed, you can send requests using HTTP request or OpenAI SDK to the following endpoints:
143143

144144
* `/v1/responses` endpoint can perform tool use (browsing, python, mcp) in between chain-of-thought and deliver a final response. This endpoint leverages the `openai-harmony` library for input rendering and output parsing. Stateful operation and full streaming API are work in progress. Responses API is recommended by OpenAI as the way to interact with this model.
145-
* `/v1/chat/completions` endpoint offers a familiar interface to this model. No tool will be invoked but reasoning and final text output will be returned structurally. Function calling is work in progress. You can also set the parameter `include_reasoning: false` in request parameter to skip CoT being part of the output.
145+
* `/v1/chat/completions` endpoint offers a familiar interface to this model. No tool will be invoked but reasoning and final text output will be returned structurally. You can also set the parameter `include_reasoning: false` in request parameter to skip CoT being part of the output.
146146
* `/v1/completions` endpoint is the endpoint for a simple input output interface without any sorts of template rendering.
147147

148148
All endpoints accept `stream: true` as part of the operations to enable incremental token streaming. Please note that vLLM currently does not cover the full scope of responses API, for more detail, please see Limitation section below.
@@ -153,7 +153,7 @@ One premier feature of gpt-oss is the ability to call tools directly, called "bu
153153

154154
* By default, we integrate with the reference library's browser (with `ExaBackend`) and demo Python interpreter via docker container. In order to use the search backend, you need to get access to [exa.ai](http://exa.ai) and put `EXA_API_KEY=` as an environment variable. For Python, either have docker available, or set `PYTHON_EXECUTION_BACKEND=dangerously_use_uv` to dangerously allow execution of model generated code snippets to be executed on the same machine. Please note that `PYTHON_EXECUTION_BACKEND=dangerously_use_uv` needs `gpt-oss>=0.0.5`.
155155

156-
```
156+
```bash
157157
uv pip install gpt-oss
158158

159159
vllm serve ... --tool-server demo
@@ -162,22 +162,30 @@ vllm serve ... --tool-server demo
162162
* Please note that the default options are simply for demo purposes. For production usage, vLLM itself can act as MCP client to multiple services.
163163
Here is an [example tool server](https://github.com/openai/gpt-oss/tree/main/gpt-oss-mcp-server) that vLLM can work with, they wrap the demo tools:
164164

165-
```
165+
```bash
166166
mcp run -t sse browser_server.py:mcp
167167
mcp run -t sse python_server.py:mcp
168168

169169
vllm serve ... --tool-server ip-1:port-1,ip-2:port-2
170170
```
171171

172-
The URLs are expected to be MCP SSE servers that implement `instructions` in server info and well documented tools. The tools will be injected into the system prompt for the model to enable them.
172+
The URLs are expected to be MCP SSE servers that implement `instructions` in server info and well documented tools. The tools will be injected into the system prompt for the model to enable them.
173+
174+
### Function calling
175+
176+
vLLM also supports calling user-defined functions. Make sure to run your gpt-oss models with the following arguments.
177+
178+
```bash
179+
vllm serve ... --tool-call-parser openai --enable-auto-tool-choice
180+
```
173181

174182
## Accuracy Evaluation Panels
175183

176184
OpenAI recommends using the gpt-oss reference library to perform evaluation.
177185

178186
First, deploy the model with vLLM:
179187

180-
```
188+
```bash
181189
# Example deployment on 8xH100
182190
vllm serve openai/gpt-oss-120b \
183191
--tensor_parallel_size 8 \
@@ -190,7 +198,7 @@ vllm serve openai/gpt-oss-120b \
190198

191199
Then, run the evaluation with gpt-oss. The following command will run all the 3 reasoning effort levels.
192200

193-
```
201+
```bash
194202
mkdir -p /tmp/gpqa_openai
195203
OPENAI_API_KEY=empty python -m gpt_oss.evals --model openai/gpt-oss-120b --eval gpqa --n-threads 128
196204
```
@@ -265,8 +273,8 @@ Meaning:
265273
| Response API ||||||
266274
| Response API with Background Mode ||||||
267275
| Response API with Streaming ||||||
268-
| Chat Completion API ||||| |
269-
| Chat Completion API with Streaming ||||| |
276+
| Chat Completion API ||||| |
277+
| Chat Completion API with Streaming ||||| |
270278

271279

272280
If you want to use offline inference, you can treat vLLM as a token-in-token-out service and pass in tokens that are already formatted with Harmony.

0 commit comments

Comments
 (0)