Skip to content

Commit 92956a1

Browse files
committed
feat(api_server): Add OpenAI-compatible API server for MaxText models
This commit introduces a fully-featured, OpenAI-compatible RESTful API server for serving MaxText models. The server is built with FastAPI, supports multi-host inference on TPUs, and is designed for both interactive use and large-scale benchmarking. Key features and additions: 1. **Core Server Implementation:** - Adds `maxtext_server.py`, a FastAPI application that serves `/v1/completions` and `/v1/chat/completions` endpoints. - Implements dynamic request batching to efficiently utilize underlying hardware. - Uses `maxtext_generator.py` to encapsulate the MaxText inference engine, handling model loading, tokenization, and the generation loop. - Includes Pydantic models in `server_models.py` for robust, OpenAI-compliant request and response validation. 2. **Deployment and Utilities:** - Provides `start_server.sh` to simplify launching the server from the project root. - Adds `port_forward_xpk.sh`, a utility script to automatically find and connect to a server running on a GKE cluster via `xpk`, supporting custom namespaces. - Isolates server-specific dependencies in `benchmarks/api_server/requirements.txt` (`uvicorn`, `fastapi`, `openai-harmony`). 3. **Comprehensive Documentation:** - A new `README.md` in the `api_server` directory offers a complete guide covering: - Installation and environment setup. - Launching the server in both single-pod and multi-pod GKE environments. - Detailed examples for interacting with the API using `curl` and the `openai` Python client. - Step-by-step instructions for running benchmarks with `lm-evaluation-harness` and `evalchemy` for both log-likelihood and generative tasks.
1 parent cbd599f commit 92956a1

12 files changed

+2134
-0
lines changed

benchmarks/api_server/README.md

Lines changed: 317 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,317 @@
1+
# MaxText API Server
2+
3+
This directory contains an OpenAI-compatible API server for serving MaxText models, enabling benchmarks with evaluation frameworks like lm-eval-harness and evalchemy. It uses [FastAPI](https://fastapi.tiangolo.com/) as the web framework and can be deployed on a single machine or a multi-host GKE cluster.
4+
5+
## Table of Contents
6+
- [Installation](#installation)
7+
- [Environment Variables](#environment-variables)
8+
- [Launching the Server (Single-Host)](#launching-the-server-single-pod)
9+
- [Deploying on a GKE Cluster (Multi-Host)](#deploying-on-a-gke-cluster-multi-host)
10+
- [Interacting with the Server](#interacting-with-the-server)
11+
- [Benchmarking with Evaluation Frameworks](#benchmarking-with-evaluation-frameworks)
12+
13+
14+
## Installation
15+
16+
The server has a few additional dependencies beyond the core MaxText requirements. Install them using the provided `requirements.txt` file:
17+
18+
```bash
19+
pip install -r benchmarks/api_server/requirements.txt
20+
```
21+
22+
## Environment Variables
23+
24+
Before launching the server, you may need to set the following environment variable:
25+
26+
- `HF_TOKEN`: Your Hugging Face access token. This is required if the model's tokenizer is hosted on the Hugging Face Hub and is not public.
27+
28+
```bash
29+
export HF_TOKEN=<your_hugging_face_token>
30+
```
31+
32+
## Launching the Server (Single-Host)
33+
34+
The primary way to launch the API server is by using the `start_server.sh` script. This script ensures that the server is run from the project's root directory, which is necessary for the Python interpreter to find all the required modules.
35+
36+
The script takes the path to a base configuration file (e.g., `MaxText/configs/base.yml`) followed by any number of model-specific configuration overrides.
37+
38+
### Benchmarking Configuration
39+
40+
To use this server for benchmarking with frameworks like `lm-eval-harness` or `evalchemy`, you **must** include the following two arguments in your launch command:
41+
42+
- `tokenizer_type="huggingface"`: Ensures the tokenizer is compatible with the evaluation harness.
43+
- `return_log_prob=True`: Enables the log probability calculations required for many standard evaluation metrics.
44+
45+
### Command Structure
46+
47+
```bash
48+
bash benchmarks/api_server/start_server.sh /path/to/base.yml [arg1=value1] [arg2=value2] ...
49+
```
50+
51+
### Example
52+
53+
Here is an example of how to launch the server with a `qwen3-30b-a3b` model, configured for benchmarking. This example is configured for a TPU v5p-8, which has 4 chips.
54+
55+
```bash
56+
# Make sure you are in the root directory of the maxtext project.
57+
58+
bash benchmarks/api_server/start_server.sh \
59+
MaxText/configs/base.yml \
60+
model_name="qwen3-30b-a3b" \
61+
tokenizer_path="Qwen/Qwen3-30B-A3B-Thinking-2507" \
62+
load_parameters_path="<path_to_your_checkpoint>" \
63+
per_device_batch_size=4 \
64+
ici_tensor_parallelism=4 \
65+
max_prefill_predict_length=1024 \
66+
max_target_length=2048 \
67+
async_checkpointing=false \
68+
scan_layers=false \
69+
attention="dot_product" \
70+
tokenizer_type="huggingface" \
71+
return_log_prob=True
72+
```
73+
74+
Once the server starts successfully, you will see a confirmation message from Uvicorn:
75+
76+
<img src="./images/single-host-server-startup.png" alt="Single-Host Server Startup" width="894"/>
77+
78+
```
79+
INFO: RANK 0: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
80+
```
81+
82+
The server is now ready to accept requests on port 8000.
83+
84+
## Deploying on a GKE Cluster (Multi-Host)
85+
86+
For large models that require a multi-host TPU setup, you can deploy the server using the [xpk (Kubernetes Pod Executor) tool](https://github.com/AI-Hypercomputer/xpk). The recommended approach is to create a single submission script to configure and launch the workload.
87+
88+
89+
### 1. Create a Job Submission Script
90+
91+
Create a new bash script (e.g., `launch_gke_server.sh`) to hold your configuration and `xpk` command. This makes launching jobs repeatable and easy to modify.
92+
93+
For your convenience, the script below is also available as a template file at `benchmarks/api_server/launch_gke_server.sh.template`.
94+
95+
Inside this script, you will define the server's startup command and your cluster configuration. Before running the script, define the placeholders at the top of the file. Placeholders are enclosed in angle brackets (e.g., `<your_gcp_project>`).
96+
97+
```bash
98+
#!/bin/bash
99+
set -e
100+
101+
# ==============================================================================
102+
# 1. User-Configurable Variables
103+
# ==============================================================================
104+
105+
# -- GKE Cluster Configuration --
106+
# (<your_gke_cluster>, <your_gcp_project>, <your_gcp_zone>)
107+
export CLUSTER="<your-gke-cluster>"
108+
export DEVICE_TYPE="v5p-16"
109+
export PROJECT="<your-gcp-project>"
110+
export ZONE="<your-gcp-zone>"
111+
112+
# -- XPK Workload Configuration --
113+
# (<YYYY-MM-DD>, <your_hugging_face_token>)
114+
export RUNNAME="my-server-$(date +%Y-%m-%d-%H-%M-%S)"
115+
export DOCKER_IMAGE="gcr.io/tpu-prod-env-multipod/maxtext_jax_nightly:<YYYY-MM-DD>"
116+
export HF_TOKEN="<your_hugging_face_token>" # Optional: if your tokenizer is private
117+
118+
# -- Model Configuration --
119+
# IMPORTANT: Replace these with your model's details.
120+
# (<your_model_name>, <path_or_name_to_your_tokenizer>, <path_to_your_checkpoint>)
121+
export MODEL_NAME="qwen3-30b-a3b"
122+
export TOKENIZER_PATH="Qwen/Qwen3-30B-A3B-Thinking-2507"
123+
export LOAD_PARAMETERS_PATH="<path_to_your_checkpoint>"
124+
export PER_DEVICE_BATCH_SIZE=4
125+
# Parallelism settings should match the number of chips on your device.
126+
# For a v5p-16 (8 chips), the product of parallelism values should be 8.
127+
export ICI_TENSOR_PARALLELISM=4
128+
export ICI_EXPERT_PARALLELISM=2
129+
130+
# ==============================================================================
131+
# 2. Define the Command to Run on the Cluster
132+
# ==============================================================================
133+
# This command installs dependencies and then starts the server.
134+
CMD="export HF_TOKEN=${HF_TOKEN} && \
135+
pip install --upgrade pip && \
136+
pip install -r benchmarks/api_server/requirements.txt && \
137+
bash benchmarks/api_server/start_server.sh \
138+
MaxText/configs/base.yml \
139+
model_name="${MODEL_NAME}" \
140+
tokenizer_path="${TOKENIZER_PATH}" \
141+
load_parameters_path="${LOAD_PARAMETERS_PATH}" \
142+
per_device_batch_size=${PER_DEVICE_BATCH_SIZE} \
143+
ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \
144+
ici_expert_parallelism=${ICI_EXPERT_PARALLELISM} \
145+
tokenizer_type=\"huggingface\" \
146+
return_log_prob=True"
147+
148+
149+
# ==============================================================================
150+
# 3. Launch the Workload
151+
# ==============================================================================
152+
echo "Launching workload ${RUNNAME}..."
153+
xpk workload create --workload "${RUNNAME}" \
154+
--base-docker-image "${DOCKER_IMAGE}" \
155+
--command "${CMD}" \
156+
--num-slices=1 \
157+
--cluster "${CLUSTER}" --device-type "${DEVICE_TYPE}" --project "${PROJECT}" --zone "${ZONE}"
158+
159+
echo "Workload ${RUNNAME} created."
160+
echo "Use the following command to connect:"
161+
echo "bash benchmarks/api_server/port_forward_xpk.sh job_name=${RUNNAME} project=${PROJECT} zone=${ZONE} cluster=${CLUSTER}"
162+
```
163+
164+
### 2. Launch the Workload
165+
166+
Make the script executable and run it:
167+
168+
```bash
169+
chmod +x launch_gke_server.sh
170+
./launch_gke_server.sh
171+
```
172+
173+
### 3. Connect to the Server
174+
175+
The API server only runs on the first host/worker (rank 0 on GPU) of the workload. To connect to it, use the `port_forward_xpk.sh` script as instructed in the output of your launch script.
176+
177+
```bash
178+
bash benchmarks/api_server/port_forward_xpk.sh \
179+
job_name=<your_job_name> \
180+
project=<your-gcp-project> \
181+
zone=<your-gcp-zone> \
182+
cluster=<your-gke-cluster>
183+
```
184+
185+
The script will automatically find the correct pod and establish the port-forward connection. Your server is now accessible at `http://localhost:8000`.
186+
187+
## Interacting with the Server
188+
189+
Once the server is running (either locally or connected via port-forwarding), you can interact with it using any standard HTTP client. The `model` field in the request body can be set to any string; it is used for identification purposes but does not change which model is being served.
190+
191+
### Using `curl`
192+
193+
#### Completions API
194+
195+
The `/v1/completions` endpoint is suitable for simple prompt-response interactions.
196+
197+
```bash
198+
curl -X POST http://localhost:8000/v1/completions \
199+
-H "Content-Type: application/json" \
200+
-d
201+
"{
202+
"model": "<your-model-name>",
203+
"prompt": "The capital of France is",
204+
"max_tokens": 50,
205+
"temperature": 0.7
206+
}"
207+
```
208+
209+
#### Chat Completions API
210+
211+
The `/v1/chat/completions` endpoint is designed for multi-turn conversations.
212+
213+
```bash
214+
curl -X POST http://localhost:8000/v1/chat/completions \
215+
-H "Content-Type: application/json" \
216+
-d
217+
"{
218+
"model": "<your-model-name>",
219+
"messages": [
220+
{"role": "system", "content": "You are a helpful assistant."},
221+
{"role": "user", "content": "What is the largest planet in our solar system?"}
222+
],
223+
"max_tokens": 50,
224+
"temperature": 0.7
225+
}"
226+
```
227+
228+
Server logs will display the following information:
229+
230+
<img src="./images/server-request-logs.png" alt="Server Request Logs" width="894"/>
231+
232+
### Using the OpenAI Python Client
233+
234+
You can also use the official `openai` Python library to interact with the server.
235+
236+
**Installation:**
237+
```bash
238+
pip install openai
239+
```
240+
241+
**Example Python Script:**
242+
```python
243+
from openai import OpenAI
244+
245+
# Point the client to the local server
246+
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
247+
248+
completion = client.chat.completions.create(
249+
model="<your-model-name>",
250+
messages=[
251+
{"role": "system", "content": "You are a helpful assistant."},
252+
{"role": "user", "content": "What is the largest planet in our solar system?"}
253+
]
254+
)
255+
256+
print(completion.choices[0].message.content)
257+
```
258+
259+
## Benchmarking with Evaluation Frameworks
260+
261+
You can evaluate models served by this API using standard frameworks like [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [evalchemy](https://github.com/mlfoundations/evalchemy).
262+
263+
### Setup
264+
265+
It is highly recommended to set up a new, separate Python virtual environment for the evaluation framework. This prevents any dependency conflicts with the MaxText environment.
266+
267+
```bash
268+
# In a new terminal
269+
python3 -m venv eval_env
270+
source eval_env/bin/activate
271+
```
272+
273+
Install the evaluation frameworks by following their official guides:
274+
- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
275+
- [evalchemy](https://github.com/mlfoundations/evalchemy)
276+
277+
278+
### Log-Likelihood / Multiple Choice Tasks (e.g., MMLU)
279+
280+
Tasks that compare the log-probabilities of different choices (`output_type: multiple_choice` or `loglikelihood`) use the `/v1/completions` endpoint.
281+
282+
To maximize throughput, set the `batch_size` in your evaluation command to match the total batch size of your running server (`per_device_batch_size` * `number of devices`).
283+
284+
**Example: Running MMLU**
285+
```bash
286+
python -m eval.eval \
287+
--model local-completions \
288+
--model_args "pretrained=<path_or_name_to_your_tokenizer>,base_url=http://localhost:8000/v1/completions,tokenizer_backend=huggingface,tokenizer=<path_or_name_to_your_tokenizer>,model=<your_model_name>,max_length=<your_max_target_length>" \
289+
--tasks mmlu \
290+
--batch_size <per_device_batch_size * number of devices> \
291+
--output_path logs
292+
```
293+
294+
An example benchmark outpus will be like:
295+
296+
<img src="./images/mmlu_example.png" alt="MMLU Example" width="894"/>
297+
298+
### Generative Tasks (e.g., AIME)
299+
300+
Tasks that require generating text until a stop sequence is met (`output_type: generate_until`) use the `/v1/chat/completions` endpoint.
301+
302+
The chat API does not support batched requests directly. Instead, the evaluation harness sends concurrent requests to simulate batching. To enable this, set `num_concurrent` to match your server's total batch size and set the evaluation `batch_size` to 1. You must also include the `--apply_chat_template` flag. All sampling parameters (like temperature, top_p, etc.) should be passed via the `--gen_kwargs` argument. For Example, if you are using v5p-8(`4 chips`) with `per_device_batch_size = 4`, the `num_concurrent = 16`
303+
304+
**Example: Running AIME25**
305+
```bash
306+
python -m eval.eval \
307+
--model local-chat-completions \
308+
--model_args "num_concurrent=16,pretrained=<path_or_name_to_your_tokenizer>,base_url=http://localhost:8000/v1/chat/completions,tokenizer_backend=huggingface,tokenizer=<path_or_name_to_your_tokenizer>,model=<your_model_name>,max_length=<your_max_target_length>" \
309+
--tasks AIME25 \
310+
--batch_size 1 \
311+
--output_path logs \
312+
--apply_chat_template \
313+
--gen_kwargs "temperature=0.6,top_p=0.95,top_k=20,max_tokens=<your_max_target_length>,max_gen_toks=<your_max_target_length>"
314+
```
315+
The valid arguments for `--gen_kwargs` are `temperature`, `top_p`, `top_k`, `stop`, `seed`, `max_tokens` and `max_gen_toks`. The `max_gen_toks` argument is used by some tasks in evaluation harness to control the maximum number of tokens to generate. We suggest pass `max_tokens` and `max_gen_toks` with the same value at the same time.
316+
317+
The evaluation results will be saved to the directory specified by the `--output_path` argument (in the examples above, a directory named `logs`).
62.1 KB
Loading
70.2 KB
Loading
66.8 KB
Loading
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
# Copyright 2023–2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# https://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
16+
#!/bin/bash
17+
set -e
18+
19+
# ==============================================================================
20+
# 1. User-Configurable Variables
21+
# ==============================================================================
22+
23+
# -- GKE Cluster Configuration --
24+
# (<your_gke_cluster>, <your_gcp_project>, <your_gcp_zone>)
25+
export CLUSTER="<your-gke-cluster>"
26+
export DEVICE_TYPE="v5p-16"
27+
export PROJECT="<your-gcp-project>"
28+
export ZONE="<your-gcp-zone>"
29+
30+
# -- XPK Workload Configuration --
31+
# (<YYYY-MM-DD>, <your_hugging_face_token>)
32+
export RUNNAME="my-server-$(date +%Y-%m-%d-%H-%M-%S)"
33+
export DOCKER_IMAGE="gcr.io/tpu-prod-env-multipod/maxtext_jax_nightly:<YYYY-MM-DD>"
34+
export HF_TOKEN="<your_hugging_face_token>" # Optional: if your tokenizer is private
35+
36+
# -- Model Configuration --
37+
# IMPORTANT: Replace these with your model's details.
38+
# (<your_model_name>, <path_or_name_to_your_tokenizer>, <path_to_your_checkpoint>)
39+
export MODEL_NAME="qwen3-30b-a3b"
40+
export TOKENIZER_PATH="Qwen/Qwen3-30B-A3B-Thinking-2507"
41+
export LOAD_PARAMETERS_PATH="<path_to_your_checkpoint>"
42+
export PER_DEVICE_BATCH_SIZE=4
43+
# Parallelism settings should match the number of chips on your device.
44+
# For a v5p-16 (8 chips), the product of parallelism values should be 8.
45+
export ICI_TENSOR_PARALLELISM=4
46+
export ICI_EXPERT_PARALLELISM=2
47+
48+
# ==============================================================================
49+
# 2. Define the Command to Run on the Cluster
50+
# ==============================================================================
51+
# This command installs dependencies and then starts the server.
52+
CMD="export HF_TOKEN=${HF_TOKEN} && \
53+
pip install --upgrade pip && \
54+
pip install -r benchmarks/api_server/requirements.txt && \
55+
bash benchmarks/api_server/start_server.sh \
56+
MaxText/configs/base.yml \
57+
model_name=\"${MODEL_NAME}\" \
58+
tokenizer_path=\"${TOKENIZER_PATH}\" \
59+
load_parameters_path=\"${LOAD_PARAMETERS_PATH}\" \
60+
per_device_batch_size=${PER_DEVICE_BATCH_SIZE} \
61+
ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \
62+
ici_expert_parallelism=${ICI_EXPERT_PARALLELISM} \
63+
tokenizer_type=\"huggingface\" \
64+
return_log_prob=True"
65+
66+
# ==============================================================================
67+
# 3. Launch the Workload
68+
# ==============================================================================
69+
echo "Launching workload ${RUNNAME}..."
70+
xpk workload create --workload "${RUNNAME}" \
71+
--base-docker-image "${DOCKER_IMAGE}" \
72+
--command "${CMD}" \
73+
--num-slices=1 \
74+
--cluster "${CLUSTER}" --device-type "${DEVICE_TYPE}" --project "${PROJECT}" --zone "${ZONE}"
75+
76+
echo "Workload ${RUNNAME} created."
77+
echo "Use the following command to connect:"
78+
echo "bash benchmarks/api_server/port_forward_xpk.sh job_name=${RUNNAME} project=${PROJECT} zone=${ZONE} cluster=${CLUSTER}"

0 commit comments

Comments
 (0)