Skip to content

Commit 400ee8c

Browse files
committed
feat(api_server): Add OpenAI-compatible API server for MaxText models
This commit introduces a fully-featured, OpenAI-compatible RESTful API server for serving MaxText models. The server is built with FastAPI, supports multi-host inference on TPUs, and is designed for both interactive use and large-scale benchmarking. Key features and additions: 1. **Core Server Implementation:** - Adds `maxtext_server.py`, a FastAPI application that serves `/v1/completions` and `/v1/chat/completions` endpoints. - Implements dynamic request batching to efficiently utilize underlying hardware. - Uses `maxtext_generator.py` to encapsulate the MaxText inference engine, handling model loading, tokenization, and the generation loop. - Includes Pydantic models in `server_models.py` for robust, OpenAI-compliant request and response validation. 2. **Deployment and Utilities:** - Provides `start_server.sh` to simplify launching the server from the project root. - Adds `port_forward_xpk.sh`, a utility script to automatically find and connect to a server running on a GKE cluster via `xpk`, supporting custom namespaces. - Isolates server-specific dependencies in `benchmarks/api_server/requirements.txt` (`uvicorn`, `fastapi`, `openai-harmony`). 3. **Comprehensive Documentation:** - A new `README.md` in the `api_server` directory offers a complete guide covering: - Installation and environment setup. - Launching the server in both single-pod and multi-pod GKE environments. - Detailed examples for interacting with the API using `curl` and the `openai` Python client. - Step-by-step instructions for running benchmarks with `lm-evaluation-harness` and `evalchemy` for both log-likelihood and generative tasks.
1 parent 3eb56ef commit 400ee8c

File tree

10 files changed

+1639
-1
lines changed

10 files changed

+1639
-1
lines changed

benchmarks/api_server/README.md

Lines changed: 297 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,297 @@
1+
# MaxText API Server
2+
3+
This directory contains an OpenAI-compatible API server for serving MaxText models. It uses FastAPI as the web framework and can be deployed on a single machine or a multi-pod GKE cluster.
4+
5+
## Table of Contents
6+
- [Installation](#installation)
7+
- [Environment Variables](#environment-variables)
8+
- [Launching the Server (Single Pod)](#launching-the-server-single-pod)
9+
- [Deploying on a GKE Cluster (Multi-Pod)](#deploying-on-a-gke-cluster-multi-pod)
10+
- [Interacting with the Server](#interacting-with-the-server)
11+
- [Benchmarking with Evaluation Frameworks](#benchmarking-with-evaluation-frameworks)
12+
13+
14+
## Installation
15+
16+
The server has a few additional dependencies beyond the core MaxText requirements. Install them using the provided `requirements.txt` file:
17+
18+
```bash
19+
pip install -r benchmarks/api_server/requirements.txt
20+
```
21+
22+
## Environment Variables
23+
24+
Before launching the server, you may need to set the following environment variable:
25+
26+
- `HF_TOKEN`: Your Hugging Face access token. This is required if the model's tokenizer is hosted on the Hugging Face Hub and is not public.
27+
28+
```bash
29+
export HF_TOKEN=<your_hugging_face_token>
30+
```
31+
32+
## Launching the Server (Single Pod)
33+
34+
The primary way to launch the API server is by using the `start_server.sh` script. This script ensures that the server is run from the project's root directory, which is necessary for the Python interpreter to find all the required modules.
35+
36+
The script takes the path to a base configuration file (e.g., `MaxText/configs/base.yml`) followed by any number of model-specific configuration overrides.
37+
38+
### Benchmarking Configuration
39+
40+
To use this server for benchmarking with frameworks like `lm-eval-harness` or `evalchemy`, you **must** include the following two arguments in your launch command:
41+
42+
- `tokenizer_type="huggingface"`: Ensures the tokenizer is compatible with the evaluation harness.
43+
- `return_log_prob=True`: Enables the log probability calculations required for many standard evaluation metrics.
44+
45+
### Command Structure
46+
47+
```bash
48+
bash benchmarks/api_server/start_server.sh /path/to/base.yml [arg1=value1] [arg2=value2] ...
49+
```
50+
51+
### Example
52+
53+
Here is an example of how to launch the server with a `qwen3-30b-a3b` model, configured for benchmarking.
54+
55+
```bash
56+
# Make sure you are in the root directory of the maxtext project.
57+
58+
bash benchmarks/api_server/start_server.sh \
59+
MaxText/configs/base.yml \
60+
model_name="qwen3-30b-a3b" \
61+
tokenizer_path="Qwen/Qwen3-30B-A3B-Thinking-2507" \
62+
load_parameters_path="<path_to_your_checkpoint>" \
63+
per_device_batch_size=4 \
64+
ici_tensor_parallelism=4 \
65+
max_prefill_predict_length=1024 \
66+
max_target_length=2048 \
67+
async_checkpointing=false \
68+
scan_layers=false \
69+
attention="dot_product" \
70+
tokenizer_type="huggingface" \
71+
return_log_prob=True
72+
```
73+
74+
Once the server starts successfully, you will see a confirmation message from Uvicorn:
75+
76+
```
77+
INFO: RANK 0: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
78+
```
79+
80+
The server is now ready to accept requests on port 8000.
81+
82+
## Deploying on a GKE Cluster (Multi-Pod)
83+
84+
For large models that require a multi-pod TPU setup, you can deploy the server using the xpk (Kubernetes Pod Executor) tool. The recommended approach is to create a single submission script to configure and launch the workload.
85+
86+
87+
### 1. Create a Job Submission Script
88+
89+
Create a new bash script (e.g., `launch_gke_server.sh`) to hold your configuration and `xpk` command. This makes launching jobs repeatable and easy to modify.
90+
91+
Inside this script, you will define the server's startup command and your cluster configuration. Before running the script, define the placeholders at the top of the file. Placeholders are enclosed in angle brackets (e.g., `<your_gcp_project>`).
92+
93+
```bash
94+
#!/bin/bash
95+
set -e
96+
97+
# ==============================================================================
98+
# 1. User-Configurable Variables
99+
# ==============================================================================
100+
101+
# -- GKE Cluster Configuration --
102+
# (<your_gke_cluster>, <your_gcp_project>, <your_gcp_zone>)
103+
export CLUSTER="<your-gke-cluster>"
104+
export DEVICE_TYPE="v5p-16"
105+
export PROJECT="<your-gcp-project>"
106+
export ZONE="<your-gcp-zone>"
107+
108+
# -- XPK Workload Configuration --
109+
# (<YYYY-MM-DD>, <your_hugging_face_token>)
110+
export RUNNAME="my-server-$(date +%Y-%m-%d-%H-%M-%S)"
111+
export DOCKER_IMAGE="gcr.io/tpu-prod-env-multipod/maxtext_jax_nightly:<YYYY-MM-DD>"
112+
export HF_TOKEN="<your_hugging_face_token>" # Optional: if your tokenizer is private
113+
114+
# -- Model Configuration --
115+
# IMPORTANT: Replace these with your model's details.
116+
# (<your_model_name>, <path_or_name_to_your_tokenizer>, <path_to_your_checkpoint>)
117+
export MODEL_NAME="qwen3-30b-a3b"
118+
export TOKENIZER_PATH="Qwen/Qwen3-30B-A3B-Thinking-2507"
119+
export LOAD_PARAMETERS_PATH="<path_to_your_checkpoint>"
120+
export PER_DEVICE_BATCH_SIZE=4
121+
export ICI_TENSOR_PARALLELISM=4
122+
123+
# ==============================================================================
124+
# 2. Define the Command to Run on the Cluster
125+
# ==============================================================================
126+
# This command installs dependencies and then starts the server.
127+
CMD="export HF_TOKEN=${HF_TOKEN} && \
128+
pip install --upgrade pip && \
129+
pip install -r benchmarks/api_server/requirements.txt && \
130+
bash benchmarks/api_server/start_server.sh \
131+
MaxText/configs/base.yml \
132+
model_name="${MODEL_NAME}" \
133+
tokenizer_path="${TOKENIZER_PATH}" \
134+
load_parameters_path="${LOAD_PARAMETERS_PATH}" \
135+
per_device_batch_size=${PER_DEVICE_BATCH_SIZE} \
136+
ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \
137+
tokenizer_type="huggingface" \
138+
return_log_prob=True"
139+
140+
# ==============================================================================
141+
# 3. Launch the Workload
142+
# ==============================================================================
143+
echo "Launching workload ${RUNNAME}..."
144+
xpk workload create --workload "${RUNNAME}" \
145+
--base-docker-image "${DOCKER_IMAGE}" \
146+
--command "${CMD}" \
147+
--num-slices=1 \
148+
--cluster "${CLUSTER}" --device-type "${DEVICE_TYPE}" --project "${PROJECT}" --zone "${ZONE}"
149+
150+
echo "Workload ${RUNNAME} created."
151+
echo "Use the following command to connect:"
152+
echo "bash benchmarks/api_server/port_forward_xpk.sh job_name=${RUNNAME} project=${PROJECT} zone=${ZONE} cluster=${CLUSTER}"
153+
```
154+
155+
### 2. Launch the Workload
156+
157+
Make the script executable and run it:
158+
159+
```bash
160+
chmod +x launch_gke_server.sh
161+
./launch_gke_server.sh
162+
```
163+
164+
### 3. Connect to the Server
165+
166+
The API server only runs on the first pod (rank 0) of the workload. To connect to it, use the `port_forward_xpk.sh` script as instructed in the output of your launch script.
167+
168+
```bash
169+
bash benchmarks/api_server/port_forward_xpk.sh \
170+
job_name=<your_job_name> \
171+
project=<your-gcp-project> \
172+
zone=<your-gcp-zone> \
173+
cluster=<your-gke-cluster>
174+
```
175+
176+
The script will automatically find the correct pod and establish the port-forward connection. Your server is now accessible at `http://localhost:8000`.
177+
178+
## Interacting with the Server
179+
180+
Once the server is running (either locally or connected via port-forwarding), you can interact with it using any standard HTTP client. The `model` field in the request body can be set to any string; it is used for identification purposes but does not change which model is being served.
181+
182+
### Using `curl`
183+
184+
#### Completions API
185+
186+
The `/v1/completions` endpoint is suitable for simple prompt-response interactions.
187+
188+
```bash
189+
curl -X POST http://localhost:8000/v1/completions \
190+
-H "Content-Type: application/json" \
191+
-d
192+
"{
193+
"model": "<your-model-name>",
194+
"prompt": "The capital of France is",
195+
"max_tokens": 50,
196+
"temperature": 0.7
197+
}"
198+
```
199+
200+
#### Chat Completions API
201+
202+
The `/v1/chat/completions` endpoint is designed for multi-turn conversations.
203+
204+
```bash
205+
curl -X POST http://localhost:8000/v1/chat/completions \
206+
-H "Content-Type: application/json" \
207+
-d
208+
"{
209+
"model": "<your-model-name>",
210+
"messages": [
211+
{"role": "system", "content": "You are a helpful assistant."},
212+
{"role": "user", "content": "What is the largest planet in our solar system?"}
213+
],
214+
"max_tokens": 50,
215+
"temperature": 0.7
216+
}"
217+
```
218+
219+
### Using the OpenAI Python Client
220+
221+
You can also use the official `openai` Python library to interact with the server.
222+
223+
**Installation:**
224+
```bash
225+
pip install openai
226+
```
227+
228+
**Example Python Script:**
229+
```python
230+
from openai import OpenAI
231+
232+
# Point the client to the local server
233+
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
234+
235+
completion = client.chat.completions.create(
236+
model="<your-model-name>",
237+
messages=[
238+
{"role": "system", "content": "You are a helpful assistant."},
239+
{"role": "user", "content": "What is the largest planet in our solar system?"}
240+
]
241+
)
242+
243+
print(completion.choices[0].message.content)
244+
```
245+
246+
## Benchmarking with Evaluation Frameworks
247+
248+
You can evaluate models served by this API using standard frameworks like [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [evalchemy](https://github.com/mlfoundations/evalchemy).
249+
250+
### Setup
251+
252+
It is highly recommended to set up a new, separate Python virtual environment for the evaluation framework. This prevents any dependency conflicts with the MaxText environment.
253+
254+
```bash
255+
# In a new terminal
256+
python3 -m venv eval_env
257+
source eval_env/bin/activate
258+
259+
# Install the evaluation frameworks by following their official guides:
260+
# lm-evaluation-harness: https://github.com/EleutherAI/lm-evaluation-harness
261+
# evalchemy: https://github.com/mlfoundations/evalchemy
262+
```
263+
264+
265+
### Log-Likelihood / Multiple Choice Tasks (e.g., MMLU)
266+
267+
Tasks that compare the log-probabilities of different choices (`output_type: multiple_choice` or `loglikelihood`) use the `/v1/completions` endpoint.
268+
269+
To maximize throughput, set the `batch_size` in your evaluation command to match the total batch size of your running server (`per_device_batch_size` * number of devices).
270+
271+
**Example: Running MMLU**
272+
```bash
273+
python -m eval.eval \
274+
--model local-completions \
275+
--model_args "pretrained=<path_or_name_to_your_tokenizer>,base_url=http://localhost:8000/v1/completions,tokenizer_backend=huggingface,tokenizer=<path_or_name_to_your_tokenizer>,model=<your_model_name>,max_length=81920" \
276+
--tasks mmlu \
277+
--batch_size 16 \
278+
--output_path logs
279+
```
280+
281+
### Generative Tasks (e.g., AIME)
282+
283+
Tasks that require generating text until a stop sequence is met (`output_type: generate_until`) use the `/v1/chat/completions` endpoint.
284+
285+
The chat API does not support batched requests directly. Instead, the evaluation harness sends concurrent requests to simulate batching. To enable this, set `num_concurrent` to match your server's total batch size and set the evaluation `batch_size` to 1. You must also include the `--apply_chat_template` flag. All sampling parameters (like temperature, top_p, etc.) should be passed via the `--gen_kwargs` argument.
286+
287+
**Example: Running AIME25**
288+
```bash
289+
python -m eval.eval \
290+
--model local-chat-completions \
291+
--model_args "pretrained=<path_or_name_to_your_tokenizer>,base_url=http://localhost:8000/v1/chat/completions,tokenizer_backend=huggingface,tokenizer=<path_or_name_to_your_tokenizer>,model=<your_model_name>,max_length=81920,num_concurrent=30" \
292+
--tasks AIME25 \
293+
--batch_size 1 \
294+
--output_path logs \
295+
--apply_chat_template \
296+
--gen_kwargs "temperature=0.6,top_p=0.95,top_k=20,max_gen_toks=81920"
297+
```

0 commit comments

Comments
 (0)