|
| 1 | +# MaxText API Server |
| 2 | + |
| 3 | +This directory contains an OpenAI-compatible API server for serving MaxText models, enabling benchmarks with evaluation frameworks like lm-eval-harness and evalchemy. It uses [FastAPI](https://fastapi.tiangolo.com/) as the web framework and can be deployed on a single machine or a multi-host GKE cluster. |
| 4 | + |
| 5 | +## Table of Contents |
| 6 | +- [Installation](#installation) |
| 7 | +- [Environment Variables](#environment-variables) |
| 8 | +- [Launching the Server (Single Pod)](#launching-the-server-single-pod) |
| 9 | +- [Deploying on a GKE Cluster (Multi-Host)](#deploying-on-a-gke-cluster-multi-host) |
| 10 | +- [Interacting with the Server](#interacting-with-the-server) |
| 11 | +- [Benchmarking with Evaluation Frameworks](#benchmarking-with-evaluation-frameworks) |
| 12 | + |
| 13 | + |
| 14 | +## Installation |
| 15 | + |
| 16 | +The server has a few additional dependencies beyond the core MaxText requirements. Install them using the provided `requirements.txt` file: |
| 17 | + |
| 18 | +```bash |
| 19 | +pip install -r benchmarks/api_server/requirements.txt |
| 20 | +``` |
| 21 | + |
| 22 | +## Environment Variables |
| 23 | + |
| 24 | +Before launching the server, you may need to set the following environment variable: |
| 25 | + |
| 26 | +- `HF_TOKEN`: Your Hugging Face access token. This is required if the model's tokenizer is hosted on the Hugging Face Hub and is not public. |
| 27 | + |
| 28 | +```bash |
| 29 | +export HF_TOKEN=<your_hugging_face_token> |
| 30 | +``` |
| 31 | + |
| 32 | +## Launching the Server (Single Pod) |
| 33 | + |
| 34 | +The primary way to launch the API server is by using the `start_server.sh` script. This script ensures that the server is run from the project's root directory, which is necessary for the Python interpreter to find all the required modules. |
| 35 | + |
| 36 | +The script takes the path to a base configuration file (e.g., `MaxText/configs/base.yml`) followed by any number of model-specific configuration overrides. |
| 37 | + |
| 38 | +### Benchmarking Configuration |
| 39 | + |
| 40 | +To use this server for benchmarking with frameworks like `lm-eval-harness` or `evalchemy`, you **must** include the following two arguments in your launch command: |
| 41 | + |
| 42 | +- `tokenizer_type="huggingface"`: Ensures the tokenizer is compatible with the evaluation harness. |
| 43 | +- `return_log_prob=True`: Enables the log probability calculations required for many standard evaluation metrics. |
| 44 | + |
| 45 | +### Command Structure |
| 46 | + |
| 47 | +```bash |
| 48 | +bash benchmarks/api_server/start_server.sh /path/to/base.yml [arg1=value1] [arg2=value2] ... |
| 49 | +``` |
| 50 | + |
| 51 | +### Example |
| 52 | + |
| 53 | +Here is an example of how to launch the server with a `qwen3-30b-a3b` model, configured for benchmarking. This example is configured for a TPU v5p-8, which has 4 chips. |
| 54 | + |
| 55 | +```bash |
| 56 | +# Make sure you are in the root directory of the maxtext project. |
| 57 | + |
| 58 | +bash benchmarks/api_server/start_server.sh \ |
| 59 | + MaxText/configs/base.yml \ |
| 60 | + model_name="qwen3-30b-a3b" \ |
| 61 | + tokenizer_path="Qwen/Qwen3-30B-A3B-Thinking-2507" \ |
| 62 | + load_parameters_path="<path_to_your_checkpoint>" \ |
| 63 | + per_device_batch_size=4 \ |
| 64 | + ici_tensor_parallelism=4 \ |
| 65 | + max_prefill_predict_length=1024 \ |
| 66 | + max_target_length=2048 \ |
| 67 | + async_checkpointing=false \ |
| 68 | + scan_layers=false \ |
| 69 | + attention="dot_product" \ |
| 70 | + tokenizer_type="huggingface" \ |
| 71 | + return_log_prob=True |
| 72 | +``` |
| 73 | + |
| 74 | +Once the server starts successfully, you will see a confirmation message from Uvicorn: |
| 75 | + |
| 76 | +``` |
| 77 | +INFO: RANK 0: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit) |
| 78 | +``` |
| 79 | + |
| 80 | +The server is now ready to accept requests on port 8000. |
| 81 | + |
| 82 | +## Deploying on a GKE Cluster (Multi-Host) |
| 83 | + |
| 84 | +For large models that require a multi-host TPU setup, you can deploy the server using the [xpk (Kubernetes Pod Executor) tool](https://github.com/AI-Hypercomputer/xpk). The recommended approach is to create a single submission script to configure and launch the workload. |
| 85 | + |
| 86 | + |
| 87 | +### 1. Create a Job Submission Script |
| 88 | + |
| 89 | +Create a new bash script (e.g., `launch_gke_server.sh`) to hold your configuration and `xpk` command. This makes launching jobs repeatable and easy to modify. |
| 90 | + |
| 91 | +For your convenience, the script below is also available as a template file at `benchmarks/api_server/launch_gke_server.sh.template`. |
| 92 | + |
| 93 | +Inside this script, you will define the server's startup command and your cluster configuration. Before running the script, define the placeholders at the top of the file. Placeholders are enclosed in angle brackets (e.g., `<your_gcp_project>`). |
| 94 | + |
| 95 | +```bash |
| 96 | +#!/bin/bash |
| 97 | +set -e |
| 98 | + |
| 99 | +# ============================================================================== |
| 100 | +# 1. User-Configurable Variables |
| 101 | +# ============================================================================== |
| 102 | + |
| 103 | +# -- GKE Cluster Configuration -- |
| 104 | +# (<your_gke_cluster>, <your_gcp_project>, <your_gcp_zone>) |
| 105 | +export CLUSTER="<your-gke-cluster>" |
| 106 | +export DEVICE_TYPE="v5p-16" |
| 107 | +export PROJECT="<your-gcp-project>" |
| 108 | +export ZONE="<your-gcp-zone>" |
| 109 | + |
| 110 | +# -- XPK Workload Configuration -- |
| 111 | +# (<YYYY-MM-DD>, <your_hugging_face_token>) |
| 112 | +export RUNNAME="my-server-$(date +%Y-%m-%d-%H-%M-%S)" |
| 113 | +export DOCKER_IMAGE="gcr.io/tpu-prod-env-multipod/maxtext_jax_nightly:<YYYY-MM-DD>" |
| 114 | +export HF_TOKEN="<your_hugging_face_token>" # Optional: if your tokenizer is private |
| 115 | + |
| 116 | +# -- Model Configuration -- |
| 117 | +# IMPORTANT: Replace these with your model's details. |
| 118 | +# (<your_model_name>, <path_or_name_to_your_tokenizer>, <path_to_your_checkpoint>) |
| 119 | +export MODEL_NAME="qwen3-30b-a3b" |
| 120 | +export TOKENIZER_PATH="Qwen/Qwen3-30B-A3B-Thinking-2507" |
| 121 | +export LOAD_PARAMETERS_PATH="<path_to_your_checkpoint>" |
| 122 | +export PER_DEVICE_BATCH_SIZE=4 |
| 123 | +# Parallelism settings should match the number of chips on your device. |
| 124 | +# For a v5p-16 (8 chips), the product of parallelism values should be 8. |
| 125 | +export ICI_TENSOR_PARALLELISM=4 |
| 126 | +export ICI_EXPERT_PARALLELISM=2 |
| 127 | + |
| 128 | +# ============================================================================== |
| 129 | +# 2. Define the Command to Run on the Cluster |
| 130 | +# ============================================================================== |
| 131 | +# This command installs dependencies and then starts the server. |
| 132 | +CMD="export HF_TOKEN=${HF_TOKEN} && \ |
| 133 | + pip install --upgrade pip && \ |
| 134 | + pip install -r benchmarks/api_server/requirements.txt && \ |
| 135 | + bash benchmarks/api_server/start_server.sh \ |
| 136 | + MaxText/configs/base.yml \ |
| 137 | + model_name="${MODEL_NAME}" \ |
| 138 | + tokenizer_path="${TOKENIZER_PATH}" \ |
| 139 | + load_parameters_path="${LOAD_PARAMETERS_PATH}" \ |
| 140 | + per_device_batch_size=${PER_DEVICE_BATCH_SIZE} \ |
| 141 | + ici_tensor_parallelism=${ICI_TENSOR_PARALLELISM} \ |
| 142 | + ici_expert_parallelism=${ICI_EXPERT_PARALLELISM} \ |
| 143 | + tokenizer_type=\"huggingface\" \ |
| 144 | + return_log_prob=True" |
| 145 | + |
| 146 | + |
| 147 | +# ============================================================================== |
| 148 | +# 3. Launch the Workload |
| 149 | +# ============================================================================== |
| 150 | +echo "Launching workload ${RUNNAME}..." |
| 151 | +xpk workload create --workload "${RUNNAME}" \ |
| 152 | + --base-docker-image "${DOCKER_IMAGE}" \ |
| 153 | + --command "${CMD}" \ |
| 154 | + --num-slices=1 \ |
| 155 | + --cluster "${CLUSTER}" --device-type "${DEVICE_TYPE}" --project "${PROJECT}" --zone "${ZONE}" |
| 156 | + |
| 157 | +echo "Workload ${RUNNAME} created." |
| 158 | +echo "Use the following command to connect:" |
| 159 | +echo "bash benchmarks/api_server/port_forward_xpk.sh job_name=${RUNNAME} project=${PROJECT} zone=${ZONE} cluster=${CLUSTER}" |
| 160 | +``` |
| 161 | + |
| 162 | +### 2. Launch the Workload |
| 163 | + |
| 164 | +Make the script executable and run it: |
| 165 | + |
| 166 | +```bash |
| 167 | +chmod +x launch_gke_server.sh |
| 168 | +./launch_gke_server.sh |
| 169 | +``` |
| 170 | + |
| 171 | +### 3. Connect to the Server |
| 172 | + |
| 173 | +The API server only runs on the first host/worker (rank 0 on GPU) of the workload. To connect to it, use the `port_forward_xpk.sh` script as instructed in the output of your launch script. |
| 174 | + |
| 175 | +```bash |
| 176 | +bash benchmarks/api_server/port_forward_xpk.sh \ |
| 177 | + job_name=<your_job_name> \ |
| 178 | + project=<your-gcp-project> \ |
| 179 | + zone=<your-gcp-zone> \ |
| 180 | + cluster=<your-gke-cluster> |
| 181 | +``` |
| 182 | + |
| 183 | +The script will automatically find the correct pod and establish the port-forward connection. Your server is now accessible at `http://localhost:8000`. |
| 184 | + |
| 185 | +## Interacting with the Server |
| 186 | + |
| 187 | +Once the server is running (either locally or connected via port-forwarding), you can interact with it using any standard HTTP client. The `model` field in the request body can be set to any string; it is used for identification purposes but does not change which model is being served. |
| 188 | + |
| 189 | +### Using `curl` |
| 190 | + |
| 191 | +#### Completions API |
| 192 | + |
| 193 | +The `/v1/completions` endpoint is suitable for simple prompt-response interactions. |
| 194 | + |
| 195 | +```bash |
| 196 | +curl -X POST http://localhost:8000/v1/completions \ |
| 197 | +-H "Content-Type: application/json" \ |
| 198 | +-d |
| 199 | + "{ |
| 200 | + "model": "<your-model-name>", |
| 201 | + "prompt": "The capital of France is", |
| 202 | + "max_tokens": 50, |
| 203 | + "temperature": 0.7 |
| 204 | +}" |
| 205 | +``` |
| 206 | + |
| 207 | +#### Chat Completions API |
| 208 | + |
| 209 | +The `/v1/chat/completions` endpoint is designed for multi-turn conversations. |
| 210 | + |
| 211 | +```bash |
| 212 | +curl -X POST http://localhost:8000/v1/chat/completions \ |
| 213 | +-H "Content-Type: application/json" \ |
| 214 | +-d |
| 215 | + "{ |
| 216 | + "model": "<your-model-name>", |
| 217 | + "messages": [ |
| 218 | + {"role": "system", "content": "You are a helpful assistant."}, |
| 219 | + {"role": "user", "content": "What is the largest planet in our solar system?"} |
| 220 | + ], |
| 221 | + "max_tokens": 50, |
| 222 | + "temperature": 0.7 |
| 223 | +}" |
| 224 | +``` |
| 225 | + |
| 226 | +### Using the OpenAI Python Client |
| 227 | + |
| 228 | +You can also use the official `openai` Python library to interact with the server. |
| 229 | + |
| 230 | +**Installation:** |
| 231 | +```bash |
| 232 | +pip install openai |
| 233 | +``` |
| 234 | + |
| 235 | +**Example Python Script:** |
| 236 | +```python |
| 237 | +from openai import OpenAI |
| 238 | + |
| 239 | +# Point the client to the local server |
| 240 | +client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed") |
| 241 | + |
| 242 | +completion = client.chat.completions.create( |
| 243 | + model="<your-model-name>", |
| 244 | + messages=[ |
| 245 | + {"role": "system", "content": "You are a helpful assistant."}, |
| 246 | + {"role": "user", "content": "What is the largest planet in our solar system?"} |
| 247 | + ] |
| 248 | +) |
| 249 | + |
| 250 | +print(completion.choices[0].message.content) |
| 251 | +``` |
| 252 | + |
| 253 | +## Benchmarking with Evaluation Frameworks |
| 254 | + |
| 255 | +You can evaluate models served by this API using standard frameworks like [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and [evalchemy](https://github.com/mlfoundations/evalchemy). |
| 256 | + |
| 257 | +### Setup |
| 258 | + |
| 259 | +It is highly recommended to set up a new, separate Python virtual environment for the evaluation framework. This prevents any dependency conflicts with the MaxText environment. |
| 260 | + |
| 261 | +```bash |
| 262 | +# In a new terminal |
| 263 | +python3 -m venv eval_env |
| 264 | +source eval_env/bin/activate |
| 265 | +``` |
| 266 | + |
| 267 | +Install the evaluation frameworks by following their official guides: |
| 268 | +- [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) |
| 269 | +- [evalchemy](https://github.com/mlfoundations/evalchemy) |
| 270 | + |
| 271 | + |
| 272 | +### Log-Likelihood / Multiple Choice Tasks (e.g., MMLU) |
| 273 | + |
| 274 | +Tasks that compare the log-probabilities of different choices (`output_type: multiple_choice` or `loglikelihood`) use the `/v1/completions` endpoint. |
| 275 | + |
| 276 | +To maximize throughput, set the `batch_size` in your evaluation command to match the total batch size of your running server (`per_device_batch_size` * `number of devices`). |
| 277 | + |
| 278 | +**Example: Running MMLU** |
| 279 | +```bash |
| 280 | +python -m eval.eval \ |
| 281 | + --model local-completions \ |
| 282 | + --model_args "pretrained=<path_or_name_to_your_tokenizer>,base_url=http://localhost:8000/v1/completions,tokenizer_backend=huggingface,tokenizer=<path_or_name_to_your_tokenizer>,model=<your_model_name>,max_length=<your_max_target_length>" \ |
| 283 | + --tasks mmlu \ |
| 284 | + --batch_size <per_device_batch_size * number of devices> \ |
| 285 | + --output_path logs |
| 286 | +``` |
| 287 | + |
| 288 | +### Generative Tasks (e.g., AIME) |
| 289 | + |
| 290 | +Tasks that require generating text until a stop sequence is met (`output_type: generate_until`) use the `/v1/chat/completions` endpoint. |
| 291 | + |
| 292 | +The chat API does not support batched requests directly. Instead, the evaluation harness sends concurrent requests to simulate batching. To enable this, set `num_concurrent` to match your server's total batch size and set the evaluation `batch_size` to 1. You must also include the `--apply_chat_template` flag. All sampling parameters (like temperature, top_p, etc.) should be passed via the `--gen_kwargs` argument. For Example, if you are using v5p-8(`4 chips`) with `per_device_batch_size = 4`, the `num_concurrent = 16` |
| 293 | + |
| 294 | +**Example: Running AIME25** |
| 295 | +```bash |
| 296 | +python -m eval.eval \ |
| 297 | + --model local-chat-completions \ |
| 298 | + --model_args "num_concurrent=16,pretrained=<path_or_name_to_your_tokenizer>,base_url=http://localhost:8000/v1/chat/completions,tokenizer_backend=huggingface,tokenizer=<path_or_name_to_your_tokenizer>,model=<your_model_name>,max_length=<your_max_target_length>" \ |
| 299 | + --tasks AIME25 \ |
| 300 | + --batch_size 1 \ |
| 301 | + --output_path logs \ |
| 302 | + --apply_chat_template \ |
| 303 | + --gen_kwargs "temperature=0.6,top_p=0.95,top_k=20,max_tokens=<your_max_target_length>,max_gen_toks=<your_max_target_length>" |
| 304 | +``` |
| 305 | +The valid arguments for `--gen_kwargs` are `temperature`, `top_p`, `top_k`, `stop`, `seed`, `max_tokens` and `max_gen_toks`. The `max_gen_toks` argument is used by some tasks in evaluation harness to control the maximum number of tokens to generate. We suggest pass `max_tokens` and `max_gen_toks` with the same value at the same time. |
0 commit comments