Skip to content

Commit 21209f9

Browse files
Support ray (#6323)
1 parent eacc0a1 commit 21209f9

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+1200
-765
lines changed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,8 +75,10 @@ You can contact us and communicate with us by adding our group:
7575

7676

7777
## 🎉 News
78+
- 🎁 2025.10.28: Ray [here](docs/source_en/Instruction/Ray.md).
79+
- 🎁 2025.10.28: Support [use yaml](examples/yaml) to configure command line parameters.
7880
- 🎁 2025.09.29: Support padding_free for embedding/reranker/seq_cls tasks, use `--padding_free true --task_type embedding/reranker/generative_reranker/seq_cls` to begin!
79-
- 🎁 2025.09.07: Added support for CHORD training algorithm. See the [documentation](./docs/source_en/Instruction/GRPO/AdvancedResearch/CHORD.md)
81+
- 🎁 2025.09.07: Added support for CHORD training algorithm. See the [documentation](./docs/source_en/Instruction/GRPO/AdvancedResearch/CHORD.md).
8082
- 🎁 2025.09.06: Ulysses can now be used with ring-attention, allowing sequences to be sharded into any number of chunks (no longer limited by the number of heads). The argument remains `--sequence_parallel_size N`.
8183
- 🎁 2025.09.02: Megatron-SWIFT now supports multimodal model training. Documentation can be found [here](./docs/source_en/Megatron-SWIFT/Multimodal-Model.md).
8284
- 🎁 2025.08.12: Support [Dynamic Fine-Tuning](https://arxiv.org/abs/2508.05629)(DFT) in SFT training, use parameter `--enable_dft_loss true`. Training scripts can be found [here](https://github.com/modelscope/ms-swift/blob/main/examples/train/full/dft.sh).

README_CN.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,8 @@
7171
- **模型量化**:支持AWQ、GPTQ、FP8和BNB的量化导出,导出的模型支持使用vLLM/SGLang/LmDeploy推理加速,并支持继续训练。
7272

7373
## 🎉 新闻
74+
- 🎁 2025.10.28: Ray [已支持](docs/source/Instruction/ray的支持.md)
75+
- 🎁 2025.10.28: 已支持[使用yaml](examples/yaml)配置命令行参数。
7476
- 🎁 2025.09.29: 支持embedding/reranker/seq_cls任务的padding_free参数, 使用`--padding_free true --task_type embedding/reranker/generative_reranker/seq_cls`开始训练!
7577
- 🎁 2025.09.07: 支持CHORD训练算法,请查看[文档](docs/source/Instruction/GRPO/AdvancedResearch/CHORD.md)
7678
- 🎁 2025.09.06: Ulysses现已支持与ring-attention结合使用,使得输入序列可以被切分成任意数量的块(不再受限于num_heads),命令参数仍然是`--sequence_parallel_size N`
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# ray的支持
2+
3+
SWIFT已经支持使用ray来进行多卡或多节点训练。已有功能中对ray的支持情况如下:
4+
5+
| 功能 | 支持ray | 例子 | 可分配角色 |
6+
|----------|-------|--------------------------------------------------------------------------------|-----------------|
7+
| pt/sft || https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node/ray | default |
8+
| dpo || | |
9+
| grpo || | |
10+
| ppo || | |
11+
| megatron || | |
12+
| sampling || https://github.com/modelscope/ms-swift/tree/main/examples/sampler/distill | sampler/prm/orm |
13+
| distill || https://github.com/modelscope/ms-swift/tree/main/examples/sampler/sample | sampler/prm/orm |
14+
15+
## 技术细节
16+
17+
在叙述参数设置之前,我们有必要先行讲一下技术细节。由于SWIFT的内部当前使用了大量transformers和trl的已有实现,像veRL或ROLL一样拆解为不同的ray角色是不现实的,而且拆解后会以ray为中心,对非ray的场景的支持会不良。
18+
因此SWIFT采取了装饰器为主的技术方案,以函数级别定义了不同角色,这些角色可以在参数中被定义如何使用。看下面的例子:
19+
20+
```python
21+
from swift.ray import RayHelper
22+
23+
@RayHelper.worker(group=['model1', 'model2'])
24+
class MyTrainer:
25+
26+
def __init__(self, args):
27+
self._prepare_model1()
28+
self._prepare_model2()
29+
self._prepare_datasets()
30+
31+
@RayHelper.function(group='model1')
32+
def _prepare_model1(self):
33+
...
34+
35+
@RayHelper.function(group='model2')
36+
def _prepare_model2(self):
37+
...
38+
39+
@RayHelper.function(group='model1')
40+
def rollout(self, inputs):
41+
return self.model1.generate(inputs)
42+
43+
@RayHelper.function(group='model2')
44+
def forward_model2(self, inputs):
45+
loss = self.model2.forward(inputs)
46+
loss.backward()
47+
48+
def _prepare_datasets(self):
49+
self.dataset = ...
50+
51+
def train(self):
52+
for batch in DataLoader(self.dataset):
53+
generated = self.rollout(batch)
54+
self.forward_model2(generated)
55+
...
56+
57+
58+
if __name__ == '__main__':
59+
...
60+
MyTrainer(args).train()
61+
```
62+
63+
RayHelper会将被装饰的方法分配到不同的硬件集群中,本地调用会被平滑转换到ray集群中进行远程调用。也可以以类为中心进行划分:
64+
65+
```python
66+
67+
@RayHelper.worker(group=['model1'])
68+
class Model1:
69+
...
70+
71+
@RayHelper.function(group='model1')
72+
def rollout(self):
73+
...
74+
75+
@RayHelper.worker(group=['model2'])
76+
class Model2:
77+
...
78+
79+
@RayHelper.function(group='model2')
80+
def forward_and_optimize(self):
81+
...
82+
83+
84+
class Trainer:
85+
...
86+
```
87+
88+
SWIFT对ray的支持本质上是使用@worker@function两个注解的组合使用,worker指定ray集群的角色,function指定如何分配数据。
89+
90+
function注解有额外的几个参数:
91+
```python
92+
@staticmethod
93+
def function(group: str,
94+
dispatch: Union[Literal['slice', 'all'], Callable] = 'all',
95+
execute: Literal['first', 'all'] = 'all',
96+
collect: Union[Literal['none', 'flatten'], Callable] = 'none'):
97+
```
98+
99+
- dispatch: 如何分配调用入参
100+
- slice:对入参切分,也就是worker负载均衡执行
101+
- all:各个worker入参完全相同
102+
- 自定义切分方式,格式为:
103+
```python
104+
def my_custom_slice(n, i, data):
105+
# n是worker数量,i是当前worker索引,data是原始入参
106+
# 返回第i个的入参
107+
```
108+
- execute: 如何执行
109+
- first: rank0执行,此时slice和Callable方式切分无效
110+
- all: 全部执行
111+
112+
- collect: 如何收集返回数据
113+
- none:原样返回,格式为各个worker返回值的列表
114+
- flatten: 将worker返回的结果进行拉平,支持tuple的拉平
115+
- Callable: 自定义collect方式,格式为:
116+
```python
117+
def my_custom_collect(result):
118+
# result是各个worker返回的列表
119+
# 输入你想要的格式
120+
```
121+
122+
## 参数设置
123+
124+
讲完技术细节后,可以将参数配置了。开发者可以根据不同的流程中的角色列表,设置不同的硬件搭配方式,例如采样功能中,共有三个角色,sampler、prm、orm,可以这样配置:
125+
126+
```yaml
127+
device_groups:
128+
nproc_per_node: 4
129+
sample_group:
130+
device: GPU
131+
ranks: list(range(0, 2))
132+
workers:
133+
- sampler
134+
rm_group:
135+
device: GPU
136+
ranks: list(range(2, 4))
137+
workers:
138+
- prm
139+
- orm
140+
```
141+
142+
- nproc_per_node: ray集群中需要的每个node的最小卡数。
143+
xxx_group: 每个ray组的名称,可以随意指定
144+
- device: 设备类型,当前支持GPU/CPU等。
145+
- ranks: 当前组分配到哪些ranks上。如果是CPU,ranks只能为整数,代表共需要多少进程,如果是GPU,可以为`[0,1,2,3]`, `4`, `list(range(0, 4))`等格式。
146+
- workers: 哪些角色分配到当前组中。
147+
148+
所有可用的角色可以见本文最上面的表。
149+
150+
如果使用命令行,device_groups也可以以`--device_groups xxx`方式传入,xxx为jsonstring。为了配置的简便,我们强烈推荐使用yaml方式搭配ray使用。

docs/source/Instruction/命令行参数.md

Lines changed: 31 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,34 @@
142142
- bnb_4bit_use_double_quant: 是否使用双重量化,默认为`True`
143143
- bnb_4bit_quant_storage: bnb量化存储类型,默认为None。
144144

145+
### RAY参数
146+
147+
- use_ray: boolean类型。是否使用ray,默认为`False`
148+
- ray_exp_name: ray实验名字,这个字段会用作cluster和worker名称前缀,可以不填
149+
- device_groups: 字符串(jsonstring)类型。在使用ray时,该字段必须配置,具体可以查看[ray文档](ray的支持.md)
150+
151+
### yaml支持
152+
153+
- config: 可以使用config代替命令行参数,例如:
154+
155+
```shell
156+
swift sft --config demo.yaml
157+
```
158+
159+
demo.yaml的内容为具体命令行配置:
160+
161+
```yaml
162+
# Model args
163+
model: Qwen/Qwen2.5-7B-Instruct
164+
dataset: swift/self-cognition
165+
...
166+
167+
# Train args
168+
output_dir: xxx/xxx
169+
gradient_checkpointing: true
170+
171+
...
172+
```
145173

146174
## 原子参数
147175

@@ -681,13 +709,13 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)
681709

682710
- prm_model: 过程奖励模型的类型,可以是模型id(以pt方式拉起),或者plugin中定义的prm key(自定义推理过程)。
683711
- orm_model: 结果奖励模型的类型,通常是通配符或测试用例等,一般定义在plugin中。
684-
- sampler_type:采样类型,目前支持 sample, mcts,未来会支持 dvts。
712+
- sampler_type:采样类型,目前支持 sample, distill
685713
- sampler_engine:支持`pt`, `lmdeploy`, `vllm`, `client`, `no`,默认为`pt`,采样模型的推理引擎。
686714
- output_dir:输出目录,默认为`sample_output`
687715
- output_file:输出文件名称,默认为`None`使用时间戳作为文件名。传入时不需要传入目录,仅支持jsonl格式。
688716
- override_exist_file:如`output_file`存在,是否覆盖。
689-
- num_sampling_per_gpu_batch_size:每次采样的batch_size。
690-
- num_sampling_per_gpu_batches:共采样多少batch。
717+
- num_sampling_batch_size:每次采样的batch_size。
718+
- num_sampling_batches:共采样多少batch。
691719
- n_best_to_keep:返回多少最佳sequences。
692720
- data_range:本采样处理数据集的分片。传入格式为`2 3`,代表数据集分为3份处理(这意味着通常有三个`swift sample`在并行处理),本实例正在处理第3个分片。
693721
- temperature:在这里默认为1.0。
@@ -698,16 +726,6 @@ App参数继承于[部署参数](#部署参数), [Web-UI参数](#Web-UI参数)
698726
- cache_files:为避免同时加载prm和generator造成显存OOM,可以分两步进行采样,第一步将prm和orm置为`None`,则所有结果都会输出到文件中,第二次运行采样将sampler_engine置为`no`并传入`--cache_files`为上次采样的输出文件,则会使用上次输出的结果进行prm和orm评估并输出最终结果。
699727
- 注意:使用cache_files时,`--dataset`仍然需要传入,这是因为cache_files的id是由原始数据计算的md5,需要把两部分信息结合使用。
700728

701-
#### MCTS
702-
- rollout_depth:rollout 时的最大深度,默认为 `5`
703-
- rollout_start_depth:开始 rollout 时的深度,低于此深度的节点只会进行 expand 操作,默认为 `3`
704-
- max_iterations:mcts 的最大迭代次数,默认为 `100`
705-
- process_reward_rate:select 中计算 value 时 process reward 占的比例,默认为 `0.0`,即不使用 PRM。
706-
- exploration_rate:UCT 算法中的探索参数,值越大越照顾探索次数较小的节点,默认为 `0.5`
707-
- api_key:使用 client 作为推理引擎时需要,默认为 `EMPTY`
708-
- base_url:使用 client 作为推理引擎时需要,默认为 'https://dashscope.aliyuncs.com/compatible-mode/v1'
709-
710-
711729
## 特定模型参数
712730
除了以上参数外,有些模型还支持额外的具体模型参数。这些参数含义通常可以在对应模型官方repo或者其推理代码中找到相应含义。**ms-swift引入这些参数以确保训练的模型与官方推理代码效果对齐**
713731
- 特定模型参数可以通过`--model_kwargs`或者环境变量进行设置,例如: `--model_kwargs '{"fps_max_frames": 12}'`或者`FPS_MAX_FRAMES=12`

docs/source/Instruction/强化微调.md

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -66,11 +66,7 @@ DeepSeek-R1使用了GRPO算法从零使base模型涌现CoT能力,该方法需
6666

6767
SWIFT支持sample命令,该命令就是用于模型采样。目前支持的采样方式有:
6868

69-
- do_sample:sample方式对模型进行采样,该方式支持对开源模型进行采样,后续会支持模型蒸馏
70-
- sample方式后续会支持URL采样,用于大模型蒸馏
71-
72-
- mcts:蒙特卡洛采样,该方式在PR中,后续会支持
73-
- dvts:调研中
69+
- sample:以generate方式对模型进行采样
7470

7571
目前我们给出了一个较为通用的[RFT脚本](https://github.com/modelscope/ms-swift/tree/main/examples/train/rft/rft.py)。该脚本适用于自我提升方式的训练,且支持动态调整采样温度值、PRM阈值等超参数,并且训练方式灵活可变(微调、DPO等;或者每次迭代重新训练原模型或继续训练上个迭代的模型,甚至加载上个迭代的所有训练状态等)。开发者可以在该脚本中增加其他数据过滤(生成的数据集中,id相同的行来自同一个query),例如多样性判断、语种判断等。
7672

@@ -95,9 +91,3 @@ SWIFT支持sample命令,该命令就是用于模型采样。目前支持的采
9591
| Qwen2.5_math_7b_instruct | 92.8 | 91.6 |
9692

9793
可以看到,RFT训练后gsm8k指标变化不大,并没有出现前述的掉点现象。
98-
99-
## 未来计划
100-
101-
1. 更多的采样方式,如MCTS
102-
2. 超大模型蒸馏训练
103-
3. 以PPO为主的on-policy训练

docs/source_en/Instruction/Command-line-parameters.md

Lines changed: 32 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,35 @@ The following are parameters for quantizing models upon loading. See the [quanti
144144
- bnb_4bit_use_double_quant: Whether to use double quantization. Default is `True`.
145145
- bnb_4bit_quant_storage: Data type used to store quantized weights. Default is `None`.
146146

147+
### RAY Arguments
148+
149+
- use_ray: Boolean type. Whether to use ray, defaults to `False`.
150+
- ray_exp_name: Ray experiment name. This field will be used as the prefix for cluster and worker names, can be empty.
151+
- device_groups: String (jsonstring) type. When using ray, this field must be configured. For details, please refer to the [ray documentation](Ray.md).
152+
153+
### YAML Arguments
154+
155+
- config: You can use config instead of command-line arguments, for example:
156+
157+
```shell
158+
swift sft --config demo.yaml
159+
```
160+
161+
The content of demo.yaml consists of other command-line configurations:
162+
163+
```yaml
164+
# Model args
165+
model: Qwen/Qwen2.5-7B-Instruct
166+
dataset: swift/self-cognition
167+
...
168+
169+
# Train args
170+
output_dir: xxx/xxx
171+
gradient_checkpointing: true
172+
173+
...
174+
```
175+
147176
## Atomic Arguments
148177

149178
### Seq2SeqTrainer Arguments
@@ -698,13 +727,13 @@ Export Arguments include the [basic arguments](#base-arguments) and [merge argum
698727

699728
- prm_model: The type of process reward model. It can be a model ID (triggered using `pt`) or a `prm` key defined in a plugin (for custom inference processes).
700729
- orm_model: The type of outcome reward model, typically a wildcard or test case, usually defined in a plugin.
701-
- sampler_type: The type of sampling. Currently supports `sample` (using `do_sample` method). Future support will include `mcts` and `dvts`.
730+
- sampler_type: The type of sampling. Currently supports `sample` and `distill`.
702731
- sampler_engine: Supports `pt`, `lmdeploy`, `vllm`, `no`. Defaults to `pt`. Specifies the inference engine for the sampling model.
703732
- output_dir: The output directory. Defaults to `sample_output`.
704733
- output_file: The name of the output file. Defaults to `None`, which uses a timestamp as the filename. When provided, only the filename should be passed without the directory, and only JSONL format is supported.
705734
- override_exist_file: Whether to overwrite if `output_file` already exists.
706-
- num_sampling_per_gpu_batch_size: The batch size for each sampling operation.
707-
- num_sampling_per_gpu_batches: The total number of batches to sample.
735+
- num_sampling_batch_size: The batch size for each sampling operation.
736+
- num_sampling_batches: The total number of batches to sample.
708737
- n_best_to_keep: The number of best sequences to return.
709738
- data_range: The partition of the dataset being processed for this sampling operation. The format should be `2 3`, meaning the dataset is divided into 3 parts, and this instance is processing the 3rd partition (this implies that typically three `swift sample` processes are running in parallel).
710739
- temperature: Defaults to `1.0`.
@@ -715,15 +744,6 @@ Export Arguments include the [basic arguments](#base-arguments) and [merge argum
715744
- cache_files: To avoid loading both `prm` and `generator` simultaneously and causing GPU memory OOM, sampling can be done in two steps. In the first step, set `prm` and `orm` to `None`, and all results will be output to a file. In the second run, set `sampler_engine` to `no` and pass `--cache_files` with the output file from the first sampling. This will use the results from the first run for `prm` and `orm` evaluation and output the final results.
716745
- Note: When using `cache_files`, the `--dataset` still needs to be provided because the ID for `cache_files` is calculated using the MD5 of the original data. Both pieces of information need to be used together.
717746

718-
#### MCTS
719-
- rollout_depth: The maximum depth during rollouts, default is `5`.
720-
- rollout_start_depth: The depth at which rollouts begin; nodes below this depth will only undergo expand operations, default is `3`.
721-
- max_iterations: The maximum number of iterations for MCTS, default is `100`.
722-
- process_reward_rate: The proportion of process reward used in calculating value during selection, default is `0.0`, meaning PRM is not used.
723-
- exploration_rate: A parameter in the UCT algorithm that balances exploration; a higher value gives more weight to nodes with fewer explorations, default is `0.5`.
724-
- api_key: Required when using the client as an inference engine, default is `EMPTY`.
725-
- base_url: Required when using the client as an inference engine, default is 'https://dashscope.aliyuncs.com/compatible-mode/v1'.
726-
727747
## Specific Model Arguments
728748

729749
In addition to the parameters listed above, some models support additional model-specific arguments. The meanings of these parameters can usually be found in the corresponding model's official repository or its inference code. **MS-Swift includes these parameters to ensure that the trained model aligns with the behavior of the official inference implementation**.

0 commit comments

Comments
 (0)