Skip to content
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
abba497
update
Jintao-Huang Aug 22, 2025
0883b84
update
Jintao-Huang Aug 24, 2025
bdbaa9a
update
Jintao-Huang Aug 24, 2025
efd6f72
Merge branch 'main' into support_megatron_multimodal
Jintao-Huang Aug 26, 2025
c0b28b4
Merge branch 'main' into support_megatron_multimodal
Jintao-Huang Aug 29, 2025
0e60545
update
Jintao-Huang Aug 29, 2025
b349b56
update
Jintao-Huang Aug 31, 2025
6a77b0e
update
Jintao-Huang Aug 31, 2025
26d5f64
Merge branch 'main' into support_megatron_multimodal
Jintao-Huang Aug 31, 2025
a502bbb
update
Jintao-Huang Aug 31, 2025
2e92219
Merge branch 'main' into support_megatron_multimodal
Jintao-Huang Sep 1, 2025
6fad478
fix
Jintao-Huang Sep 1, 2025
2878d15
update
Jintao-Huang Sep 1, 2025
0cbada0
update
Jintao-Huang Sep 1, 2025
3151603
fix
Jintao-Huang Sep 1, 2025
93c4693
update
Jintao-Huang Sep 1, 2025
308d565
update
Jintao-Huang Sep 1, 2025
08320eb
update
Jintao-Huang Sep 1, 2025
44a95bd
fix
Jintao-Huang Sep 1, 2025
50b2eb1
lint pass
Jintao-Huang Sep 1, 2025
51f315a
fix cp
Jintao-Huang Sep 1, 2025
ef66901
update
Jintao-Huang Sep 1, 2025
631691c
lint pass
Jintao-Huang Sep 1, 2025
587298e
update
Jintao-Huang Sep 2, 2025
a1cb64b
update
Jintao-Huang Sep 2, 2025
798bdd4
fix
Jintao-Huang Sep 2, 2025
5953506
fix
Jintao-Huang Sep 2, 2025
0508aaf
lint pass
Jintao-Huang Sep 2, 2025
6a431aa
update
Jintao-Huang Sep 2, 2025
3f35a86
update
Jintao-Huang Sep 2, 2025
0156a0c
update
Jintao-Huang Sep 2, 2025
4c810cb
fix
Jintao-Huang Sep 2, 2025
d90d282
fix
Jintao-Huang Sep 2, 2025
f89a2bb
update
Jintao-Huang Sep 2, 2025
dc803a0
update
Jintao-Huang Sep 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/source/Instruction/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@
- 🔥aligner_lr: 当训练多模态大模型时,该参数指定aligner的学习率,默认为None,等于learning_rate。
- lr_scheduler_type: lr_scheduler类型,默认为'cosine'。
- lr_scheduler_kwargs: lr_scheduler其他参数。默认为None。
- 🔥gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。
- gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。
- 注意:当使用DDP而不使用deepspeed/fsdp,且gradient_checkpointing_kwargs为None,会默认设置其为`'{"use_reentrant": false}'`。
- full_determinism: 确保训练中获得可重现的结果,注意:这会对性能产生负面影响。默认为False。
- 🔥report_to: 默认值为`tensorboard`。你也可以指定`--report_to tensorboard wandb swanlab`、`--report_to all`。
Expand Down Expand Up @@ -211,10 +211,10 @@
- hub_private_repo: 默认为False。

### Tuner参数
- 🔥freeze_llm: 该参数只对多模态模型生效,可用于全参和LoRA,但含义不同。若是全参数训练,将freeze_llm设置为True将会将llm部分权重进行冻结,若是LoRA训练且`target_modules`设置为'all-linear',将freeze_llm设置为True将会取消在llm部分添加LoRA模块。该参数默认为False。
- 🔥freeze_vit: 该参数只对多模态模型生效,可用于全参和LoRA,含义参考`freeze_llm`。默认为True
- 🔥freeze_llm: 该参数只对多模态模型生效,可用于全参和LoRA,但会产生不同的效果。若是全参数训练,将freeze_llm设置为True将会将LLM部分权重进行冻结;若是LoRA训练且`target_modules`设置为'all-linear',将freeze_llm设置为True将会取消在LLM部分添加LoRA模块。该参数默认为False。
- 🔥freeze_vit: 该参数只对多模态模型生效,可用于全参和LoRA,但会产生不同的效果。若是全参数训练,将freeze_vit设置为True将会将vit部分权重进行冻结;若是LoRA训练且`target_modules`设置为'all-linear',将freeze_vit设置为True将会取消在vit部分添加LoRA模块。该参数默认为True
- 注意:这里的vit不仅限于vision_tower, 也包括audio_tower。
- 🔥freeze_aligner: 该参数只对多模态模型生效,可用于全参和LoRA,含义参考`freeze_llm`。默认为True
- 🔥freeze_aligner: 该参数只对多模态模型生效,可用于全参和LoRA,但会产生不同的效果。若是全参数训练,将freeze_aligner设置为True将会将aligner(也称为projector)部分权重进行冻结;若是LoRA训练且`target_modules`设置为'all-linear',将freeze_aligner设置为True将会取消在aligner部分添加LoRA模块。该参数默认为True
- 🔥target_modules: 指定lora模块, 默认为`['all-linear']`。你也可以设置为module的后缀,例如:`--target_modules q_proj k_proj v_proj`。该参数不限于LoRA,可用于其他tuners。
- 注意:在LLM和多模态LLM中,'all-linear'的行为有所不同。若是LLM则自动寻找除lm_head外的linear并附加tuner;若是多模态LLM,则默认只在LLM上附加tuner,该行为可以被`freeze_llm`、`freeze_vit`、`freeze_aligner`控制。
- 🔥target_regex: 指定lora模块的regex表达式,默认为`None`。如果该值传入,则target_modules参数失效。该参数不限于LoRA,可用于其他tuners。
Expand Down
6 changes: 6 additions & 0 deletions docs/source/Megatron-SWIFT/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,10 @@

**Tuner参数**:
- train_type: 可选为'lora'和'full'。默认为'full'。
- 🔥freeze_llm: 该参数只对多模态模型生效,可用于全参和LoRA,但会产生不同的效果。若是全参数训练,将freeze_llm设置为True将会将LLM部分权重进行冻结;若是LoRA训练且`target_modules`设置为'all-linear',将freeze_llm设置为True将会取消在LLM部分添加LoRA模块。该参数默认为False。
- 🔥freeze_vit: 该参数只对多模态模型生效,可用于全参和LoRA,但会产生不同的效果。若是全参数训练,将freeze_vit设置为True将会将vit部分权重进行冻结;若是LoRA训练且`target_modules`设置为'all-linear',将freeze_vit设置为True将会取消在vit部分添加LoRA模块。该参数默认为True。
- 注意:这里的vit不仅限于vision_tower, 也包括audio_tower。
- 🔥freeze_aligner: 该参数只对多模态模型生效,可用于全参和LoRA,但会产生不同的效果。若是全参数训练,将freeze_aligner设置为True将会将aligner(也称为projector)部分权重进行冻结;若是LoRA训练且`target_modules`设置为'all-linear',将freeze_aligner设置为True将会取消在aligner部分添加LoRA模块。该参数默认为True。

全参数训练:
- freeze_parameters: 需要被冻结参数的前缀,默认为`[]`。
Expand Down Expand Up @@ -234,6 +238,8 @@ Megatron训练参数继承自Megatron参数和基本参数(与ms-swift共用da
- 若要自定义attention_mask,你可以设置`--padding_free false`。
- 注意:Megatron-SWIFT训练特性优先支持padding_free格式,若非特殊情况,请勿修改该值。
- mlp_padding_free: 默认为False。用于padding_free设置为false时,对mlp进行padding_free优化。这可以在自定义attention_mask的同时,提升训练速度和减少显存占用。
- vit_gradient_checkpointing: 多模态模型训练时,是否对vit部分开启gradient_checkpointing。默认为True。
- gradient_checkpointing_kwargs: 传入`torch.utils.checkpoint`中的参数。例如设置为`--gradient_checkpointing_kwargs '{"use_reentrant": false}'`。默认为None。
- 🔥packing: 是否使用序列packing,默认为False。当前支持CPT/SFT/DPO。
- packing_length: packing的长度。默认为None,设置为max_length。
- streaming: 流式读取并处理数据集,默认False。
Expand Down
12 changes: 6 additions & 6 deletions docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@ This parameter list inherits from transformers `Seq2SeqTrainingArguments`, with
- 🔥aligner_lr: When training a multimodal large model, this parameter specifies the learning rate for the aligner. By default, it is set to None, which means it equals `learning_rate`.
- lr_scheduler_type: Type of lr_scheduler, defaults to 'cosine'.
- lr_scheduler_kwargs: Other parameters for the lr_scheduler, defaults to None.
- 🔥gradient_checkpointing_kwargs: Parameters for `torch.utils.checkpoint`. For example, set as `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Defaults to None.
- gradient_checkpointing_kwargs: Parameters for `torch.utils.checkpoint`. For example, set as `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Defaults to None.
- Note: When using DDP without DeepSpeed/FSDP, and `gradient_checkpointing_kwargs` is `None`, it will default to `'{"use_reentrant": false}'`.
- full_determinism: Ensures reproducible results during training. Note: This will negatively impact performance. Defaults to False.
- 🔥report_to: Default value is `tensorboard`. You can also specify `--report_to tensorboard wandb swanlab` or `--report_to all`.
Expand Down Expand Up @@ -215,11 +215,11 @@ Other important parameters:

### Tuner Arguments

- 🔥freeze_llm: This parameter is only effective for multimodal models and can be used for full parameter training and LoRA, but with different meanings. In full parameter training, setting freeze_llm to True will freeze some of the LLM weights. In LoRA training, if `target_modules` is set to 'all-linear', setting freeze_llm to True will prevent adding LoRA modules to the LLM part. The default is False.
- 🔥freeze_vit: This parameter is only effective for multimodal models and can be used for full parameter training and LoRA, with similar meanings as `freeze_llm`. The default is True.
- Note: Here, "vit" refers not only to the vision_tower but also includes the audio_tower.
- 🔥freeze_aligner: This parameter is only effective for multimodal models and can be used for full parameter training and LoRA, with similar meanings as `freeze_llm`. The default is True.
- 🔥 target_modules: Specifies the LoRA modules. The default is `['all-linear']`, but you can also pass layer-name suffixes, e.g. `--target_modules q_proj k_proj v_proj`. This argument is not restricted to LoRA and can be used with other tuners as well.
- 🔥freeze_llm: This parameter only takes effect for multimodal models and can be used in both full-parameter and LoRA training, but with different behaviors. In full-parameter training, setting `freeze_llm` to `True` will freeze the weights of the LLM component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_llm` to `True` will prevent LoRA modules from being added to the LLM component. The default value is `False`.
- 🔥freeze_vit: This parameter only applies to multimodal models and can be used in both full-parameter and LoRA training, though with different effects. In full-parameter training, setting `freeze_vit` to `True` will freeze the weights of the ViT component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_vit` to `True` will prevent LoRA modules from being added to the ViT component. The default value is `True`.
- Note: The term "ViT" here refers not only to the vision tower but also includes the audio tower.
- 🔥freeze_aligner: This parameter is only effective for multimodal models and can be used in both full-parameter and LoRA training, with differing outcomes. In full-parameter training, setting `freeze_aligner` to `True` will freeze the weights of the aligner (also known as the projector) component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_aligner` to `True` will prevent LoRA modules from being added to the aligner component. The default value is `True`.
- 🔥target_modules: Specifies the LoRA modules. The default is `['all-linear']`, but you can also pass layer-name suffixes, e.g. `--target_modules q_proj k_proj v_proj`. This argument is not restricted to LoRA and can be used with other tuners as well.
- Note: The behavior of the special value `'all-linear'` differs between plain LLMs and multimodal LLMs. For a standard LLM, it automatically locates every linear layer except `lm_head` and attaches a tuner. For a multimodal LLM, it attaches the tuner only to the LLM component by default. This default can be changed with the `freeze_llm`, `freeze_vit`, and `freeze_aligner` options.
- 🔥target_regex: Specifies a regex expression for LoRA modules, with a default of `None`. If this value is provided, the target_modules parameter becomes ineffective. This parameter is not limited to LoRA and can be used for other tuners.
- target_parameters: List of parameter names to be replaced with LoRA. This argument behaves similarly to target_modules, but you should pass parameter names instead. This feature requires "peft>=0.17.0". For example, in many Mixture-of-Experts (MoE) layers in Hugging Face Transformers, `nn.Linear` is not used; instead, `nn.Parameter` is used. In such cases, the `target_parameters` argument can be used to apply LoRA.
Expand Down
6 changes: 6 additions & 0 deletions docs/source_en/Megatron-SWIFT/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,10 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
**Tuner Parameters**:

- train_type: Options are `'lora'` and `'full'`. Default is `'full'`.
- 🔥freeze_llm: This parameter only takes effect for multimodal models and can be used in both full-parameter and LoRA training, but with different behaviors. In full-parameter training, setting `freeze_llm` to `True` will freeze the weights of the LLM component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_llm` to `True` will prevent LoRA modules from being added to the LLM component. The default value is `False`.
- 🔥freeze_vit: This parameter only applies to multimodal models and can be used in both full-parameter and LoRA training, though with different effects. In full-parameter training, setting `freeze_vit` to `True` will freeze the weights of the ViT component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_vit` to `True` will prevent LoRA modules from being added to the ViT component. The default value is `True`.
- Note: The term "ViT" here refers not only to the vision tower but also includes the audio tower.
- 🔥freeze_aligner: This parameter is only effective for multimodal models and can be used in both full-parameter and LoRA training, with differing outcomes. In full-parameter training, setting `freeze_aligner` to `True` will freeze the weights of the aligner (also known as the projector) component. In LoRA training with `target_modules` set to 'all-linear', setting `freeze_aligner` to `True` will prevent LoRA modules from being added to the aligner component. The default value is `True`.

Full-parameter Training:

Expand Down Expand Up @@ -249,6 +253,8 @@ Megatron training parameters are inherited from Megatron parameters and basic pa
- If you wish to customize the attention_mask, you can set `--padding_free false`.
- Note: The Megatron-SWIFT training feature prioritizes support for the padding-free format. Unless under special circumstances, please do not modify this value.
- mlp_padding_free: The default is False. This is used for applying padding-free optimization to the MLP when padding_free is set to false. It allows for improved training speed and reduced memory usage while customizing the attention_mask.
- vit_gradient_checkpointing: Whether to enable gradient checkpointing for the ViT part during multimodal model training. Default: True.
- gradient_checkpointing_kwargs: Arguments passed to `torch.utils.checkpoint`. For example: set `--gradient_checkpointing_kwargs '{"use_reentrant": false}'`. Default: None.
- 🔥packing: Whether to use sequence packing, defaults to False. Currently supports CPT/SFT/DPO.
- packing_length: the length to use for packing. Defaults to None, in which case it is set to max_length.
- streaming: Stream data loading and processing, default is False.
Expand Down
31 changes: 31 additions & 0 deletions examples/megatron/multimodal/dense.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# 4 * 56GiB; 2.3s/it
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=4 \
MAX_PIXELS=1003520 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
megatron sft \
--load Qwen2.5-VL-7B-Instruct-mcore \
--dataset 'AI-ModelScope/LaTeX_OCR:human_handwrite' \
--tensor_model_parallel_size 2 \
--sequence_parallel true \
--packing true \
--split_dataset_ratio 0.01 \
--micro_batch_size 1 \
--global_batch_size 4 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-5 \
--lr_warmup_fraction 0.05 \
--min_lr 1e-6 \
--max_epochs 1 \
--save megatron_output/Qwen2.5-VL-7B-Instruct \
--save_interval 200 \
--vit_gradient_checkpointing true \
--max_length 2048 \
--num_workers 4 \
--no_save_optim true \
--no_save_rng true \
--dataset_num_proc 8
16 changes: 15 additions & 1 deletion swift/megatron/argument/megatron_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,9 @@ class RLHFMegatronArgumentsMixin:
@dataclass
class MegatronTunerMixin:
train_type: Literal['lora', 'full'] = 'full'
freeze_llm: bool = False
freeze_vit: bool = True
freeze_aligner: bool = True
# full
freeze_parameters: List[str] = field(default_factory=list)
freeze_parameters_regex: Optional[str] = None
Expand Down Expand Up @@ -71,6 +74,8 @@ def load_tuner_config(adapter_load: Optional[str]) -> Dict[str, Any]:
def __post_init__(self):
if self.freeze_parameters_ratio > 0 and self.pipeline_model_parallel_size > 1:
raise ValueError('`freeze_parameters_ratio` is not supported when `pipeline_model_parallel_size` > 1')
if self.target_regex:
self.target_modules = self.target_regex


@dataclass
Expand All @@ -94,6 +99,10 @@ class ExtraMegatronArguments(RLHFMegatronArgumentsMixin, MegatronTunerMixin):
partial_rotary_factor: Optional[float] = None
use_shared_expert_gate: Optional[bool] = None

# visual
vit_gradient_checkpointing: bool = True
gradient_checkpointing_kwargs: Optional[Union[dict, str]] = None


@dataclass
class MegatronArguments(ExtraMegatronArguments):
Expand Down Expand Up @@ -185,7 +194,8 @@ class MegatronArguments(ExtraMegatronArguments):
group_query_attention: Optional[bool] = None
num_query_groups: Optional[int] = None
max_position_embeddings: Optional[int] = None
position_embedding_type: Literal['learned_absolute', 'rope', 'mrope', 'relative', 'none'] = 'rope'
position_embedding_type: Optional[Literal['learned_absolute', 'rope', 'mrope', 'relative', 'none']] = None
mrope_section: Optional[List[int]] = None
rotary_base: Optional[int] = None
rotary_percent: float = 1.
rotary_interleaved: Optional[bool] = None
Expand Down Expand Up @@ -376,10 +386,14 @@ def __post_init__(self):
self.rope_scaling = json_parse_to_dict(self.rope_scaling)
if 'type' in self.rope_scaling and 'rope_type' not in self.rope_scaling:
self.rope_scaling['rope_type'] = self.rope_scaling['type']
if self.gradient_checkpointing_kwargs is not None:
self.gradient_checkpointing_kwargs = json_parse_to_dict(self.gradient_checkpointing_kwargs)
if self.eval_interval is None:
self.eval_interval = self.save_interval
if self.seq_length is None:
self.seq_length = self.max_position_embeddings
if self.position_embedding_type is None:
self.position_embedding_type = 'rope'
if self.tensorboard_dir is None and self.save is not None:
self.tensorboard_dir = f'{self.save}/runs'
self._init_moe()
Expand Down
5 changes: 4 additions & 1 deletion swift/megatron/argument/train_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@ class MegatronTrainArguments(MegatronArguments, BaseArguments):
add_version: bool = True

def init_model_args(self, tokenizer, config):
self.megatron_model_meta = get_megatron_model_meta(self.model_type)
kwargs = self.megatron_model_meta.convert_hf_config(config)
if self.new_special_tokens and kwargs['padded_vocab_size'] < len(tokenizer):
kwargs['padded_vocab_size'] = math.ceil(len(tokenizer) / 128) * 128
Expand All @@ -28,6 +27,9 @@ def init_model_args(self, tokenizer, config):
setattr(self, k, v)
MegatronArguments.__post_init__(self)
self.extra_args = self.parse_to_megatron()
self.extra_args['model_info'] = self.model_info
self.extra_args['model_meta'] = self.model_meta
self.extra_args['megatron_model_meta'] = self.megatron_model_meta

def _init_save(self):
init_process_group(backend=self.ddp_backend, timeout=self.ddp_timeout)
Expand All @@ -46,6 +48,7 @@ def __post_init__(self):
self.padding_free = True
self.load = to_abspath(self.load, check_path_exist=True)
BaseArguments.__post_init__(self)
self.megatron_model_meta = get_megatron_model_meta(self.model_type)
if len(self.dataset) == 0 and len(self.cached_dataset) == 0:
raise ValueError(f'self.dataset: {self.dataset}, self.cached_dataset: {self.cached_dataset}. '
'Please input the training dataset.')
Expand Down
2 changes: 1 addition & 1 deletion swift/megatron/model/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
from . import gpt
from . import gpt, mm_gpt
from .constant import MegatronModelType
from .register import MegatronModelMeta, get_megatron_model_meta, register_megatron_model
1 change: 1 addition & 0 deletions swift/megatron/model/constant.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
class MegatronModelType:
gpt = 'gpt'
qwen2_5_vl = 'qwen2_5_vl'
Loading
Loading