Skip to content

Commit 2361909

Browse files
authored
[model] support LLaVA-OneVision-1.5 (#6284)
* update faq * Fixed the inconsistencies between the Chinese and English FAQ documentation. * Update link to sequence parallel example * support llava-onevision-1.5 * update model list * update model list * add test * Update test_vision.py
1 parent f6a4e79 commit 2361909

File tree

8 files changed

+160
-0
lines changed

8 files changed

+160
-0
lines changed

docs/source/Instruction/支持的模型和数据集.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -930,6 +930,10 @@
930930
|[AI-ModelScope/llava-next-72b](https://modelscope.cn/models/AI-ModelScope/llava-next-72b)|llava_next_qwen|llava_next_qwen|transformers>=4.42, av|✘|vision|[lmms-lab/llava-next-72b](https://huggingface.co/lmms-lab/llava-next-72b)|
931931
|[AI-ModelScope/llava-next-110b](https://modelscope.cn/models/AI-ModelScope/llava-next-110b)|llava_next_qwen|llava_next_qwen|transformers>=4.42, av|✘|vision|[lmms-lab/llava-next-110b](https://huggingface.co/lmms-lab/llava-next-110b)|
932932
|[AI-ModelScope/llama3-llava-next-8b](https://modelscope.cn/models/AI-ModelScope/llama3-llava-next-8b)|llama3_llava_next|llama3_llava_next|transformers>=4.42, av|✘|vision|[lmms-lab/llama3-llava-next-8b](https://huggingface.co/lmms-lab/llama3-llava-next-8b)|
933+
|[lmms-lab/LLaVA-OneVision-1.5-4B-Instruct](https://modelscope.cn/models/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct)|llava_onevision1_5|llava_onevision1_5|transformers>=4.53, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[lmms-lab/LLaVA-OneVision-1.5-4B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct)|
934+
|[lmms-lab/LLaVA-OneVision-1.5-8B-Instruct](https://modelscope.cn/models/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct)|llava_onevision1_5|llava_onevision1_5|transformers>=4.53, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[lmms-lab/LLaVA-OneVision-1.5-8B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct)|
935+
|[lmms-lab/LLaVA-OneVision-1.5-4B-Base](https://modelscope.cn/models/lmms-lab/LLaVA-OneVision-1.5-4B-Base)|llava_onevision1_5|llava_onevision1_5|transformers>=4.53, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[lmms-lab/LLaVA-OneVision-1.5-4B-Base](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Base)|
936+
|[lmms-lab/LLaVA-OneVision-1.5-8B-Base](https://modelscope.cn/models/lmms-lab/LLaVA-OneVision-1.5-8B-Base)|llava_onevision1_5|llava_onevision1_5|transformers>=4.53, qwen_vl_utils>=0.0.6, decord|✘|vision, video|[lmms-lab/LLaVA-OneVision-1.5-8B-Base](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Base)|
933937
|[deepseek-ai/deepseek-vl-1.3b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-1.3b-chat)|deepseek_vl|deepseek_vl|-|✘|vision|[deepseek-ai/deepseek-vl-1.3b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-1.3b-chat)|
934938
|[deepseek-ai/deepseek-vl-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat)|deepseek_vl|deepseek_vl|-|✘|vision|[deepseek-ai/deepseek-vl-7b-chat](https://huggingface.co/deepseek-ai/deepseek-vl-7b-chat)|
935939
|[deepseek-ai/deepseek-vl2-tiny](https://modelscope.cn/models/deepseek-ai/deepseek-vl2-tiny)|deepseek_vl2|deepseek_vl2|transformers<4.42|&#x2718;|vision|[deepseek-ai/deepseek-vl2-tiny](https://huggingface.co/deepseek-ai/deepseek-vl2-tiny)|

docs/source_en/Instruction/Supported-models-and-datasets.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -912,6 +912,10 @@ The table below introduces the models integrated with ms-swift:
912912
|[llava-hf/llava-v1.6-vicuna-13b-hf](https://modelscope.cn/models/llava-hf/llava-v1.6-vicuna-13b-hf)|llava1_6_vicuna_hf|llava1_6_vicuna_hf|transformers>=4.39|&#x2718;|vision|[llava-hf/llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf)|
913913
|[llava-hf/llava-v1.6-34b-hf](https://modelscope.cn/models/llava-hf/llava-v1.6-34b-hf)|llava1_6_yi_hf|llava1_6_yi_hf|transformers>=4.39|&#x2718;|vision|[llava-hf/llava-v1.6-34b-hf](https://huggingface.co/llava-hf/llava-v1.6-34b-hf)|
914914
|[llava-hf/llama3-llava-next-8b-hf](https://modelscope.cn/models/llava-hf/llama3-llava-next-8b-hf)|llama3_llava_next_hf|llama3_llava_next_hf|transformers>=4.39|&#x2718;|vision|[llava-hf/llama3-llava-next-8b-hf](https://huggingface.co/llava-hf/llama3-llava-next-8b-hf)|
915+
|[lmms-lab/LLaVA-OneVision-1.5-4B-Instruct](https://modelscope.cn/models/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct)|llava_onevision1_5|llava_onevision1_5|transformers>=4.53, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[lmms-lab/LLaVA-OneVision-1.5-4B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Instruct)|
916+
|[lmms-lab/LLaVA-OneVision-1.5-8B-Instruct](https://modelscope.cn/models/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct)|llava_onevision1_5|llava_onevision1_5|transformers>=4.53, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[lmms-lab/LLaVA-OneVision-1.5-8B-Instruct](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Instruct)|
917+
|[lmms-lab/LLaVA-OneVision-1.5-4B-Base](https://modelscope.cn/models/lmms-lab/LLaVA-OneVision-1.5-4B-Base)|llava_onevision1_5|llava_onevision1_5|transformers>=4.53, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[lmms-lab/LLaVA-OneVision-1.5-4B-Base](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-4B-Base)|
918+
|[lmms-lab/LLaVA-OneVision-1.5-8B-Base](https://modelscope.cn/models/lmms-lab/LLaVA-OneVision-1.5-8B-Base)|llava_onevision1_5|llava_onevision1_5|transformers>=4.53, qwen_vl_utils>=0.0.6, decord|&#x2718;|vision, video|[lmms-lab/LLaVA-OneVision-1.5-8B-Base](https://huggingface.co/lmms-lab/LLaVA-OneVision-1.5-8B-Base)|
915919
|[llava-hf/llava-next-72b-hf](https://modelscope.cn/models/llava-hf/llava-next-72b-hf)|llava_next_qwen_hf|llava_next_qwen_hf|transformers>=4.39|&#x2718;|vision|[llava-hf/llava-next-72b-hf](https://huggingface.co/llava-hf/llava-next-72b-hf)|
916920
|[llava-hf/llava-next-110b-hf](https://modelscope.cn/models/llava-hf/llava-next-110b-hf)|llava_next_qwen_hf|llava_next_qwen_hf|transformers>=4.39|&#x2718;|vision|[llava-hf/llava-next-110b-hf](https://huggingface.co/llava-hf/llava-next-110b-hf)|
917921
|[llava-hf/LLaVA-NeXT-Video-7B-DPO-hf](https://modelscope.cn/models/llava-hf/LLaVA-NeXT-Video-7B-DPO-hf)|llava_next_video_hf|llava_next_video_hf|transformers>=4.42, av|&#x2718;|video|[llava-hf/LLaVA-NeXT-Video-7B-DPO-hf](https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-DPO-hf)|

swift/llm/model/constant.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -223,6 +223,7 @@ class MLLMModelType:
223223
llava1_6_yi = 'llava1_6_yi'
224224
llava_next_qwen = 'llava_next_qwen'
225225
llama3_llava_next = 'llama3_llava_next'
226+
llava_onevision1_5 = 'llava_onevision1_5'
226227

227228
deepseek_vl = 'deepseek_vl'
228229
deepseek_vl2 = 'deepseek_vl2'

swift/llm/model/model/llava.py

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
from typing import Any, Dict
66

77
from transformers import AutoConfig
8+
from transformers.dynamic_module_utils import get_class_from_dynamic_module
89

910
from swift.llm import TemplateType
1011
from ..constant import MLLMModelType
@@ -389,3 +390,32 @@ def _new_forward(*args, **kwargs):
389390
requires=['transformers>=4.42', 'av'],
390391
tags=['vision'],
391392
model_arch=None))
393+
394+
395+
def get_model_tokenizer_llava_onevision1_5(model_dir, *args, **kwargs):
396+
model_cls = get_class_from_dynamic_module('modeling_llavaonevision1_5.LLaVAOneVision1_5_ForConditionalGeneration',
397+
model_dir)
398+
model_cls._no_split_modules = ['LLaVAOneVision1_5_DecoderLayer', 'RiceBlock']
399+
model, processor = get_model_tokenizer_multimodal(model_dir, *args, **kwargs)
400+
model.config.vision_start_token_id = 151652
401+
return model, processor
402+
403+
404+
register_model(
405+
ModelMeta(
406+
MLLMModelType.llava_onevision1_5,
407+
[
408+
ModelGroup([
409+
Model('lmms-lab/LLaVA-OneVision-1.5-4B-Instruct', 'lmms-lab/LLaVA-OneVision-1.5-4B-Instruct'),
410+
Model('lmms-lab/LLaVA-OneVision-1.5-8B-Instruct', 'lmms-lab/LLaVA-OneVision-1.5-8B-Instruct'),
411+
Model('lmms-lab/LLaVA-OneVision-1.5-4B-Base', 'lmms-lab/LLaVA-OneVision-1.5-4B-Base'),
412+
Model('lmms-lab/LLaVA-OneVision-1.5-8B-Base', 'lmms-lab/LLaVA-OneVision-1.5-8B-Base'),
413+
], ),
414+
],
415+
TemplateType.llava_onevision1_5,
416+
get_model_tokenizer_llava_onevision1_5,
417+
architectures=['LLaVAOneVision1_5_ForConditionalGeneration'],
418+
model_arch=ModelArch.llava_onevision1_5,
419+
requires=['transformers>=4.53.0', 'qwen_vl_utils'],
420+
tags=['vision'],
421+
))

swift/llm/model/model_arch.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ class MLLMModelArch:
4646
llava_hf = 'llava_hf'
4747
llava_hf_legacy = 'llava_hf_legacy' # transformers<4.52
4848
llava_next_video_hf = 'llava_next_video_hf'
49+
llava_onevision1_5 = 'llava_onevision1_5'
4950

5051
llava_llama = 'llava_llama'
5152
llava_mistral = 'llava_mistral'
@@ -705,6 +706,14 @@ def register_model_arch(model_arch: ModelKeys, *, exist_ok: bool = False) -> Non
705706
language_model='model',
706707
))
707708

709+
register_model_arch(
710+
MultiModelKeys(
711+
MLLMModelArch.llava_onevision1_5,
712+
language_model='model.language_model',
713+
aligner='model.visual.merger',
714+
vision_tower='model.visual',
715+
))
716+
708717

709718
def get_model_arch(arch_name: Optional[str]) -> Optional[MultiModelKeys]:
710719
return MODEL_ARCH_MAPPING.get(arch_name)

swift/llm/template/constant.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,7 @@ class MLLMTemplateType:
153153
llava1_6_yi = 'llava1_6_yi'
154154
llava_next_qwen = 'llava_next_qwen'
155155
llama3_llava_next = 'llama3_llava_next'
156+
llava_onevision1_5 = 'llava_onevision1_5'
156157

157158
yi_vl = 'yi_vl'
158159

swift/llm/template/template/llava.py

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
import transformers
77
from packaging import version
88

9+
from swift.utils import get_env_args
910
from ..base import Template
1011
from ..constant import MLLMTemplateType
1112
from ..register import TemplateMeta, register_template
@@ -307,3 +308,101 @@ def _data_collator(self, batch: List[Dict[str, Any]], *, padding_to: Optional[in
307308
))
308309

309310
register_template(QwenTemplateMeta(MLLMTemplateType.llava_next_qwen, template_cls=LLavaTemplate))
311+
312+
313+
class LLavaOneVision1_5Template(Template):
314+
image_token_id = 151655
315+
video_token_id = 151656
316+
placeholder_tokens = ['<|image_pad|>', '<|video_pad|>']
317+
use_model = True
318+
support_padding_free = True
319+
320+
def init_env_args(self):
321+
super().init_env_args()
322+
self.bbox_format = get_env_args('QWENVL_BBOX_FORMAT', str, 'legacy')
323+
324+
def replace_tag(self, media_type: Literal['image', 'video', 'audio'], index: int,
325+
inputs: StdTemplateInputs) -> List[Context]:
326+
from qwen_vl_utils import fetch_image, fetch_video
327+
assert media_type in {'image', 'video'}
328+
if media_type == 'image':
329+
inputs.images[index] = fetch_image({'image': inputs.images[index]})
330+
if self.mode == 'lmdeploy':
331+
return ['<|vision_start|>', [-100], '<|vision_end|>']
332+
else:
333+
return ['<|vision_start|><|image_pad|><|vision_end|>']
334+
else:
335+
video = inputs.videos[index]
336+
video, video_kwargs = fetch_video({'video': video}, return_video_sample_fps=True)
337+
inputs.mm_processor_kwargs.setdefault('fps', []).append(video_kwargs)
338+
tokens = ['<|vision_start|><|video_pad|><|vision_end|>']
339+
if isinstance(video, torch.Tensor):
340+
video = video.to(torch.uint8)
341+
inputs.videos[index] = video
342+
return tokens
343+
344+
def replace_ref(self, ref: str, index: int, inputs: StdTemplateInputs) -> List[Context]:
345+
if self.bbox_format == 'legacy':
346+
return [f'<|object_ref_start|>{ref}<|object_ref_end|>']
347+
else:
348+
return [ref]
349+
350+
def replace_bbox(self, bbox: List[int], index: int, inputs: StdTemplateInputs) -> List[Context]:
351+
if self.bbox_format == 'legacy':
352+
return [f'<|box_start|>{self._get_bbox_str(bbox)}<|box_end|>']
353+
else:
354+
return [str(bbox)]
355+
356+
def _encode(self, inputs: StdTemplateInputs) -> Dict[str, Any]:
357+
encoded = super()._encode(inputs)
358+
processor = self.processor
359+
input_ids = encoded['input_ids']
360+
labels = encoded['labels']
361+
loss_scale = encoded.get('loss_scale', None)
362+
for media_type in ['images', 'videos']:
363+
mm_data = getattr(inputs, media_type)
364+
if mm_data:
365+
if media_type == 'images':
366+
media_token = self.image_token_id
367+
media_inputs = processor.image_processor(images=mm_data, return_tensors='pt', do_resize=False)
368+
media_grid_thw = media_inputs['image_grid_thw']
369+
else:
370+
kwargs = {}
371+
if hasattr(processor, 'video_processor'):
372+
processor_func = processor.video_processor
373+
else:
374+
processor_func = processor.image_processor
375+
kwargs['images'] = None
376+
media_inputs = processor_func(videos=mm_data, return_tensors='pt', do_resize=False, **kwargs)
377+
media_grid_thw = media_inputs['video_grid_thw']
378+
media_token = self.video_token_id
379+
idx_list = findall(input_ids, media_token)
380+
merge_length = processor.image_processor.merge_size**2
381+
382+
def _get_new_tokens(i):
383+
token_len = (media_grid_thw[i].prod() // merge_length)
384+
return [media_token] * token_len
385+
386+
input_ids, labels, loss_scale = self._extend_tokens(input_ids, labels, loss_scale, idx_list,
387+
_get_new_tokens)
388+
encoded.update(media_inputs)
389+
390+
encoded['input_ids'] = input_ids
391+
encoded['labels'] = labels
392+
encoded['loss_scale'] = loss_scale
393+
return encoded
394+
395+
def _post_encode(self, model, inputs: Dict[str, Any]) -> Dict[str, Any]:
396+
if not self.is_training:
397+
return inputs
398+
input_ids = inputs['input_ids']
399+
base_model = self.get_base_model(model)
400+
if hasattr(base_model.model, 'embed_tokens'):
401+
inputs_embeds = base_model.model.embed_tokens(input_ids)
402+
else:
403+
inputs_embeds = base_model.model.language_model.embed_tokens(input_ids)
404+
inputs_embeds = self._get_inputs_embeds_hf(inputs_embeds, inputs, model.visual, self.processor, model.config)
405+
return {'inputs_embeds': inputs_embeds}
406+
407+
408+
register_template(QwenTemplateMeta(MLLMTemplateType.llava_onevision1_5, template_cls=LLavaOneVision1_5Template))

tests/test_align/test_template/test_vision.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -980,6 +980,17 @@ def test_deepseek_ocr():
980980
'创空间 中体验SWIFT web-ui功能了。')
981981

982982

983+
def test_llava_onevision1_5():
984+
pt_engine = PtEngine('lmms-lab/LLaVA-OneVision-1.5-4B-Instruct')
985+
query = 'Describe this image.'
986+
messages = [{'role': 'user', 'content': query}]
987+
images = ['https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg']
988+
response = _infer_model(pt_engine, messages=messages, images=images)
989+
pt_engine.default_template.template_backend = 'jinja'
990+
response2 = _infer_model(pt_engine, messages=messages, images=images)
991+
assert response == response2
992+
993+
983994
def test_paddle_ocr():
984995
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
985996
pt_engine = PtEngine('PaddlePaddle/PaddleOCR-VL')
@@ -1069,4 +1080,5 @@ def test_paddle_ocr():
10691080
# test_internvl_gpt_hf()
10701081
# test_sailvl2()
10711082
# test_deepseek_ocr()
1083+
# test_llava_onevision1_5()
10721084
test_paddle_ocr()

0 commit comments

Comments
 (0)