Skip to content

Conversation

ChinmayBansal
Copy link
Contributor

Related Issues

Proposed Changes:

This PR adds multimodal (image + text) support to LlamaCppChatGenerator, enabling the component to process
both text and images in chat messages. The implementation follows established patterns from the
AnthropicChatGenerator multimodal support (PR #2186).

Key Features Added:

  • Image format validation for supported formats (JPEG, PNG, GIF, WebP)
  • Proper message conversion to LlamaCpp OpenAI-compatible format with base64 data URIs
  • Support for multimodal models through chat handlers and CLIP models
  • Enhanced component initialization with chat_handler and clip_model_path parameters
  • Role-based image restrictions (images only allowed in user messages)
  • Comprehensive error handling for unsupported formats and edge cases

Implementation Details:

  • Updated _convert_message_to_llamacpp_format() function to handle multimodal content while preserving
    order
  • Added multimodal model initialization in warm_up() method with Llava15ChatHandler support
  • Enhanced component docstring with detailed usage examples for multimodal scenarios
  • Added proper serialization/deserialization support for new parameters

How did you test it?

Unit Tests:

  • test_convert_message_to_llamacpp_format_with_image() - Tests proper multimodal message conversion
  • test_convert_message_to_llamacpp_format_with_unsupported_mime_type() - Tests error handling for
    unsupported formats
  • test_convert_message_to_llamacpp_format_with_none_mime_type() - Tests edge case with None mime type
  • test_convert_message_to_llamacpp_format_image_in_non_user_message() - Tests role-based restrictions
  • test_multimodal_message_processing() - Tests end-to-end multimodal processing with mocked model

Code Quality Verification:

  • ✅ All linting checks pass: hatch run fmt
  • ✅ All type checking passes: hatch run test:types
  • ✅ All unit tests pass: hatch run test:unit

Manual Verification:

  • Tested multimodal message creation and conversion
  • Verified proper error messages for validation failures
  • Confirmed component initialization with multimodal parameters

Notes for the reviewer

  • The implementation closely follows the patterns established in AnthropicChatGenerator (lines 137-167 in
    anthropic/chat_generator.py)
  • Image validation uses the same error message format as Anthropic for consistency
  • The OpenAI-compatible format with image_url structure is required by LlamaCpp for multimodal processing
  • Added comprehensive test coverage that matches and exceeds the patterns used in Anthropic tests
  • All edge cases are properly handled including None mime types and role restrictions

Checklist

@ChinmayBansal ChinmayBansal requested a review from a team as a code owner August 19, 2025 01:48
@ChinmayBansal ChinmayBansal requested review from davidsbatista and removed request for a team August 19, 2025 01:48
@github-actions github-actions bot added integration:llama_cpp type:documentation Improvements or additions to documentation labels Aug 19, 2025
@davidsbatista davidsbatista requested a review from anakin87 August 19, 2025 05:44
@ChinmayBansal
Copy link
Contributor Author

The Linux CI failure, I think it is because of a package download issue rather than an implementation issue since when I opened the PR, all checks were successful.

@anakin87
Copy link
Member

Hello @Chimnay, thanks for your work!

In general, I have mixed feeling when I work on this integration.
The main point is that we use llama-cpp-python bindings, a project that is not constantly maintained and up to date with llama.cpp features (for example, tool calling lags significantly behind current llama.cpp capabilities). In the long run, I am not sure that depending on this project is the best way to support Haystack users who want to use llama.cpp.

Diving into this PR:

  • a core requirement is that the dictionary produced by to_dict() is JSON serializable, so we can't directly use the chat_handler. I would propose using chat_handler_name instead, pointing users to handlers in https://github.com/abetlen/llama-cpp-python/blob/main/llama_cpp/llama_chat_format.py. I think this can then be imported using importlib.
  • if users specify a model_clip_path, they must also provide a chat_handler_name
  • let's choose a small model (maybe moondream2) and add one integration test
  • let's try to remove type:ignore wherever possible (sometimes LLMs and coding assistants are not bad at that if you force them for a while :-))

I'm writing this based on reading the llama-cpp-python docs, so I might have missed some details. If that's the case, please let me know.

@ChinmayBansal
Copy link
Contributor Author

Hi @anakin87,

I believe I have addressed your feedback in my latest commit. For multimodal support, the chat handlers are established and stable in the current version.

To address your feedback:

  1. I am now using chat_handler_name instead of the handler object.
  2. I added more validation that if users specify chat_handler_name, they must also provide a model_clip_path.
  3. I added an integration test with moondream2 and also skipped when the model file isn't available.
  4. I reduced the instances of type:ignore from 3 to 1. I used cast() for most cases and kept one for the nest image URL structure. This is needed since llama-cpp-python's ChatCompletionMessage type system does not properly handle the nested dictionary structure needed for mutlimodal content.

I think your suggestions were valid and I have implemented them. I did encounter one detail you might have missed which the reason for the remaining type: ignore.

Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are going in a good direction. I left some comments.

@anakin87 anakin87 changed the title feat: add multimodal support to LlamaCppChatGenerator feat: add image support to LlamaCppChatGenerator Aug 22, 2025
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, @ChinmayBansal, thank you again!

I felt free to simplify some aspects.
I also took the opportunity for using smaller models in the tests, hoping that CI can get faster. I also did another minor adjustments.

I'll merge this PR as soon as tests pass.

@anakin87 anakin87 merged commit be68edd into deepset-ai:main Aug 22, 2025
11 checks passed
@ChinmayBansal
Copy link
Contributor Author

Hi @anakin87,

I see you removed some complex logic, are we assuming that the user passes in the exact class name?

I wanted to confirm the experience from a user POV since these changes a little less user friendly (requiring exact class name knowledge). Both approaches are valid, just wanted to confirm that this is the right direction.

@anakin87
Copy link
Member

Yes, I think that the user should pass the exact class name. This reduces the complexity and maintenance efforts.

Plus, I found this docs page with these names: https://llama-cpp-python.readthedocs.io/en/latest/#multi-modal-models. I linked it in the docstrings, so I hope that this will be clear for users.
(There is a small typo in that page but I opened a PR to fix it).

Thank you!

@ChinmayBansal ChinmayBansal deleted the feat/llama-cpp-multimodal-support branch August 22, 2025 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:llama_cpp topic:CI type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Image support in LlamaCppChatGenerator
3 participants