[Contribution - Hackaton_2025] Add VLM model and UI support to Virtual AI Assistant Demo #196 #225
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Enhancement of the existing Virtual AI Assistant demo to understand visual inputs of up to 1,003,520 pixels on any resolution.
Detailed Description
The PR is developed to integrate a Vision-Language Model (VLM) supported into the Virtual AI Assistant demo allowing the it to describe or answer questions of any image, it also updates the User Interface to supports visual inputs allowing users to upload its own images.
The Vision-Language model chosen is
Qwen2.5-vlbecause is an efficient Vision Encoder capable of analyzing texts, charts, icons, graphics, and layouts within images of up to 1,003,520 pixels on any resolution, this model can be integrated in any of the 5 Virtual personalities that has been provided, describing each detail of the image helping us to understand more about our queries.These changes has been tested in the agrobot personality.
Example of the updated user interface
The updates were:
Deploying
To run this demo apply all the steps on the README.
Testing
The inputs considered for the integration of the model to the demo where an image of HD resolution and an input text about the user query.
The chatbot has been tested in a local instance, ensuring that when an image is sent, the model detects it and temporally saves it. The VLM model works sepparately from the principal LLM models with personalities, but the history is then shared with the principal LLM model ensuring context is not lost after the VLM model interaction.
Challenges
Issues related to model optimization and bit input were not tackled in this PR due to time constraints.
Continous development and integration for this Issue were slow due to all chat models dependencies and initializations, there is a great area of opportunity for future developments of this demo.
To run the original model, we used a server with Intel Atom GRR in a contenerized environment, due to dependencies missmatch and system capacity. Testing on a local PC is still pending.
Modifications needed in the future
Webcam UI interfface was added however we found some challenges when accessing the webcam from the client desktop. A Future enhancement would be to modify the demo to fix this issue.
The option for clipboard input that is beside the Webcam input is also out of service.