Skip to content

Conversation

@fractalito
Copy link

Summary

Enhancement of the existing Virtual AI Assistant demo to understand visual inputs of up to 1,003,520 pixels on any resolution.

Detailed Description

The PR is developed to integrate a Vision-Language Model (VLM) supported into the Virtual AI Assistant demo allowing the it to describe or answer questions of any image, it also updates the User Interface to supports visual inputs allowing users to upload its own images.

The Vision-Language model chosen is Qwen2.5-vl because is an efficient Vision Encoder capable of analyzing texts, charts, icons, graphics, and layouts within images of up to 1,003,520 pixels on any resolution, this model can be integrated in any of the 5 Virtual personalities that has been provided, describing each detail of the image helping us to understand more about our queries.

These changes has been tested in the agrobot personality.
Example of the updated user interface
The updates were:

  • A new option to load an image in the left section using gradio.
  • When image is loaded it displays the option chosen whit the question that the user sends to keep in track for what is asking.
  • The output shows a detailed description of the image:
    Screenshot 2025-03-07 110643
  • After an image is used its color will turn grey.
    dog_before
    dog_after

Deploying

To run this demo apply all the steps on the README.

Testing

The inputs considered for the integration of the model to the demo where an image of HD resolution and an input text about the user query.

The chatbot has been tested in a local instance, ensuring that when an image is sent, the model detects it and temporally saves it. The VLM model works sepparately from the principal LLM models with personalities, but the history is then shared with the principal LLM model ensuring context is not lost after the VLM model interaction.

Challenges

Issues related to model optimization and bit input were not tackled in this PR due to time constraints.

Continous development and integration for this Issue were slow due to all chat models dependencies and initializations, there is a great area of opportunity for future developments of this demo.

To run the original model, we used a server with Intel Atom GRR in a contenerized environment, due to dependencies missmatch and system capacity. Testing on a local PC is still pending.

Modifications needed in the future

Webcam UI interfface was added however we found some challenges when accessing the webcam from the client desktop. A Future enhancement would be to modify the demo to fix this issue.

The option for clipboard input that is beside the Webcam input is also out of service.

…l AI Assistant Demo openvinotoolkit#196

Enhancement of the existing Virtual AI Assistant demo to understand visual inputs of up to 1,003,520 pixels on any resolution.

Fixes openvinotoolkit#196

* [feat] virtual_ai_assistant/Dockerfile: Personality as build argument and mount model weights as a volume

* [fix] virtual_ai_assistant/Dockerfile: Personality after requirements are installed

* [feat] virtual_ai_assistant: Added QWEN2.5 VLM dependencies to requirements.txt

* [feat] virtual_ai_assistant/Dockerfile: install git in container to install latest optimum-intel version from github

* [feat] virtual_ai_assistant/main: Added new vlm_model parameter to script and new load_vlm_model function.

Co-authored-by: fjescala <[email protected]>

* [fix] virtual_ai_assistant/requirements: transformer dependencies alligned with jupyter notebook

* [feat] virtual_ai_assistant: Image input UI

* [feat] virtual_ai_assistant: llm_chat & vlm_chat

* [feat] virtual_ai_assistant: image_to_grayscale

* [feat] virtual_ai_assistant: new variable current_model to choose between llm_chat & vlm_chat

* [feat] virtual_ai_assistant/readme: instructions to run demo in docker container

---------

Co-authored-by: Armando Cruz <[email protected]>
Co-authored-by: Fabian Escalante <[email protected]>
Co-authored-by: Katherine Salcido <[email protected]>
Signed-off-by: Alondra Parra <[email protected]>
Signed-off-by: Eliana Pacheco <[email protected]>
@adrianboguszewski
Copy link
Contributor

I like the docker approach here. Let's split it into two PRs. One provides a docker file for this demo and the second has all vlm changes

@fractalito
Copy link
Author

Glad to hear you see value on the Docker containerization of this demo, it was very helpful for us while continuous developing and testing the enhancements for the Virtual AI Assistant demo.

We will be working on splitting this PR and will let you know in the comments when it's finished, and the new PR is created.

@fractalito
Copy link
Author

Here is the new PR with the Dockerfile and build instructions #244 I suggest this new PR is merged first, before the one from this conversation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants