-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add google embedder support #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add google embedder support #192
Conversation
…t, integration, and API tests
…e backward compatibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @kuarcis, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello team,
Gemini here, providing a summary for this pull request titled "Add google embedder support".
The primary goal of this PR is to integrate support for Google AI's embedding models into the DeepWiki project. This allows users to utilize Google's embedding capabilities, particularly the text-embedding-004 model, and simplifies setup for users already leveraging Google Gemini models for text generation, as it uses the same GOOGLE_API_KEY.
The changes involve adding a new client for Google embeddings, updating the configuration system to allow selecting the embedder type via the DEEPWIKI_EMBEDDER_TYPE environment variable, modifying the data processing pipeline and RAG components to be aware of the different embedder types, enhancing documentation, and adding comprehensive tests for the new functionality and the overall embedder system.
This enhancement provides users with more flexibility in choosing their embedding provider based on their existing infrastructure and preferences.
Highlights
- Google Embedder Support: Adds a new
GoogleEmbedderClientto integrate with Google AI's embedding API, specifically supporting thetext-embedding-004model for tasks like semantic similarity. - Flexible Embedder Selection: Introduces the
DEEPWIKI_EMBEDDER_TYPEenvironment variable (openai,google,ollama) to easily switch between different embedding providers without code changes. Defaults toopenai. - Updated Data Pipeline & RAG: Modifies core components like token counting, document reading, data pipeline preparation, and the RAG class to correctly handle and utilize the selected embedder type, including backward compatibility for the old
is_ollama_embedderparameter. - Enhanced Documentation: Updates the
README.mdto clearly explain how to enable and use Google AI embeddings, including configuration options for.env, Docker, and Docker Compose, and provides a table of available embedder types. - Comprehensive Testing: Adds new unit and integration tests (
test_google_embedder.py,test_full_integration.py,test_all_embedders.py) to verify the Google embedder client, AdalFlow integration, embedder selection logic, environment variable handling, and data pipeline compatibility across different embedder types. - Minor API & Main Updates: Includes a small fix in the API's cache deletion endpoint for authorization code validation and refines the
uvicornreload configuration inmain.pyfor development mode usingwatchfiles.
Changelog
Click here to see the changelog
- README.md
- Added 'Flexible Embeddings' feature to the main list.
- Added
DEEPWIKI_EMBEDDER_TYPE=googleto the Quick Start.envexample. - Added
DEEPWIKI_EMBEDDER_TYPE=googleto the Manual Setup.envexample. - Added a new section '🧠 Using Google AI Embeddings' detailing features, how to enable (env var, Docker, Docker Compose), available types table, reasons to use it, and switching instructions.
- Updated the Environment Variables table to include
DEEPWIKI_EMBEDDER_TYPEand clarify API key requirements based on embedder type.
- api/api.py
- Modified
delete_wiki_cacheendpoint to check ifauthorization_codeis not empty before comparing it toWIKI_AUTH_CODE.
- Modified
- api/config.py
- Imported
GoogleEmbedderClient. - Added
EMBEDDER_TYPEenvironment variable (DEEPWIKI_EMBEDDER_TYPE, default 'openai'). - Added
GoogleEmbedderClientto theCLIENT_CLASSESmapping. - Updated
load_embedder_configto include theembedder_googlekey when processing client classes. - Modified
get_embedder_configto return the configuration based on theEMBEDDER_TYPEenvironment variable ('google', 'ollama', or default 'embedder'). - Added
is_google_embedderfunction to check if the current embedder is Google. - Added
get_embedder_typefunction to return the current embedder type string ('ollama', 'google', 'openai'). - Updated the loop in the main config loading section to include
embedder_googlewhen updating configs.
- Imported
- api/config/embedder.json
- Added a new configuration section
embedder_googlespecifyingGoogleEmbedderClient,batch_size, andmodel_kwargs(text-embedding-004,SEMANTIC_SIMILARITY).
- Added a new configuration section
- api/data_pipeline.py
- Modified
count_tokensto accept optionalembedder_typeparameter (with backward compatibility foris_ollama_embedder) and use it to determine the encoding. - Modified
read_all_documentsto accept optionalembedder_typeparameter (with backward compatibility foris_ollama_embedder) and pass it tocount_tokens. - Modified
prepare_data_pipelineto accept optionalembedder_typeparameter (with backward compatibility foris_ollama_embedder) and use it to select the embedder and the appropriate document processor (OllamaDocumentProcessorfor ollama,ToEmbeddingsfor others including google).
- Modified
- api/google_embedder_client.py
- Added a new file implementing
GoogleEmbedderClientinheriting fromadalflow.core.model_client.ModelClient. - Includes methods for initializing the client with
GOOGLE_API_KEY, parsing Google AI embedding responses, converting inputs to API kwargs (handling single and batch), and calling the Google AI embedding API (genai.embed_content). - Adds backoff for API calls.
- Notes the lack of async support in the current Google AI Python client.
- Added a new file implementing
- api/main.py
- Removed unused
uvicornimport at the top. - Added configuration for
watchfileslogger to show file paths in development. - Implemented
watchfilesmonkey patch to specifically watchapisubdirectories (excludinglogs) and.pyfiles in theapiroot during development reload. - Updated
uvicorn.runcall to includereload_excludesforlogs,__pycache__, and.pycfiles whenreloadis enabled.
- Removed unused
- api/rag.py
- Modified
RAGclass initialization to useapi.config.get_embedder_type()to determine the embedder type and pass it toget_embedder. - Updated
prepare_retrievermethod to pass the detectedembedder_typetoprepare_database.
- Modified
- api/tools/embedder.py
- Modified
get_embedderfunction to acceptembedder_typeanduse_google_embedder(legacy) parameters. - Updated logic to select the embedder configuration based on
embedder_type, legacy parameters, or auto-detection viaapi.config.get_embedder_type(). - Added logic to set the
batch_sizeattribute on the returnedadal.Embedderinstance if it's present in the configuration.
- Modified
- tests/README.md
- Added mention of Google AI embedder tests.
- Updated Environment Variables section to include
DEEPWIKI_EMBEDDER_TYPErequirement for Google tests. - Added
test_google_embedder.pyandtest_google_embedder_fix.pyto the Unit Tests section. - Added
test_full_integration.pyto the Integration Tests section. - Updated descriptions for test categories.
- Added troubleshooting tips for API Key Issues and Server Dependencies.
- tests/init.py
- Added a comment.
- tests/api/init.py
- Added a comment.
- tests/integration/init.py
- Added a comment.
- tests/integration/test_full_integration.py
- Added a new file with integration tests specifically for Google AI embeddings, including tests for configuration loading, embedder selection, and environment variable handling.
- tests/run_tests.py
- Updated the test runner script to include the new test directories (
unit,integration,api). - Added
check_environmentfunction to verify required API keys (GOOGLE_API_KEY,OPENAI_API_KEY) and dependencies (adalflow,google-generativeai,requests). - Modified
run_teststo iterate through specified directories and runtest_*.pyfiles. - Added argument parsing for
--unit,--integration,--api, and--check-env.
- Updated the test runner script to include the new test directories (
- tests/unit/init.py
- Added a comment.
- tests/unit/test_all_embedders.py
- Added a new file with comprehensive unit tests for the embedder system.
- Includes tests for configuration loading, embedder type detection (
is_ollama_embedder,is_google_embedder,get_embedder_type),get_embedder_config,get_embedderfactory function (with explicit types, legacy params, and auto-detection), direct client tests (GoogleEmbedderClient,OpenAIClientvia AdalFlow), data pipeline functions (count_tokens,prepare_data_pipeline), RAG integration, and environment variable handling. - Uses a simple custom test runner.
- tests/unit/test_google_embedder.py
- Added a new file with unit tests specifically for the
GoogleEmbedderClient. - Includes tests for the client's
callandparse_embedding_responsemethods for both single and batch embeddings, and tests its integration with the AdalFlowEmbedderandToEmbeddingscomponents.
- Added a new file with unit tests specifically for the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Google's vectors now arrive,
Making DeepWiki feel alive.
With keys in place,
We embed with grace,
And watch our RAG thrive.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively introduces support for Google Embedders, which is a valuable addition to the DeepWiki Open project. The refactoring from boolean flags to a string-based embedder_type for selecting embedders is a significant improvement for maintainability and future extensibility.
The new GoogleEmbedderClient is well-implemented, and the configuration changes are clear and appropriate. The updates to the README.md are particularly commendable, providing comprehensive documentation for the new feature, including setup, benefits, and switching between embedders. The backward compatibility for existing embedder configurations has also been thoughtfully handled.
Overall, the changes are well-executed and enhance the flexibility of the embedding system. The inclusion of new tests and the statement that they pass is also a positive indicator of quality.
Summary of Findings
- Refactoring for Embedder Type: The refactoring across multiple files (e.g.,
api/config.py,api/data_pipeline.py,api/rag.py,api/tools/embedder.py) to use a string-basedembedder_typeinstead of boolean flags (likeis_ollama_embedder) is a well-executed improvement. This makes the system more extensible for adding new embedder types in the future and improves code clarity. - GoogleEmbedderClient Implementation: The new
api/google_embedder_client.pyprovides a solid integration with the Google AI Embeddings API, correctly handling API key configuration, input conversion for single and batch embeddings, and response parsing. The use ofbackofffor API calls is also a good practice. - Documentation (README.md): The updates to
README.mdare comprehensive and very helpful for users. The new section on 'Using Google AI Embeddings' clearly explains its features, setup, and benefits. The updated environment variable table and API key requirements are also much clearer. - Configuration Management: The changes in
api/config.pyandapi/config/embedder.jsoncorrectly incorporate the Google embedder, allowing for easy configuration and selection via theDEEPWIKI_EMBEDDER_TYPEenvironment variable. - Backward Compatibility: The handling of backward compatibility for functions previously using
is_ollama_embedderby introducing logic to interpret it in the context of the newembedder_typesystem is well done, ensuring a smoother transition. - Minor: Unused Parameter in
get_embedder: Inapi/tools/embedder.py, theget_embedderfunction signature was updated to includeuse_google_embedder: bool = False. While this doesn't cause issues, this specific parameter is not utilized by any call sites within the changes of this PR. This was noted but not commented on due to review settings (low severity). - Minor: Potentially Unreachable Code in
GoogleEmbedderClient: Inapi/google_embedder_client.py, theparse_embedding_responsemethod includes a conditionelif hasattr(response, 'embeddings'):. Given the standard responses from the Google AI SDK'sembed_contentmethod, this specific branch might be unreachable. This was noted but not commented on due to review settings (low severity).
Merge Readiness
The pull request is in excellent shape and significantly enhances the project's embedding capabilities. The code is well-structured, the new feature is thoroughly documented, and backward compatibility has been maintained. I believe these changes are ready for merging after any standard final checks by the maintainers. As an AI assistant, I am not authorized to approve pull requests, so please ensure it undergoes the necessary human review and approval process.
|
Fix the conflicts and simplify it. |
|
Thank you for your feedback. I’ve resolved the conflicts as requested. Regarding the complexity of the PR, due to the complexity of supporting multiple embedders while ensuring backward compatibility, the changes are already as streamlined as possible without sacrificing reliability or flexibility. |
|
Is this going to be merged? I'm keen to test this, and this PR opens the possibility to adding vertex integration, which would make this project better suited for corporate users that need to go through vertex for their models due to security/privacy |
|
Quick update: I ran the tests locally on this branch and the changes work as expected.
Looks good from my side. |
|
i also test this PR and it is really good. Are you willing to merge this PR? |
|
Hi all Since there's been interests in getting this PR merged, I've created a community-maintained fork where this change is merged, and I'll try to maintain and update with this upstream repo. 🔗 Community Fork: https://github.com/kuarcis/deepwiki-open-community |
Summary
This PR adds support for the Google Embedder to the DeepWiki Open project.
Changes Introduced
Motivation
Adding Google Embedder support allows users to leverage Google’s embedding capabilities, and allow use only google api key run the backend
Testing
Related Issues
Checklist
Let me know if you need to customize or expand this further!