Skip to content

[WIP]: Refactor db layer #165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

rieger-jared
Copy link

No description provided.

@rieger-jared
Copy link
Author

Hey @Davidyz, still a work in progress and I still have a bunch to do. Right now I'm working on abstracting the chroma layer. Will fill in the PR description and other chores a little later when it's ready. Feel free to peruse through what I've done so far

@Davidyz
Copy link
Owner

Davidyz commented May 31, 2025

Before you've gone too far on this: have you considered using sth like langchain? They seem to provide wrappers for embeddings and vectorstore adapters, too. I had reservations about it because I was only working with chroma and I wanted to KISS. But as we try to introduce compatibilities for more databases, it seems like this PR might be re-inventing the wheel. I personally don't have much experience with langchain and alike, so I'd love to know your thoughts on this.

This is something that I should've thought about when we were in the discussion. To be clear, I have no problem having our custom DB connectors since we'd be able to tailor them to better fit our specific needs. I just hope we think through other possibilities before investing too much in this.

EDIT: also, it'll help me keep things organised if you convert this to a draft PR until it's ready.

@rieger-jared rieger-jared marked this pull request as draft May 31, 2025 10:20
@rieger-jared
Copy link
Author

@Davidyz good point, I’m not that familiar with langchain. would you mind sharing the embeddings and connectors that you found with me? Would be great to not have to reinvent the wheel.

@Davidyz
Copy link
Owner

Davidyz commented May 31, 2025

This is what I found from langchain: https://python.langchain.com/docs/integrations/vectorstores/
And this from llamaindex: https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores/

The db support requires extra packages, though, and both of them require some direct manipulations to the database libraries.

@rieger-jared
Copy link
Author

@Davidyz sorry for the belated response. Thanks for sharing those. Which preference do you have between the two? From what I've heard it seems that Langchain has the most adoption.

@Davidyz
Copy link
Owner

Davidyz commented Jun 19, 2025

Honestly, I don't have a preference. I haven't looked at their implementations very closely (if I had, I'd probably used them at the beginning 😆). But one thing we'd need is the support of metadata. Since the opening of the PR, VectorCode has made more use of the metadata (to store the line ranges of chunks and to skip vectorising files that are up-to-date). Since langchain and llama-index are both MIT-licensed, we can probably make an in-house connector based on theirs (that is, if their implementation doesn't work well with the metadata)?

@rieger-jared
Copy link
Author

Honestly, I don't have a preference. I haven't looked at their implementations very closely (if I had, I'd probably used them at the beginning 😆). But one thing we'd need is the support of metadata. Since the opening of the PR, VectorCode has made more use of the metadata (to store the line ranges of chunks and to skip vectorising files that are up-to-date). Since langchain and llama-index are both MIT-licensed, we can probably make an in-house connector based on theirs (that is, if their implementation doesn't work well with the metadata)?

Ok nice. I'll investigate and play around with both and see how they fit. I've briefly looked at langchain and I'd imagine to use their connectors would require using their Document models. I can imagine this ties Vectorcode a little to their library but obviously the gains would probably be better.

Good to know that the metadata is important to the integration with these libraries. From looking at langchain, the metadata seems to be associated with the Document models which is independent of the db connectors. That would be great news as from what I understand we wouldn't need to worry about the different types of dbs supporting the database right?

@Davidyz
Copy link
Owner

Davidyz commented Jun 22, 2025

That would be great news as from what I understand we wouldn't need to worry about the different types of dbs supporting the database right?

Yes that would be very handy, as long as those metadata works for filtering. there's a feature in the current implementation that allows users to exclude some files from the query. I use this for de-duplication in the codecompanion query tool. Another recently introduced feature uses the metadata to store a hash of the file, so that vectorise and update can skip files that haven't changed since the last vectorisation. I might introduce extra metadata fields in the future, but I imagine the way they'd be used would be similar to the existing 2 fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants