-
Notifications
You must be signed in to change notification settings - Fork 37
[WIP]: Refactor db layer #165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hey @Davidyz, still a work in progress and I still have a bunch to do. Right now I'm working on abstracting the chroma layer. Will fill in the PR description and other chores a little later when it's ready. Feel free to peruse through what I've done so far |
Before you've gone too far on this: have you considered using sth like langchain? They seem to provide wrappers for embeddings and vectorstore adapters, too. I had reservations about it because I was only working with chroma and I wanted to KISS. But as we try to introduce compatibilities for more databases, it seems like this PR might be re-inventing the wheel. I personally don't have much experience with langchain and alike, so I'd love to know your thoughts on this. This is something that I should've thought about when we were in the discussion. To be clear, I have no problem having our custom DB connectors since we'd be able to tailor them to better fit our specific needs. I just hope we think through other possibilities before investing too much in this. EDIT: also, it'll help me keep things organised if you convert this to a draft PR until it's ready. |
@Davidyz good point, I’m not that familiar with langchain. would you mind sharing the embeddings and connectors that you found with me? Would be great to not have to reinvent the wheel. |
This is what I found from langchain: https://python.langchain.com/docs/integrations/vectorstores/ The db support requires extra packages, though, and both of them require some direct manipulations to the database libraries. |
@Davidyz sorry for the belated response. Thanks for sharing those. Which preference do you have between the two? From what I've heard it seems that Langchain has the most adoption. |
Honestly, I don't have a preference. I haven't looked at their implementations very closely (if I had, I'd probably used them at the beginning 😆). But one thing we'd need is the support of metadata. Since the opening of the PR, VectorCode has made more use of the metadata (to store the line ranges of chunks and to skip vectorising files that are up-to-date). Since langchain and llama-index are both MIT-licensed, we can probably make an in-house connector based on theirs (that is, if their implementation doesn't work well with the metadata)? |
Ok nice. I'll investigate and play around with both and see how they fit. I've briefly looked at langchain and I'd imagine to use their connectors would require using their Document models. I can imagine this ties Vectorcode a little to their library but obviously the gains would probably be better. Good to know that the metadata is important to the integration with these libraries. From looking at langchain, the metadata seems to be associated with the Document models which is independent of the db connectors. That would be great news as from what I understand we wouldn't need to worry about the different types of dbs supporting the database right? |
Yes that would be very handy, as long as those metadata works for filtering. there's a feature in the current implementation that allows users to exclude some files from the query. I use this for de-duplication in the codecompanion query tool. Another recently introduced feature uses the metadata to store a hash of the file, so that |
No description provided.