[WIP]: Refactor db layer #165

rieger-jared · 2025-05-31T02:39:10Z

No description provided.

rieger-jared · 2025-05-31T02:41:29Z

Hey @Davidyz, still a work in progress and I still have a bunch to do. Right now I'm working on abstracting the chroma layer. Will fill in the PR description and other chores a little later when it's ready. Feel free to peruse through what I've done so far

Davidyz · 2025-05-31T08:45:10Z

Before you've gone too far on this: have you considered using sth like langchain? They seem to provide wrappers for embeddings and vectorstore adapters, too. I had reservations about it because I was only working with chroma and I wanted to KISS. But as we try to introduce compatibilities for more databases, it seems like this PR might be re-inventing the wheel. I personally don't have much experience with langchain and alike, so I'd love to know your thoughts on this.

This is something that I should've thought about when we were in the discussion. To be clear, I have no problem having our custom DB connectors since we'd be able to tailor them to better fit our specific needs. I just hope we think through other possibilities before investing too much in this.

EDIT: also, it'll help me keep things organised if you convert this to a draft PR until it's ready.

rieger-jared · 2025-05-31T10:21:53Z

@Davidyz good point, I’m not that familiar with langchain. would you mind sharing the embeddings and connectors that you found with me? Would be great to not have to reinvent the wheel.

Davidyz · 2025-05-31T13:50:49Z

This is what I found from langchain: https://python.langchain.com/docs/integrations/vectorstores/
And this from llamaindex: https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_stores/

The db support requires extra packages, though, and both of them require some direct manipulations to the database libraries.

rieger-jared · 2025-06-19T09:55:41Z

@Davidyz sorry for the belated response. Thanks for sharing those. Which preference do you have between the two? From what I've heard it seems that Langchain has the most adoption.

Davidyz · 2025-06-19T10:23:26Z

Honestly, I don't have a preference. I haven't looked at their implementations very closely (if I had, I'd probably used them at the beginning 😆). But one thing we'd need is the support of metadata. Since the opening of the PR, VectorCode has made more use of the metadata (to store the line ranges of chunks and to skip vectorising files that are up-to-date). Since langchain and llama-index are both MIT-licensed, we can probably make an in-house connector based on theirs (that is, if their implementation doesn't work well with the metadata)?

rieger-jared · 2025-06-22T06:55:11Z

Honestly, I don't have a preference. I haven't looked at their implementations very closely (if I had, I'd probably used them at the beginning 😆). But one thing we'd need is the support of metadata. Since the opening of the PR, VectorCode has made more use of the metadata (to store the line ranges of chunks and to skip vectorising files that are up-to-date). Since langchain and llama-index are both MIT-licensed, we can probably make an in-house connector based on theirs (that is, if their implementation doesn't work well with the metadata)?

Ok nice. I'll investigate and play around with both and see how they fit. I've briefly looked at langchain and I'd imagine to use their connectors would require using their Document models. I can imagine this ties Vectorcode a little to their library but obviously the gains would probably be better.

Good to know that the metadata is important to the integration with these libraries. From looking at langchain, the metadata seems to be associated with the Document models which is independent of the db connectors. That would be great news as from what I understand we wouldn't need to worry about the different types of dbs supporting the database right?

Davidyz · 2025-06-22T07:49:10Z

That would be great news as from what I understand we wouldn't need to worry about the different types of dbs supporting the database right?

Yes that would be very handy, as long as those metadata works for filtering. there's a feature in the current implementation that allows users to exclude some files from the query. I use this for de-duplication in the codecompanion query tool. Another recently introduced feature uses the metadata to store a hash of the file, so that vectorise and update can skip files that haven't changed since the last vectorisation. I might introduce extra metadata fields in the future, but I imagine the way they'd be used would be similar to the existing 2 fields.

rieger-jared added 2 commits May 31, 2025 12:36

add db type to config

d3ebe52

WIP: base implementation for chroma

74a6726

fix connection to local

88226f4

rieger-jared marked this pull request as draft May 31, 2025 10:20

Davidyz mentioned this pull request Jun 25, 2025

Implement filelock for db_path when using the chroma server #217

Merged

Davidyz mentioned this pull request Jul 17, 2025

[FEAT] Support for newer ChromaDB version #247

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP]: Refactor db layer #165

[WIP]: Refactor db layer #165

Uh oh!

rieger-jared commented May 31, 2025

Uh oh!

rieger-jared commented May 31, 2025

Uh oh!

Davidyz commented May 31, 2025 •

edited

Loading

Uh oh!

rieger-jared commented May 31, 2025

Uh oh!

Davidyz commented May 31, 2025

Uh oh!

rieger-jared commented Jun 19, 2025

Uh oh!

Davidyz commented Jun 19, 2025

Uh oh!

rieger-jared commented Jun 22, 2025

Uh oh!

Davidyz commented Jun 22, 2025

Uh oh!

Uh oh!

[WIP]: Refactor db layer #165

Are you sure you want to change the base?

[WIP]: Refactor db layer #165

Uh oh!

Conversation

rieger-jared commented May 31, 2025

Uh oh!

rieger-jared commented May 31, 2025

Uh oh!

Davidyz commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rieger-jared commented May 31, 2025

Uh oh!

Davidyz commented May 31, 2025

Uh oh!

rieger-jared commented Jun 19, 2025

Uh oh!

Davidyz commented Jun 19, 2025

Uh oh!

rieger-jared commented Jun 22, 2025

Uh oh!

Davidyz commented Jun 22, 2025

Uh oh!

Uh oh!

Davidyz commented May 31, 2025 •

edited

Loading