-
Notifications
You must be signed in to change notification settings - Fork 97
feat: add code chunking functionality #398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
✅ DCO Check Passed Thanks @bridgetmcg, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
Related Documentation Checked 2 published document(s). No updates required. You have 5 draft document(s). Publish docs to keep them always up-to-date |
5c6fc1e
to
32b120d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Let me share some first thoughts, mostly on how these capabilities can be packaged and exposed:
I see the PR contains various language-specific chunkers, e.g. Java, Python etc.
My recommendation would be:
- to encapsulate these capabilities under a single component, which would include the language detection inside it, and,
- to ensure composability with existing chunkers, instead of introducing a new chunker, I'd rather provide this as a capability pluggable into the
HierarchicalChunker
(good fit because (1) it follows Item boundaries, so CodeItems can be nicely delegated, and (2) is itself composable into theHybridChunker
). Let us still define the interface specifics, but it could look like an optional kwarg inHierarchicalChunker.chunk()
, e.g.code_chunking_strategy
, adhering to a matching interface.
A more minor comment is regarding CodeDocMeta
, which I see inherits from BaseMeta
. Some application code may expect to interact with DocMeta
(also a BaseMeta
child), so if possible we should best extend that.
This point goes hand in hand with the interface specifics TBD above.
(For now I would focus on the present PR & the points above — the idea of introducing a code backend as per the docling repo PR can be discussed at a second step.)
hi @vagenas, I added some logic to address your suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bridgetmcg, many thanks for the new iteration, incorporating feedback from above!
I only have a couple last points from my side (@dolfim-ibm I don't know if you want to add anything):
- I see the actual integration meanwhile occurs via
HierarchicalChunker
->DefaultCodeChunkingStrategy
->CodeChunkingStrategyFactory
which returns aCodeChunker
(subclass ofBaseChunker
). Since the chunker primitive should be runnable on any document, it could be confusing to expose a component that operates only on e.g. Python, as a "chunker". Ideally one could just satisfy the newly introduced interface/protocol (and notBaseChunker
which may require additional points), but perhaps the fastest way to address this point is just to mark theCodeChunker
class hierarchy "internal" by prepending with_
. Then it's clear all these classes are implementation internals, and users only need to care about the strategy they can optionally pass toHierarchicalChunker
. - the defined
CodeChunkingStrategy(Protocol)
does not seem to be used anywhere — shouldn't this be somewhere in the typing of fieldcode_chunking_strategy
withinHierarchicalChunker
? - to allow for extensibility we support the notion of customizable serializers; while this perhaps makes little sense in
CodeItem
s, for consistency, inHierarchicalChunker
we should still best use the result fromdoc_serializer
(instead of justitem.text
)
Hope that makes sense. Otherwise we can also have a quick call to clarify & finalize.
meta: CodeDocMeta | ||
|
||
|
||
class ChunkType(str, Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to avoid users thinking this is a generic ChunkType
, we could rename the class to CodeChunkType
.
Looking at the discussion and the proposed |
What about the obvious way to expand this, i.e. adding an optional |
I was just looking up what we do in the serializers, there we are indeed using this approach. So let's go with it. |
@vagenas I believe I addressed your comments! Let me know if not. Many thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bridgetmcg I made some in-line comments incl. code suggestions.
Please also install the pre-commit hooks, so all checks are verified locally before pushing.
(E.g. I think some tests are still not up-to-date. FYI to generate the data, set env var DOCLING_GEN_TEST_DATA=1
, e.g. DOCLING_GEN_TEST_DATA=1 uv run pytest
)
@bridgetmcg sounds good, now we still need to address the conflicts on uv.lock (not possible manually), which means:
Can you take care of these? Otherwise let me know and I could try to look into it. |
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 334811a Signed-off-by: Bridget McGinn <[email protected]>
Co-authored-by: Panos Vagenas <[email protected]> Signed-off-by: Bridget <[email protected]>
Co-authored-by: Panos Vagenas <[email protected]> Signed-off-by: Bridget <[email protected]>
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 46bb88a I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 10e9ed8 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: d9827c7 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 814dc61 Signed-off-by: Bridget McGinn <[email protected]>
5b8e4cc
to
377d5ce
Compare
Signed-off-by: Bridget McGinn <[email protected]>
Signed-off-by: Bridget McGinn <[email protected]>
68bbf44
to
b417cae
Compare
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: a4a21e9 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 0266c63 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 336dd6a I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 68890e9 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 3c65eef Signed-off-by: Bridget McGinn <[email protected]>
@vagenas I had to restrict the tree-sitter versioning due to python compatibility. treesitter > 0.24 requires python 3.10+ and 0.23 requires all the treesitter language libraries to be <0.24 as well. |
This PR introduces code chunking functionality to
docling-core
, enabling intelligent parsing and chunking of source code files across multiple programming languages. The implementation leverages tree-sitter for accurate parsing and provides language-specific chunkers for Python, TypeScript, JavaScript, Java, and C.Features
Core
Language Support
Testing