Skip to content

Conversation

@bridgetmcg
Copy link

This PR introduces code chunking functionality to docling-core, enabling intelligent parsing and chunking of source code files across multiple programming languages. The implementation leverages tree-sitter for accurate parsing and provides language-specific chunkers for Python, TypeScript, JavaScript, Java, and C.

Features

Core

  • CodeChunker - Base abstract class for code chunking with Tree-sitter integration
  • Language-specific chunkers - Specialized implementations for 5 major programming languages
  • Smart chunk splitting - Automatic splitting of large functions while preserving context

Language Support

  • Python (PythonFunctionChunker) - Functions, classes, imports, module variables
  • TypeScript (TypeScriptFunctionChunker) - Functions, classes, interfaces, imports
  • JavaScript (JavaScriptFunctionChunker) - Inherits from TypeScript chunker
  • Java (JavaFunctionChunker) - Methods, constructors, classes, enums, interfaces
  • C (CFunctionChunker) - Functions, structs, macros, preprocessor definitions

Testing

  • test_code_chunker.py - multi-language, real code samples

@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2025

DCO Check Passed

Thanks @bridgetmcg, all your commits are properly signed off. 🎉

@mergify
Copy link

mergify bot commented Oct 3, 2025

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot
Copy link

dosubot bot commented Oct 3, 2025

Related Documentation

Checked 2 published document(s). No updates required.

You have 5 draft document(s). Publish docs to keep them always up-to-date

How did I do? Any feedback?  Join Discord

@bridgetmcg bridgetmcg changed the title Add Code Chunking Functionality feat: add code chunking functionality Oct 3, 2025
Copy link
Collaborator

@vagenas vagenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Let me share some first thoughts, mostly on how these capabilities can be packaged and exposed:

I see the PR contains various language-specific chunkers, e.g. Java, Python etc.
My recommendation would be:

  • to encapsulate these capabilities under a single component, which would include the language detection inside it, and,
  • to ensure composability with existing chunkers, instead of introducing a new chunker, I'd rather provide this as a capability pluggable into the HierarchicalChunker (good fit because (1) it follows Item boundaries, so CodeItems can be nicely delegated, and (2) is itself composable into the HybridChunker). Let us still define the interface specifics, but it could look like an optional kwarg in HierarchicalChunker.chunk(), e.g. code_chunking_strategy, adhering to a matching interface.

A more minor comment is regarding CodeDocMeta, which I see inherits from BaseMeta. Some application code may expect to interact with DocMeta (also a BaseMeta child), so if possible we should best extend that.
This point goes hand in hand with the interface specifics TBD above.

(For now I would focus on the present PR & the points above — the idea of introducing a code backend as per the docling repo PR can be discussed at a second step.)

@bridgetmcg
Copy link
Author

bridgetmcg commented Oct 10, 2025

hi @vagenas, I added some logic to address your suggestions.

Copy link
Collaborator

@vagenas vagenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bridgetmcg, many thanks for the new iteration, incorporating feedback from above!

I only have a couple last points from my side (@dolfim-ibm I don't know if you want to add anything):

  1. I see the actual integration meanwhile occurs via HierarchicalChunker -> DefaultCodeChunkingStrategy -> CodeChunkingStrategyFactory which returns a CodeChunker (subclass of BaseChunker). Since the chunker primitive should be runnable on any document, it could be confusing to expose a component that operates only on e.g. Python, as a "chunker". Ideally one could just satisfy the newly introduced interface/protocol (and not BaseChunker which may require additional points), but perhaps the fastest way to address this point is just to mark the CodeChunker class hierarchy "internal" by prepending with _. Then it's clear all these classes are implementation internals, and users only need to care about the strategy they can optionally pass to HierarchicalChunker.
  2. the defined CodeChunkingStrategy(Protocol) does not seem to be used anywhere — shouldn't this be somewhere in the typing of field code_chunking_strategy within HierarchicalChunker?
  3. to allow for extensibility we support the notion of customizable serializers; while this perhaps makes little sense in CodeItems, for consistency, in HierarchicalChunker we should still best use the result from doc_serializer (instead of just item.text)

Hope that makes sense. Otherwise we can also have a quick call to clarify & finalize.

meta: CodeDocMeta


class ChunkType(str, Enum):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid users thinking this is a generic ChunkType, we could rename the class to CodeChunkType.

@dolfim-ibm
Copy link
Contributor

Looking at the discussion and the proposed code_chunking_strategy, I'm think we want to consider how this will scale to other types. For example, multiple chunking strategies for tables are a recurrent topic.

@vagenas
Copy link
Collaborator

vagenas commented Oct 22, 2025

Looking at the discussion and the proposed code_chunking_strategy, I'm think we want to consider how this will scale to other types. For example, multiple chunking strategies for tables are a recurrent topic.

What about the obvious way to expand this, i.e. adding an optional table_chunking_strategy to the HierarchicalChunker?

@dolfim-ibm
Copy link
Contributor

Looking at the discussion and the proposed code_chunking_strategy, I'm think we want to consider how this will scale to other types. For example, multiple chunking strategies for tables are a recurrent topic.

What about the obvious way to expand this, i.e. adding an optional table_chunking_strategy to the HierarchicalChunker?

I was just looking up what we do in the serializers, there we are indeed using this approach. So let's go with it.

@bridgetmcg
Copy link
Author

@vagenas I believe I addressed your comments! Let me know if not. Many thanks!

Copy link
Collaborator

@vagenas vagenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bridgetmcg I made some in-line comments incl. code suggestions.
Please also install the pre-commit hooks, so all checks are verified locally before pushing.
(E.g. I think some tests are still not up-to-date. FYI to generate the data, set env var DOCLING_GEN_TEST_DATA=1, e.g. DOCLING_GEN_TEST_DATA=1 uv run pytest)

@bridgetmcg
Copy link
Author

@vagenas I ran all the pre-commit checks which showed that now with the language identification we can correctly label some code snippets from other tests. Those were updated in 814dc61

@vagenas
Copy link
Collaborator

vagenas commented Oct 24, 2025

@bridgetmcg sounds good, now we still need to address the conflicts on uv.lock (not possible manually), which means:

  1. branch needs to get up-to-date with latest main
  2. uv.lock needs to be regenerated
  3. (if 1. is done by rebasing, a force-push would be needed)

Can you take care of these? Otherwise let me know and I could try to look into it.

@bridgetmcg bridgetmcg requested a review from vagenas October 24, 2025 13:17
bridgetmcg and others added 9 commits October 24, 2025 10:10
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 334811a

Signed-off-by: Bridget McGinn <[email protected]>
Co-authored-by: Panos Vagenas <[email protected]>
Signed-off-by: Bridget <[email protected]>
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 46bb88a
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 10e9ed8
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: d9827c7
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 814dc61

Signed-off-by: Bridget McGinn <[email protected]>
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: a4a21e9
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 0266c63
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 336dd6a
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 68890e9
I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 3c65eef

Signed-off-by: Bridget McGinn <[email protected]>
@bridgetmcg
Copy link
Author

@vagenas I had to restrict the tree-sitter versioning due to python compatibility. treesitter > 0.24 requires python 3.10+ and 0.23 requires all the treesitter language libraries to be <0.24 as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants