feat: add code chunking functionality #398

bridgetmcg · 2025-10-03T13:24:52Z

This PR introduces code chunking functionality to docling-core, enabling intelligent parsing and chunking of source code files across multiple programming languages. The implementation leverages tree-sitter for accurate parsing and provides language-specific chunkers for Python, TypeScript, JavaScript, Java, and C.

Features

Core

CodeChunker - Base abstract class for code chunking with Tree-sitter integration
Language-specific chunkers - Specialized implementations for 5 major programming languages
Smart chunk splitting - Automatic splitting of large functions while preserving context

Language Support

Python (PythonFunctionChunker) - Functions, classes, imports, module variables
TypeScript (TypeScriptFunctionChunker) - Functions, classes, interfaces, imports
JavaScript (JavaScriptFunctionChunker) - Inherits from TypeScript chunker
Java (JavaFunctionChunker) - Methods, constructors, classes, enums, interfaces
C (CFunctionChunker) - Functions, structs, macros, preprocessor definitions

Testing

test_code_chunker.py - multi-language, real code samples

github-actions · 2025-10-03T13:25:03Z

✅ DCO Check Passed

Thanks @bridgetmcg, all your commits are properly signed off. 🎉

mergify · 2025-10-03T13:25:29Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dosubot · 2025-10-03T13:26:32Z

Related Documentation

Checked 2 published document(s). No updates required.

^{You have 5 draft document(s). Publish docs to keep them always up-to-date}

^{How did I do? Any feedback?}

vagenas

Nice! Let me share some first thoughts, mostly on how these capabilities can be packaged and exposed:

I see the PR contains various language-specific chunkers, e.g. Java, Python etc.
My recommendation would be:

to encapsulate these capabilities under a single component, which would include the language detection inside it, and,
to ensure composability with existing chunkers, instead of introducing a new chunker, I'd rather provide this as a capability pluggable into the HierarchicalChunker (good fit because (1) it follows Item boundaries, so CodeItems can be nicely delegated, and (2) is itself composable into the HybridChunker). Let us still define the interface specifics, but it could look like an optional kwarg in HierarchicalChunker.chunk(), e.g. code_chunking_strategy, adhering to a matching interface.

A more minor comment is regarding CodeDocMeta, which I see inherits from BaseMeta. Some application code may expect to interact with DocMeta (also a BaseMeta child), so if possible we should best extend that.
This point goes hand in hand with the interface specifics TBD above.

(For now I would focus on the present PR & the points above — the idea of introducing a code backend as per the docling repo PR can be discussed at a second step.)

bridgetmcg · 2025-10-10T16:38:42Z

hi @vagenas, I added some logic to address your suggestions.

vagenas

Hi @bridgetmcg, many thanks for the new iteration, incorporating feedback from above!

I only have a couple last points from my side (@dolfim-ibm I don't know if you want to add anything):

I see the actual integration meanwhile occurs via HierarchicalChunker -> DefaultCodeChunkingStrategy -> CodeChunkingStrategyFactory which returns a CodeChunker (subclass of BaseChunker). Since the chunker primitive should be runnable on any document, it could be confusing to expose a component that operates only on e.g. Python, as a "chunker". Ideally one could just satisfy the newly introduced interface/protocol (and not BaseChunker which may require additional points), but perhaps the fastest way to address this point is just to mark the CodeChunker class hierarchy "internal" by prepending with _. Then it's clear all these classes are implementation internals, and users only need to care about the strategy they can optionally pass to HierarchicalChunker.
the defined CodeChunkingStrategy(Protocol) does not seem to be used anywhere — shouldn't this be somewhere in the typing of field code_chunking_strategy within HierarchicalChunker?
to allow for extensibility we support the notion of customizable serializers; while this perhaps makes little sense in CodeItems, for consistency, in HierarchicalChunker we should still best use the result from doc_serializer (instead of just item.text)

Hope that makes sense. Otherwise we can also have a quick call to clarify & finalize.

dolfim-ibm · 2025-10-22T09:38:37Z

docling_core/transforms/chunker/hierarchical_chunker.py

+    meta: CodeDocMeta
+
+
+class ChunkType(str, Enum):


to avoid users thinking this is a generic ChunkType, we could rename the class to CodeChunkType.

dolfim-ibm · 2025-10-22T09:54:26Z

Looking at the discussion and the proposed code_chunking_strategy, I'm think we want to consider how this will scale to other types. For example, multiple chunking strategies for tables are a recurrent topic.

vagenas · 2025-10-22T11:20:59Z

Looking at the discussion and the proposed code_chunking_strategy, I'm think we want to consider how this will scale to other types. For example, multiple chunking strategies for tables are a recurrent topic.

What about the obvious way to expand this, i.e. adding an optional table_chunking_strategy to the HierarchicalChunker?

dolfim-ibm · 2025-10-22T11:32:56Z

Looking at the discussion and the proposed code_chunking_strategy, I'm think we want to consider how this will scale to other types. For example, multiple chunking strategies for tables are a recurrent topic.

What about the obvious way to expand this, i.e. adding an optional table_chunking_strategy to the HierarchicalChunker?

I was just looking up what we do in the serializers, there we are indeed using this approach. So let's go with it.

bridgetmcg · 2025-10-22T20:42:22Z

@vagenas I believe I addressed your comments! Let me know if not. Many thanks!

vagenas

@bridgetmcg I made some in-line comments incl. code suggestions.
Please also install the pre-commit hooks, so all checks are verified locally before pushing.
(E.g. I think some tests are still not up-to-date. FYI to generate the data, set env var DOCLING_GEN_TEST_DATA=1, e.g. DOCLING_GEN_TEST_DATA=1 uv run pytest)

pyproject.toml

docling_core/transforms/chunker/hierarchical_chunker.py

bridgetmcg · 2025-10-24T01:33:26Z

@vagenas I ran all the pre-commit checks which showed that now with the language identification we can correctly label some code snippets from other tests. Those were updated in 814dc61

vagenas · 2025-10-24T06:33:57Z

@bridgetmcg sounds good, now we still need to address the conflicts on uv.lock (not possible manually), which means:

branch needs to get up-to-date with latest main
uv.lock needs to be regenerated
(if 1. is done by rebasing, a force-push would be needed)

Can you take care of these? Otherwise let me know and I could try to look into it.

I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 334811a Signed-off-by: Bridget McGinn <[email protected]>

Co-authored-by: Panos Vagenas <[email protected]> Signed-off-by: Bridget <[email protected]>

I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 46bb88a I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 10e9ed8 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: d9827c7 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 814dc61 Signed-off-by: Bridget McGinn <[email protected]>

Signed-off-by: Bridget McGinn <[email protected]>

I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: a4a21e9 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 0266c63 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 336dd6a I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 68890e9 I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 3c65eef Signed-off-by: Bridget McGinn <[email protected]>

bridgetmcg · 2025-10-24T18:09:03Z

@vagenas I had to restrict the tree-sitter versioning due to python compatibility. treesitter > 0.24 requires python 3.10+ and 0.23 requires all the treesitter language libraries to be <0.24 as well.

bridgetmcg changed the title ~~Add Code Chunking Functionality~~ feat: add code chunking functionality Oct 3, 2025

bridgetmcg force-pushed the feat/code-chunking branch from 5c6fc1e to 32b120d Compare October 3, 2025 13:51

bridgetmcg mentioned this pull request Oct 3, 2025

feat: code chunking backend for docling docling-project/docling#2378

Open

3 tasks

vagenas reviewed Oct 7, 2025

View reviewed changes

vagenas reviewed Oct 22, 2025

View reviewed changes

dolfim-ibm reviewed Oct 22, 2025

View reviewed changes

vagenas reviewed Oct 23, 2025

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

pyproject.toml Outdated Show resolved Hide resolved

docling_core/transforms/chunker/hierarchical_chunker.py Outdated Show resolved Hide resolved

bridgetmcg requested a review from vagenas October 24, 2025 13:17

bridgetmcg and others added 9 commits October 24, 2025 10:10

initial code chunking for docling-core

a4a21e9

DCO Remediation Commit for Bridget McGinn <[email protected]>

38ed69a

I, Bridget McGinn <[email protected]>, hereby add my Signed-off-by to this commit: 334811a Signed-off-by: Bridget McGinn <[email protected]>

include language detections, add code chunking into hierarchical chunker

0266c63

add serializer, internal marking of chunkers, typing

336dd6a

Update pyproject.toml

9344d8e

Co-authored-by: Panos Vagenas <[email protected]> Signed-off-by: Bridget <[email protected]>

Update docling_core/transforms/chunker/hierarchical_chunker.py

bed80ff

Co-authored-by: Panos Vagenas <[email protected]> Signed-off-by: Bridget <[email protected]>

run all pre-commit less pytest

68890e9

update test files for code ID

3c65eef

bridgetmcg force-pushed the feat/code-chunking branch from 5b8e4cc to 377d5ce Compare October 24, 2025 15:21

bridgetmcg added 2 commits October 24, 2025 11:23

update uv.lock

d12fbc6

Signed-off-by: Bridget McGinn <[email protected]>

revert to stricter treesitter versioning due to compatibility

b417cae

Signed-off-by: Bridget McGinn <[email protected]>

bridgetmcg force-pushed the feat/code-chunking branch from 68bbf44 to b417cae Compare October 24, 2025 18:03

feat: add code chunking functionality #398

Are you sure you want to change the base?

feat: add code chunking functionality #398

Conversation

bridgetmcg commented Oct 3, 2025

Features

Core

Language Support

Testing

Uh oh!

github-actions bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

dosubot bot commented Oct 3, 2025

Uh oh!

vagenas left a comment

Choose a reason for hiding this comment

Uh oh!

bridgetmcg commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vagenas left a comment

Choose a reason for hiding this comment

Uh oh!

dolfim-ibm Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

dolfim-ibm commented Oct 22, 2025

Uh oh!

vagenas commented Oct 22, 2025

Uh oh!

dolfim-ibm commented Oct 22, 2025

Uh oh!

bridgetmcg commented Oct 22, 2025

Uh oh!

vagenas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bridgetmcg commented Oct 24, 2025

Uh oh!

vagenas commented Oct 24, 2025

Uh oh!

bridgetmcg commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Oct 3, 2025 •

edited

Loading

mergify bot commented Oct 3, 2025 •

edited

Loading

bridgetmcg commented Oct 10, 2025 •

edited

Loading