feat: 20882: redesign the validator for validate billion-entry states #22215

thenswan · 2025-11-18T07:12:05Z

WIP

Fixes #20882

Signed-off-by: Nikita Lebedev <[email protected]>

lfdt-bot · 2025-11-18T07:12:22Z

✅ Snyk checks have passed. No issues have been found so far.

Status	Scanner	Critical	High	Medium	Low	Total (0)
✅	Open Source Security	0	0	0	0	0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

artemananiev · 2025-11-19T00:02:06Z

hedera-state-validator/src/main/java/com/hedera/statevalidation/poc/ChunkedFileIterator.java

+    private void openStreams() {
+        var channelStream = Channels.newInputStream(channel);
+        // 16777216 took from MerkleDbConfig.iteratorInputBufferBytes
+        this.bufferedInputStream = new BufferedInputStream(channelStream, 16777216);


16M buffer looks redundant. Most our data items don't exceed a few kilobytes, 128K or 256K buffer should be enough

artemananiev · 2025-11-19T00:09:12Z

hedera-state-validator/src/main/java/com/hedera/statevalidation/poc/ChunkedFileIterator.java

+    private long findBoundaryOffset() throws IOException {
+        // Use buffer to minimize disk I/O and channel repositioning
+        // It should account for boundary + full data item to validate its proto schema
+        // 16777216 took from MerkleDbConfig.iteratorInputBufferBytes


artemananiev · 2025-11-19T00:10:10Z

hedera-state-validator/src/main/java/com/hedera/statevalidation/poc/ChunkedFileIterator.java

+                int tag = bufferData.readVarInt(false);
+                int fieldNum = tag >> TAG_FIELD_OFFSET;
+
+                if (fieldNum == FIELD_DATAFILE_ITEMS.number()) {


Field wire type should be checked, too. For all data items we store in MerkleDb, wire type is ProtoConstants.WIRE_TYPE_DELIMITED

Agree, addressed

artemananiev · 2025-11-19T00:10:55Z

...ate-validator/src/main/java/com/hedera/statevalidation/poc/pipeline/ChunkedFileIterator.java

+                    long dataStartPosition = bufferData.position();
+
+                    if (dataItemSize > 0) {
+                        bufferData.limit(dataStartPosition + dataItemSize);


Data item size should be checked against buffer limits

Agree, addressed

artemananiev · 2025-11-19T00:16:08Z

hedera-state-validator/src/main/java/com/hedera/statevalidation/poc/ChunkedFileIterator.java

+    private long currentDataItemFilePosition;
+    private boolean closed = false;
+
+    private long boundaryOffset = 0L;


This looks redundant. startByte can be reused for this purpose, after the field boundary is identified

Agree, addressed

artemananiev · 2025-11-19T00:21:27Z

hedera-state-validator/src/main/java/com/hedera/statevalidation/poc/ChunkedFileIterator.java

+        }
+
+        while (in.hasRemaining()) {
+            currentDataItemFilePosition = startByte + boundaryOffset + in.position();


This doesn't look correct. in is a BufferedData on top of bufferedInputStream, which is a stream on top of the file channel. It means, in.position() is the current position in the file, no need to add startByte or boundaryOffset

Channels.newInputStream(channel) creates a stream that reads starting from the channel's current position (which we set to startByte). The in wrapper (via BufferedInputStream) tracks the position relative to the start of that stream (starting at 0), not the absolute file position. Therefore, adding startByte is necessary to calculate the correct offset in the file.

artemananiev · 2025-11-19T18:27:07Z

platform-sdk/swirlds-merkledb/src/main/java/com/swirlds/merkledb/config/MerkleDbConfig.java

        @Positive @ConfigProperty(defaultValue = "1000000000") long initialCapacity,
        @Positive @ConfigProperty(defaultValue = "4000000000") long maxNumOfKeys,
-        @Min(0) @ConfigProperty(defaultValue = "8388608") long hashesRamToDiskThreshold,
+        @Min(0) @ConfigProperty(defaultValue = "0") long hashesRamToDiskThreshold,


Do not forget to revert this change after you're done with testing/debugging

Signed-off-by: Nikita Lebedev <[email protected]>

codacy-production · 2025-12-02T09:42:34Z

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation	Diff coverage
✅ +0.00% (target: -1.00%)	✅ 100.00%

Coverage variation details

	Coverable lines	Covered lines	Coverage
Common ancestor commit (`3adca96`)	104089	77776	74.72%
Head commit (`a5be4d1`)	104089 (+0)	77772 (-4)	74.72% (+0.00%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details

	Coverable lines	Covered lines	Diff coverage
Pull request (#22215)	2	2	100.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings Change summary preferences

codecov · 2025-12-02T09:42:47Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

@@             Coverage Diff              @@
##               main   #22215      +/-   ##
============================================
- Coverage     70.82%   70.80%   -0.02%     
  Complexity    24384    24384              
============================================
  Files          2667     2667              
  Lines        104184   104184              
  Branches      10941    10941              
============================================
- Hits          73785    73772      -13     
- Misses        26363    26367       +4     
- Partials       4036     4045       +9

Files with missing lines	Coverage Δ	Complexity Δ
...va/com/swirlds/merkledb/config/MerkleDbConfig.java	`92.30% <ø> (ø)`	`5.00 <0.00> (ø)`
...ava/com/swirlds/merkledb/files/DataFileCommon.java	`65.60% <100.00%> (ø)`	`28.00 <0.00> (ø)`

... and 17 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

poc

ff4b126

Signed-off-by: Nikita Lebedev <[email protected]>

thenswan added this to the v0.69 milestone Nov 18, 2025

thenswan requested review from artemananiev and imalygin November 18, 2025 07:12

thenswan self-assigned this Nov 18, 2025

thenswan added the Hedera State Operator Issues related to the hedera state operator label Nov 18, 2025

thenswan added this to Foundation Team Nov 18, 2025

artemananiev reviewed Nov 19, 2025

View reviewed changes

thenswan added 2 commits December 2, 2025 10:52

chunked parallel reading

3cb224d

Signed-off-by: Nikita Lebedev <[email protected]>

add validators which can run in parallel

a5be4d1

Signed-off-by: Nikita Lebedev <[email protected]>

thenswan force-pushed the 20882-poc-2 branch from 40cc6dd to a5be4d1 Compare December 2, 2025 08:52

thenswan changed the title ~~feat: 20882: poc 2~~ feat: 20882: redesign the validator for validate billion-entry states Dec 2, 2025

feat: 20882: redesign the validator for validate billion-entry states #22215

Are you sure you want to change the base?

feat: 20882: redesign the validator for validate billion-entry states #22215

Uh oh!

Conversation

thenswan commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lfdt-bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Snyk checks have passed. No issues have been found so far.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codacy-production bot commented Dec 2, 2025

Coverage summary from Codacy

Uh oh!

codecov bot commented Dec 2, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

thenswan commented Nov 18, 2025 •

edited

Loading

lfdt-bot commented Nov 18, 2025 •

edited

Loading