Skip to content

Conversation

@thenswan
Copy link
Contributor

@thenswan thenswan commented Nov 18, 2025

WIP

Fixes #20882

Signed-off-by: Nikita Lebedev <[email protected]>
@thenswan thenswan added this to the v0.69 milestone Nov 18, 2025
@thenswan thenswan self-assigned this Nov 18, 2025
@thenswan thenswan added the Hedera State Operator Issues related to the hedera state operator label Nov 18, 2025
@lfdt-bot
Copy link

lfdt-bot commented Nov 18, 2025

Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

private void openStreams() {
var channelStream = Channels.newInputStream(channel);
// 16777216 took from MerkleDbConfig.iteratorInputBufferBytes
this.bufferedInputStream = new BufferedInputStream(channelStream, 16777216);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16M buffer looks redundant. Most our data items don't exceed a few kilobytes, 128K or 256K buffer should be enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

private long findBoundaryOffset() throws IOException {
// Use buffer to minimize disk I/O and channel repositioning
// It should account for boundary + full data item to validate its proto schema
// 16777216 took from MerkleDbConfig.iteratorInputBufferBytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

int tag = bufferData.readVarInt(false);
int fieldNum = tag >> TAG_FIELD_OFFSET;

if (fieldNum == FIELD_DATAFILE_ITEMS.number()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Field wire type should be checked, too. For all data items we store in MerkleDb, wire type is ProtoConstants.WIRE_TYPE_DELIMITED

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, addressed

long dataStartPosition = bufferData.position();

if (dataItemSize > 0) {
bufferData.limit(dataStartPosition + dataItemSize);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data item size should be checked against buffer limits

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, addressed

private long currentDataItemFilePosition;
private boolean closed = false;

private long boundaryOffset = 0L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks redundant. startByte can be reused for this purpose, after the field boundary is identified

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, addressed

}

while (in.hasRemaining()) {
currentDataItemFilePosition = startByte + boundaryOffset + in.position();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look correct. in is a BufferedData on top of bufferedInputStream, which is a stream on top of the file channel. It means, in.position() is the current position in the file, no need to add startByte or boundaryOffset

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Channels.newInputStream(channel) creates a stream that reads starting from the channel's current position (which we set to startByte). The in wrapper (via BufferedInputStream) tracks the position relative to the start of that stream (starting at 0), not the absolute file position. Therefore, adding startByte is necessary to calculate the correct offset in the file.

@Positive @ConfigProperty(defaultValue = "1000000000") long initialCapacity,
@Positive @ConfigProperty(defaultValue = "4000000000") long maxNumOfKeys,
@Min(0) @ConfigProperty(defaultValue = "8388608") long hashesRamToDiskThreshold,
@Min(0) @ConfigProperty(defaultValue = "0") long hashesRamToDiskThreshold,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not forget to revert this change after you're done with testing/debugging

@thenswan thenswan changed the title feat: 20882: poc 2 feat: 20882: redesign the validator for validate billion-entry states Dec 2, 2025
@codacy-production
Copy link

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
+0.00% (target: -1.00%) 100.00%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (3adca96) 104089 77776 74.72%
Head commit (a5be4d1) 104089 (+0) 77772 (-4) 74.72% (+0.00%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#22215) 2 2 100.00%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

@codecov
Copy link

codecov bot commented Dec 2, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.

Impacted file tree graph

@@             Coverage Diff              @@
##               main   #22215      +/-   ##
============================================
- Coverage     70.82%   70.80%   -0.02%     
  Complexity    24384    24384              
============================================
  Files          2667     2667              
  Lines        104184   104184              
  Branches      10941    10941              
============================================
- Hits          73785    73772      -13     
- Misses        26363    26367       +4     
- Partials       4036     4045       +9     
Files with missing lines Coverage Δ Complexity Δ
...va/com/swirlds/merkledb/config/MerkleDbConfig.java 92.30% <ø> (ø) 5.00 <0.00> (ø)
...ava/com/swirlds/merkledb/files/DataFileCommon.java 65.60% <100.00%> (ø) 28.00 <0.00> (ø)

... and 17 files with indirect coverage changes

Impacted file tree graph

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Hedera State Operator Issues related to the hedera state operator

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

State Operator: redesign the validator for validate billion-entry states

5 participants