-
Notifications
You must be signed in to change notification settings - Fork 184
feat: 20882: redesign the validator for validate billion-entry states #22215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Nikita Lebedev <[email protected]>
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
| private void openStreams() { | ||
| var channelStream = Channels.newInputStream(channel); | ||
| // 16777216 took from MerkleDbConfig.iteratorInputBufferBytes | ||
| this.bufferedInputStream = new BufferedInputStream(channelStream, 16777216); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16M buffer looks redundant. Most our data items don't exceed a few kilobytes, 128K or 256K buffer should be enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
| private long findBoundaryOffset() throws IOException { | ||
| // Use buffer to minimize disk I/O and channel repositioning | ||
| // It should account for boundary + full data item to validate its proto schema | ||
| // 16777216 took from MerkleDbConfig.iteratorInputBufferBytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed
| int tag = bufferData.readVarInt(false); | ||
| int fieldNum = tag >> TAG_FIELD_OFFSET; | ||
|
|
||
| if (fieldNum == FIELD_DATAFILE_ITEMS.number()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Field wire type should be checked, too. For all data items we store in MerkleDb, wire type is ProtoConstants.WIRE_TYPE_DELIMITED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, addressed
| long dataStartPosition = bufferData.position(); | ||
|
|
||
| if (dataItemSize > 0) { | ||
| bufferData.limit(dataStartPosition + dataItemSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Data item size should be checked against buffer limits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, addressed
| private long currentDataItemFilePosition; | ||
| private boolean closed = false; | ||
|
|
||
| private long boundaryOffset = 0L; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks redundant. startByte can be reused for this purpose, after the field boundary is identified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, addressed
| } | ||
|
|
||
| while (in.hasRemaining()) { | ||
| currentDataItemFilePosition = startByte + boundaryOffset + in.position(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look correct. in is a BufferedData on top of bufferedInputStream, which is a stream on top of the file channel. It means, in.position() is the current position in the file, no need to add startByte or boundaryOffset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Channels.newInputStream(channel) creates a stream that reads starting from the channel's current position (which we set to startByte). The in wrapper (via BufferedInputStream) tracks the position relative to the start of that stream (starting at 0), not the absolute file position. Therefore, adding startByte is necessary to calculate the correct offset in the file.
| @Positive @ConfigProperty(defaultValue = "1000000000") long initialCapacity, | ||
| @Positive @ConfigProperty(defaultValue = "4000000000") long maxNumOfKeys, | ||
| @Min(0) @ConfigProperty(defaultValue = "8388608") long hashesRamToDiskThreshold, | ||
| @Min(0) @ConfigProperty(defaultValue = "0") long hashesRamToDiskThreshold, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not forget to revert this change after you're done with testing/debugging
Signed-off-by: Nikita Lebedev <[email protected]>
Signed-off-by: Nikita Lebedev <[email protected]>
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferences |
Codecov Report✅ All modified and coverable lines are covered by tests. @@ Coverage Diff @@
## main #22215 +/- ##
============================================
- Coverage 70.82% 70.80% -0.02%
Complexity 24384 24384
============================================
Files 2667 2667
Lines 104184 104184
Branches 10941 10941
============================================
- Hits 73785 73772 -13
- Misses 26363 26367 +4
- Partials 4036 4045 +9
... and 17 files with indirect coverage changes 🚀 New features to boost your workflow:
|
WIP
Fixes #20882