-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Data, Flink, Spark: Use TestHelpers for FormatVersion #13880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| List<Object> parameters = Lists.newArrayList(); | ||
| for (Boolean isStreamingMode : new Boolean[] {true, false}) { | ||
| for (int formatVersion : new int[] {1, 2}) { | ||
| for (int formatVersion : TestHelpers.ALL_VERSIONS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Someone who is a flink expert maybe @stevenzwu or @pvary let me know if it was intentional that this test not check all versions or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is not intentional.
| new FileFormat[] {FileFormat.AVRO, FileFormat.ORC, FileFormat.PARQUET}) { | ||
| for (Object[] catalogParams : CatalogTestBase.parameters()) { | ||
| for (int version : Arrays.asList(2, 3)) { | ||
| for (int version : TestHelpers.V2_AND_ABOVE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we have tests like testRewriteNoConflictWithEqualityDeletes which will not work with V1.
I think @nastra intentionally only added V2 and V3 - maybe adding V4 would be nice.
| List<Object> parameters = Lists.newArrayList(); | ||
| for (Boolean isStreamingMode : new Boolean[] {true, false}) { | ||
| for (int formatVersion : new int[] {1, 2}) { | ||
| for (int formatVersion : TestHelpers.ALL_VERSIONS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and here
| new Object[] {FileFormat.PARQUET, 3, false, PlanningMode.DISTRIBUTED}, | ||
| new Object[] {FileFormat.PARQUET, 3, true, PlanningMode.LOCAL}, | ||
| }; | ||
| List<Object[]> parameters = Lists.newArrayList(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had a really weird pattern of what we tested here before, I mimicked it below, only testing V4 with parquet
82d0e3d to
1b21e1d
Compare
1b21e1d to
81a988e
Compare
stevenzwu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks good to me after CI failure is fixed. Thanks for doing this. With these examples, hopefully new tests will be able to follow the same pattern.
It is hard to be 100% accurate because they are just plain numbers (not enum).
Yep, let me keep fixing cli issues, there are so many configs so I've been leaning on the automated cli to find any remaining bugs. |
| parameters.add(new Object[] {FileFormat.ORC, false, version}); | ||
| parameters.add(new Object[] {FileFormat.ORC, true, version}); | ||
| } | ||
| return parameters.toArray(new Object[0][]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we only updating v3.4? What about v4.0 and v3.5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I missed that, Sorry I've been going back and forth between build configs!
|
I couldn't figure out exactly what the memory leak is in our test suite that's causing an Issue but it seems like it's related to the task statuses never getting cleared from the Spark context during the TestRewriteDataFilesAction test suite. Because the suite now runs within an additional config, the number of tasks increased dramatically and I believe this was the base cause of the OOM. I tried disabling the UI but that didn't seem to help in any way, the statuses still stuck around. So I decided to take a different tack and just optimize the test suite instead. The main thing I did is to go through and take all of the "Spark Sorts" and switch them to normal Java collection sorts. This has two outcomes, first the test suite runs much faster since we had adaptive shuffle disabled for this suite and it had to do 200 tasks per sort and because local sort is much faster than using the Spark mechanism. Second, the number of tasks is reduced dramatically which decreases the amount of "Task Status" objects that hang around. If this ends up still being an issue in the future we can either track down the status issue or move some these tests into a different test suite. |
|
Ok apparently that wasn't enough, fixed it locally but on gradle i'm still getting an Exited with Return Code 52 error from the workflow |
|
Ok the issues is with Direct memory ... So Now I have to track down memory usage in our Arrow Readers |
|
My current understanding is that at least within our compaction code (and rewrite delete files action code) we are leaking memory. The biggest contributor is TestCombineMixedFiles (probably because of the amount of data?). When I added additional test versions the leak just got worse. When I removed the Spark Shuffles I reduced the leakage somehow, but that just pushed out the OOM since the memory is never released My current targets are - My guesses are these are culprits because we see this happen a ton during compaction but not really anywhere else in the test suite. Tomorrow i'm going to disable our vectorized read path, if that fixes the issue then I'll have narrowed it down a bit more |
|
Test For Memory Leak, still working on nailing down where this is happening but it's unrelated to the test parameterization. |
Less complicated repo, this test will OOM after 40 iteratiors (default offheap is 4GB~) |
|
I was wrong, it's more like the last column in the last reader of the last task allocates more memory than it frees |
|
I figured it out: In VectorizedArrowReader private void allocateFieldVector(boolean dictionaryEncodedVector) {
if (dictionaryEncodedVector) {
allocateDictEncodedVector();
} else {
Field arrowField = ArrowSchemaUtil.convert(getPhysicalType(columnDescriptor, icebergField));
if (columnDescriptor.getPrimitiveType().getOriginalType() != null) {
allocateVectorBasedOnOriginalType(columnDescriptor.getPrimitiveType(), arrowField);
} else {
allocateVectorBasedOnTypeName(columnDescriptor.getPrimitiveType(), arrowField);
}
}
}Makes the assumption that all pages will have the same encoding. This is a big problem if the first page is dictionary encoded and the following ones are not. The first pass by this function will call allocateDictEncodedVector() Which does this this.vec = field.createVector(rootAlloc);
((IntVector) vec).allocateNew(batchSize);But what happens if we then read a non-dictionary encoded page? We will then go down the other path, AllocateVectorBasedOnOriginalType, And hit this switch (primitive.getOriginalType()) {
case ENUM:
case JSON:
case UTF8:
case BSON:
this.vec = arrowField.createVector(rootAlloc);
// TODO: Possibly use the uncompressed page size info to set the initial capacity
vec.setInitialCapacity(batchSize * AVERAGE_VARIABLE_WIDTH_RECORD_SIZE);
vec.allocateNewSafe();
this.readType = ReadType.VARCHAR;
this.typeWidth = UNKNOWN_WIDTH;
break;Which will create a new vector for This is easy enough to fix, in both of these functions we just need to clear out "this.vec" if it is set So in the case above we have dictionary encoded pages which allocate IntVectors which are then replaced with BaseVarWidthVectors for non encoded pages. This means we drop the previous vector and it's allocation with each swap. The more pages the worse it is. |
|
CC: @nandorKollar |
Why did they type of the vector change from IntVectors to BaseVarWidthVectors? If we clear out "this.vec" if it is set, wouldn't this type change in the vector cause problems? Shouldn't we explicitly close the |
The vector changes because Dictionary encoded pages are a sequence of ints, {1, 2, 3, 4} that refer to entries in the Dictionary which maps the int to the actual column value. {1: "foo", 2: "bar", ....}. Other pages have literal representations of the values stored as binary {foo, bar, bazz }. So you have to switch vector types when you alternate.
No. To be clear, the code has always cleared out this.vec and we dont' have correctness issues because essentially what is happening is:
What is missing here is |
Thanks for clarifying why the type change happens, makes sense. We can't reuse the vector, only when there's a switch from/to dictionary encoded pages, right? When you mention, that it is always cleared, you mean the the value count is set to 0 in this block: |
|
Previously it would just completely drop the reference to the previous vec so it didn't matter. (That's the leak). I'll raise a new PR with a fix. Also fun fact, we already have a test THAT WOULD FAIL if we were actually tracking memory usage. See TestParquetDictionaryEncodedVectorizedReads.testMixedDictionaryNonDictionaryReads. One sec and i'll have a new PR up to show this |
a260188 to
846e77a
Compare
|
Rebased now that the leak is gone, removed test speedups also being done in a separate PR #13947 |
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @RussellSpitzer
|
Finally Merged! Thanks everyone who helped with this and the associated memory leak issue. Thanks to everyone: |


While I was working on V4 Parquet Manifests I found a bunch of test classes that do not properly parameterize formatVersion and don't run on V4. I've fixed all the ones I could find.