-
Notifications
You must be signed in to change notification settings - Fork 380
Description
Bug Report
Affected tool(s)
MarkDuplicates OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 TAGGING_POLICY=All
Affected version(s)
Latest test on 2.21.4-2-ga3021a7-SNAPSHOT. Older versions display the same problem.
Description
For HiSeq X runs the coordinate part of the query name can be greater than 32K. The coordinates are stored using PhysicalLocationShort which converts the read in integer to a short. As you would expect, integers over 32767 become negative. This leads duplicates that should taken as being physically close together instead being calculated as far apart.
From my example file the numbers should look like this in the distance calculation:
abs (33023 - 32760) = 263
instead become:
abs (-32513 - 32760) = 65273
Which is a bit bigger than the optical duplicate distance of 2500.
(This is from the calculations in closeEnough in OpticalDuplicateFinder)
Any duplicates with X/Y coordinates around the short size boundary stand a chance of not being identified as optical duplicates. In the production file I was testing on that amounted to about 22000 missed optical duplicates. The included example file is based off the real data.
Steps to reproduce
MarkDuplicates I=short_test.sam O=short_test_out.sam OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 M=short_test_stat.txt TAGGING_POLICY=All TAG_DUPLICATE_SET_MEMBERS=true
Input file (added .txt suffix to allow upload to GitHub):
short_test.sam.txt
Output file:
short_test_out.sam.txt
Expected behavior
The duplicates should be tagged DT:Z:SQ.
Actual behavior
The duplicates are tagged DT:Z:LB.