Skip to content

MarkDuplicates Missing Optical Duplicates #1441

@whitwham

Description

@whitwham

Bug Report

Affected tool(s)

MarkDuplicates OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 TAGGING_POLICY=All

Affected version(s)

Latest test on 2.21.4-2-ga3021a7-SNAPSHOT. Older versions display the same problem.

Description

For HiSeq X runs the coordinate part of the query name can be greater than 32K. The coordinates are stored using PhysicalLocationShort which converts the read in integer to a short. As you would expect, integers over 32767 become negative. This leads duplicates that should taken as being physically close together instead being calculated as far apart.

From my example file the numbers should look like this in the distance calculation:
abs (33023 - 32760) = 263
instead become:
abs (-32513 - 32760) = 65273

Which is a bit bigger than the optical duplicate distance of 2500.
(This is from the calculations in closeEnough in OpticalDuplicateFinder)

Any duplicates with X/Y coordinates around the short size boundary stand a chance of not being identified as optical duplicates. In the production file I was testing on that amounted to about 22000 missed optical duplicates. The included example file is based off the real data.

Steps to reproduce

MarkDuplicates I=short_test.sam O=short_test_out.sam OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 M=short_test_stat.txt TAGGING_POLICY=All TAG_DUPLICATE_SET_MEMBERS=true

Input file (added .txt suffix to allow upload to GitHub):
short_test.sam.txt
Output file:
short_test_out.sam.txt

Expected behavior

The duplicates should be tagged DT:Z:SQ.

Actual behavior

The duplicates are tagged DT:Z:LB.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions