Skip to content

Support degenerate / gap characters #12

@fedarko

Description

@fedarko

Currently, the presence of Ns in a sequence will make matrix construction fail with the following error: Input sequence contains character N; only DNA nucleotides (A, C, G, T) are currently allowed.

This is a very "safe" way of handling this situation, but it's a bit over-cautious. It would be better to just modify things so that these characters are allowed, but any k-mers containing them are just assumed to not have any matches anywhere.

Some workaround options, in the meantime:

  • Remove these characters from your sequence before creating a dot plot (if you keep track of where the "breaks" are, you can then label these on the dot plot to explain the situation)

    • The downside of this, ofc, is that this will create "spurious" k-mers that span the "break".
  • Split up your sequence into "islands" of non-degenerate/gap characters, and just analyze these independently. I guess you could also concatenate the resulting dot plot matrices together, too, although that would require some extra programming work.

  • Replace these characters with random (?) DNA nucleotides (as is done, for example, in section 2.7.1 of the BWA paper).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions