Support degenerate / gap characters

Currently, the presence of `N`s in a sequence will make matrix construction fail with the following error: `Input sequence contains character N; only DNA nucleotides (A, C, G, T) are currently allowed.`

This is a very "safe" way of handling this situation, but it's a bit over-cautious. It would be better to just modify things so that these characters are allowed, but any k-mers containing them are just assumed to not have any matches anywhere.

Some workaround options, in the meantime:

- Remove these characters from your sequence before creating a dot plot (if you keep track of where the "breaks" are, you can then label these on the dot plot to explain the situation)
  - The downside of this, ofc, is that this will create "spurious" _k_-mers that span the "break".

- Split up your sequence into "islands" of non-degenerate/gap characters, and just analyze these independently. I guess you could also concatenate the resulting dot plot matrices together, too, although that would require some extra programming work.

- Replace these characters with random (?) DNA nucleotides (as is done, for example, in [section 2.7.1 of the BWA paper](https://academic.oup.com/bioinformatics/article/25/14/1754/225615#394204959)).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support degenerate / gap characters #12

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support degenerate / gap characters #12

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions