-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Currently, the presence of Ns in a sequence will make matrix construction fail with the following error: Input sequence contains character N; only DNA nucleotides (A, C, G, T) are currently allowed.
This is a very "safe" way of handling this situation, but it's a bit over-cautious. It would be better to just modify things so that these characters are allowed, but any k-mers containing them are just assumed to not have any matches anywhere.
Some workaround options, in the meantime:
-
Remove these characters from your sequence before creating a dot plot (if you keep track of where the "breaks" are, you can then label these on the dot plot to explain the situation)
- The downside of this, ofc, is that this will create "spurious" k-mers that span the "break".
-
Split up your sequence into "islands" of non-degenerate/gap characters, and just analyze these independently. I guess you could also concatenate the resulting dot plot matrices together, too, although that would require some extra programming work.
-
Replace these characters with random (?) DNA nucleotides (as is done, for example, in section 2.7.1 of the BWA paper).