## Wednesday, September 12, 2012

### Palindromes with errors or gaps

Sometimes I not only want to find perfect palindromes, but also palindromes that contain a limited number of errors. A palindrome with a limited number of errors is often called an approximate palindrome. For example, in the book Judges in the King James Bible, verse 19:9 reads:

  And when the man rose up to depart, he, and his
concubine, and his servant, his father in law, the
damsel's father, said unto him, Behold, now the day
draweth toward evening, I pray you tarry all night:
behold, the day groweth to an end, lodge here, that
thine heart may be merry; and to morrow get you early
on your way, that thou mayest go home.

The substring "draweth toward" is a text palindrome with one error: the e' and the o' don't match. To be precise, a string s is an approximate palindrome with k errors, if it satisfies the predicate approximatePalindrome k s. I define approximatePalindrome as a program by taking the first half of both the input string and its reverse, and comparing each of the elements of these strings pairwise. I don't have to compare the second halves of these strings, since that would give the same result as the comparison of the first halves, and mismatches would be counted twice.

> approximatePalindrome k s =
>   let half = div (length s) 2
>       approximatePalindrome' k []     []     = k>=0
>       approximatePalindrome' k (x:xs) (y:ys)
>         | x == y    = approximatePalindrome' k     xs ys
>         | k >= 0    = approximatePalindrome' (k-1) xs ys
>         | otherwise = False
>   in  approximatePalindrome' k
>         (take half s)
>         (take half (reverse s))



A palindrome with a gap is a palindrome in which a gap of a particular size in the middle is ignored. An example of a palindrome with a gap is found in Revelations, where verses 20:7-8 read:

  And when the thousand years are expired, Satan shall
be loosed out of his prison, And shall go out to
deceive the nations which are in the four quarters of
the earth, Gog, and Magog, to gather them together to
battle: the number of whom is as the sand of the sea.

Here "Gog, and Magog" is a text palindrome with a gap of length three in the middle: the n' and the M' around the central d' don't match. Since the gap appears in the middle of the string, the length of the gap is odd if the length of the palindrome is odd, and even if the length of the palindrome is even. To be precise, a string s is a palindrome with a gap of length g in the middle, if it satisfies the predicate gappedPalindrome g s:

> gappedPalindrome g s =
>   let ls            = length s
>       removeGap g s =
>           let armLength     = div (ls - g) 2
>               (before,rest) = splitAt armLength s
>               (gap,after)   = splitAt g rest
>           in  before ++ after
>   in  if    g < ls && odd g == odd ls
>       then  palindrome (removeGap g s)
>       else  error "gappedPalindrome: incorrect gap length"


If the gap is shorter than the length of the input string, and both the gap and input string have even or odd length, I remove the gap from the middle of the string, and check that the remaining string is a palindrome. I remove the gap by taking half of the elements of the input string, minus half of the gap elements, then dropping the gap, and then taking the rest of the string.

To find palindromes with errors or gaps, I adapt my software for finding palindromes.

It is quite easy to adapt the quadratic-time algorithm for finding palindromes to also find approximate palindromes, and palindromes with gaps. In this blog post I show how to adapt the quadratic-time algorithm that finds palindromes in DNA. Adapting the quadratic-time algorithms for finding exact or text palindromes is done in exactly the same way. To find approximate palindromes in DNA, I pass an extra argument k, representing the maximum number of errors allowed, to the function qMaximalPalindromesDNA, which I now call approximatePalindromesDNA. The only function I have to adapt is the function lengthPalindromeAroundDNA:


> lengthApproximatePalindromeAroundDNA
>   input lastPos k start end
>   |  start < 0 || end > lastPos                =  end-start-1
>   |  B.index input start =:= B.index input end =
>        lengthApproximatePalindromeAroundDNA
>          input
>          lastPos
>          k
>          (start-1)
>          (end+1)
>   |  k > 0                                     =
>        lengthApproximatePalindromeAroundDNA
>          input
>          lastPos
>          (k-1)
>          (start-1)
>          (end+1)
>   |  otherwise                                 =  end-start-1


The time complexity of approximatePalindromesDNA is related to the sum of the lengths of the approximate palindromes found in the input. As I described in the blog post on finding palindromes in DNA, this is not a problem for perfect palindromes in DNA. In the case of approximate palindromes, the number of allowed errors has a big influence on the time complexity. If I increase the number of allowed errors by one, the length of all palindromes increases with at least two, since a single error represents two symbols, and since after the error there might be matching symbols. For example, if I allow for 500 errors to find a palindrome consisting of almost $3$ million symbols, each approximate palindrome in the input has at least length $1000$, except for the palindromes at the start and the end of the string. The average approximate palindrome is quite a bit longer, since the chance for an accidental match is one in four. Since each palindrome has length on average more than $1250$, applying the function approximatePalindromesDNA to a DNA string of length $25$ million requires more than $30$ billion steps only to compute the lengths of the approximate palindromes. This number is very challenging for my laptop, and until now I haven't been able to convince my machine to calculate these values for me, despite letting it run for several nights.

To find palindromes with gaps in DNA, I use an extra argument g, which represents the length of the gap in the middle of the palindrome which I can ignore. A gap may be odd or even, and in the former case I get a palindrome around a center on a symbol in a DNA string, instead of in between two symbols. So where an exact palindromes in a DNA string has its center in between two symbols, a gapped palindrome may have a center in between two symbols or on a symbol, depending on whether or not the gap is even. The function qMaximalPalindromesDNA, which is now called gappedMaximalPalindromes, is the only function I have to adapt. For each center, I have to subtract half of the gap of the center for the start position, and add half of the gap to the end position. When doing so, I have to check that these positions do not point before the beginning of the string or after the end of the string. If they do, I have to replace half of the gap by the appropriate smaller number.


> gappedMaximalPalindromesDNA input g  =
>   let halfg       =  div g 2
>       lengthInput =  B.length input
>       centers     =  if   even g
>                      then [0 .. lengthInput]
>                      else [0 .. lengthInput-1]
>       halfg' c    |  c < halfg             = c
>                   |  c+halfg > lengthInput = lengthInput-c
>                   |  otherwise             = halfg
>       left c      =  c-1-halfg' c
>       right c     =  if   even g
>                      then c+halfg' c
>                      else c+1+halfg' c
>   in  map
>         (\c -> lengthPalindromeAroundDNA
>                  input
>                  (lengthInput-1)
>                  (left c)
>                  (right c)
>         )
>         centers



By combining the functions lengthApproximatePalindromeAroundDNA and gappedMaximalPalindromesDNA` I obtain a function for determining palindromes with both gaps in the middle, and errors in the palindromic arms.

Adapting the quadratic-time algorithm for finding palindromes to also allow for errors and gaps in palindromes is not very hard. How about adapting the linear-time algorithm for finding palindromes?

Adapting the linear-time algorithm to account for errors and gaps is very hard. I spent a long time studying the problem of finding palindromes with a given maximum number of errors in time linear in some combination of the number of errors and the length of the input, but failed to find a satisfying solution. During a couple of long trips I drew many ideas in my notebook to adapt the main components of the linear-time algorithm, the longest tail palindrome and the list of maximal palindromes with centers before the longest tail palindrome, to allow for errors and gaps. In vain.

I tried to adapt the linear-time algorithm for finding palindromes to also find approximate palindromes before I had realized that the quadratic-time algorithm returns its results much faster on DNA strings than the linear-time algorithm. My frustrations about not being able to find a solution have almost disappeared now.

In the scientific literature I found two solutions to the problem of finding approximate palindromes in time linear in some combination of the input and the number of allowed errors. The first solution uses so-called suffix trees. A suffix tree is an advanced data structure that can be used to find all kinds of structures in strings, among which palindromes. I compared the efficiency of my linear-time algorithm for finding exact palindromes with an algorithm using suffix trees, and my linear-time algorithm was far superior. The solutions for finding approximate palindromes using suffix trees use even more arguments, and they will run even slower than the algorithm for finding exact palindromes in strings using suffix trees. The quadratic-time solution for finding approximate palindromes with gaps is no more complicated than the above programs, so although I did not perform any tests, I am pretty confident that the above quadratic-time algorithm for finding approximate palindromes runs much faster than the solutions using suffix trees when applied to DNA strings. The second solution uses advanced algorithms on graphs to determine edit actions that turn one string into another string. I'm afraid I don't understand the details sufficiently well to discuss the algorithm here. The time-complexity of this algorithm is related to $k^2 \times n$, where $k$ is the number of allowed errors, and $n$ the length of the input string. For the DNA string of length $25$ million and the $500$ errors used as examples above, this would amount to more than $6$ trillion steps. My machine would just laugh at it.