Tuesday, September 25, 2012

Finding palindromes in the Y chromosome

To test the program I obtain by combining palindrome finding with errors and gaps, I downloaded a copy of the Y chromosome. The first paper that described the occurrences of huge palindromes in the male chromosome referred to in my blog post on palindromes in DNA is `The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes'. This paper appeared in Nature in 2003, and the Nature web pages provide a link to the Y chromosome studied in the paper. The paper caused quite a stir. It has been cited more than a thousand times in other scientific papers. A thousand citations is a lot: none of my direct colleagues in computer science ever wrote an article with so many citations.

Since it is computationally infeasible to test the program for finding palindromes with errors and gaps with large numbers of allowed errors, I looked up the information about palindromes in the paper. The paper contains a table that gives information about the eight palindromes the authors found in chromosome Y.

Palindrome Arm length (kb) Arm-to-arm identity (%) Spacer length (kb) Palindrome span (kb)
P1 1,450 99.97 2.1 2,902
P1.1 9,9 99.95 3.9 24
P1.2 9,9 99.95 3.9 24
P2 122 99.97 2.1 246
P3 283 99.94 170 736
P4 190 99.98 40 419
P5 496 99.98 3.5 996
P6 110 99.97 46 266
P7 8.7 99.97 12.6 30
P8 36 99.997 3.4 75

Palindrome P1 consists of almost 3 million symbols, and the arm-to-arm identity is 99.97 percent. An arm has length almost one and a half million, so the number of errors is around 450. I assume that the errors are more or less evenly distributed in the palindrome, which implies that in the central 10,000 symbols of an arm of a palindrome, after the gap, about 3 errors would occur. So instead of trying to find the long palindrome with possibly 450 errors and a gap of size around 2100 symbols, I instead try to find palindromes with a gap of size 2200 (to be on the safe side) of length at least 5000 (to be on the safe side) with at most 5 errors (to be on the safe side). To my surprise, I do not find a single palindromes that satisfies these requirements. This long palindrome with a gap of size 2100 does not seem to appear in the Y chromosome of the authors. Since P2 has the same gap size, it does not appear either. Hmm, what's wrong here? Have I made a mistake?

Lets have a look at the other palindromes reported in the table. The next smallest gap size reported in the above table is 3.4 kb. If I try to find palindromes with a gap of size 3400, I get two hits. Around position 18905096 I find a palindrome of length 32056 (including the gap) if I allow for 8 errors. If I try to reduce the gap length, I find that I can take a gap length of 2320, and still find the same palindrome. So maybe this is one of the palindromes with a gap of around 2.1 kb? The arm length is (32056-2320)/2 = 14868, which doesn't correspond at all with the reported arm lengths. I also find a palindrome around position 9806955. This palindrome has length 17304, and 5 errors. This palindrome indeed has a gap of size 3400: reducing the gap size leads to much shorter palindromes. But the palindrome around this position is much shorter than the reported length in Nature. I experimented with different gap sizes and numbers of allowed errors, and found the following palindromes in the Y chromosome:

Center Length Errors Gap length
9806955 13904 5 3400
12078263 43356 1 151498
13738581 34452 0 39262
14445793 64738 5 132950
17989037 14934 5 297114
18899672 3364 7 39540
18905096 29736 8 2320
20530881 24374 2 157760

Since my findings are very different from the results reported in Nature, I contacted the group responsible for the paper, the Whitehead Institute at the Department of Biology at the MIT. The research scientist that answered my questions was kind enough to provide me with the positions at which the palindromes occur in the Y chromosome. Given the positions, I tried to compare the arms by hand (or eye, actually). I got nowhere: the start and end of the arms did not resemble each other at all.

The final clue I needed was that it is not enough to discard errors constituted by two symbols that do not match in a palindrome. Sometimes I also have to delete symbols in one arm of a palindrome, or, equivalently, add symbols to the other arm. With this last clue, I selected the reported palindromic arms in the Y chromosome, and compared them. It turns out that in each case I only have to delete or insert a couple of DNA symbols to obtain a palindrome. The table of palindromes I find are the palindromes P8 to P1 reported in Nature, with P7 missing. My seventh palindrome does not appear in the list of positions I received, but this is probably one of P1.1 or P1.2, of which I didn't receive the positions.

After months of playing with the Y chromosome in some of my spare time, I now finally find the palindromes I was after. If I would have to do the same thing again for another DNA string, I would follow the same approach:

  • use the program that finds palindromes with gaps and errors with various gap sizes and numbers of allowed errors to find centers of palindrome candidates
  • hand align the sequences around the centers of palindrome candidates to determine the length of the palindromes around these centers, given some number of allowed deletions or insertions
I don't think another approach is feasible given the current speed of computers.

Wednesday, September 12, 2012

Palindromes with errors or gaps

Sometimes I not only want to find perfect palindromes, but also palindromes that contain a limited number of errors. A palindrome with a limited number of errors is often called an approximate palindrome. For example, in the book Judges in the King James Bible, verse 19:9 reads:

  And when the man rose up to depart, he, and his 
  concubine, and his servant, his father in law, the 
  damsel's father, said unto him, Behold, now the day 
  draweth toward evening, I pray you tarry all night: 
  behold, the day groweth to an end, lodge here, that 
  thine heart may be merry; and to morrow get you early 
  on your way, that thou mayest go home.
The substring "draweth toward" is a text palindrome with one error: the `e' and the `o' don't match. To be precise, a string s is an approximate palindrome with k errors, if it satisfies the predicate approximatePalindrome k s. I define approximatePalindrome as a program by taking the first half of both the input string and its reverse, and comparing each of the elements of these strings pairwise. I don't have to compare the second halves of these strings, since that would give the same result as the comparison of the first halves, and mismatches would be counted twice.

> approximatePalindrome k s = 
>   let half = div (length s) 2
>       approximatePalindrome' k []     []     = k>=0
>       approximatePalindrome' k (x:xs) (y:ys)  
>         | x == y    = approximatePalindrome' k     xs ys
>         | k >= 0    = approximatePalindrome' (k-1) xs ys
>         | otherwise = False
>   in  approximatePalindrome' k 
>         (take half s) 
>         (take half (reverse s)) 

A palindrome with a gap is a palindrome in which a gap of a particular size in the middle is ignored. An example of a palindrome with a gap is found in Revelations, where verses 20:7-8 read:

  And when the thousand years are expired, Satan shall 
  be loosed out of his prison, And shall go out to 
  deceive the nations which are in the four quarters of 
  the earth, Gog, and Magog, to gather them together to 
  battle: the number of whom is as the sand of the sea.
Here "Gog, and Magog" is a text palindrome with a gap of length three in the middle: the `n' and the `M' around the central `d' don't match. Since the gap appears in the middle of the string, the length of the gap is odd if the length of the palindrome is odd, and even if the length of the palindrome is even. To be precise, a string s is a palindrome with a gap of length g in the middle, if it satisfies the predicate gappedPalindrome g s:

> gappedPalindrome g s = 
>   let ls            = length s
>       removeGap g s = 
>           let armLength     = div (ls - g) 2
>               (before,rest) = splitAt armLength s
>               (gap,after)   = splitAt g rest
>           in  before ++ after 
>   in  if    g < ls && odd g == odd ls 
>       then  palindrome (removeGap g s)
>       else  error "gappedPalindrome: incorrect gap length"
 
If the gap is shorter than the length of the input string, and both the gap and input string have even or odd length, I remove the gap from the middle of the string, and check that the remaining string is a palindrome. I remove the gap by taking half of the elements of the input string, minus half of the gap elements, then dropping the gap, and then taking the rest of the string.

To find palindromes with errors or gaps, I adapt my software for finding palindromes.

It is quite easy to adapt the quadratic-time algorithm for finding palindromes to also find approximate palindromes, and palindromes with gaps. In this blog post I show how to adapt the quadratic-time algorithm that finds palindromes in DNA. Adapting the quadratic-time algorithms for finding exact or text palindromes is done in exactly the same way. To find approximate palindromes in DNA, I pass an extra argument k, representing the maximum number of errors allowed, to the function qMaximalPalindromesDNA, which I now call approximatePalindromesDNA. The only function I have to adapt is the function lengthPalindromeAroundDNA:


> lengthApproximatePalindromeAroundDNA 
>   input lastPos k start end  
>   |  start < 0 || end > lastPos                =  end-start-1
>   |  B.index input start =:= B.index input end =  
>        lengthApproximatePalindromeAroundDNA 
>          input 
>          lastPos 
>          k 
>          (start-1) 
>          (end+1) 
>   |  k > 0                                     =  
>        lengthApproximatePalindromeAroundDNA 
>          input 
>          lastPos 
>          (k-1) 
>          (start-1) 
>          (end+1) 
>   |  otherwise                                 =  end-start-1

The time complexity of approximatePalindromesDNA is related to the sum of the lengths of the approximate palindromes found in the input. As I described in the blog post on finding palindromes in DNA, this is not a problem for perfect palindromes in DNA. In the case of approximate palindromes, the number of allowed errors has a big influence on the time complexity. If I increase the number of allowed errors by one, the length of all palindromes increases with at least two, since a single error represents two symbols, and since after the error there might be matching symbols. For example, if I allow for 500 errors to find a palindrome consisting of almost $3$ million symbols, each approximate palindrome in the input has at least length $1000$, except for the palindromes at the start and the end of the string. The average approximate palindrome is quite a bit longer, since the chance for an accidental match is one in four. Since each palindrome has length on average more than $1250$, applying the function approximatePalindromesDNA to a DNA string of length $25$ million requires more than $30$ billion steps only to compute the lengths of the approximate palindromes. This number is very challenging for my laptop, and until now I haven't been able to convince my machine to calculate these values for me, despite letting it run for several nights.

To find palindromes with gaps in DNA, I use an extra argument g, which represents the length of the gap in the middle of the palindrome which I can ignore. A gap may be odd or even, and in the former case I get a palindrome around a center on a symbol in a DNA string, instead of in between two symbols. So where an exact palindromes in a DNA string has its center in between two symbols, a gapped palindrome may have a center in between two symbols or on a symbol, depending on whether or not the gap is even. The function qMaximalPalindromesDNA, which is now called gappedMaximalPalindromes, is the only function I have to adapt. For each center, I have to subtract half of the gap of the center for the start position, and add half of the gap to the end position. When doing so, I have to check that these positions do not point before the beginning of the string or after the end of the string. If they do, I have to replace half of the gap by the appropriate smaller number.


> gappedMaximalPalindromesDNA input g  =
>   let halfg       =  div g 2
>       lengthInput =  B.length input
>       centers     =  if   even g 
>                      then [0 .. lengthInput]
>                      else [0 .. lengthInput-1]
>       halfg' c    |  c < halfg             = c
>                   |  c+halfg > lengthInput = lengthInput-c
>                   |  otherwise             = halfg
>       left c      =  c-1-halfg' c
>       right c     =  if   even g
>                      then c+halfg' c
>                      else c+1+halfg' c
>   in  map 
>         (\c -> lengthPalindromeAroundDNA 
>                  input 
>                  (lengthInput-1)
>                  (left c)
>                  (right c)
>         ) 
>         centers

By combining the functions lengthApproximatePalindromeAroundDNA and gappedMaximalPalindromesDNA I obtain a function for determining palindromes with both gaps in the middle, and errors in the palindromic arms.

Adapting the quadratic-time algorithm for finding palindromes to also allow for errors and gaps in palindromes is not very hard. How about adapting the linear-time algorithm for finding palindromes?

Adapting the linear-time algorithm to account for errors and gaps is very hard. I spent a long time studying the problem of finding palindromes with a given maximum number of errors in time linear in some combination of the number of errors and the length of the input, but failed to find a satisfying solution. During a couple of long trips I drew many ideas in my notebook to adapt the main components of the linear-time algorithm, the longest tail palindrome and the list of maximal palindromes with centers before the longest tail palindrome, to allow for errors and gaps. In vain.

I tried to adapt the linear-time algorithm for finding palindromes to also find approximate palindromes before I had realized that the quadratic-time algorithm returns its results much faster on DNA strings than the linear-time algorithm. My frustrations about not being able to find a solution have almost disappeared now.

In the scientific literature I found two solutions to the problem of finding approximate palindromes in time linear in some combination of the input and the number of allowed errors. The first solution uses so-called suffix trees. A suffix tree is an advanced data structure that can be used to find all kinds of structures in strings, among which palindromes. I compared the efficiency of my linear-time algorithm for finding exact palindromes with an algorithm using suffix trees, and my linear-time algorithm was far superior. The solutions for finding approximate palindromes using suffix trees use even more arguments, and they will run even slower than the algorithm for finding exact palindromes in strings using suffix trees. The quadratic-time solution for finding approximate palindromes with gaps is no more complicated than the above programs, so although I did not perform any tests, I am pretty confident that the above quadratic-time algorithm for finding approximate palindromes runs much faster than the solutions using suffix trees when applied to DNA strings. The second solution uses advanced algorithms on graphs to determine edit actions that turn one string into another string. I'm afraid I don't understand the details sufficiently well to discuss the algorithm here. The time-complexity of this algorithm is related to $k^2 \times n$, where $k$ is the number of allowed errors, and $n$ the length of the input string. For the DNA string of length $25$ million and the $500$ errors used as examples above, this would amount to more than $6$ trillion steps. My machine would just laugh at it.