Thursday, August 16, 2012

Finding palindromes in DNA

Palindromes play an important role in DNA, and many geneticists want to find palindromes in the DNA they have. To find palindromes in DNA, I cannot directly apply the algorithm for finding palindromes described in a previous blog post, since the equality used for palindromes in DNA differs from the normal equality. A string is a palindrome in DNA if each symbol in the reversed string is the negation of the symbol in the original string, where `A' is the negation of `T' and vice versa, and `C' is the negation of `G', and vice versa. The operator =:= introduced in my blog post on What is a palindrome II? A specification in Haskell implements this equality. By negating the result of =:= I get the inequality operator =/= on DNA symbols.

A first attempt to find palindromes in DNA is to adapt the various components to deal with DNA in the functions extendPalindrome, moveCenter, and finalPalindromes introduced in the blog post on finding palindromes efficiently. In the function extendPalindrome the expression a!... /= a!..., where the dots should be replaced by the appropriate indices in the input, is replaced by the expression a!... =/= a!... using the DNA inequality operator. This is sufficient to turn the function extendPalindrome into a function that finds palindromes in DNA. The functions moveCenter and finalPalindromes do not use the (in)equality operator, so you might think that this is all that is needed to obtain a program for finding palindromes in DNA.

There is one more location in the algorithm for finding palindromes where I use a property of palindromes that does not hold for DNA. In the function moveCenter I call function extendPalindrome with a new "current palindrome" of length one. But a DNA string of length one is not a palindrome, since if I reverse it, I get the same string, and to be a palindrome, the string has to contain the complement symbol. Every DNA string of odd length is not a palindrome, since when such a string is reversed, the symbol in the middle is compared against itself, and DNA equality fails. To find palindromes in a DNA string, I only have to look at the even positions in the string, which correspond to the positions in between symbols. The palindromes around odd positions on the symbols always have length zero.

I adapt the algorithm for efficiently finding palindromes to only find palindromes around even positions. Besides adapting the function extendPalindrome with a different inequality operator, I adapt the function moveCenter by calling extendPalindrome with a new longest current palindrome of length zero, and by counting the number of centers down with steps of size two instead of one, since I am not interested in the odd center positions anymore. These are all changes needed, and I will not give the resulting algorithm. The interested reader can find them in my software package for finding palindromes.

In spring 2012 I was contacted by an Indian scientist who was developing an algorithm for finding palindromes in DNA. I soon found out that her algorithm was similar to my naive, quadratic-time, algorithm for finding maximal palindromes. What worried me was that her algorithm seemed to run faster than my efficient algorithm for finding palindromes, adapted to finding palindromes in DNA. By thinking that my efficient algorithm would run faster, I had made the classical mistake of confusing worst-case time complexity with practice.

The worst-case time complexity of my efficient algorithm for finding palindromes is linear. The number of steps the algorithm takes is linear in the length of the input string. The worst case time-complexity of the naive algorithm is quadratic. A string in which all substrings are palindromes leads to "worst case" time behaviour. An example of a DNA string satisfying this property is a string looking like "ATATATAT...". For a string consisting of three million copies of "AT", the efficient algorithm returns the length of all maximal palindromes about equally fast as for the the Amycolatopsis marina, which is about equally long, but the quadratic time algorithm never finishes on the repetitions of ATs. Strings like "ATATATAT..." appear in DNA, but they are short. For example, the longest maximal palindrome in chromosome 18 consisting of just "ATAT..." has length 66. The longest overall maximal palindrome has length 76. The majority of the maximal palindromes is much shorter. The time-complexity of the naive algorithm for finding palindromes is the sum of the length of the palindromes found in the argument string. If the maximum length of a palindrome is 76, then the number of steps performed by the naive algorithm is at most 76 times the length of the input string. Since chromosome 18 consists of almost 75 million symbols, this is more than five and a half billion, which is quite a lot, but doable for a modern laptop. Indeed, if I measure the running time of the quadratic and linear-time solutions on chromosome 18, I find that the quadratic solution is about four to five times faster than the linear solution. To be precise, on my MacBook Pro with 2.4 GHz Intel Core i5 with 4GB memory, with very few other processes running, the linear-time solution requires slightly more than 30 minutes to analyse the almost 75 megabytes of chromosome 18, whereas the quadratic solution uses around 6 minutes. Why is it faster?

It is hard to exactly determine what one program makes faster than another. The linear time solution uses quite a few arguments, in particular the list of maximal palindromes found until now. This is the central idea to obtain linear-time behaviour, but it also requires constructing, passing on, inspecting, and deconstructing this list to determine maximal palindromes with centers to the right of the current longest tail palindrome. I have experimented with several kinds of representations for the list of maximal palindromes used as argument to the functions extendPalindrome and moveCenter, but none of these solutions can compare to the quadratic solution when looking at running times on large DNA strings. An important motivation for developing a linear-time algorithm for finding palindromes turns out not to be valid at all...

The quadratic-time solution for finding palindromes in DNA is very similar to the naive algorithm for finding maximal palindromes. There are three differences. First, the input to the function maximalPalindromes, which I now call qMaximalPalindromesDNA, is of type ByteString instead of Array Int Char. The type ByteString is a data structure specifically designed to efficiently deal with strings of which the length is known. It is very similar to the type String, but functions like read for reading strings, map for mapping a function over a string, length for determining the length of a stringdefined in the module ByteString, all are usually much faster than their equivalents on String, or even on Array Int Char. I write B.ByteString, B.length, and so on, to make explicit that these components come from the module ByteString:


> import qualified Data.ByteString as B

Second, in the function qMaximalPalindromesDNA I only consider positions in between symbols around which to find maximal palindromes, since palindromes in DNA have even length. The length of the list of centers has length one plus the length of the input, to account for the center at the end of the input.

> qMaximalPalindromesDNA :: B.ByteString -> [Int]
> qMaximalPalindromesDNA input = 
>   let lengthInput =  B.length input
>       centers     =  [0 .. lengthInput]
>       lastPos     =  lengthInput - 1
>   in map 
>        (\c -> lengthPalindromeAroundDNA 
>                 input 
>                 lastPos 
>                 (c-1) 
>                 c
>        ) 
>        centers

The third difference is that function lengthPalindromeAroundDNA doesn't distinguish between even-length and odd-length palindromes anymore, since all palindromes have even length, and that it needs to use the equality operator for DNA.

> lengthPalindromeAroundDNA input lastPos start end  
>   | start < 0 || end > lastPos                 =  end-start-1
>   | B.index input start =:= B.index input end  =  
>       lengthPalindromeAroundDNA 
>         input 
>         lastPos 
>         (start-1) 
>         (end+1) 
>   | otherwise                                  =  end-start-1

This solves the problem of finding palindromes in DNA. But I cannot yet use this algorithm to find the huge palindromes in the Y chromosome, since those palindromes have some noise in the middle, and the palindromic arms contain a few errors. I will look into this in a following blog post.

2 comments:

  1. Hi Johan,

    I created an implementation in C++ of your linear-time algorithm to find palindromes. I would like to test it with real DNA data.

    Could you provide me the chromosome 18's DNA string (75 million symbols)?
    I would like to test the efficiency of my linear-time implementation against quadratic version.

    My email is fpelliccioni at gmail dot com.

    Thanks and regards,
    Fernando Pelliccioni

    ReplyDelete
  2. I downloaded it via this link:

    http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr18.fa.gz

    Good luck!

    ReplyDelete