Prof. Cenk Sahinalp
Department of EECS, Case School of Engineering
We have developed a new algorithm designed for DNA/RNA sequences which improves upon the standard biocompress utility by UCSB. Our goal is to beat biocompress both in terms of compression rate and performance. Known techniques limit the number of mismatches to one or two in order to obtain reasonable performance figures. We improve the available search routines via a new algorithm for finding approximate matches for the longest uncompressed prefix of the input sequence. Our immediate goal is to achieve the best implementation of the above algorithm and develop generalizations of it for bioXML which incorporates annotations and other forms of text.