Date of Award

Winter 2018

Rights

Access is available to all users

Document Type

Thesis

Degree Name

Master of Science (MS) in Computer Science

Department

Computer Science

Abstract

This thesis revisits the k-mismatch shortest unique substring (SUS) finding problem and demonstrates that a technique recently presented in the context of solving the k-mismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n logk n), while maintaining a practical space complexity at O(kn), where n is the string length. When k > 0, which is the hard case, the new proposal significantly improves the any-case O(n2) time complexity of the prior best method for k-mismatch SUS finding. Experimental study shows that the new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, the proposed method processes a 200KB sample DNA sequence with k = 1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel resulting in further significant practical performance improvement. As an example, when using 8 cores to process a 10MB sample DNA sequence with k = 2, two parallel implementations each achieved processing times less than 1=4 that of the serial implementation. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that trading additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may require years to process a 200MB DNA sample for any k > 0, while this new proposal, using 24 cores, finished processing a sample of this size with k = 1 in 206:376 seconds with a peak memory usage of 46GB, which is easily available and affordable for many users. It is expected that this new practical and efficient algorithm for k-mismatch SUS finding will prove useful to those using the measure on long sequences in fields such as computational biology.

Share

COinS