The Role of The Prefix Array in Sequence Analysis: A Survey

Frantisek Franek; W. F. Smyth; Xinfang Wang; Frantisek Franek; W. F. Smyth; Xinfang Wang

doi:10.3934/medsci.2017.3.261

AIMS Medical Science

2017, Volume 4, Issue 3: 261-273. doi: 10.3934/medsci.2017.3.261

Previous Article Next Article

Research article Special Issues

The Role of The Prefix Array in Sequence Analysis: A Survey

1.
Department of Computing and Software, McMaster University, Hamilton, Canada
2.
School of Computational Science and Engineering, McMaster University, Hamilton, Canada
3.
School of Engineering and Information Technology, Murdoch University, Perth, Australia

Received: 25 April 2017 Accepted: 22 June 2017 Published: 13 July 2017

The prefix array was apparently first computed and used algorithmically in 1984, playing a pivotal role in an optimal algorithm to determine all the tandem repeats in a given (DNA or protein) sequence. However, it is especially since the turn of the 21^st century that applications of the prefix array to fundamental sequencing problems have been recognized. An important aspect of this expanding role has been the recognition that the prefix table and the border array are “equivalent” data structures 一 that is, one can be computed from the other in linear time. Since the border array in turn specifies all the periods of every prefix of the sequence, the prefix array thus turns out to be a structure of central importance. In this paper we survey important applications of the prefix array 一 in particular to approximate string matching under Hamming distance, as well as to the computation of covers and enhanced covers 一 and show how, unlike border array algorithms, these are extendible to sequences containing “don’t-care” or indeterminate letters such as {a, c} or {g, t}. This extension leads to a surprising correspondence between prefix arrays and undirected graphs that seems likely to be a fertile source of new insights in future. We conclude with an overview of sequencing problems that the authors believe can be handled using prefix array technology.
- string,
- sequence,
- indeterminate string,
- prefix array,
- sequence analysis,
- algorithm
Citation: Frantisek Franek, W. F. Smyth, Xinfang Wang. The Role of The Prefix Array in Sequence Analysis: A Survey[J]. AIMS Medical Science, 2017, 4(3): 261-273. doi: 10.3934/medsci.2017.3.261

Related Papers:

Abstract

The prefix array was apparently first computed and used algorithmically in 1984, playing a pivotal role in an optimal algorithm to determine all the tandem repeats in a given (DNA or protein) sequence. However, it is especially since the turn of the 21^st century that applications of the prefix array to fundamental sequencing problems have been recognized. An important aspect of this expanding role has been the recognition that the prefix table and the border array are “equivalent” data structures 一 that is, one can be computed from the other in linear time. Since the border array in turn specifies all the periods of every prefix of the sequence, the prefix array thus turns out to be a structure of central importance. In this paper we survey important applications of the prefix array 一 in particular to approximate string matching under Hamming distance, as well as to the computation of covers and enhanced covers 一 and show how, unlike border array algorithms, these are extendible to sequences containing “don’t-care” or indeterminate letters such as {a, c} or {g, t}. This extension leads to a surprising correspondence between prefix arrays and undirected graphs that seems likely to be a fertile source of new insights in future. We conclude with an overview of sequencing problems that the authors believe can be handled using prefix array technology.

References

[1]	Iliopoulos CS, Mouchard L (1999) Quasiperiodicity and string covering. Theoret Comput Sci 218: 205-216. doi: 10.1016/S0304-3975(98)00260-6
[2]	Smyth WF (2013) Computing regularities in strings: a survey. Europ J Combinatorics 34: 3-14. doi: 10.1016/j.ejc.2012.07.010
[3]	Aho AV, Hopcroft JE (1974) The design and analysis of computer algorithms. Pearson Education India.
[4]	Knuth DE, Morris, Jr JH, Pratt VR (1977) Fast pattern matching in strings. SIAM J Computing 6: 323-350. doi: 10.1137/0206024
[5]	Fredkin E (1960) Trie memory. Commun Assoc Comput Mach 3: 490-499.
[6]	Weiner P (1973) Linear pattern matching algorithms. In: Switching and Automata Theory, 1973, SWAT'08. IEEE Conference Record of 14th Annual Symposium on. IEEE: 1-11.
[7]	McCreight EM (1976) A space-economical suffix tree construction algorithm. JACM 23: 262-272. doi: 10.1145/321941.321946
[8]	Farach M (1997) Optimal su x tree construction with large alphabets. In: Proceedings of the 38th Annual Symposium on the Foundations of Computer Science, FOCS: 97.
[9]	Apostolico A (1985) The myriad virtues of subword trees. In: Combinatorial algorithms on words. Springer Berlin Heidelberg: 85-96.
[10]	Manber U, Myers GW (1990) Suﬃx arrays: a new method for on-line string searches, Proc. First Annual ACM-SIAM Symp. Discrete Algs: 319-327.
[11]	Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22: 935-948. doi: 10.1137/0222058
[12]	Kärkkäinen J, Sanders P (2003) Simple linear work suffix array construction. In: International Colloquium on Automata, Languages, and Programming. Springer Berlin Heidelberg: 943-955.
[13]	Ko P, Aluru S (2003) Space efficient linear time construction of suffix arrays. In: Annual Symposium on Combinatorial Pattern Matching. Springer Berlin Heidelberg: 200-210.
[14]	Kim DK, Sim JS, Park H, et al. (2003) Linear-time construction of suffix arrays. In: Annual Symposium on Combinatorial Pattern Matching. Springer Berlin Heidelberg: 186-199.
[15]	Puglisi SJ, Smyth WF, Turpin AH (2007) A taxonomy of suffix array construction algorithms. acm Computing Surveys (CSUR) 39: 4. doi: 10.1145/1242471.1242472
[16]	Nong G, Zhang S, Chan WH (2009) Linear time suffix array construction using D-critical substrings. In: Annual Symposium on Combinatorial Pattern Matching. Springer Berlin Heidelberg: 54-67.
[17]	Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2: 53-86. doi: 10.1016/S1570-8667(03)00065-0
[18]	Smyth B (2003) Computing patterns in strings. Pearson Education.
[19]	Fischer MJ, Paterson MS (1974) String-matching and other products (No. MAC-TM-41). Massachusetts Inst. of Technology.
[20]	Blanchet-Sadri F (2007) Algorithmic combinatorics on partial words. CRC Press.
[21]	Abrahamson K (1987) Generalized string matching. SIAM J Comput 16:1039-1051.
[22]	Holub J, Smyth WF (2003) Algorithms on indeterminate strings. 36-45.
[23]	Holub J, Smyth W F, Wang S (2008) Fast pattern-matching on indeterminate strings. J Discrete Algorithms 6: 37-50. doi: 10.1016/j.jda.2006.10.003
[24]	Smyth WF, Wang S (2009) A new approach to the periodicity lemma on strings with holes. Theoret Comput Sci 410: 4295-4302.
[25]	Main MG, Lorentz RJ (1984) An O (n log n) algorithm for finding all repetitions in a string. J Algorithms 5: 422-432.
[26]	Crochemore M, Hancart C, Lecroq T (2001) Algorithmique du texte. Paris: Vuiber: 347.
[27]	Crochemore M, Hancart C, Lecroq T (2007) Algorithms on strings. Cambridge University Press: 383.
[28]	Smyth WF, Wang S (2008) New perspectives on the prefix array. In: International Symposium on String Processing and Information Retrieval. Springer Berlin Heidelberg: 133-143.
[29]	Bland W, Kucherov G, Smyth WF (2013) Prefix table construction and conversion. In: International Workshop on Combinatorial Algorithms. Springer Berlin Heidelberg: 41-53.
[30]	Gusfield D (1997) Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge university press.
[31]	Clément J, Crochemore M, Rindone G (2009) Reverse engineering prefix tables. In: 26th International Symposium on Theoretical Aspects of Computer Science STACS 2009. IBFI Schloss Dagstuhl: 289-300.
[32]	Alatabbi A, Rahman MS, Smyth WF (2015) Inferring an indeterminate string from a prefix graph. J Discrete Algorithms 32: 6-13.
[33]	Christodoulakis M, Ryan PJ, Smyth WF, et al. (2015) Indeterminate strings, prefix arrays & undirected graphs. Theoret Comput Sci 600: 34-48.
[34]	Blanchet-Sadri F, Bodnar M, De Winkle B (2017) New bounds and extended relations between prefix arrays, border arrays, undirected graphs, and indeterminate strings. Theory of Computing Systems 60: 473-497. doi: 10.1007/s00224-016-9668-2
[35]	Helling J, Ryan PJ, Smyth WF, et al. (2017) Constructing an indeterminate string from its associated graph. Theoret Comput Sci. Available from: http://www.sciencedirect.com/science/article/pii/S0304397517301494
[36]	Beal R, Adjeroh DA, Smyth WF (2017) A prefix array for parameterized strings. J Discrete Algorithms 42: 23-34. doi: 10.1016/j.jda.2016.11.002
[37]	Barton C, Iliopoulos CS, Pissis SP, et al. (2014) Fast and simple computations using prefix tables under hamming and edit distance. In: International Workshop on Combinatorial Algorithms. Springer International Publishing: 49-61.
[38]	Apostolico A, Ehrenfeucht A (1993) Efficient detection of quasiperiodicities in strings. Theoret Comput Sci 119: 247-265.
[39]	Apostolico A, Farach M, Iliopoulos CS (1991) Optimal superprimitivity testing for strings. Inform Process Lett 39: 17-20.
[40]	Breslauer D (1992) An on-line string superprimitivity test. Inform Proces. Lett 44: 345-347.
[41]	Moore D, Smyth WF (1994) An optimal algorithm to compute all the covers of a string. Inform Process Lett 50: 239-246. doi: 10.1016/0020-0190(94)00045-X
[42]	Moore D, Smyth WF (1995) A correction to "An optimal algorithm to compute all the covers of a string". Inform Process Lett 54: 101-103. doi: 10.1016/0020-0190(94)00235-Q
[43]	Li Y, Smyth WF (2002) Computing the cover array in linear time. Algorithmica 32: 95-106. doi: 10.1007/s00453-001-0062-2
[44]	Iliopoulos CS, Moore DWG, Park K (1996) Covering a string. Algorithmica 16: 288-297. doi: 10.1007/BF01955677
[45]	Christodoulakis M, Iliopoulos CS, Park K, et al. (2005) Approximate Seeds of Strings. J Automata Languages & Combinatorics 10: 609-626.
[46]	Iliopoulos CS, Smyth WF (1998) On-line algorithms for k-covering. Proc. 9th Australasian Workshop on Combinatorial Algs: 107-116.
[47]	Cole R, Ilopoulos CS, Mohamed M, et al. (2005) The complexity of the minimum k-cover problem. J Automata Languages & Combinatorics 10: 641-653.
[48]	Iliopoulos CS, Mohamed M, Smyth WF (2011) New complexity results for the k-covers problem. Inform Sci 181: 2571-2575.
[49]	Kociumaka T, Pissis SP, Radoszewski J, et al. (2015) Fast algorithm for partial covers in words. Algorithmica 73: 217-233. doi: 10.1007/s00453-014-9915-3
[50]	Kociumaka T, Pissis S P, Radoszewski J, et al. (2016) Efficient algorithms for shortest partial seeds in words. Theoret Comput Sci Available from: http://www.sciencedirect.com/science/article/pii/S0304397516307034
[51]	Flouri T, Iliopoulos CS, Kociumaka T, et al. (2013) Enhanced string covering. Theoret Comput Sci 506: 102-114.
[52]	Bari MF, Rahman MS, Shahriyar R (2009) Finding All Covers of an Indeterminate String in O (n) Time on Average. In: Stringology: 263-271.
[53]	Aho AV, Corasick MJ (1975) Efficient string matching: an aid to bibliographic search. Commun Assoc Comput Mach 18: 333-340.
[54]	Alatabbi A, Rahman MS, Smyth WF (2016) Computing covers using prefix tables. Discrete Appl Math 212: 2-9.
[55]	Alatabbi A, Islam AS, Rahman MS, et al. (2016) Enhanced covers of regular & indeterminate strings using prefix tables. J Automata Languages & Combinatorics: 41-46.

Reader Comments

Your name:*

Email:*
© 2017 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)