Export file:

Format

  • RIS(for EndNote,Reference Manager,ProCite)
  • BibTex
  • Text

Content

  • Citation Only
  • Citation and Abstract

Space Efficient Data Structures for N-gram Retrieval

Department of Computer Engineering and Informatics, University of Patras, 26500 Greece

Special Issues: The Future of Informatics in Biomedicine

A significant problem in computer science is the management of large data strings and a great number of works dealing with the specific problem has been published in the scientific literature. In this article, we use a technique to store efficiently biological sequences, making use of data structures like suffix trees and inverted files and also employing techniques like n-grams, in order to improve previous constructions. In our attempt, we drastically reduce the space needed to store the inverted indexes, by representing the substrings that appear more frequently in a more compact inverted index. Our technique is based on n-gram indexing, providing us the extra advantage of indexing sequences that cannot be separated in words. Moreover, our technique combines classical one level with two-level n-gram inverted file indexing. Our results suggest that the new proposed algorithm can compress the data more efficiently than previous attempts.
  Figure/Table
  Supplementary
  Article Metrics

References

1. Blandford D, Blelloch G (2002) Index compression through document reordering. In: Storer JA and Cohn M, Eds., Proc. 2002 IEEE Data Compression Conference, April, IEEE Computer Society Press, Los Alamitos, CA. pp. 342-351.

2. Scholer F, Williams HE, Yiannis J, et al. (2002) Compression of inverted indexes for fast query evaluation. In: Beaulieu M, Baeza-Yates R, Myaeng SH and Jarvelin K, Eds., Proc. 25th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, August, Tampere, Finland, ACM Press, New York, pp. 222-229.

3. Brandon MC, Wallace DC, Baldi P (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics 25: 1731-1738.    

4. Gonzalo N, Veli M (2007) Compressed full-text indexes. ACM Computing Surveys 39: 2.    

5. Ricardo AB, Berthier AR (2011) Modern Information Retrieval-the concepts and technology behind search, Second edition. Pearson Education Ltd., Harlow, England ISBN 978-0-321-41691-9.

6. Ian HW, Alistair M, Timothy CB (1999) Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition. Morgan Kaufmann ISBN 1-55860-570-3.

7. Weiner P (1973) Linear Pattern Matching Algorithms In 14th Annual IEEE Symposium on Switching and Automata Theory.

8. Ukkonen E (1995) On-Line Construction of Suffix Trees. Algorithmica: 249-260.

9. Iliopoulos C, Makris C, Panagis Y, et al. (2006) The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications. Fundament Informa 71: 259-277.

10. Diamanti K, Kanavos A, Makris C, et al. (2014) Handling Weighted Sequences Employing Inverted Files and Suffix Trees. In Web Information Systems and Technologies, pp. 231-238, extended version, private communication with the authors.

11. Kim MS, Whang KY, Lee JG, et al. (2005) N-gram/2l: A space and time efficient two level n-gram inverted index structure. Int Conference Very Large Databases: 325-336.

12. Christodoulakis M, Iliopoulos CS, Mouchard L, et al. (2006) Computation of repetitions and regularities of biologically weighted sequences. J Comput Biol 13: 1214-1231.    

13. Zhu H, Kollios G, Athitsos V (2012) A Generic Framework for Efficient and Effective Subsequence Retrieval. Proceed Very Large Data Bases: 1579-1590.

14. Amir A, Iliopoulos CS, Kapah O, et al. (2006). Approximate matching in weighted sequences. Combinator Pattern Match: 365376

15. Alatabbi A, Crochemore M, Iliopoulos CS, et al. (2012) Overlapping repetitions in weighted sequence. Int Informat Technol Conference: 435-440.

16. Zhang H, Guo Q, Iliopoulos CS (2010) An algorithmic framework for motif discovery problems in weighted sequences. Int Conference Algorithms Complex: 335-346.

17. Zhang H, Guo Q, Iliopoulos CS (2010) Varieties of regularities in weighted sequences. Algorithmic Aspect Informa Manage: 271-280.

18. Li G, Deng D, Feng J (2011) Faerie: Efficient Filtering Algorithm for Approximate Dictionary-based Entity Extraction. Special Interest Group Manage Data: 529-540.

19. Mouza C, Litwin W, Rigaux P, et al. (2009) AS-Index: A Structure for String Search Using n-grams and Algebraic Signatures. Confer Informat Knowledge: 295-304.

20. Marsan L, Sagot MF (2000) Extracting structured motifs using a suffix tree-algorithms and application to promoter consensus identification. Int Conference Res Comput Mol Biol: 210-219.

21. Sun Z, Yang J, Deogun JS (2004) Misae: A new approach for regulatory motif extraction. Computa System Bioinforma Conference: 173-181.

22. Holub J, Smyth WF (2003) Algorithms on indeterminate strings. In Australasian Workshop on Combinatorial Algorithms.

23. Holub J, Smyth WF, Wang S (2008) Fast pattern matching on indeterminate strings. J Discrete Algorithms 6: 37-50.    

24. Manning CD, Raghavan P, Schutze H (2008) Introduction to Information Retrieval. Cambridge University Press.

25. Kim MS, Whang KY, Lee JG (2007) N-gram/2l-approximation: a two-level n-gram inverted index structure for approximate string matching. Compu System Sci Engineer 22: 6.

Copyright Info: © 2017, Christos Makris, et al., licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution Licese (http://creativecommons.org/licenses/by/4.0)

Download full text in PDF

Export Citation

Article outline

Show full outline
Copyright © AIMS Press All Rights Reserved