Research article Special Issues

Space Efficient Data Structures for N-gram Retrieval

  • Received: 29 April 2017 Accepted: 18 October 2017 Published: 24 October 2017
  • A significant problem in computer science is the management of large data strings and a great number of works dealing with the specific problem has been published in the scientific literature. In this article, we use a technique to store efficiently biological sequences, making use of data structures like suffix trees and inverted files and also employing techniques like n-grams, in order to improve previous constructions. In our attempt, we drastically reduce the space needed to store the inverted indexes, by representing the substrings that appear more frequently in a more compact inverted index. Our technique is based on n-gram indexing, providing us the extra advantage of indexing sequences that cannot be separated in words. Moreover, our technique combines classical one level with two-level n-gram inverted file indexing. Our results suggest that the new proposed algorithm can compress the data more efficiently than previous attempts.

    Citation: Fotios Kounelis, Christos Makris. Space Efficient Data Structures for N-gram Retrieval[J]. AIMS Medical Science, 2017, 4(4): 426-440. doi: 10.3934/medsci.2017.4.426

    Related Papers:

  • A significant problem in computer science is the management of large data strings and a great number of works dealing with the specific problem has been published in the scientific literature. In this article, we use a technique to store efficiently biological sequences, making use of data structures like suffix trees and inverted files and also employing techniques like n-grams, in order to improve previous constructions. In our attempt, we drastically reduce the space needed to store the inverted indexes, by representing the substrings that appear more frequently in a more compact inverted index. Our technique is based on n-gram indexing, providing us the extra advantage of indexing sequences that cannot be separated in words. Moreover, our technique combines classical one level with two-level n-gram inverted file indexing. Our results suggest that the new proposed algorithm can compress the data more efficiently than previous attempts.


    加载中
    [1] Blandford D, Blelloch G (2002) Index compression through document reordering. In: Storer JA and Cohn M, Eds., Proc. 2002 IEEE Data Compression Conference, April, IEEE Computer Society Press, Los Alamitos, CA. pp. 342-351.
    [2] Scholer F, Williams HE, Yiannis J, et al. (2002) Compression of inverted indexes for fast query evaluation. In: Beaulieu M, Baeza-Yates R, Myaeng SH and Jarvelin K, Eds., Proc. 25th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval, August, Tampere, Finland, ACM Press, New York, pp. 222-229.
    [3] Brandon MC, Wallace DC, Baldi P (2009) Data structures and compression algorithms for genomic sequence data. Bioinformatics 25: 1731-1738. doi: 10.1093/bioinformatics/btp319
    [4] Gonzalo N, Veli M (2007) Compressed full-text indexes. ACM Computing Surveys 39: 2. doi: 10.1145/1216370.1216372
    [5] Ricardo AB, Berthier AR (2011) Modern Information Retrieval-the concepts and technology behind search, Second edition. Pearson Education Ltd., Harlow, England ISBN 978-0-321-41691-9.
    [6] Ian HW, Alistair M, Timothy CB (1999) Managing Gigabytes: Compressing and Indexing Documents and Images, Second Edition. Morgan Kaufmann ISBN 1-55860-570-3.
    [7] Weiner P (1973) Linear Pattern Matching Algorithms In 14th Annual IEEE Symposium on Switching and Automata Theory.
    [8] Ukkonen E (1995) On-Line Construction of Suffix Trees. Algorithmica: 249-260.
    [9] Iliopoulos C, Makris C, Panagis Y, et al. (2006) The Weighted Suffix Tree: An Efficient Data Structure for Handling Molecular Weighted Sequences and its Applications. Fundament Informa 71: 259-277.
    [10] Diamanti K, Kanavos A, Makris C, et al. (2014) Handling Weighted Sequences Employing Inverted Files and Suffix Trees. In Web Information Systems and Technologies, pp. 231-238, extended version, private communication with the authors.
    [11] Kim MS, Whang KY, Lee JG, et al. (2005) N-gram/2l: A space and time efficient two level n-gram inverted index structure. Int Conference Very Large Databases: 325-336.
    [12] Christodoulakis M, Iliopoulos CS, Mouchard L, et al. (2006) Computation of repetitions and regularities of biologically weighted sequences. J Comput Biol 13: 1214-1231. doi: 10.1089/cmb.2006.13.1214
    [13] Zhu H, Kollios G, Athitsos V (2012) A Generic Framework for Efficient and Effective Subsequence Retrieval. Proceed Very Large Data Bases: 1579-1590.
    [14] Amir A, Iliopoulos CS, Kapah O, et al. (2006). Approximate matching in weighted sequences. Combinator Pattern Match: 365376
    [15] Alatabbi A, Crochemore M, Iliopoulos CS, et al. (2012) Overlapping repetitions in weighted sequence. Int Informat Technol Conference: 435-440.
    [16] Zhang H, Guo Q, Iliopoulos CS (2010) An algorithmic framework for motif discovery problems in weighted sequences. Int Conference Algorithms Complex: 335-346.
    [17] Zhang H, Guo Q, Iliopoulos CS (2010) Varieties of regularities in weighted sequences. Algorithmic Aspect Informa Manage: 271-280.
    [18] Li G, Deng D, Feng J (2011) Faerie: Efficient Filtering Algorithm for Approximate Dictionary-based Entity Extraction. Special Interest Group Manage Data: 529-540.
    [19] Mouza C, Litwin W, Rigaux P, et al. (2009) AS-Index: A Structure for String Search Using n-grams and Algebraic Signatures. Confer Informat Knowledge: 295-304.
    [20] Marsan L, Sagot MF (2000) Extracting structured motifs using a suffix tree-algorithms and application to promoter consensus identification. Int Conference Res Comput Mol Biol: 210-219.
    [21] Sun Z, Yang J, Deogun JS (2004) Misae: A new approach for regulatory motif extraction. Computa System Bioinforma Conference: 173-181.
    [22] Holub J, Smyth WF (2003) Algorithms on indeterminate strings. In Australasian Workshop on Combinatorial Algorithms.
    [23] Holub J, Smyth WF, Wang S (2008) Fast pattern matching on indeterminate strings. J Discrete Algorithms 6: 37-50. doi: 10.1016/j.jda.2006.10.003
    [24] Manning CD, Raghavan P, Schutze H (2008) Introduction to Information Retrieval. Cambridge University Press.
    [25] Kim MS, Whang KY, Lee JG (2007) N-gram/2l-approximation: a two-level n-gram inverted index structure for approximate string matching. Compu System Sci Engineer 22: 6.
  • Reader Comments
  • © 2017 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(3633) PDF downloads(912) Cited by(2)

Article outline

Figures and Tables

Figures(3)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog