Citation: Xiaoying Chen, Chong Zhang, Zonglin Shi, Weidong Xiao. Spatio-temporal Keywords Queries in HBase[J]. Big Data and Information Analytics, 2016, 1(1): 81-91. doi: 10.3934/bdia.2016.1.81
[1] | Pawan Lingras, Farhana Haider, Matt Triff . Fuzzy temporal meta-clustering of financial trading volatility patterns. Big Data and Information Analytics, 2017, 2(3): 219-238. doi: 10.3934/bdia.2017018 |
[2] | Ricky Fok, Agnieszka Lasek, Jiye Li, Aijun An . Modeling daily guest count prediction. Big Data and Information Analytics, 2016, 1(4): 299-308. doi: 10.3934/bdia.2016012 |
[3] | Bill Huajian Yang . Modeling path-dependent state transitions by a recurrent neural network. Big Data and Information Analytics, 2022, 7(0): 1-12. doi: 10.3934/bdia.2022001 |
[4] | Elnaz Delpisheh, Aijun An, Heidar Davoudi, Emad Gohari Boroujerdi . Time aware topic based recommender System. Big Data and Information Analytics, 2016, 1(2): 261-274. doi: 10.3934/bdia.2016008 |
[5] | Prince Peprah Osei, Ajay Jasra . Estimating option prices using multilevel particle filters. Big Data and Information Analytics, 2018, 3(2): 24-40. doi: 10.3934/bdia.2018005 |
[6] | Ioannis D. Schizas, Vasileios Maroulas, Guohua Ren . Regularized kernel matrix decomposition for thermal video multi-object detection and tracking. Big Data and Information Analytics, 2018, 3(2): 1-23. doi: 10.3934/bdia.2018004 |
[7] | Yaru Cheng, Yuanjie Zheng . Frequency filtering prompt tuning for medical image semantic segmentation with missing modalities. Big Data and Information Analytics, 2024, 8(0): 109-128. doi: 10.3934/bdia.2024006 |
[8] | Jiaqi Ma, Hui Chang, Xiaoqing Zhong, Yueli Chen . Risk stratification of sepsis death based on machine learning algorithm. Big Data and Information Analytics, 2024, 8(0): 26-42. doi: 10.3934/bdia.2024002 |
[9] | J. Kent Poots, Nick Cercone . First steps in the investigation of automated text annotation with pictures. Big Data and Information Analytics, 2017, 2(2): 97-106. doi: 10.3934/bdia.2017001 |
[10] | John A. Doucette, Robin Cohen . A testbed to enable comparisons between competing approaches for computational social choice. Big Data and Information Analytics, 2016, 1(4): 309-340. doi: 10.3934/bdia.2016013 |
With the development of wireless communication and positioning technology, more and more data about locations are collected and used for many applications, such as location-based services. In particular, the format of the data could be generalized to (
However, tens of billions of spatio-temporal data would be a serious problem for applications. HBase [1] is a distributed non-sql, key-value database, and is capable to store such a huge amount of data. However, due to the non-sql characteristics of HBase, it is inefficient to process spatio-temporal keyword queries using the built-in functions of HBase. Consequently, it is highly necessary to design efficient algorithms for processing spatio-temporal keyword queries in HBase. A previous work [9] about processing spatial or multi-attributes queries in HBase is different from ours, because they merely consider spatial dimension in query processing. To the best of our knowledge, this is the first work for processing spatio-temporal keyword queries in HBase.
Our motivation is to adopt HBase to process STK queries efficiently. First, a space filling curve, Hilbert curve [8], is used to encode the spatial dimension into one-dimensional codes, i.e., for a location referred by geo-coordinates, it is represented by a number denoting the relative position in the original space. Based on this, we design a suitable access model for HBase, containing row keys for indexing spatio-temporal dimensions and Bloom filters [3] for fast detecting the existence of query keywords. After that, we develop two algorithms, one is suitable for the query with ordinary selectivity, the other is a parallel algorithm for the large range query. We evaluate our algorithms on a real dataset, and the results show that our algorithms are capable to handle spatio-temporal keyword queries efficiently. In summary, we make the following contributions:
● We propose spatio-temporal keyword queries processing problem in HBase, which is a new challenge for HBase platform.
● We design a novel access model for HBase to store spatial, temproal and textual information.
● We propose efficient algorithms for processing spatio-temporal keyword queries in HBase.
The rest of this paper is organized as follows. Section 2 reviews related works on spatio-temporal queries and HBase queries. Section 3 formally defines spatio-temporal keyword problem. Section 4 presents an access model for HBase. Algorithms for spatio-temporal keyword queries are presented in section 5. And we experimentally evaluate our algorithms in section 6. Finally, section 7 concludes the paper with directions for future works.
To our knowledge, the state of the art for spatio-temporal keyword queries in HBase is less well studied. However, some researches focusing on distributed index could be referenced. As an attractive choice for large-scale data processing, Cloud storage system currently adopts a hash-like approach to retrieve data that only support simple keyword-based queries, but lacks various forms of information search. To overcome this disadvantage, Zhou et al. [10] propose a novel SkipNet and B
For supporting similarity search in cloud system, Cheng et al. [4] propose VF-CAN, a novel indexing scheme, which integrates content addressable network (CAN) based routing protocol and the improved vector approximation file (VA-file) index. There are two index levels in this scheme: global index and local index. In the local index, VA-file approximation vectors are clustered by k-means according to their degree of proximity. In the global index, storage nodes are organized into an overlay network CAN, and only clustering information of local index is issued to the entire overlay network through the CAN interface. The experimental results show that VF-CAN reduces the index storage space and improves query performance effectively.
For multi-dimensional data applications, Nishimura et al. [9] proposes MD-HBase, a scalable multi-dimensional data store supporting efficient multi-dimensional range and nearest neighbor queries. MD-HBase layers a multi-dimensional index structure over a range partitioned key-value store. Using a design based on linearization, its implementation layers standard index structures like K-D trees and Quad trees.
As we have mentioned above, many applications not only require finding objects closest to a specified location, but also that containing a set of keywords. Felipe et al. [6] present an efficient method to answer top-k spatial keyword queries. It adopts a structure called IR
Cong et al. [5] propose a novel indexing framework for location top-k text retrieval. It integrates the inverted file for text retrieval and the R-tree for spatial proximity query. Several hybrid indexing approaches are explored within the framework. Algorithms utilize the proposed indexes to compute the top-k query by taking into account both text relevancy and location proximity to prune the search space.
In practical applications, it is need to consider the nature of the movement of objects. Traditional indexes have good query performance but can not handle this data processing problem in that they are not efficient for update which is crucial for an index for moving objects, as they change their position frequently. Jensen et al. [7] represent moving-object locations as vectors that are time stamped based on their update time. Then B
In this section, we first present the definitions for spatio-temporal keyword data and queries, and then describe the HBase logical table. For simplicity, only two-dimensional space is considered in this paper, however, our method can be directly extended into higher dimensional space.
Spatio-temporal data. A record
Spatio-temporal keyword query. Given a set
rs.(x,y)∈Rq |
rs.t∈[ts,te] |
rs.W∩Wq≠ϕ |
Logical view of HBase table. Without loss of generality, we give the descriptions for HBase table. HBase is a distributed key-value database which consists of a number of computing nodes cooperatively processing large-scale data. A physical table in HBase is partitioned into several regions each of which is maintained by a node. From the logical view, a table is similar to a grid, where a cell can be located by the given row identifier and column identifier. Row identifiers are implemented by row keys (
In this section, we describe a spatio-temporal keyword access (STKA) model for the logical view of HBase table. The model is built based on Hilbert curve and Bloom filter. We first introduce Hilbert curve, which maps high-dimensional data into one-dimensional space and then describe the access model.
Hilbert curve is a kind of space filling curve which maps multi-dimensional space into one-dimensional space. In particular, the whole space is partitioned into equal-size cells and then a curve is passed through each cell for only once in term of some sequence, so that every cell is assigned a sequence number. Different space filling curves are distinguished by different sequencing methods. Due to information loss in the transformation, different space filling curves are evaluated by the criteria, locality preservation, meaning that how much the change of proximities is from original space to one-dimensional space. Hilbert curve is proved to be the best locality preserved space filling curve. With Hilbert curve, any object in the original space is transformed into [0,
We describe two functions for Hilbert curve, one is mapping a point in the original space to a value in one-dimensional space, the other is mapping a range window to a series of intervals. Specifically, for a Hilbert curve with order =
●
●
For instance, in Figure 1 (b),
For each record
The STKA model is
The STKA model is suitable for the logical view of HBase table. Table 2 shows an example of the model. For storing
| | |||||||
| |
In this section, we first describe the processing for STK queries on original model. Considering massive data processing, we use MapReduce [2] framework to design parallel algorithm for STK queries.
For a STK query
Algorithm 1 Spatio-temporal Keyword Query |
Input: |
Output: |
1: |
2: |
3: while |
4: for each |
5: |
6: |
7: |
8: |
9: |
10: for each |
11: if |
12:while |
13: if |
14: |
15: end if |
16: end while |
17: end if |
18: end for |
19: end for |
20: |
21: end while |
For the sake of legibility, some functions and variables of the algorithm are explained below:
1.
2.
3.
4.
5.
6.
7.
In line 1,
As we have mentioned above, with the development of wireless communication and positioning technology, more and more data about locations are collected, so it is necessary to consider the problems of massive data processing. For a STK query with large range predicates, a great number of rows are retrieved, which is a bottleneck of performance. Consequently, it is necessary to design parallel algorithm which involved computing nodes to process the query simultaneously, so that throughput and efficiency increase. Based on MapReduce framework, we design a parallel algorithm for STK queries.
The basic work flow is that, firstly, an intermediate file containing all the involved row key ranges for the query is generated, in which each record corresponds to a row key range in HBase. Then Map procedure reads the records in the file and outputs them to Reduce procedure. And then in the Reduce, corresponding nodes retrieve and examine rows simultaneously to get results. Algorithm 2 describes the parallel algorithm for STK queries based on MapReduce framework.
Algorithm 2 Parallel Spatio-temporal Keyword Query |
Input: |
Output: |
1: /*generating intermediate file |
2: |
3: while |
4: for each |
5: |
6: |
7: |
8: |
9:end for |
10: |
11: end while |
12: /*Map input: file |
13: while |
14: |
15: end while |
16: /*Reduce input: output of Map */ |
17: |
18: |
19: |
20: |
21: for each |
22:if |
23:while |
24:if |
25: |
26:end if |
27:end while |
28:end if |
29:end for |
Pseudo-codes between line 2 and 11 describe the generation for the intermediate file
We evaluate our algorithms on a real dataset, which contains trajectories of taxis in Beijing. In particular, each record in the dataset contains vehicle ID, geo-location, recording time stamp, etc. For each record, we randomly assign it a list of keywords with size varied from 7 to 15 words. For comparison, we extract 5 datasets in different sizes from the original one in terms of temporal range. Table 3 shows the datasets in detail.
Dataset | Temporal Range | Size (records) |
dataset1 | 0:00-0:48, Nov. 1st | 1 million |
dataset2 | 0:00-8:25, Nov. 1st | 10 million |
dataset3 | 0:00-23:59, Nov. 1st | 30 million |
dataset4 | Nov. 1st & Nov. 2nd | 60 million |
dataset5 | Nov. 1st, Nov. 2nd & Nov. 3rd | 100 million |
Our algorithms are implemented in Java, and run on a three-node cluster with Hadoop 2.5.1 and HBase 0.98.6, in which each node is equipped with Intel(R) Core(TM) i3 CPU @ 3.40GHz, 4GB main memory, and 500GB storage, and operating system is CentOS release 6.5 64bit, and network bandwidth is 10Mbps.
First, we evaluate the algorithm for STK queries. And we introduce two parameters to test the algorithm under various conditions. One is selectivity
θ=L(ts,te)Lt⋅ARqAS |
where
Firstly, we fix
We can see that response time increases with
We fix
Similarly, the response time increases with
After studying the performance of STK queries, we compare parallel STK query with the original one to observe the improvement. For both of them, we issue 5 different queries on dataset 1, and Table 4 shows the conditions in detail.
Conditions | Temporal Range(min) | Spatial Range(km) | Number of Keywords | Number of Scanned Records |
c1 | 10 | 1 | 3 | 3987 |
c2 | 20 | 2 | 5 | 18786 |
c3 | 30 | 3 | 5 | 42457 |
c4 | 40 | 4 | 7 | 99789 |
c5 | 50 | 5 | 10 | 108040 |
Figure 4 shows the results. We can see that at the beginning, i.e, the number of scanned records is not much, STK query is faster than the parallel one, because the cost of writing and parsing the intermediate file in the MapReduce procedure impacts the performance. However, with the query range enlarged, the parallel STK query is more efficient than the original one. This can be explained by the parallelism of processing the query, making accessing and comparing to operate simultaneously. Therefore, the parallel STK queries are applicable for the condition with large selectivity.
In this paper, we propose the spatio-temporal keyword queries in HBase, which is a new problem for HBase platform. We devise a proper access model for HBase, utilized Hilbert curve and Bloom filter. Two algorithms are developed suitable for ordinary and large query selectivities, respectively. We conduct experiments on a real dataset, and the results show our methods are capable for large scale spatio-temporal keyword queries.
In the future, we plan to extend our work to the inner structure of HBase index table, in order to improve efficiency of spatio-temporal keyword queries.
This work is supported by NSF of China grant 61303062. We would like to thank Peijun He for helping with the implementation.
[1] | [ HBase, 2015. Available from:http://hbase.apache.org. |
[2] | [ Hadoop, 2015. Available from:http://hadoop.apache.org. |
[3] | [ J. Blustein and A. El-Maazawi, Bloom filters. a tutorial, analysis, and survey, Halifax, NS:Dalhousie University, (2002), 1-31. |
[4] | [ C. Cheng, C. Sun, X. Xu and D. Zhang, A multi-dimensional index structure based on improved VA-file and CAN in the cloud, International Journal of Automation and Computing, 11(2014), 109-117. |
[5] | [ G. Cong, C. S. Jensen and D. Wu, Efficient retrieval of the top k most relevant spatial web objects, VLDB Endowment, 2(2009), 337-348. |
[6] | [ I. D. Felipe, V. Hristidis and N. Rishe, Keyword search on spatial databases, In ICDE, (2008), 656-665. |
[7] | [ C. S. Jensen, D. Lin and B. C. Ooi, Query and update efficient B+-tree based indexing of moving objects, VLDB Endowment, 30(2004), 768-779. |
[8] | [ B. Moon, H. V. Jagadish, C. Faloutsos and J. H. Saltz, Analysis of the clustering properties of the Hilbert space-filling curve, IEEE Transactions on Knowledge and Data Engineering, 13(2001), 124-141. |
[9] | [ S. Nishimura, S. Das, D. Agrawal and A. E. Abbadi, MD-HBase:A Scalable Multidimensional Data Infrastructure for Location Aware Services, In MDM, 1(2011), 7-16. |
[10] | [ W. Zhou, J. Lu, Z. Luan, S. Wang, G. Xue and S. Yao, SNB-index:A SkipNet and B+ tree based auxiliary Cloud index, Cluster Computing, 17(2014), 453-462. |
1. | Tiantian Liu, Shu Gao, Xiumin Chu, Cong Lu, 2017, Storing and querying AIS data in HBase, 978-1-5090-6792-3, 88, 10.1109/ACIRS.2017.7986071 |
| | |||||||
| |
Algorithm 1 Spatio-temporal Keyword Query |
Input: |
Output: |
1: |
2: |
3: while |
4: for each |
5: |
6: |
7: |
8: |
9: |
10: for each |
11: if |
12:while |
13: if |
14: |
15: end if |
16: end while |
17: end if |
18: end for |
19: end for |
20: |
21: end while |
Algorithm 2 Parallel Spatio-temporal Keyword Query |
Input: |
Output: |
1: /*generating intermediate file |
2: |
3: while |
4: for each |
5: |
6: |
7: |
8: |
9:end for |
10: |
11: end while |
12: /*Map input: file |
13: while |
14: |
15: end while |
16: /*Reduce input: output of Map */ |
17: |
18: |
19: |
20: |
21: for each |
22:if |
23:while |
24:if |
25: |
26:end if |
27:end while |
28:end if |
29:end for |
Dataset | Temporal Range | Size (records) |
dataset1 | 0:00-0:48, Nov. 1st | 1 million |
dataset2 | 0:00-8:25, Nov. 1st | 10 million |
dataset3 | 0:00-23:59, Nov. 1st | 30 million |
dataset4 | Nov. 1st & Nov. 2nd | 60 million |
dataset5 | Nov. 1st, Nov. 2nd & Nov. 3rd | 100 million |
Conditions | Temporal Range(min) | Spatial Range(km) | Number of Keywords | Number of Scanned Records |
c1 | 10 | 1 | 3 | 3987 |
c2 | 20 | 2 | 5 | 18786 |
c3 | 30 | 3 | 5 | 42457 |
c4 | 40 | 4 | 7 | 99789 |
c5 | 50 | 5 | 10 | 108040 |
| | |||||||
| |
Algorithm 1 Spatio-temporal Keyword Query |
Input: |
Output: |
1: |
2: |
3: while |
4: for each |
5: |
6: |
7: |
8: |
9: |
10: for each |
11: if |
12:while |
13: if |
14: |
15: end if |
16: end while |
17: end if |
18: end for |
19: end for |
20: |
21: end while |
Algorithm 2 Parallel Spatio-temporal Keyword Query |
Input: |
Output: |
1: /*generating intermediate file |
2: |
3: while |
4: for each |
5: |
6: |
7: |
8: |
9:end for |
10: |
11: end while |
12: /*Map input: file |
13: while |
14: |
15: end while |
16: /*Reduce input: output of Map */ |
17: |
18: |
19: |
20: |
21: for each |
22:if |
23:while |
24:if |
25: |
26:end if |
27:end while |
28:end if |
29:end for |
Dataset | Temporal Range | Size (records) |
dataset1 | 0:00-0:48, Nov. 1st | 1 million |
dataset2 | 0:00-8:25, Nov. 1st | 10 million |
dataset3 | 0:00-23:59, Nov. 1st | 30 million |
dataset4 | Nov. 1st & Nov. 2nd | 60 million |
dataset5 | Nov. 1st, Nov. 2nd & Nov. 3rd | 100 million |
Conditions | Temporal Range(min) | Spatial Range(km) | Number of Keywords | Number of Scanned Records |
c1 | 10 | 1 | 3 | 3987 |
c2 | 20 | 2 | 5 | 18786 |
c3 | 30 | 3 | 5 | 42457 |
c4 | 40 | 4 | 7 | 99789 |
c5 | 50 | 5 | 10 | 108040 |