Research article

A nonparametric copula distribution framework for bivariate joint distribution analysis of flood characteristics for the Kelantan River basin in Malaysia

  • Received: 23 March 2020 Accepted: 11 May 2020 Published: 19 May 2020
  • The joint distribution analysis of multidimensional flood characteristics i.e., flood peak flow, volume and duration, often facilitates a comprehensive understanding in the hydrologic risk assessments. Copula-based methodology are frequently incorporated via parametric approach to model dependence structure of parametric based univariate marginal distributions. But, if the targeted copulas and univariate marginal distributions belongs to some specific parametric families, it might be problematic, if the underlying assumption are violated. Also, no universal rules and literatures are imposed to model any hydrologic vectors and their joint dependence structure through any fixed or pre-defined distributions. In this literature, a nonparametric copula simulation are incorporated and applied as a case study for 50 years annual maximum flood samples of the Kelantan River basin at the Gulliemard bridge station in Malaysia. In this study, a combination of both parametric and nonparametric marginal distribution separately conjoined by a nonparametric copulas framework, which is based on the Beta kernel function. The Beta kernel copula function are incorporated to estimate bivariate copula density which further used to derived joint cumulative density of flood peak-volume, volume-duration and peak-duration pairs and their associated joint as well as conditional return periods.

    Citation: Shahid Latif, Firuza Mustafa. A nonparametric copula distribution framework for bivariate joint distribution analysis of flood characteristics for the Kelantan River basin in Malaysia[J]. AIMS Geosciences, 2020, 6(2): 171-198. doi: 10.3934/geosci.2020012

    Related Papers:

    [1] Guojun Gan, Qiujun Lan, Shiyang Sima . Scalable Clustering by Truncated Fuzzy c-means. Big Data and Information Analytics, 2016, 1(2): 247-259. doi: 10.3934/bdia.2016007
    [2] Marco Tosato, Jianhong Wu . An application of PART to the Football Manager data for players clusters analyses to inform club team formation. Big Data and Information Analytics, 2018, 3(1): 43-54. doi: 10.3934/bdia.2018002
    [3] Jinyuan Zhang, Aimin Zhou, Guixu Zhang, Hu Zhang . A clustering based mate selection for evolutionary optimization. Big Data and Information Analytics, 2017, 2(1): 77-85. doi: 10.3934/bdia.2017010
    [4] Zhouchen Lin . A Review on Low-Rank Models in Data Analysis. Big Data and Information Analytics, 2016, 1(2): 139-161. doi: 10.3934/bdia.2016001
    [5] Pawan Lingras, Farhana Haider, Matt Triff . Fuzzy temporal meta-clustering of financial trading volatility patterns. Big Data and Information Analytics, 2017, 2(3): 219-238. doi: 10.3934/bdia.2017018
    [6] Yaguang Huangfu, Guanqing Liang, Jiannong Cao . MatrixMap: Programming abstraction and implementation of matrix computation for big data analytics. Big Data and Information Analytics, 2016, 1(4): 349-376. doi: 10.3934/bdia.2016015
    [7] Ming Yang, Dunren Che, Wen Liu, Zhao Kang, Chong Peng, Mingqing Xiao, Qiang Cheng . On identifiability of 3-tensors of multilinear rank (1; Lr; Lr). Big Data and Information Analytics, 2016, 1(4): 391-401. doi: 10.3934/bdia.2016017
    [8] Subrata Dasgupta . Disentangling data, information and knowledge. Big Data and Information Analytics, 2016, 1(4): 377-390. doi: 10.3934/bdia.2016016
    [9] Robin Cohen, Alan Tsang, Krishna Vaidyanathan, Haotian Zhang . Analyzing opinion dynamics in online social networks. Big Data and Information Analytics, 2016, 1(4): 279-298. doi: 10.3934/bdia.2016011
    [10] Ugo Avila-Ponce de León, Ángel G. C. Pérez, Eric Avila-Vales . A data driven analysis and forecast of an SEIARD epidemic model for COVID-19 in Mexico. Big Data and Information Analytics, 2020, 5(1): 14-28. doi: 10.3934/bdia.2020002
  • The joint distribution analysis of multidimensional flood characteristics i.e., flood peak flow, volume and duration, often facilitates a comprehensive understanding in the hydrologic risk assessments. Copula-based methodology are frequently incorporated via parametric approach to model dependence structure of parametric based univariate marginal distributions. But, if the targeted copulas and univariate marginal distributions belongs to some specific parametric families, it might be problematic, if the underlying assumption are violated. Also, no universal rules and literatures are imposed to model any hydrologic vectors and their joint dependence structure through any fixed or pre-defined distributions. In this literature, a nonparametric copula simulation are incorporated and applied as a case study for 50 years annual maximum flood samples of the Kelantan River basin at the Gulliemard bridge station in Malaysia. In this study, a combination of both parametric and nonparametric marginal distribution separately conjoined by a nonparametric copulas framework, which is based on the Beta kernel function. The Beta kernel copula function are incorporated to estimate bivariate copula density which further used to derived joint cumulative density of flood peak-volume, volume-duration and peak-duration pairs and their associated joint as well as conditional return periods.


    1. Introduction

    In data clustering or cluster analysis, the goal is to divide a set of objects into homogeneous groups called clusters [10,18,20,26,12,1]. For high-dimensional data, clusters are usually formed in subspaces of the original data space and different clusters may relate to different subspaces. To recover clusters embedded in subspaces, subspace clustering algorithms have been developed, see for example [2,15,19,17,9,21,16,22,3,25,7,11,13]. Subspace clustering algorithms can be classified into two categories: hard subspace clustering algorithms and soft subspace clustering algorithms.

    In hard subspace clustering algorithms, the subspaces in which clusters embed are determined exactly. In other words, each attribute of the data is either associated with a cluster or not associated with the cluster. For example, the subspace clustering algorithms developed in [2] and [15] are hard subspace clustering algorithms. In soft subspace clustering algorithms, the subspaces of clusters are not determined exactly. Each attribute is associated to a cluster with some probability. If an attribute is important to the formation of a cluster, then the attribute is associated to the cluster with high probability. Examples of soft subspace clustering algorithms include [19], [9], [21], [16], and [13].

    In soft subspace clustering algorithms, the attribute weights associated with clusters are automatically determined. In general, the weight of an attribute for a cluster is inversely proportional to the dispersion of the attribute in the cluster. If the values of an attribute in a cluster is relatively compact, then the attribute will be assigned a relatively high value. In the FSC algorithm [16], for example, the attribute weights are calculated as

    wlj=1dh=1(Vlj+ϵVlh+ϵ)1α1,   l=1,2,,k,j=1,2,,d, (1)

    where ϵ is a small positive number used to prevent dividing by zero, α>1 is a parameter used to control the smoothness of the attribute weights, and

    Vlj=xCl(xjzlj)2. (2)

    Here k is the number of clusters, d is the number of attributes, and zl is the center of the lth cluster Cl. In the EWKM algorithm [21], the attribute weights are calculated as

    wlj=exp(Vljγ)ds=1exp(Vlsγ),   k=1,2,,n,l=1,2,,d, (3)

    where γ>0 is a parameter used to control the smoothness of the attribute weights.

    One drawback of the FSC algorithm is that a positive value of ϵ is required in order to prevent dividing by zero when an attribute has identical values in a cluster. Using the entropy weighting, the EWKM algorithm does not suffer from the problem of dividing by zero. However, the attribute weights calculated in the EWKM algorithm are sensitive to the parameter γ when the range of the attribute dispersions (e.g., Vlj) in a cluster is large. For example, suppose that a dataset has two attributes, whose dispersions in a cluster are 10 and 30, respectively. If we use a small value of γ such as γ=1, the attribute weights will be

    w1=e10e10+e30=11+e20=1,   w2=e30e10+e30=11+e20=0.

    If we use γ=10, the attribute weights will be

    w1=e1e1+e3=11+e2=0.88,   w2=e3e1+e3=11+e2=0.12.

    From the above example we see that choosing an appropriate value for the parameter γ is a difficult task when the attribute dispersions in a cluster is large. Feature group weighting has been introduced to address the issue [7,14].

    In this paper, we address the issue from a different perspective. Unlike the group feature weighting approach, the approach we employ in this paper involves using the log transformation to transform the distances so that the attribute weights are not dominated by a single attribute with the smallest dispersion. In particular, we present a soft subspace clustering algorithm called the LEKM algorithm (log-transformed entropy weighting k-means) to address the aforementioned problem. The LEKM algorithm extends the EWKM algorithm by using log-transformed distances in its objective function. The resulting attribute dispersions in a cluster are more compact than those from the EWKM algorithm. Due to the small difference of the attribute dispersions, the LEKM algorithm is less sensitive to the parameter than other soft subspace clustering algorithms are.

    The remaining part of this paper is structured as follows. In Section 2, we give a brief review of the LAC algorithm [9] and the EWKM algorithm [21]. In Section 3, we present the LEKM algorithm in detail. In Section 4, we present numerical experiments to demonstrate the performance of the LEKM algorithm. Section 5 concludes the paper with some remarks.


    2. Related work

    In this section, we introduce the EWKM algorithm [21] and the LAC algorithm [9], which are soft subspace clustering algorithms using the entropy weighting.


    2.1. The EWKM algorithm

    Let x1,x2,,xn be n data points, each of which is described by d attributes. Let k be the desired number of clusters. Then the objective function of the EWKM algorithm is defined as follows [21]:

    F(U,W,Z)=kl=1[ni=1dj=1uilwlj(xijzlj)2+γdj=1wljlnwlj], (4)

    where γ>0 is a parameter, U=(uil)n×k is a n×k partition matrix, and W=(wlj)k×d is a k×d weight matrix. In addition, the partition matrix U and the weight matrix W satisfy the following conditions:

    kl=1uil=1,    i=1,2,,n, (5a)
    uil{0,1},    i=1,2,,n,l=1,2,,k, (5b)
    dj=1wlj=1,    l=1,2,,k, (5c)

    and

    wlj>0,    l=1,2,,k,j=1,2,,d. (5d)

    Like the k-means algorithm [23,4], the EWKM algorithm tries to minimize the objective function using an iterative process. At the beginning, the EWKM algorithm initializes the cluster centers by selecting k points from the dataset randomly and initializes the attribute weights with equal values. Then the EWKM algorithm keeps updating U, W, and Z one at a time by fixing the other two. Given W and Z, the partition matrix U is updated as

    uil={1,    if   dj=1wlj(xijzlj)2dj=1uiswsj(xijzsj)2 for 1sk,0,    if  otherwise,

    for i=1,2,,n and l=1,2,,k. Given U and Z, the weight matrix W is updated as

    wlj=exp(Vljγ)ds=1exp(Vlsγ)

    for l=1,2,,k and j=1,2,,d, where

    Vlj=ni=1uil(xijzlj)2.

    Given U and W, the cluster centers are updated as

    zlj=ni=1uilxijni=1uil

    for l=1,2,,k and j=1,2,,d. The runtime complexity of one iteration of the EWKM algorithm is O(nkd).

    The parameter γ in the EWKM algorithm is used to control the smoothness of the attribute weights. If γ approaches to infinity, then all attributes have the same weights. In such cases, the EWKM algorithm becomes the standard k-means algorithm. Since the attribute weights are based on exponential normalization, the weights are sensitive to the parameter γ when the attribute dispersions (e.g., Vlj) have a wide range.


    2.2. The LAC algorithm

    The LAC algorithm (Locally Adaptive Clustering) [9] and the EWKM algorithm are similar soft subspace clustering algorithms in that both algorithms discover subspace clusters via exponential weighting of attributes. However, the LAC algorithm differs from the EWKM algorithm in the definition of objective function. Clusters found by the LAC algorithm are referred to as weighted clusters. The objective function of the LAC algorithm is defined as

    E(C,Z,W)=kl=1dj=1(wlj1|Cl|xCl(xjzlj)2+hwljlogwlj), (6)

    where k is the number of clusters, d is the number of attributes, Z={z1,z2,,zk} is a set of cluster centers, W=(wlj)k×d is a weight matrix, C={C1,C2,,Ck} is a set of clusters, and h>0 is a parameter. The weight matrix also satisfies the conditions given in Equations (5c) and (5d).

    Like the k-means algorithm and the EWKM algorithm, the LAC algorithm also employs an iterative process to optimize the objective function. Similar to the EWKM algorithm, the LAC algorithm initializes the cluster centers by selecting k points from the dataset randomly and initializes the attribute weights with equal values. Given the set of cluster centers Z and the set of weight vectors W, the clusters are determined as follows:

    Sl={x:dj=1wlj(xjzlj)2<dj=1wsj(xjzsj)2,sl} (7)

    for l=1,2,,k. Given the set of cluster centers Z and the set of clusters {S1,S2,,Sk}, the set of weight vector is determined as follows:

    wlj=exp(Vlj)/hds=1exp(Vls/h) (8)

    for l=1,2,,k and j=1,2,,d, where

    Vlj=1|Sl|xSl(xjzlj)2.

    Given the set of clusters {S1,S2,,Sk}, the cluster centers are updated as follows:

    zlj=1|Sl|xSlxj (9)

    for l=1,2,,k and j=1,2,,d. The runtime complexity of one iteration of the LAC algorithm is O(nkd).

    Comparing Equation (6) with Equation (4), we see that the distances in the objective function of the LAC algorithm are normalized by the sizes of the corresponding clusters. As a result, the dispersions (i.e., Vlj) calculated in the LAC algorithm are smaller than those calculated in the EWKM algorithm. However, the dispersions calculated in the LAC algorithm can still have a wide range for small-sample high-dimensional data such as gene expression data [8].


    3. The LEKM algorithm

    In this section, we present the LEKM algorithm. The LEKM algorithm is similar to the EWKM algorithm [21] and the LAC algorithm [9] in that the entropy weighting is used to determine the attribute weights.

    Let X={x1,x2,,xn} be a dataset containing n points, each of which is described by d numerical features or attributes. Let Z={z1,z2,,zk} be a set of cluster centers, where k is the number of clusters. Then the objective function of the LEKM algorithm is defined as

        P(U,W,Z)=kl=1ni=1uildj=1wljln[1+(xijzlj)2]+λkl=1ni=1uildj=1wljlnwlj=kl=1ni=1uil[dj=1wljln[1+(xijzlj)2]+λdj=1wljlnwlj], (10)

    where U=(uil)n×k is a n×k binary matrix satisfying Equations (5a) and (5b), W=(wlj)k×d is a k×d satisfying Equations (5c) and (5d), and λ>0 is a parameter. In the above equation, xij and zlj denote the values of xi and zl in the jth attribute, respectively. The matrix U is the partition matrix in the following sense. If uil=1, then the point xi belongs to the lth cluster. The matrix W is the weight matrix containing the attribute weights. If wlj is relatively large, then the jth attribute is important for the formulation of the lth cluster.

    Similar to the EWKM algorithm, the LEKM algorithm tries to minimize the objective function given in Equation (10) iteratively by finding the optimal value of U, W, and Z according to the following theorems.

    Theorem 3.1. Let W and Z be fixed. Then the partition matrix U that minimizes the objective function P(U,W,Z) is given by

    uil={1,    if D(xi,zl)D(xi,zs) for all s=1,2,,k;0,    if otherwise, (11)

    for i=1,2,,n and l=1,2,,k, where

    D(xi,zs)=dj=1wljln[1+(xijzsj)2]+λdj=1wljlnwlj.

    Proof. Since W and Z are fixed and the rows of the partition matrix U are independent of each other, the objective function is minimized if for each i=1,2,,n, the following function

    f(ui1,ui2,,uik)=kl=1uilD(xi,zl) (12)

    is minimized. Note that uil{0,1} and

    kl=1uil=1.

    The function defined in Equation (12) is minimized if Equation (11) holds. This completes the proof.

    Theorem 3.2. Let U and Z be fixed. Then the weight matrix W that minimizes the objective function P(U,W,Z) is given by

    wlj=exp(Vljλ)ds=1exp(Vlsλ) (13)

    for l=1,2,,k and j=1,2,,d, where

    Vlj=ni=1uilln[1+(xijzlj)2]ni=1uil.

    Proof. The weight matrix W that minimizes the objective function P(U,W,Z) subject to

    dj=1wlj=1,    l=1,2,,k,

    is the matrix W that minimizes the following function

    f(W)=P(U,W,Z)+kl=1βl(dj=1wlj1)         =kl=1ni=1uil[dj=1wljln[1+(xijzlj)2]+λdj=1wljlnwlj]            +kl=1βl(dj=1wlj1). (14)

    The weight matrix W that minimizes Equation (14) satisfies the following equations

    f(W)wlj=ni=1uil(ln[1+(xijzlj)2]+λlnwlj+λ)+βl=0

    for l=1,2,,k and j=1,2,,d, and

    f(W)βl=dj=1wlj1=0

    for l=1,2,,k. Solving the above equations leads to Equation (13).

    From Equation (13) we see that the attribute weights of the lth cluster are the exponential normalizations of Vl1, Vl2, , Vld. Since Vlj is the sum of log-transformed distances, the range of the magnitudes of Vl1, Vl2, , Vld is small. Hence the weights are less sensitive to the parameter λ.

    Theorem 3.3. Let U and W be fixed. Then the set of cluster centers Z that minimizes the objective function P(U,W,Z) satisfies the following nonlinear equations

    zlj=ni=1uil[1+(xijzlj)2]1xijni=1uil[1+(xijzlj)2]1 (15)

    for l=1,2,,k and j=1,2,,d.

    Proof. If the set of cluster centers Z minimizes the objective function P(U,W,Z), then for all l=1,2,,k and j=1,2,,d, the derivative of P(U,W,Z) with respect to wlj is equal to zeros. In other words, we have

    Pzlj=wljni=1uil[1+(xijzlj)2]1[2(xijzlj)]=0.

    Since wlj>0, we have

    ni=1uil[1+(xijzlj)2]1[2(xijzlj)]=0,

    from which Equation (15) follows.

    In the standard k-means algorithm, the EWKM algorithm, and the LAC algorithm, the center of a cluster is calculated as the average of the points in the cluster. In the LEKM algorithm, however, the center of a cluster is governed by a nonlinear equation in such a way that the center is a weighted average of the points in the cluster. In addition, if a point is far away from its center, then the point is given a low weight in the center calculation. As a result, the LEKM algorithm is less sensitive to outliers than the EWKM algorithm and the LAC algorithm. Since the LEKM algorithm is an iterative algorithm, we can in practice update the cluster centers as follows:

    zlj=ni=1uil[1+(xijzlj)2]1xijni=1uil[1+(xijzlj)2]1 (16)

    for l=1,2,,k and j=1,2,,d, where Z={z1,z2,,zk} is the set of cluster centers from the previous iteration. When the algorithm converges, the cluster centers in the current iteration are the same as those from the previous iteration and Equation (16) is the same as Equation (15).

    To find the optimal values of U, W, and Z that minimize the objective function given in Equation (10), the LEKM algorithm proceeds iteratively by updating one of U, W, and Z at a time with other other two fixed. The pseudo-code of the LEKM algorithm is shown in Algorithm 1. The computational complexity of one iteration of the LEKM algorithm is O(nkd). Although the runtime complexity of the LEKM algorithm is the same as those of the EWKM algorithm and the LAC algorithm, we expect the LEKM algorithm to be slower than the EWKM algorithm and the LAC algorithm as more operations are involved in the LEKM algorithm.

    The LEKM algorithm requires four parameters: k, λ, δ, and Nmax. The parameter k is the desired number of clusters. The parameter λ controls the smoothness of the attribute weights. The larger the value of λ, the more uniform of the attribute weights. The last two parameters are used to terminate the algorithm. Table 1 gives some default values of some parameters.

    Table 1. Default parameter values of the LEKM algorithm..
    Parameter Default Value
    λ 1
    δ 106
    Nmax 100
     | Show Table
    DownLoad: CSV

    4. Numerical experiments

    In this section, we present numerical experiments based on both synthetic data and real data to demonstrate the performance of the LEKM algorithm. We also compare the LEKM algorithm with the EWKM algorithm and the LAC algorithm in terms of accuracy and runtime. We implemented all three algorithms in Java and used the same convergence criterion as shown in Algorithm 1.

    In our experiments, we use the corrected Rand index [8,13] to measure the accuracy of clustering results. The corrected Rand index is calculated from two partitions of the same dataset and its value ranges from -1 to 1, with 1 indicating perfect agreement between the two partitions and 0 indicating agreement by chance. In general, the higher the corrected Rand index, the better the clustering result.

    Since the all the three algorithms are k-means-type algorithms, they are sensitive to initial cluster centers [6,13]. To compare the performance of these three algorithms on the first synthetic dataset, we run these algorithm 100 times and calculate the average accuracy and runtime. In each run, we use a different seed to select random initial cluster centers. To compare the three algorithms in a consistent way, we used the same 100 seeds for all three algorithms. To test the impact of the parameters (i.e., γ in EWKM, h in LAC, and λ in LEKM), we use five different values for the parameter: 1, 2, 4, 8, and 16.


    4.1. Experiments on synthetic data

    To test the performance of the LEKM algorithm, we generated two synthetic datasets. The first synthetic dataset is a 2-dimensional dataset with two clusters and is shown in Figure 1. From the figure we see that the cluster in the top is compact but the cluster in the bottom contains several points that are far away from the cluster center. We can consider this dataset as a dataset containing noises.

    Figure 1. A 2-dimensional dataset with two clusters..

    Table 2 shows the average corrected Rand index of 100 runs of the three algorithms on the first synthetic dataset. From the table we see that the LEKM algorithm produced more accurate results than the LAC algorithm and the EWKM algorithm. The EWKM produced the least accurate results. Since the dispersion of an attribute in a cluster is normalized by the size of the cluster in the LAC and LEKM algorithms, the LAC and LEKM algorithms are less sensitive to the parameter.

    Table 2. The average accuracy of 100 runs of the three algorithms on the first synthetic dataset. The numbers in parenthesis are the corresponding standard deviations over the 100 runs. The parameter refers to γ, h, and λ in EWKM, LAC, and LEKM, respectively..
    Parameter EWKM LAC LEKM
    1 0.0351 (0.0582) 0.0024 (0.0158) 0.9154 (0.2704)
    2 0.0378 (0.0556) 0.9054 (0.2322) 0.9063 (0.2827)
    4 0.012 (0.031) 0.8019 (0.2422) 0.9067 (0.2815)
    8 -0.0135 (0.0125) 0.7604 (0.2406) 0.9072 (0.2799)
    16 -0.013 (0.0134) 0.7527 (0.2501) 0.9072 (0.2799)
     | Show Table
    DownLoad: CSV

    Table 3 shows the confusion matrices produced by the best run of the three algorithms on the first synthetic dataset. We run the EWKM algorithm, the LAC algorithm, and the LEKM algorithm 100 times on the first synthetic dataset with parameter 2 (i.e., γ=2 in EWKM, h=2 in LAC, and λ=2 in LEKM) and chose the best run to be the run with the lowest objective function value. From Table 3 we see that the LEKM algorithm was able to recover the two clusters from the first synthetic dataset correctly. The LAC algorithm clustered one point incorrectly. The EWKM algorithm is sensitive to noises and clustered many points incorrectly.

    Table 3. The confusion matrices of the first synthetic dataset correspond to the runs with the lowest objective function values. The parameter used in these runs is 2. The labels "1" and "2" in the first row indicate the given clusters. The labels "C1" and "C2" in the first column indicate the found clusters. (a) EWKM. (b) LAC. (c) LEKM..
    1 2 1 2 1 2
    C2 35 25 C2 59 0 C2 60 0
    C1 25 15 C1 1 40 C1 0 40
    (a) (b) (c)
     | Show Table
    DownLoad: CSV

    Table 4 shows the attribute weights of the two clusters produced by the best runs of the three algorithms. As we can see from the table that the attribute weights produced by the EWKM algorithm are dominated by one attribute. The attribute weights of one cluster produced by the LAC algorithm is also affected by the noises in the cluster. The attribute weights of the clusters produced by the LEKM algorithm seem reasonable as the two clusters are formed in the full space and approximate the same attribute weights are expected.

    Table 4. The attribute weights of the two clusters correspond to the runs with the lowest objective function values. The parameter used in these runs is 2. The labels "C1" and "C2" in the first column indicate the found clusters. (a) EWKM. (b) LAC. (c) LEKM..
    Weight Weight Weight
    C1 1 3.01E-36 C1 0.8931 0.1069 C1 0.5448 0.4552
    C2 1 2.85E-51 C2 0.5057 0.4943 C2 0.5055 0.4945
    (a) (b) (c)
     | Show Table
    DownLoad: CSV

    Table 5 shows the average runtime of the 100 runs of the three algorithms on the first synthetic dataset. From the table we see that the EWKM algorithm converged the fastest. The LAC algorithm and the LEKM algorithm converged in about the same time.

    Table 5. The average runtime of the three algorithms on the first synthetic dataset. The numbers in parenthesis are the corresponding standard deviations over the 100 runs. The numbers are in seconds..
    Parameter EWKM LAC LEKM
    1 0.0005 (0.0005) 0.0021 (0.0032) 0.0016 (0.0009)
    2 0.0002 (0.0004) 0.0018 (0.0026) 0.0013 (0.0006)
    4 0.0002 (0.0004) 0.0017 (0.0025) 0.0014 (0.0011)
    8 0.0003 (0.0004) 0.0018 (0.0026) 0.0016 (0.0017)
    16 0.0002 (0.0004) 0.0018 (0.0025) 0.0016 (0.002)
     | Show Table
    DownLoad: CSV

    The second synthetic dataset is a 100-dimensional dataset with four clusters. Table 6 shows the sizes and dimensions of the four clusters. This dataset was also used to test the SAP algorithm developed in [13]. Table 7 summarizes the clustering results of the three algorithms. From the table we see that the LEKM algorithm produced the most accurate results when the parameter is small. When the parameter is large, the attribute weights calculated by the LEKM algorithm become approximately the same. Since the clusters are embedded in subspaces, assigning approximately the same weight to attributes prevents the LEKM algorithm from recovering these clusters.

    Table 6. A 100-dimensional dataset with 4 subspace clusters..
    Cluster Cluster Size Subspace Dimensions
    A 500 10, 15, 70
    B 300 20, 30, 80, 85
    C 500 30, 40, 70, 90, 95
    D 700 40, 45, 50, 55, 60, 80
     | Show Table
    DownLoad: CSV
    Table 7. The average accuracy of 100 runs of the three algorithms on the second synthetic dataset..
    Parameter EWKM LAC LEKM
    1 0.557 (0.1851) 0.5534 (0.1857) 0.9123 (0.147)
    2 0.557 (0.1851) 0.5572 (0.1883) 0.928 (0.1361)
    4 0.557 (0.1851) 0.5658 (0.1902) 0.6128 (0.1626)
    8 0.557 (0.1851) 0.574 (0.2028) 0.3197 (0.1247)
    16 0.5573 (0.1854) 0.6631 (0.2532) 0.2293 (0.0914)
     | Show Table
    DownLoad: CSV

    Table 8 shows the confusion matrices produced by the runs of the three algorithms with the lowest objective function value. From the table we see that only three points were clustered incorrectly by the LEKM algorithm. Many points were clustered incorrectly by the EWKM algorithm and the LAC algorithm. Figures 2, 3, and Figure 4 plot the attribute weights of the four clusters corresponding to the confusion matrices given in Table 8. From Figures 2 and 3 we can see that the attribute weights were dominated by a single attribute. Figure 4 shows that the LEKM algorithm was able to recover all the subspace dimensions correctly.

    Table 8. Confusion matrices of the second synthetic dataset produced by the runs with the lowest objective function values. In these runs, the parameter was set to 2. (a) EWKM. (b) LAC. (c) LEKM..
    2380-6966_2016_1_93-T8.jpg
     | Show Table
    DownLoad: CSV
    Figure 2. Attribute weights of the four clusters produced by the EWKM algorithm..
    Figure 3. Attribute weights of the four clusters produced by the LAC algorithm..
    Figure 4. Attribute weights of the four clusters produced by the LEKM algorithm..

    Table 9 shows the average runtime of 100 runs of the three algorithms on the second synthetic dataset. From the table we see that the LEKM algorithm is slower than the other two algorithms. Since the center calculation of the LEKM algorithm is more complicate than that of the EWKM algorithm and the LAC algorithm, it is expected that the LEKM algorithm is slower than the other two algorithms.

    Table 9. The average runtime of 100 runs of the three algorithms on the second synthetic dataset..
    Parameter EWKM LAC LEKM
    1 0.7849 (0.4221) 1.1788 (0.763) 10.4702 (0.1906)
    2 0.7687 (0.4141) 0.8862 (0.4952) 10.3953 (0.1704)
    4 0.7619 (0.4101) 0.8412 (0.4721) 10.5236 (0.2023)
    8 0.7567 (0.4074) 0.8767 (0.4816) 10.5059 (0.2014)
    16 0.7578 (0.4112) 0.8136 (0.5069) 10.4122 (0.189)
     | Show Table
    DownLoad: CSV

    In summary, the test results on synthetic datasets have shown that the LEKM algorithm is able to recover clusters from noise data and recover clusters embedded in subspaces. The test results also show that the LEKM algorithm is less sensitive to noises and parameter values that the EWKM algorithm and the LEKM algorithm. However, the LEKM algorithm is in general slower than the other two algorithm due to its complex center calculation.


    4.2. Experiments on real data

    To test the algorithms on real data, we obtained two cancer gene expression datasets from [8]1. The first dataset contains gene expression data of human liver cancers and the second dataset contains gene expression data of breast tumors and colon tumors. Table 10 shows the information of the two real datasets. The two datasets have known labels, which tell the type of sample of each data point. The two datasets were also used to test the SAP algorithm in [13].

    Table 10. Two real gene expression datasets..
    Dataset Samples Dimensions Cluster sizes
    Chen-2002 179 85 104, 76
    Chowdary-2006 104 182 62, 42
     | Show Table
    DownLoad: CSV

    The datasets are available at http://bioinformatics.rutgers.edu/Static/Supplements/CompCancer/datasets.htm

    Table 11 and Table 12 summarize the average accuracy and the average runtime of 100 runs of the three algorithms on the Chen-2002 dataset, respectively. From the average corrected Rand index shown in Table 11 we see that the LEKM algorithm produced more accurate results than the EWKM algorithm and the LAC algorithm did. However, the LEKM algorithm was slower than the other two algorithm.

    Table 11. The average accuracy of 100 runs of the three algorithms on the Chen-2002 dataset..
    Parameter EWKM LAC LEKM
    1 0.025 (0.0395) 0.0042 (0.0617) 0.2599 (0.2973)
    2 0.0203 (0.0343) 0.0888 (0.1903) 0.2563 (0.2868)
    4 0.0135 (0.0279) 0.041 (0.1454) 0.2743 (0.2972)
    8 0.0141 (0.0449) 0.0484 (0.1761) 0.2856 (0.2993)
    16 0.0002 (0.0416) 0.0445 (0.1726) 0.2789 (0.2984)
     | Show Table
    DownLoad: CSV
    Table 12. The average runtime of 100 runs of the three algorithms on the Chen-2002 dataset..
    Parameter EWKM LAC LEKM
    1 0.0111 (0.0031) 0.0162 (0.0083) 0.102 (0.0297)
    2 0.0123 (0.0033) 0.0124 (0.006) 0.1035 (0.0286)
    4 0.0143 (0.006) 0.0151 (0.0105) 0.1046 (0.0316)
    8 0.0122 (0.0043) 0.0137 (0.0089) 0.1068 (0.0337)
    16 0.0144 (0.007) 0.014 (0.0091) 0.105 (0.0323)
     | Show Table
    DownLoad: CSV

    The average accuracy and runtime of 100 runs of the three algorithms on the Chowdary-2006 dataset are shown in Table 13 and Table 14, respectively. From Table 13 we see than the LEKM algorithm again produced more accurate clustering results than the other two algorithm did. When the parameter was set to be 1, the LAC produced better results than the EWKM algorithm did. For other cases, however, the EWKM algorithm produced better results than the LAC algorithm did. The LAC algorithm and the EWKM algorithm are much faster than the LEKM algorithm as shown in Table 14.

    Table 13. The average accuracy of 100 runs of the three algorithms on the Chowdary-2006 dataset..
    Parameter EWKM LAC LEKM
    1 0.3952 (0.3943) 0.5197 (0.2883) 0.5826 (0.3199)
    2 0.3819 (0.3825) 0.19 (0.2568) 0.5757 (0.3261)
    4 0.3839 (0.3677) 0.0772 (0.1016) 0.5823 (0.3221)
    8 0.4188 (0.3584) 0.0595 (0.0224) 0.5756 (0.3383)
    16 0.4994 (0.3927) 0.0625 (0.0184) 0.582 (0.3363)
     | Show Table
    DownLoad: CSV
    Table 14. The average runtime of 100 runs of the three algorithms on the Chowdary-2006 dataset..
    Parameter EWKM LAC LEKM
    1 0.0115 (0.0048) 0.0109 (0.0042) 0.1369 (0.0756)
    2 0.011 (0.0046) 0.0156 (0.0093) 0.1446 (0.0723)
    4 0.0103 (0.0042) 0.0147 (0.0076) 0.1514 (0.0805)
    8 0.0107 (0.005) 0.0141 (0.0063) 0.1524 (0.0769)
    16 0.0113 (0.0047) 0.0138 (0.0068) 0.1542 (0.0854)
     | Show Table
    DownLoad: CSV

    In summary, the test results on real datasets show that the LEKM algorithm produced more accurate clustering results on average than the EWKM algorithm and the LAC algorithm did. However, the LEKM algorithm was slower than the other two algorithms.


    5. Concluding remarks

    The EWKM algorithm [21] and the LAC algorithm [9] are two soft subspace clustering algorithms that are similar to each other. In both algorithms, the attribute weights of a cluster are calculated as exponential normalizations of the negative attribute dispersions in the cluster scaled by a parameter. Setting the parameter is a challenge when the attribute dispersions in a cluster have a large range. In this paper, we proposed the LEKM (log-transformed entropy weighting k-means) algorithm by using log-transformed distances in the objective function so that the attribute dispersions in a cluster are smaller than those in the EWKM algorithm and the LAC algorithm. The proposed LEKM algorithm has the following two properties: first, the LEKM algorithm allows users to choose a value for the parameter easily because the attribute dispersions in a cluster have a small range; second, the LEKM algorithm is less sensitive to noises because data points far away from they corresponding cluster centers are given small weights in the cluster center calculation.

    We tested the performance of the LEKM algorithm and compared it with the EWKM algorithm and the LAC algorithm. The test results on both synthetic datasets and real datasets have shown that the LEKM algorithm is able to outperform the EWKM algorithm and the LAC algorithm in terms of accuracy. However, one limitation of the LEKM algorithm is that it is slower than the other two algorithm because updating the cluster centers in each iteration in the LEKM algorithm is more complicate than that in the other two algorithms.

    Another limitation of the LEKM algorithm is that it is sensitive to initial cluster centers. This limitation is common to most of the k-means-type algorithms, which include the EWKM algorithm and the LAC algorithm. Other efficient cluster center initialization methods [24,5,6] can be used to improve the performance of the k-means-type algorithms including the LEKM algorithm.


    Acknowledgments

    The authors would like to thank referees for their insightful comments that greatly improve the quality of the paper.




    [1] Drainage and Irrigation Department Malaysia (2004) Annual flood report of DID for Peninsular Malaysia. DID: Kuala Lumpur. Available from: http://www.statistics.gov.my/eng/images/stories/files/journalDOSM/V104ArticleJamaliah.pdf.
    [2] Malaysian Meteorological Department (2007) Report on Heavy Rainfall that Caused Floods in Kelantan and Terengganu. MMD: Kuala Lumpur. Available from: https://reliefweb.int/sites/reliefweb.int/files/resources/EE19DAFDE99078B649257266001FED46-Full_Report.pdf.
    [3] Adnan NA, Atkinson PM (2011) Exploring the impact of climate and land use changes on streamflow trends in a monsoon catchment. Int J Clim 31: 815-831. doi: 10.1002/joc.2112
    [4] Chan NW (1997) Institutional arrangement of flood hazard management in Malaysia: an evaluation using criteria approach. Disasters 21: 206-222. doi: 10.1111/1467-7717.00057
    [5] Hussain STPR, Ismail H (2013) Flood frequency analysis of Kelantan River Basin, Malaysia. World Appl Sci J 28: 1989-1995.
    [6] Nashwan MS, Ismail T, Ahmed K (2018) Flood susceptibility assessment in Kelantan river basin using copula. Int J Eng Technol 7: 584-590. doi: 10.14419/ijet.v7i2.10447
    [7] Zhang L, Singh VP (2006) Bivariate flood frequency analysis using copula method. J Hydrol Eng 11: 150-164. doi: 10.1061/(ASCE)1084-0699(2006)11:2(150)
    [8] Zhang L (2005) Multivariate hydrological frequency analysis and risk mapping. Doctoral dissertation, Beijing Normal University.
    [9] Reddy MJ, Ganguli P (2012) Bivariate Flood Frequency Analysis of Upper Godavari River Flows Using Archimedean Copulas. Water Resour Manage 26: 3995-4018. doi: 10.1007/s11269-012-0124-z
    [10] Bobee B, Rasmussen PF (1994) Statistical analysis of annual flood series, In: Menon J (Ed.). Trend in Hydrology, 1. Council of Scientific Research Integration, India, 117-135.
    [11] Krstanovic PF, Singh VP (1987) A multivariate stochastic flood analysis using entropy. In: Singh VP (Ed.). Hydrologic Frequency Modelling, Reidel, Dordrecht, 515-539. doi: 10.1007/978-94-009-3953-0_37
    [12] Yue S (2000) The bivariate lognormal distribution to model a multivariate flood episode. Hydrol Process 14: 2575-2588. doi: 10.1002/1099-1085(20001015)14:14<2575::AID-HYP115>3.0.CO;2-L
    [13] Sandoval CE, Raynal-Villasenor J (2008) Trivariate generalized extreme value distribution in flood frequency analysis. Hydrol Sci J 53: 550-567. doi: 10.1623/hysj.53.3.550
    [14] Song S, Singh VP (2010) Metaelliptical copulas for drought frequency analysis of periodic hydrologic data. Stoch Environ Res Risk Assess 24: 425-444. doi: 10.1007/s00477-009-0331-1
    [15] De Michele C, Salvadori G (2003) A generalized Pareto intensity-duration model of storm rainfall exploiting 2-copulas. J Geophys Res 108: 4067. doi: 10.1029/2002JD002534
    [16] Saklar A (1959) Functions de repartition n dimensions et leurs marges. ublications de l'Institut Statistique de l'Université de Paris, 8: 229-231.
    [17] Nelsen RB (2006) An introduction to copulas. Springer, New York.
    [18] Salvadori G (2004) Bivariate return periods via-2 copulas. Stat Methodol 1:129-144. doi: 10.1016/j.stamet.2004.07.002
    [19] Salvadori G, De Michele C (2004) Frequency analysis via copulas: theoretical aspects and applications to hydrological events. Water Resour Res 40: W12511. doi: 10.1029/2004WR003133
    [20] Salvadori G, De Michele C (2006) Statistical characterization of temporal structure of storms. Adv Water Resour 29: 827-842. doi: 10.1016/j.advwatres.2005.07.013
    [21] Cong RG, Brady M (2011) The interdependence between Rainfall and Temperature: copula Analyses. Sci World J 2012: 405675.
    [22] Karmakar S, Simonovic SP (2008) Bivariate flood frequency analysis. Part 1: Determination of marginal by parametric and non-parametric techniques. J Flood Risk Manag 1: 190-200.
    [23] Adamowski K (1989) A monte Carlo comparison of parametric and nonparametric estimations of flood frequencies. J Hydrol 108: 295-308. doi: 10.1016/0022-1694(89)90290-4
    [24] Silverman BW (1986) Density Estimation for Statistics and Data Analysis, 1st edition. Chapman and Hall, London.
    [25] Kim KD, Heo JH (2002) Comparative study of flood quantiles estimation by nonparametric models. J Hydrol 260: 176-193. doi: 10.1016/S0022-1694(01)00613-8
    [26] Botev ZI, Grotowski JF, Kroese DP (2010) Kernel Density Estimation via Diffusion. Ann Stat 38: 2916-2957. doi: 10.1214/10-AOS799
    [27] Dooge JCE (1986) Looking for hydrologic laws. Water Resour Res 22: 46-58. doi: 10.1029/WR022i09Sp0046S
    [28] Bardsley WE (1988) Toward a General Procedure for Analysis of Extreme Random Events in the Earth Sciences. Math Geol 20: 513-528. doi: 10.1007/BF00890334
    [29] Lall U, Moon YI, Bosworth K (1993) kernel flood frequency estimators: Bandwidth selection and kernel choice. Water Resour Res 29: 1003-1015. doi: 10.1029/92WR02466
    [30] Santhosh D, Srinivas V (2013) Bivariate frequency analysis of flood using a diffusion kernel density estimators. Water Resour Res 49: 8328-8343. doi: 10.1002/2011WR010777
    [31] Moon YI, Lall U (1994) Kernel function estimator for flood frequency analysis. Water Resour Res 30: 3095-3103. doi: 10.1029/94WR01217
    [32] Lall U (1995) Nonparametric function estimation: recent hydrologic contributions, U.S. National Republic. International Union of Geodesy and Geophysics, 1991-1994. Rev Geophys 33: 1093-1099.
    [33] Karmakar S, Simonovic SP (2009) Bivariate flood frequency analysis. Part 2: A copula-based approach with mixed marginal distributions. J Flood Risk Manag 2: 32-44.
    [34] Chen SX, Huang TM (2007) Nonparametric estimation of copula functions for dependence modelling. Can J Stat 35: 265-282. doi: 10.1002/cjs.5550350205
    [35] Latif S, Mustafa F (2020) Trivariate distribution modelling of flood characteristics using copula function-A case study for Kelantan River basin in Malaysia. AIMS Geosci 6: 92-130. doi: 10.3934/geosci.2020007
    [36] Hosking JRM, Walis JR (1987) Parameter and quantile estimations for the generalized Pareto distributions. Technometrics 29: 339-349. doi: 10.1080/00401706.1987.10488243
    [37] Yue S, Rasmussen P (2002) Bivariate frequency analysis: discussion of some useful concepts in hydrological applications. Hydrol Process 16: 2881-2898. doi: 10.1002/hyp.1185
    [38] Rao AR, Hamed KH (2000) Flood frequency analysis. CRC Press, Boca Raton, Fla.
    [39] Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Ann Math Stat 27: 832-837. doi: 10.1214/aoms/1177728190
    [40] Scott DW (1992) Multivariate Density estimation: Theory, Practice and Visualization. Wiley, New York.
    [41] Härdle W (1991) Smoothing Technique with Implementation in S. Springer, New York.
    [42] Kim KD, Heo JH (2002) Comparative study of flood quantiles estimation by nonparametric models. J Hydrol 260: 176-193. doi: 10.1016/S0022-1694(01)00613-8
    [43] Shabri A (2002) Nonparametric Kernel Estimation of Annual Maximum Stream Flow Quantiles, Matematika, 18: 99-107.
    [44] Miladinovic B (2008) Kernel density estimation of reliability with applications to extreme value distribution. Graduate Theses and Dissertations. Available from: https://scholarcommons.usf.edu/etd/408.
    [45] Azzalini A (1981) A note on the estimation of a distribution function and quantiles by a kernel method. Biometrika 68: 326-328. doi: 10.1093/biomet/68.1.326
    [46] Shiau JT (2006) Fitting drought duration and severity with two dimensional copulas. Water Resour Manag 20: 795-815. doi: 10.1007/s11269-005-9008-9
    [47] Harrell FE, Davis CE (1982) A new distribution-free quantile estimator. Biometrika 69: 635-640. doi: 10.1093/biomet/69.3.635
    [48] Brown BM, Chen SX (1999) Beta-bernstein smoothing for regression curves with compact support. Scand J Stat 26: 47-59. doi: 10.1111/1467-9469.00136
    [49] Chen SX (2000) Beta kernel estimators for density functions. Comput Stat Data Anal 31: 131-145. doi: 10.1016/S0167-9473(99)00010-9
    [50] Bounezmarni T, Rombouts JVK (2009) Nonparametric density estimation for positive time series. Comput Stat Data Anal 54: 245-261. doi: 10.1016/j.csda.2009.08.016
    [51] Charpentier A, Fermanian JD, Scaillet O (2006) The estimation of copulas: Theory and practice. In Rank J, editor. Copulas: From theory to application in finance. London: Risk Books, 35-64.
    [52] Kim TW, Valdés JB, Yoo C (2006) Nonparametric approach for bivariate drought characterisation using Palmer drought index. J Hydrol Eng 11: 134-143. doi: 10.1061/(ASCE)1084-0699(2006)11:2(134)
    [53] Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22: 79-86. doi: 10.1214/aoms/1177729694
    [54] Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19: 716-723. doi: 10.1109/TAC.1974.1100705
    [55] Schwarz GE (1978) Estimating the dimension of a model. Ann Stat 6: 461-464. doi: 10.1214/aos/1176344136
    [56] Hannan EJ, Quinn BG (1979) The Determination of the Order of an Autoregression. J R Stat Soc Ser B 41: 190-195.
    [57] Shiau JT (2003) Return period of bivariate distributed extreme hydrological events. Stoch Environ Res Risk Assess 17: 42-57. doi: 10.1007/s00477-003-0125-9
    [58] Brunner MI, Seibert J, Favre AC (2016) Bivariate return periods and their importance for flood peak and volume estimations. WIREs Water 3: 819-833. doi: 10.1002/wat2.1173
  • This article has been cited by:

    1. Tongfeng Sun, 2018, Chapter 15, 978-3-030-00827-7, 140, 10.1007/978-3-030-00828-4_15
    2. Qi He, Zhenxiang Chen, Ke Ji, Lin Wang, Kun Ma, Chuan Zhao, Yuliang Shi, 2020, Chapter 49, 978-3-030-16656-4, 530, 10.1007/978-3-030-16657-1_49
    3. Guojun Gan, Yuping Zhang, Dipak K. Dey, Clustering by propagating probabilities between data points, 2016, 41, 15684946, 390, 10.1016/j.asoc.2016.01.034
  • Reader Comments
  • © 2020 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(4688) PDF downloads(341) Cited by(6)

Article outline

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog