tk | $ \stackrel{-}{{t}_{k}} $ | |
Positive Category: ci | a | b |
Positive Category: $ \stackrel{-}{{c}_{i}} $ | c | d |
The problem of constructing a minimally resolved phylogenetic supertree (i.e., a rootedtree having the smallest possible number of internal nodes) that contains all of the rooted triplets froma consistent set R is known to be NP-hard. In this article, we prove that constructing a phylogenetictree consistent with R that contains the minimum number of additional rooted triplets is also NP-hard,and develop exact, exponential-time algorithms for both problems. The new algorithms are applied toconstruct two variants of the local consensus tree; for any set S of phylogenetic trees over some leaflabel set L, this gives a minimal phylogenetic tree over L that contains every rooted triplet present in alltrees in S, where “minimal” means either having the smallest possible number of internal nodes or thesmallest possible number of rooted triplets. (The second variant generalizes the RV-II tree, introducedby Kannan et al. in 1998.) We also measure the running times and memory usage in practice of thenew algorithms for various inputs. Finally, we use our implementations to experimentally investigatethe non-optimality of Aho et al.’s well-known BUILD algorithm from 1981 when applied to the localconsensus tree problems considered here.
Citation: Jesper Jansson, Ramesh Rajaby, Wing-Kin Sung. Minimal Phylogenetic Supertrees and Local Consensus Trees[J]. AIMS Medical Science, 2018, 5(2): 181-203. doi: 10.3934/medsci.2018.2.181
[1] | Wenfa Qi, Wei Guo, Tong Zhang, Yuxin Liu, Zongming Guo, Xifeng Fang . Robust authentication for paper-based text documents based on text watermarking technology. Mathematical Biosciences and Engineering, 2019, 16(4): 2233-2249. doi: 10.3934/mbe.2019110 |
[2] | Feng Li, Mingfeng Jiang, Hongzeng Xu, Yi Chen, Feng Chen, Wei Nie, Li Wang . Data governance and Gensini score automatic calculation for coronary angiography with deep-learning-based natural language extraction. Mathematical Biosciences and Engineering, 2024, 21(3): 4085-4103. doi: 10.3934/mbe.2024180 |
[3] | Jia Yu, Huiling Peng, Guoqiang Wang, Nianfeng Shi . A topical VAEGAN-IHMM approach for automatic story segmentation. Mathematical Biosciences and Engineering, 2024, 21(7): 6608-6630. doi: 10.3934/mbe.2024289 |
[4] | Min Zuo, Jiaqi Li, Di Wu, Yingjun Wang, Wei Dong, Jianlei Kong, Kang Hu . Advancing document-level event extraction: Integration across texts and reciprocal feedback. Mathematical Biosciences and Engineering, 2023, 20(11): 20050-20072. doi: 10.3934/mbe.2023888 |
[5] | Chaofan Li, Qiong Liu, Kai Ma . DCCL: Dual-channel hybrid neural network combined with self-attention for text classification. Mathematical Biosciences and Engineering, 2023, 20(2): 1981-1992. doi: 10.3934/mbe.2023091 |
[6] | Qing Yang, Jun Chen, Najla Al-Nabhan . Data representation using robust nonnegative matrix factorization for edge computing. Mathematical Biosciences and Engineering, 2022, 19(2): 2147-2178. doi: 10.3934/mbe.2022100 |
[7] | Ting-Huai Ma, Xin Yu, Huan Rong . A comprehensive transfer news headline generation method based on semantic prototype transduction. Mathematical Biosciences and Engineering, 2023, 20(1): 1195-1228. doi: 10.3934/mbe.2023055 |
[8] | Hyeonjeong Ahn, Hyojung Lee . Predicting the transmission trends of COVID-19: an interpretable machine learning approach based on daily, death, and imported cases. Mathematical Biosciences and Engineering, 2024, 21(5): 6150-6166. doi: 10.3934/mbe.2024270 |
[9] | Ruirui Han, Zhichang Zhang, Hao Wei, Deyue Yin . Chinese medical event detection based on event frequency distribution ratio and document consistency. Mathematical Biosciences and Engineering, 2023, 20(6): 11063-11080. doi: 10.3934/mbe.2023489 |
[10] | Manfu Ma, Penghui Sun, Yong Li, Weilong Huo . Predicting the risk of mortality in ICU patients based on dynamic graph attention network of patient similarity. Mathematical Biosciences and Engineering, 2023, 20(8): 15326-15344. doi: 10.3934/mbe.2023685 |
The problem of constructing a minimally resolved phylogenetic supertree (i.e., a rootedtree having the smallest possible number of internal nodes) that contains all of the rooted triplets froma consistent set R is known to be NP-hard. In this article, we prove that constructing a phylogenetictree consistent with R that contains the minimum number of additional rooted triplets is also NP-hard,and develop exact, exponential-time algorithms for both problems. The new algorithms are applied toconstruct two variants of the local consensus tree; for any set S of phylogenetic trees over some leaflabel set L, this gives a minimal phylogenetic tree over L that contains every rooted triplet present in alltrees in S, where “minimal” means either having the smallest possible number of internal nodes or thesmallest possible number of rooted triplets. (The second variant generalizes the RV-II tree, introducedby Kannan et al. in 1998.) We also measure the running times and memory usage in practice of thenew algorithms for various inputs. Finally, we use our implementations to experimentally investigatethe non-optimality of Aho et al.’s well-known BUILD algorithm from 1981 when applied to the localconsensus tree problems considered here.
Text categorization (TC) is a task of automatically classifying unlabeled natural language documents into a predefined set of semantic categories. As the first and a vital step, text representation converts the content of a textual document into a compact format so that the document can be recognized and classified by classifiers [1]. The vector space model (VSM) is the most widely used text representation model in text categorization. In VSM, a document is represented as a vector in term spaces, such as d = {t1, t2, …, tn}, where n is the number of features in a corpus. The value of ti represents how much the term ti contributes to the semantics of document d. The terms in VSM are extracted from training set. They can be words, phrases, or n-grams, etc [2].
Each document in datasets is represented as a corresponding vector in vector space. The elements in each vector are weighted by term weighting methods. Most studies of term weighting methods for TC has showed that supervised term weighting methods are superior to unsupervised term weighting methods [3]. The traditional term weighting methods for TC are usually borrowed from IR and belong to the unsupervised term weighting methods. The simplest one is binary representation. The most popular one is tf*idf. Note that the tf here also has various variants such as raw term frequency, log (tf), log (tf + 1), or log (tf) + 1. Besides, the idf factor (usually computed as log(N/ni)) also has a number of variants such as log (N/ni + 1)), log (N/ni) + 1), and log (N/ni) -1), etc [3]. TC is a supervised learning task because the category labels of training documents are available in advance. Generally, this known information has been used in supervised term weighting methods in the following ways. One approach is to weight terms by adopting feature selection metrics such as chi-square, information gain, gain ratio, etc. The purpose of feature selection is to select the feature with the most discriminative ability for the current classification problem, reduce the feature dimension of the data set, and improve the classification efficiency and accuracy. Another approach is to weight terms in the interaction with a text classifier. For example, in [4], terms are weighted using an iterative approach involving the k-Nearest Neighbor at each step. For each iteration, the weight is adjusted according to the classification results on the evaluation set. However, this method is usually time-consuming, especially when dealing with large-scale data problems.
The difference between unsupervised term weighting methods and supervised term weighting methods is that supervised term weighting methods use class information in training set. However, most of the existing methods did not discuss the representation of test documents for supervised term weighting methods [5]. Since test documents do not have any class information and supervised term weighting methods require it. A test document can be first represented as |C| different vectors by using estimated distribution of each class, cj, and then it has to be represented as one vector that well describes the document in vector space. |C| is the number of classes.
There are two major strategies, local policy and global policy [6,7,8]. In local policy, each test document in the independent binary classification task will be represented as a single vector. This means that the vector representation of each document is not an independent vector but a corresponding vector collection which combines with specific binary classification tasks. Global policy has been widely used. Each document will have a global independent representation. In most classification tasks, each document is generally assigned to one category and labeled with the most similar class label. For example, when using k-Nearest Neighbor classifier, the category of an unknown document will obtain the nearest k documents by calculating the Euclidean distance (or Mahalanobis distance or Manhattan distance) from other documents, and then determine the category of the unknown document according to the category of these documents. Thus, most of the classification tasks are regarded as single label task and use global policy [9]. In addition, there are some models based on neural networks [10,11,12]. This study focuses on models based on global policy, Younghoong Ko proposed Word Max (W-Max), Document Max (D-Max) and Document Two Max (D-TMax) to optimize text representation method and improve the classification performance [9], the idea behind W-Max is same as traditional global policy method. In these three methods, any method cannot always obtain optimal values for different data sets. These methods are discussed in Section 2.
Due to the above aspects, is there a relatively general strategy for supervised term weighting schemes, and, if yes, which one can achieve better performance? This is the first question we wish to address in this study.
In this study, we investigate several well-known document representation strategies, including "global policy", W-Max, D-Max, and D-TMax for supervised term weighting schemes. Since we have not discovered similar work presently, this investigation is significant and valuable in document representation strategy for supervised term weighting schemes in automatic text categorization. At the same time, a new document representation strategy for supervised term weighting schemes is proposed. These five document representation strategies are tested on two famous document collections, i.e., Reuters-21578 (skewed category distribution) and 20 Newsgroups (uniform category distribution). We discuss the experimental results in detail and draw conclusions from different aspects.
The remainder of this paper is organized as follows. We briefly review several document representation strategies in Section 2. Section 3 introduces our proposed document representation strategy, as well as a short discussion of the method. We show experimental results in Section 4, and finally, we draw conclusions and discuss future work in Section 5.
In the scenario of text categorization, an indexing procedure which converts the raw document into a vector representation is usually necessary since text documents cannot be directly interpreted by a classifier. Document representation is thereby one of the essential components for the construction of a classifier. There are some different types of document representation methods: VSM, Latent Semantic Indexing (LSI) [13] and Latent Dirichlet Allocation (LDA) [14] and the representation method based on word vector model, such as Word2Vec [15].
LSI approximates the source space with fewer dimensions which uses matrix algebra technique. Probabilistic Latent Semantic Indexing (PLSI) has a more solid statistical foundation compared with LSI, since it's based on the likelihood principle and defines a proper generative model of the data [16]. Blei, et al. proposed a more widely used topic model, LDA after PLSI. It can recognize the latent topics of documents and use topic probability distribution for representations. Word2vec model and application by Mikolov et al. have attracted a great amount of attention. Its core idea is to get the vectorial representation of words through the context of words. There are two methods: CBOW and Skip-gram. The vector representations of words learned by word2vec models have been shown to carry semantic meanings. In this representation method, words with similar semantics will have similar vector representation. In this section, we briefly review several document representation strategies based on VSM, the most classical method.
In local policy, each test document in the independent binary classification task will be represented as a single vector.
Global policy is defined as follows.
$TW\left( t \right) = \mathop {\max }\limits_{i = 1}^{|C|} TW\left( {t,{c_i}} \right)$ | (1) |
where TW(t) is the final weight of a term t; TW(t, ci) is weight of term t in category ci obtained with supervised term weighting methods. In the process of initial representation, a test document can be represented as |C| different vectors. After using appropriate selection policy, it can be represented as one vector which well describes the document. Global policy selects the maximum term value among all categories for each term. Although this method is effective in some cases, it can not ensure that it has the ability to select the most effective term weighting vector for current test samples. Since test documents do not have any class information and supervised term weighting schemes required. Besides local policy and global policy, Younghoong Ko proposed the following three solutions for this problem, i.e., W-Max, D-Max and D-TMax [9]. Now, they are described as follows. Based on the idea of global policy, there are also some methods, such as the sum $ {f}_{sum}\left({t}_{k}\right) = {\sum }_{i = 1}^{\left|C\right|}f({t}_{k}, {c}_{i}) $, or the weighted sum $ {f}_{wsum}\left({t}_{k}\right) = {\sum }_{i = 1}^{\left|C\right|}P\left({c}_{i}\right)f({t}_{k}, {c}_{i}) $, or the maximum $ {f}_{max}\left({t}_{k}\right) = {max}_{i = 1}^{\left|C\right|}({t}_{k}, {c}_{i}) $.These functions try to capture the intuition that the best terms for one category are the ones distributed most differently in the sets of positive and negative examples of the category [17]. Now, we described Younghoong Ko 's methods as follows.
Each term's value of term weighting vector will be replaced by the maximum value of the corresponding dimension's term weight in all categories. After comparing with global policy, we may find that they have the same idea.
The sum of all term weights in each term weighting vector is first calculated and then one term weighting vector with the maximum sum value is selected as the document representation vector.
The sum of all term weights in each term weighting vector is calculated and then two term weighting vectors with the two largest sum values are selected. Then the term weighting vector is constructed by choosing the higher term weighting value from the selected two term weighting vectors for each corresponding dimension's term weight.
To discuss the difference between the above strategies, we take some examples as follows. Assuming that there is a training set D = {d1, d2, …, dn} with m terms, where n is the number of documents and these documents belong to |C| categories. Corresponding to category set C = {c1, c2, …, c|C|}, there is a term weighting vector set V = {v1, v2, …, v|C|}. The matrix M (consist of v1 to v|C|) is defined as follows.
$M = \left[ {t11t21...t1j...t1mt21t22...t2j...t2m..................ti1ti2...tij...tim..................t|c|1t|c|2...t|c|j...t|c|m } \right] $
|
(2) |
In Eq (2), tij is j-th element of the term weighting vector for category ci, it is calculated using a certain term weighting method (such as tf*tf, tf*or or other term weighting methods). Assuming that the term weighting vector of test set is vd = {w1, w2, …, wm}, wk is the k-th element of vd.
When W-Max is used, wk can be defined as follows.
$ {w_k} = \mathop {\max }\limits_{i = 1}^{|C|} \left( {{w_{ik}}} \right);k \in [1, m]$ | (3) |
In Eq (3), wik is the k-th element of the vector when category ci is used as a positive category. When D-Max is used, the sum of all term weights in each vector vi is first calculated.
$ {\mathit{sum}}_{i} = \sum _{j = 1}^{m}{t}_{ij}$ | (4) |
After sorting all sumi, D-Max selects the maximum value summaxi. The term weighting vector corresponding to summaxi will be selected as vd.
When D-TMax is used, the sum of all term weights in each vector vi is first calculated using the above Eq (4). Then the two largest sum values are selected, which are labeled as sumx and sumy, respectively. The subscript x and y are indexs of different categories. The k-th element of vd is calculated as Eq (5).
$ {w}_{k} = \mathit{max}\left({t}_{\mathit{xk}}, {t}_{\mathit{yk}}\right);x\ne y\in \left[1, c\right], k\in [1, m] $ | (5) |
When compared with W-Max (global policy) method, the other two methods achieved good performance in Youngjoong Ko's experiments, i.e., D-TMax achieved good performance in Reuters21578 (skewed category distribution) while D-Max achieved good performance in 20 Newsgroups (uniform category distribution) [8]. Through experiments on the two famous document collections, we can draw a conclusion that the same method may have different effects on different data set.
Our review of document representation shows that different data sets require different methods to achieve good performance. How can we know the effect of document representation method before we choose it? In other words, when selecting a document representation strategy for an unknown data set, which method is a better choice? This is the second question we asked in this study. We will present the answer at the end of this paper. Table 1 records the numbers of documents which contain term tk and do not contain term tk under category ci and $ \stackrel{-}{{c}_{i}} $. Usually, d >> a, b, c.
tk | $ \stackrel{-}{{t}_{k}} $ | |
Positive Category: ci | a | b |
Positive Category: $ \stackrel{-}{{c}_{i}} $ | c | d |
The improper selecting of document representation strategy would lead to the problem of inappropriate to assign the weight to terms. A test document can be first represented as |C| different vectors by using estimated distribution of each category. For some categories, the weight they assign to terms would has a negative impact on the role of terms in classification. To illustrate this, suppose the training set is skewed with 19 documents, 5 terms and 5 categories. The relationship between term, document, and category is shown in Table 2. The number in Table 2 represents the times that a term occurs in a document.
category | document | t1 | t2 | t3 | t4 | t5 |
C1 | d1 | 0 | 0 | 2 | 19 | 3 |
C1 | d2 | 0 | 1 | 3 | 0 | 3 |
C1 | d3 | 0 | 0 | 5 | 16 | 2 |
C1 | d4 | 4 | 0 | 1 | 15 | 2 |
C1 | d5 | 0 | 0 | 2 | 18 | 3 |
C2 | d6 | 0 | 0 | 2 | 14 | 3 |
C2 | d7 | 0 | 1 | 3 | 0 | 3 |
C2 | d8 | 0 | 0 | 5 | 13 | 2 |
C2 | d9 | 4 | 0 | 1 | 11 | 2 |
C2 | d10 | 0 | 0 | 2 | 17 | 3 |
C3 | d11 | 0 | 0 | 3 | 0 | 3 |
C3 | d12 | 0 | 0 | 1 | 1 | 3 |
C3 | d13 | 0 | 1 | 1 | 0 | 2 |
C4 | d14 | 1 | 99 | 3 | 2 | 1 |
C4 | d15 | 2 | 99 | 1 | 1 | 1 |
C4 | d16 | 1 | 99 | 1 | 2 | 1 |
C5 | d17 | 4 | 0 | 3 | 0 | 3 |
C5 | d18 | 4 | 0 | 1 | 0 | 3 |
C5 | d19 | 4 | 1 | 1 | 0 | 2 |
According to some supervised term weighting schemes, here we use tf*rf as an example [1]. Based on notations in table 1, it is defined as follows.
$ \mathit{tf}*\mathit{rf} = tf*log(2+\frac{a}{mac\left(1, c\right)}) $ | (6) |
The term weights of t1 to t5 for each category are shown in Table 3.
category | t1 | t2 | t3 | t4 | t5 |
C1 | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
C2 | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
C3 | 1.0000 | 1.1699 | 1.1375 | 1.0704 | 1.1375 |
C4 | 1.2630 | 1.4510 | 1.0875 | 1.1520 | 1.0875 |
C5 | 1.4594 | 1.0000 | 1.1375 | 1.0000 | 1.1375 |
In Younghoong Ko's methods, D-TMax selects the two largest sum values. For multi-class classification problems, in order to obtain better performance, can more categories be selected? Now we will select 1 to 5 categories to test this hypothesis. When choosing 1 or 2 categories, it is called "D-Max" (Document Max) or "D-TMax" (Document Two Max) [9]. Based on this rule, we named 3, 4 and 5 categories as "D-3Max" (Document Three Max), "D-4Max" (Document Four Max) and "D-5Max" (Document Five Max). For multiclass text categorization, we named it "D-NMax" (Document Number Max) in this study. The number of selected categories and corresponding results are shown in the Table 4.
method | t1 | t2 | t3 | t4 | t5 |
D-Max | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
D-TMax | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
D-3Max | 1.2630 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
D-4Max | 1.4594 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
D-5Max | 1.4594 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
According to table 2, the frequency of features in each category, t3 and t5 are evenly distributed in each category and do not have the ability to distinguish categories. According to the distribution of t1, t2 and t4, these three terms are relatively distinguishable, especially t2, which is very concentrated in C4, and its term weight should be the highest among these five terms. Although t1 appears centrally in C5, it also appears in individual documents of categories C1, C2 and C4. It has little ability to distinguish categories. For t4, its distinguishing ability is not specific to a single category. In other words, it can only distinguish C1 and C2 from other categories, but cannot distinguish whether the document belongs to C1 or C2. The result in bold in the Table 4 violates our intuition that the weight of t1, t2 and t4 should be large, and the weight of t1 should be relatively small compared to t2 and t4. Since t1 appear with a low frequency in documents compared to t2 and t4. Another unreasonable observation is that t1 in some documents (d4 and d9) in category 1 and category 2 also have same frequency when compared to t1 in the documents in category 5. After observation, we can find that the results of D-Max, D-TMax and D-5Max (W-Max) cannot boost the performance of text categorization. The results from D-3Max are consistent with our intuition that the weight of t1, t2 and t4 should be large, and the weight of t1 should be relatively small. In order to overcome the shortcomings of Younghoong Ko's methods, in this section we explain our proposed DRGP method, which will choose the traversal method to appropriately weigh the contribution of each term. The DRGP can select the appropriate "N" (in D-NMax) to enhance the performance of text categorization. While our previous work in [18] provided experimental evidence for the effectiveness of DRGP under particular experimental circumstances, we formally address how and why this new method is proposed using analytical explanation and empirical observation. Results in Section 4 show that the proposed strategy outperforms other methods significantly.
In the DRGP method, by traversing the term weighting vectors generated by each class, we compare their weighting effects on the training set. The term weighting vector which produces the best effect on training set will be selected as the term weighting vector of test set. The idea of the proposed method is mainly inspired by Younghoong Ko, Miao and Kamel [19]. They proposed a pairwise optimized Rocchio algorithm, which dynamically adjusts the prototype position between pairs of categories on training set, and record the best prototype position for test set. We summarize the main process of DRGP as Algorithm 1.
Algorithm 1: document representation strategy based on global policy (DRGP) |
Input: fea: feature matrix of training set gnd: a vector of labels for documents in training set Output: selectedC: the most appropriate N value (in D-NMax) for current dataset Local variables |C|: total number of categories; M: total number of features; termWeightingVec1: the set of |C| original term weighting vectors; termWeightingVec1i: i-th vector of the original term weighting vectors; termWeightingVec2i: i-th vector of the reconstructed term weighting vectors; sumVeci: sum value of all terms in i-th term weighting vector; sortSum: sorted list of each sum values; weightedFeai: the weighted fea by using termWeightingVec2i; MicroF1i: result of 10-fold cross validation on weightedFeai; begin 1: apply supervised term weighting method to fea, and get termWeightingVec1; |
2: for i = 1 to |C| 3: for j = 1 to M 4: compute sumVeci for termWeightingVec1i; 5: end for 6: end for 7: sort all sumVeci, and get sortSum; 8: for i = 1 to |C| 9: for j = 1 to M 10: for k = 1 to i 11: tempC = sortSum[k] 12: construct termWeightingVec2i by the following ways. The j-th dimension of each vector in the selected k term weighting vectors is obtained, i.e., termWeightingVec1[tempC][j]. 13: the maximum value of all termWeightingVec1[tempC][j] will be selected as the j-th value of the termWeightingVec2i; 14: end for 15: end for 16: end for 17: for i = 1 to |C| 18: compute weightedFeai; 19: end for 20: for i = 1 to |C| 21: compute MicroF1i; 22: end for 23: record i corresponding to the maximum MicroF1i, and assign it to selectedC; end |
After the algorithm 1 is executed, we can get selectedC, that is, the N value in D-NMax. Then we can use the following Eq (7) to calculate the term weighting vector of test set. The sum of all term weights in each vector vi is first calculated using Eq (4). Then the N largest sum values are selected, which are labeled as sum1, sum2...sumi... and sumN, respectively. The subscript 1, 2...i... and N are index of different categories. The k-th element of vd is calculated as Eq (7).
$ {w}_{k} = \mathit{max}({t}_{1k}, {t}_{2k}, \dots {t}_{ik}\dots , {t}_{Nk});iϵ[1, N], kϵ[1, m] $ | (7) |
In algorithm 1, steps 1 to 6, first apply the term weighting method to the original data set to obtain |C| original term weighting vectors. Sum the original feature weight vectors separately to get sumVeci. Steps 8 to 16, sort all sumVeci to get sortSum, and reconstruct |C| term weighting vectors by the following way. Select the first 1 to |C| vectors of sortSum in turn, and construct one term weighting vector each time. The j-th dimension element of the term weighting vector is selected in the following method: the maximum value of the j-th dimension of the currently selected k term weighting vectors. Steps 17 to 19, through the newly constructed |C| term weighting vector termWeightingVec2, weight the fea matrix respectively, and each termWeightingVec2i will get the corresponding weightedFeai. Steps 20 to 22, verify each weightedFeai through ten-fold cross-validation, and record the corresponding result MicroF1i. Step 23, record the maximum MicroF1 and the corresponding weight vector sequence number i as the result of selectedC. The MicroF1 is the popular performance measure in text categorization and the formula will present in Section 4.
However, the algorithm has a shortcoming for its high time complexity. The most time-complex part is to calculate MicroF1i (steps 19 to 21, in bold), which is calculated using ten-fold cross-validation. According to Lin's 2006 "machine learning summer school", "the time complexity of the support vector machine algorithm is O(N3)". Therefore, the time complexity of this algorithm can be estimated as O(|C|*N3). Compared with some algorithms that use time in exchange for space, the DRGP algorithm uses time in exchange for precision. This algorithm is more suitable for classification systems that use a fixed training set, that is, after one training on the training set to find the optimal model, it can be applied to the classification system without changing the parameters for a long time.
The 20 Newsgroups corpus is a generally used benchmark dataset in the TC [20]. In the corpus, there are 20,000 newsgroup documents nearly uniformly distributed into 20 classes. In this study, we use the 20 Newsgroups sorted by the date. After removing duplicates and headers, the remaining 18,846 documents are partitioned into 11,314 (about 60 percent) training documents and 7532 (about 40 percent) testing documents. After preprocessing, there are 26,214 distinct words in this data set.
The Reuters21578 corpus is used in many experiments and it contains 21,578 documents in 135 categories [21,22]. We use its ModApte version. There are 5946 training documents and 2347 testing documents in this version. In the study, we choose the top 10 largest categories which have 5228 training documents and 2057 testing documents. After preprocessing, the resulting vocabulary has 18,221 distinct words. Compared with 20 Newsgroups, it is a skewed data set and 80 percent of the categories have less than 7.5 percent instances.
The statistics of datasets are listed in Table 5.
Datasets | # of documents | distinct words | # of classes | # of training | # of testing |
20 Newsgroups | 18,846 | 26,214 | 20 | 11,314 | 7532 |
Reuters21578 | 7285 | 18,221 | 10 | 5228 | 2057 |
Some different document representation strategies listed in Table 6 are selected in this study. Besides D-Max, D-TMax and W-Max, we also investigate the D-NMax in the experiment. In the experiment, we list the results of the selected categories from 1 to the maximum. We can see the corresponding results when the number of selected categories is changed.
Denoted by | Description |
D-Max | The sum of all term weights in each term weighting vector is first calculated and then one term weighting vector with the maximum sum value is selected as the document representation vector. |
D-TMax | The sum of all term weights in each term weighting vector is calculated and then two term weighting vectors with the two largest sum values are selected. Then the term weighting vector is constructed by choosing the higher term weighting value from the selected two term weighting vectors for each corresponding dimension's term weight. |
D-NMax | The sum of all term weights in each term weighting vector is calculated and then N term weighting vectors with the N largest sum values are selected. Then the term weighting vector is constructed by choosing the highest term weighting value from the selected N term weighting vectors for each corresponding dimension's term weight. |
W-Max (global policy) |
Each term's value of term weighting vector will be replaced by the maximum value of the corresponding dimension's term weight in all categories. |
To evaluate classification performance of the proposed method, we choose the promising Support Vector Machines (SVM) learning algorithm in this study [23,24]. Although other algorithms such as k Nearest Neighbor, Decision Tree and Naive Bayes are also widely used, they are not included because the real number format of term weights could not be used except for the binary representation (see an exception in [25]). Finally, the SVM learning algorithm scale to large classification problems with thousands of features and examples.
The SVM is a relatively efficient machine learning algorithm which shows a good performance. It is based on the structural risk minimization principle from computational learning theory [26]. It can handle the large-scale and high-dimensional data sets with high classification accuracy. According to different kernel functions, SVMs are divided into two categories: nonlinear (such as polynomial, sigmoid function, radial-based function) and linear methods. Leopold and Kindermann pointed out that term weighting schemes dominate the performance of SVM classifiers rather than the kernel functions [27]. Moreover, the literatures have proven that the linear SVM is superior to nonlinear SVM [28]. So, we select the linear kernel function in this study. The other parameters of SVM are set to their default values. The SVM software used is from LIBSVM-3.14 [29].
The standard measures to determine the performance of a classification task are precision and recall [30,31]. However, it is well known that we may receive low recall when we obtain high precision. It will be ineffective if precision and recall are separated [1]. The widely used measure is F1 measure which combines the precision and recall. For a given category i, Precision, Recall and F1 measure are defined as follows.
$ {\mathit{Precision}}_{i} = {TP}_{i}/({TP}_{i}+{FP}_{i}) $ | (8) |
$ {\mathit{Recall}}_{i} = {TP}_{i}/({TP}_{i}+{FN}_{i}) $ | (9) |
$ {F}_{1}^{i} = \left(2*{Precision}_{i}*{Recall}_{i}\right)/{(Precision}_{i}+{Recall}_{i}) $ | (10) |
where $ {TP}_{i} $ is the number of documents assigned correctly to class i, $ {FP}_{i} $ is the number of documents that do not belong to class i but are assigned to class i. $ {FN}_{i} $ is the number of documents that actually belong to class i but are not assigned to the class i.
The F1 measure is estimated by MicroF1 and MacroF1 [32]. In this study, MicroF1 and MacroF1 are employed to measure the performance of the proposed method. They are computed as in Eqs (11) and (12).
$ {\mathit{MicroF}}_{1} = 2*Precision*Recall $ | (11) |
$ {MacroF}_{1} = \frac{1}{m}*\sum _{i = 1}^{m}{F}_{1}^{i} $ | (12) |
In Eq (10), m is the number of categories. The MicroF1 assigns equal weight to each document and it is considered as an average over all the document/category pairs. The MacroF1 assigns equal weight to each category and is influenced by the results of rare categories.
In order to verify the effectiveness of the proposed method in the experiment, besides tf*rf mentioned before, we also include the tf*or which is widely used in the previous study [33]. Its formula is expressed as
$ tf*or = tf*\mathrm{log}\left(\frac{a*d}{b*c}\right) $ | (13) |
The main purpose of the experiments is to address the two questions, i.e., to explore the superiority of document representation strategy for supervised term weighting schemes and find a measure before choosing a document representation strategy. To accomplish this, we compare the methods on two popular benchmark data corpora, i.e., 20 Newsgroups and Reuters-21578, using SVM in terms of MicroF1 and MacroF1 measure. In addition to showing the corresponding results of DRGP in the experiment, the corresponding results of selecting other number of categories are also listed. The experimental results and discussion on these two corpora are in Sections 4.5.1–4.5.3.
The returned value of selectedC is 20 when DRGP is used on 20 Newsgroups which are weighted by tf*or term weighting method. Then we can use the formula (7) to calculate the term weighting vector of test set. Figure 1 shows the results of MicroF1 and MacroF1 when using supervised term weighting method tf*or and SVM classification algorithm with linear kernel functions on 20 Newsgroups. The best MicroF1 (0.7884) and MacroF1 (0.7837) are achieved by using DRGP and W-Max. The result obtained by D-Max is 0.5539 (MicroF1) and 0.5521 (MacroF1), and the result obtained by D-TMax is 0.6036 (MicroF1) and 0.6027 (MacroF1). It is more intuitive that the results and trends of MicroF1 and MacroF1 on the 20 newsgroups are basically the same. The main reason is that 20 Newsgroups is a balanced dataset.
The returned value of selectedC is 20 when DRGP is used on 20 Newsgroups which are weighted by tf*rf term weighting method. Then we can use the formula (7) to calculate the term weighting vector of test set. Figure 2 shows the results of MicroF1 and MacroF1 when using supervised term weighting method tf*rf and SVM classification algorithm with linear kernel functions on 20 Newsgroups. The best MicroF1 (0.7958) and MacroF1 (0.7909) are achieved by using DRGP and W-Max. The result obtained by D-Max is 0.7562 (MicroF1) and 0.7502 (MacroF1), and the result obtained by D-TMax is 0.7592 (MicroF1) and 0.7529 (MacroF1).
The returned value of selectedC is 6 when DRGP is used on Reuters-21578 which are weighted by tf*or term weighting method. Then we can use the Eq (7) to calculate the term weighting vector of test set. Figure 3 shows the results of MicroF1 and MacroF1 when using supervised term weighting method tf*or and SVM classification algorithm with linear kernel functions on Reuters-21578. In other words, the most appropriate N value for current dataset is 6. The sum of all term weights in each term weighting vector is calculated and then six term weighting vectors corresponding to the first six largest sum values are selected. Then the term weighting vector is constructed by choosing the highest term weighting value from the selected six term weighting vectors for each corresponding dimension's term weight. The best MicroF1 (0.9203) and MacroF1 (0.8919) are achieved by using DRGP. The result obtained by D-Max is 0.8769 (MicroF1) and 0.8284 (MacroF1), and the result obtained by D-TMax is 0.8697 (MicroF1) and 0.825 (MacroF1), and the result obtained by W-Max is 0.8898 (MicroF1) and 0.882 (MacroF1).
The returned value of selectedC is 6 when DRGP is used on Reuters-21578 which are weighted by tf*rf term weighting method. Then we can use the Eq (7) to calculate the term weighting vector of test set. Figure 4 shows the results of MicroF1 and MacroF1 when using supervised term weighting method tf*rf and SVM classification algorithm with linear kernel functions on Reuters-21578. The best MicroF1 (0.9283) and MacroF1 (0.9086) are achieved by using DRGP. The result obtained by D-Max is 0.9258 (MicroF1) and 0.9077 (MacroF1), and the result obtained by D-TMax is 0.9258 (MicroF1) and 0.8966 (MacroF1), and the result obtained by W-Max is 0.9269 (MicroF1) and 0.9049 (MacroF1).
The above results showed that DRGP and W-Max achieved the best performance on 20 Newsgroups while DRGP achieved the best performance on Reuters-21578. In addition, the results and trends of MicroF1 and MacroF1 on 20 Newsgroups are basically consistent. We think it is very natural results. Because 20 Newsgroups is a balanced dataset and Reuters-21578 is a skewed dataset. Many documents in Reuters-21578 have two or more labels. For example, if a document with two labels is represented by using only one class distribution between two labels, classifiers could has some difficulty to classify it in the other class.
In skewed data sets, it is not suitable to adopt the document representation strategy of D-Max, D-TMax, and W-Max (global policy). Note that although our experiments use the same corpus and same evaluation measure as Youngjoong Ko's [8], there are minor differences in data preparation such as the stemming or stop words lists for text preprocessing and the different feature selection measures. On Reuters-21578, the experimental results show the same phenomenon, that the results obtained by D-TMax are greater than D-Max, and the results obtained by W-Max method are the worst.
In the experimental results, we can observe that on the balanced data set (20 Newsgroups), MicroF1 and MacroF1 will increase with the increase of selection categories. However, the result of skewed data set (Reuters-21578) is that MicroF1 and MacroF1 first increase and then decrease with the increase of selection categories. Therefore, in order to obtain better results, it is necessary to select the appropriate method according to the characteristics of the data set. The proposed DRGP method has achieved best performance in both balanced and skewed data sets. Especially on skewed dataset, the results achieved are greatly improved compared to the results achieved by traditional methods. It should be noted that here we only use tf*or and tf*rf weighting methods and SVM classification method. Our method can be extended to other supervised term weighting methods, and other classifiers can be used for cross validation or classification.
Document representation is one of the most important parts for constructing a text classifier. For the supervised term weighting methods, in addition to the global policy, there is local policy, including D-Max, D-TMax and W-Max, etc. Faced with so many strategies, we are not sure how to choose to achieve the best results. That is, the N value in D-NMax cannot be determined.
In this study, we studied and solved this problem. We found that representation methods should be chosen according to corpora characteristics to have better classification performance. Compared with other methods, such as D-Max, D-TMax and W-Max, the method proposed in this study is not only a weighting method, but also a selection strategy. The method can select the appropriate N (in D-NMax) value according to distribution of each category in the dataset. By testing the proposed method on two representative supervised term weighting methods (tf*rf and tf*or) on two datasets (20 Newsgroups and Reuters-21578), it can be obtained that the method is effective on both balanced and unbalanced datasets.
Based on the original document representation method, the proposed method introduces the idea of traversal. Through cross validation, the proposed method first finds the optimal document representation model on the training set, and then applies the selected model to the test set. Due to the different working mechanism of different classifiers, different classifiers may have an impact on the final results, but the method proposed in this paper has no special requirements for classifiers. In this study, we use the SVM. This method can be extended to other supervised feature weighting methods, and other classifiers can be used for cross validation or classification. Compared with the original document representation methods, the proposed method increases the additional computational cost (cross validation on the test set). In the next research, we will continue to study and introduce new optimization methods to reduce the calculation cost.
At the end of this study, we answer the questions we raised in the beginning of the paper as conclusions:
1) Is there a relatively general strategy for supervised term weighting schemes, and, if yes, which one can achieve better performance?
Through a series of evaluations in text categorization, including balanced dataset and skewed dataset, we find that the performances of W-Max, D-Max, D-TMax get different results on different types of datasets. The proposed DRGP and W-Max achieved the best performance on 20 Newsgroups. Besides, the proposed DRGP achieved the best performance on Reuters-21578.
2) How can we know the effect of document representation method before we choose it? In other words, when selecting a document representation strategy for an unknown data set, which method is the best choice?
In this study, we propose the DRGP method, which can select appropriate representation strategy for an unknown dataset. No matter the data set is uniform or not, it will use traversal method on the constructed optimization function to find the document representation strategy on training set, then apply it on test set. The DRGP exhibit stable and consistent improvement over most of the previous document representations mentioned in the experiments.
This work was supported by the National Natural Science Foundation of China under Grant 61977015, and the Jilin Province Development and Reform Commission Project of China under Grant 2020C017-3. We would like to thank the organizations for their supports.
All authors declare no conflicts of interest in this paper.
[1] |
Adams III EN (1972) Consensus techniques and the comparison of taxonomic trees. Systematic Zoology 21: 390–397. doi: 10.2307/2412432
![]() |
[2] | Aho AV, Sagiv Y, Szymanski TG, et al. (1981) Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput 10: 405–421. |
[3] |
Arnold C, Matthews LJ, Nunn CL (2010) The 10kTrees website: A new online resource for primate phylogeny. Evolutionary Anthropology 19: 114–118. doi: 10.1002/evan.20251
![]() |
[4] | Bender MA, Farach-Colton M (2000) The LCA problem revisited. In Proceedings of the 4thLatin American Symposium on Theoretical Informatics (LATIN 2000), volume 1776 of LNCS, pages 88–94. Springer-Verlag. |
[5] |
Bininda-Emonds ORP (2004) The evolution of supertrees. TRENDS Ecol Evolution 19: 315–322. doi: 10.1016/j.tree.2004.03.015
![]() |
[6] |
Bininda-Emonds ORP, Cardillo M, Jones KE, et al. (2007) The delayed rise of present-day mammals. Nature 446: 507–512. doi: 10.1038/nature05634
![]() |
[7] | Bryant D (1997) Building Trees, Hunting for Trees, and Comparing Trees: Theory and Methods in Phylogenetic Analysis. PhD thesis, University of Canterbury, Christchurch, New Zealand. |
[8] | Bryant D (2003) A classification of consensus methods for phylogenetics. In Janowitz MF, Lapointe FJ, McMorris FR, et al., editors, Bioconsensus, volume 61 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 163–184. American Mathematical Society. |
[9] |
Byrka J, Guillemot S, Jansson J (2010) New results on optimizing rooted triplets consistency. Discrete Applied Mathematics 158: 1136–1147. doi: 10.1016/j.dam.2010.03.004
![]() |
[10] |
Chor B, Hendy M, Penny D (2007) Analytic solutions for three taxon ML trees with variable rates across sites. Discrete Applied Mathematics 155: 750–758. doi: 10.1016/j.dam.2005.05.043
![]() |
[11] |
Constantinescu M, Sankoff D (1995) An efficient algorithm for supertrees. J Classification 12: 101–112. doi: 10.1007/BF01202270
![]() |
[12] | Felsenstein J (2004) Inferring Phylogenies. Sinauer Associates, Inc., Sunderland, Massachusetts. |
[13] | Garey M, Johnson D (1979) Computers and Intractability – A Guide to the Theory of NPCompleteness. Freeman WH and Company, New York. |
[14] |
Gąsieniec L, Jansson J, Lingas A, et al. (1999) On the complexity of constructing evolutionary trees. J Combinatorial Optimization 3: 183–197. doi: 10.1023/A:1009833626004
![]() |
[15] |
He YJ, Huynh TND, Jansson J, et al. (2006) Inferring phylogenetic relationships avoiding forbidden rooted triplets. J Bioinformatics Comput Bio 4: 59–74. doi: 10.1142/S0219720006001709
![]() |
[16] |
Henzinger MR, King V, Warnow T. (1999) Constructing a tree from homeomorphic subtrees, with applications to computational evolutionary biology. Algorithmica 24: 1–13. doi: 10.1007/PL00009268
![]() |
[17] | Huson DH, Rupp R, Scornavacca C (2010) Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press, Cambridge, U.K. |
[18] |
Jansson J, Lemence RS, Lingas A (2012) The complexity of inferring a minimally resolved phylogenetic supertree. SIAM J Comput 41: 272–291. doi: 10.1137/100811489
![]() |
[19] | Jansson J, Lingas A, Rajaby R, et al. (2017) Determining the consistency of resolved triplets and fan triplets. In Proceedings of the 21st Annual International Conference on Research in Computational Molecular Biology (RECOMB 2017), volume 10229 of LNCS, pages 82–98. Springer-Verlag. |
[20] | Jansson J, Rajaby R, Shen C, et al. (to appear). Algorithms for the majority rule (+) consensus tree and the frequency difference consensus tree. IEEE/ACM Transactions Computational Bio Bioinformatics. |
[21] | Jansson J, Shen C, Sung WK (2016) Improved algorithms for constructing consensus trees. J ACM 63. |
[22] |
Kannan S,Warnow T, Yooseph S (1998) Computing the local consensus of trees. SIAM J Computing 27: 1695–1724. doi: 10.1137/S0097539795287642
![]() |
[23] |
McKenzie A, Steel M (2000) Distributions of cherries for two models of trees. Mathematical Biosciences 164: 81–92. doi: 10.1016/S0025-5564(99)00060-7
![]() |
[24] | Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI 2007), pages 89–100. ACM. |
[25] |
Ng MP, Wormald NC (1996) Reconstruction of rooted trees from subtrees. Discrete Applied Mathematics 69: 19–31. doi: 10.1016/0166-218X(95)00074-2
![]() |
[26] |
Semple C (2003) Reconstructing minimal rooted trees. Discrete Applied Mathematics 127: 489–503. doi: 10.1016/S0166-218X(02)00250-0
![]() |
[27] |
Semple C, Daniel P, Hordijk W, et al. (2004) Supertree algorithms for ancestral divergence dates and nested taxa. Bioinformatics 20: 2355–2360. doi: 10.1093/bioinformatics/bth246
![]() |
[28] |
Snir S, Rao S (2006) Using Max Cut to enhance rooted trees consistency. IEEE/ACM Transactions Comput Bio Bioinformatics 3: 323–333. doi: 10.1109/TCBB.2006.58
![]() |
[29] |
Steel M (1992) The complexity of reconstructing trees from qualitative characters and subtrees. J Classification 9: 91–116. doi: 10.1007/BF02618470
![]() |
[30] | Sung WK (2010) Algorithms in Bioinformatics: A Practical Introduction. Chapman & Hall/CRC, Boca Raton, Florida. |
[31] |
Willson SJ. (2004) Constructing rooted supertrees using distances. Bulletin Mathematical Bio 66: 1755–1783. doi: 10.1016/j.bulm.2004.04.006
![]() |
[32] | Wulff-Nilsen C (2013) Faster deterministic fully-dynamic graph connectivity. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2013), pages 1757–1769. SIAM. |
1. | Mingyang Zhong, Jiahui Wen, Jingwei Ma, Hao Cui, Qiuling Zhang, Morteza Karimzadeh Parizi, A hierarchical multi-leadership sine cosine algorithm to dissolving global optimization and data classification: The COVID-19 case study, 2023, 164, 00104825, 107212, 10.1016/j.compbiomed.2023.107212 | |
2. | Marcin Michał Mirończuk, Adam Müller, Witold Pedrycz, The Outcomes and Publication Standards of Research Descriptions in Document Classification: A Systematic Review, 2024, 12, 2169-3536, 189253, 10.1109/ACCESS.2024.3513550 |
tk | $ \stackrel{-}{{t}_{k}} $ | |
Positive Category: ci | a | b |
Positive Category: $ \stackrel{-}{{c}_{i}} $ | c | d |
category | document | t1 | t2 | t3 | t4 | t5 |
C1 | d1 | 0 | 0 | 2 | 19 | 3 |
C1 | d2 | 0 | 1 | 3 | 0 | 3 |
C1 | d3 | 0 | 0 | 5 | 16 | 2 |
C1 | d4 | 4 | 0 | 1 | 15 | 2 |
C1 | d5 | 0 | 0 | 2 | 18 | 3 |
C2 | d6 | 0 | 0 | 2 | 14 | 3 |
C2 | d7 | 0 | 1 | 3 | 0 | 3 |
C2 | d8 | 0 | 0 | 5 | 13 | 2 |
C2 | d9 | 4 | 0 | 1 | 11 | 2 |
C2 | d10 | 0 | 0 | 2 | 17 | 3 |
C3 | d11 | 0 | 0 | 3 | 0 | 3 |
C3 | d12 | 0 | 0 | 1 | 1 | 3 |
C3 | d13 | 0 | 1 | 1 | 0 | 2 |
C4 | d14 | 1 | 99 | 3 | 2 | 1 |
C4 | d15 | 2 | 99 | 1 | 1 | 1 |
C4 | d16 | 1 | 99 | 1 | 2 | 1 |
C5 | d17 | 4 | 0 | 3 | 0 | 3 |
C5 | d18 | 4 | 0 | 1 | 0 | 3 |
C5 | d19 | 4 | 1 | 1 | 0 | 2 |
category | t1 | t2 | t3 | t4 | t5 |
C1 | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
C2 | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
C3 | 1.0000 | 1.1699 | 1.1375 | 1.0704 | 1.1375 |
C4 | 1.2630 | 1.4510 | 1.0875 | 1.1520 | 1.0875 |
C5 | 1.4594 | 1.0000 | 1.1375 | 1.0000 | 1.1375 |
method | t1 | t2 | t3 | t4 | t5 |
D-Max | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
D-TMax | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
D-3Max | 1.2630 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
D-4Max | 1.4594 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
D-5Max | 1.4594 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
Datasets | # of documents | distinct words | # of classes | # of training | # of testing |
20 Newsgroups | 18,846 | 26,214 | 20 | 11,314 | 7532 |
Reuters21578 | 7285 | 18,221 | 10 | 5228 | 2057 |
Denoted by | Description |
D-Max | The sum of all term weights in each term weighting vector is first calculated and then one term weighting vector with the maximum sum value is selected as the document representation vector. |
D-TMax | The sum of all term weights in each term weighting vector is calculated and then two term weighting vectors with the two largest sum values are selected. Then the term weighting vector is constructed by choosing the higher term weighting value from the selected two term weighting vectors for each corresponding dimension's term weight. |
D-NMax | The sum of all term weights in each term weighting vector is calculated and then N term weighting vectors with the N largest sum values are selected. Then the term weighting vector is constructed by choosing the highest term weighting value from the selected N term weighting vectors for each corresponding dimension's term weight. |
W-Max (global policy) |
Each term's value of term weighting vector will be replaced by the maximum value of the corresponding dimension's term weight in all categories. |
tk | $ \stackrel{-}{{t}_{k}} $ | |
Positive Category: ci | a | b |
Positive Category: $ \stackrel{-}{{c}_{i}} $ | c | d |
category | document | t1 | t2 | t3 | t4 | t5 |
C1 | d1 | 0 | 0 | 2 | 19 | 3 |
C1 | d2 | 0 | 1 | 3 | 0 | 3 |
C1 | d3 | 0 | 0 | 5 | 16 | 2 |
C1 | d4 | 4 | 0 | 1 | 15 | 2 |
C1 | d5 | 0 | 0 | 2 | 18 | 3 |
C2 | d6 | 0 | 0 | 2 | 14 | 3 |
C2 | d7 | 0 | 1 | 3 | 0 | 3 |
C2 | d8 | 0 | 0 | 5 | 13 | 2 |
C2 | d9 | 4 | 0 | 1 | 11 | 2 |
C2 | d10 | 0 | 0 | 2 | 17 | 3 |
C3 | d11 | 0 | 0 | 3 | 0 | 3 |
C3 | d12 | 0 | 0 | 1 | 1 | 3 |
C3 | d13 | 0 | 1 | 1 | 0 | 2 |
C4 | d14 | 1 | 99 | 3 | 2 | 1 |
C4 | d15 | 2 | 99 | 1 | 1 | 1 |
C4 | d16 | 1 | 99 | 1 | 2 | 1 |
C5 | d17 | 4 | 0 | 3 | 0 | 3 |
C5 | d18 | 4 | 0 | 1 | 0 | 3 |
C5 | d19 | 4 | 1 | 1 | 0 | 2 |
category | t1 | t2 | t3 | t4 | t5 |
C1 | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
C2 | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
C3 | 1.0000 | 1.1699 | 1.1375 | 1.0704 | 1.1375 |
C4 | 1.2630 | 1.4510 | 1.0875 | 1.1520 | 1.0875 |
C5 | 1.4594 | 1.0000 | 1.1375 | 1.0000 | 1.1375 |
method | t1 | t2 | t3 | t4 | t5 |
D-Max | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
D-TMax | 1.1155 | 1.1699 | 1.2538 | 1.3626 | 1.2538 |
D-3Max | 1.2630 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
D-4Max | 1.4594 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
D-5Max | 1.4594 | 1.4510 | 1.2538 | 1.3626 | 1.2538 |
Datasets | # of documents | distinct words | # of classes | # of training | # of testing |
20 Newsgroups | 18,846 | 26,214 | 20 | 11,314 | 7532 |
Reuters21578 | 7285 | 18,221 | 10 | 5228 | 2057 |
Denoted by | Description |
D-Max | The sum of all term weights in each term weighting vector is first calculated and then one term weighting vector with the maximum sum value is selected as the document representation vector. |
D-TMax | The sum of all term weights in each term weighting vector is calculated and then two term weighting vectors with the two largest sum values are selected. Then the term weighting vector is constructed by choosing the higher term weighting value from the selected two term weighting vectors for each corresponding dimension's term weight. |
D-NMax | The sum of all term weights in each term weighting vector is calculated and then N term weighting vectors with the N largest sum values are selected. Then the term weighting vector is constructed by choosing the highest term weighting value from the selected N term weighting vectors for each corresponding dimension's term weight. |
W-Max (global policy) |
Each term's value of term weighting vector will be replaced by the maximum value of the corresponding dimension's term weight in all categories. |