Minimal Phylogenetic Supertrees and Local Consensus Trees

Jesper Jansson; Ramesh Rajaby; Wing-Kin Sung; Jesper Jansson; Ramesh Rajaby; Wing-Kin Sung

doi:10.3934/medsci.2018.2.181

AIMS Medical Science

2018, Volume 5, Issue 2: 181-203. doi: 10.3934/medsci.2018.2.181

Previous Article Next Article

Research article Special Issues

Minimal Phylogenetic Supertrees and Local Consensus Trees

1.
Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong;
2.
NUS Graduate School for Integrative Sciences and Engineering, National University of Singapore, 28 Medical Drive, Singapore 117456;
3.
School of Computing, National University of Singapore, 13 Computing Drive, Singapore 117417;
4.
Genome Institute of Singapore, 60 Biopolis Street, Genome, Singapore 138672

Received: 15 May 2017 Accepted: 02 March 2018 Published: 10 May 2018

The problem of constructing a minimally resolved phylogenetic supertree (i.e., a rootedtree having the smallest possible number of internal nodes) that contains all of the rooted triplets froma consistent set R is known to be NP-hard. In this article, we prove that constructing a phylogenetictree consistent with R that contains the minimum number of additional rooted triplets is also NP-hard,and develop exact, exponential-time algorithms for both problems. The new algorithms are applied toconstruct two variants of the local consensus tree; for any set S of phylogenetic trees over some leaflabel set L, this gives a minimal phylogenetic tree over L that contains every rooted triplet present in alltrees in S, where “minimal” means either having the smallest possible number of internal nodes or thesmallest possible number of rooted triplets. (The second variant generalizes the RV-II tree, introducedby Kannan et al. in 1998.) We also measure the running times and memory usage in practice of thenew algorithms for various inputs. Finally, we use our implementations to experimentally investigatethe non-optimality of Aho et al.’s well-known BUILD algorithm from 1981 when applied to the localconsensus tree problems considered here.

Keywords:

Citation: Jesper Jansson, Ramesh Rajaby, Wing-Kin Sung. Minimal Phylogenetic Supertrees and Local Consensus Trees[J]. AIMS Medical Science, 2018, 5(2): 181-203. doi: 10.3934/medsci.2018.2.181

Related Papers:

[1]	Jasmine S. Ritschard, Lea Amato, Yadhu Kumar, Britta Müller, Leo Meile, Markus Schuppler . The role of the surface smear microbiome in the development of defective smear on surface-ripened red-smear cheese. AIMS Microbiology, 2018, 4(4): 622-641. doi: 10.3934/microbiol.2018.4.622
[2]	Marina Santiago, Lindsay Eysenbach, Jessica Allegretti, Olga Aroniadis, Lawrence J. Brandt, Monika Fischer, Ari Grinspan, Colleen Kelly, Casey Morrow, Martin Rodriguez, Majdi Osman, Zain Kassam, Mark B. Smith, Sonia Timberlake . Microbiome predictors of dysbiosis and VRE decolonization in patients with recurrent C. difficile infections in a multi-center retrospective study. AIMS Microbiology, 2019, 5(1): 1-18. doi: 10.3934/microbiol.2019.1.1
[3]	Neetu Sharma, Sonu Bhatia, Abhinashi Singh Sodhi, Navneet Batra . Oral microbiome and health. AIMS Microbiology, 2018, 4(1): 42-66. doi: 10.3934/microbiol.2018.1.42
[4]	Inês Barros, Hugo Froufe, George Marnellos, Conceição Egas, Jennifer Delaney, Michele Clamp, Ricardo Serrão Santos, Raul Bettencourt . Metatranscriptomics profile of the gill microbial community during Bathymodiolus azoricus aquarium acclimatization at atmospheric pressure. AIMS Microbiology, 2018, 4(2): 240-260. doi: 10.3934/microbiol.2018.2.240
[5]	Nikolina-Alexandra Xaxiri, Eleni Nikouli, Panagiotis Berillis, Konstantinos Ar. Kormas . Bacterial biofilm development during experimental degradation of Melicertus kerathurus exoskeleton in seawater. AIMS Microbiology, 2018, 4(3): 397-412. doi: 10.3934/microbiol.2018.3.397
[6]	George L. Mendz, Nadeem O. Kaakoush, Julie A. Quinlivan . New techniques to characterise the vaginal microbiome in pregnancy. AIMS Microbiology, 2016, 2(1): 55-68. doi: 10.3934/microbiol.2016.1.55
[7]	Andrea Gärtner, Jutta Wiese, Johannes F. Imhoff . Diversity of Micromonospora strains from the deep Mediterranean Sea and their potential to produce bioactive compounds. AIMS Microbiology, 2016, 2(2): 205-221. doi: 10.3934/microbiol.2016.2.205
[8]	Mojtaba Moosavian, Mina Moradzadeh, Ataollah Ghadiri, Morteza Saki . Isolation and Identification of Legionella spp. in environmental water sources based on macrophage infectivity potentiator (mip) gene sequencing in southwest Iran. AIMS Microbiology, 2019, 5(3): 223-231. doi: 10.3934/microbiol.2019.3.223
[9]	Stephen T. Abedon . Active bacteriophage biocontrol and therapy on sub-millimeter scales towards removal of unwanted bacteria from foods and microbiomes. AIMS Microbiology, 2017, 3(3): 649-688. doi: 10.3934/microbiol.2017.3.649
[10]	Andrey A. Belov, Vladimir S. Cheptsov, Elena A. Vorobyova . Soil bacterial communities of Sahara and Gibson deserts: Physiological and taxonomical characteristics. AIMS Microbiology, 2018, 4(4): 685-710. doi: 10.3934/microbiol.2018.4.685

Abstract

1. Introduction

Text categorization (TC) is a task of automatically classifying unlabeled natural language documents into a predefined set of semantic categories. As the first and a vital step, text representation converts the content of a textual document into a compact format so that the document can be recognized and classified by classifiers ^[1]. The vector space model (VSM) is the most widely used text representation model in text categorization. In VSM, a document is represented as a vector in term spaces, such as d = {t₁, t₂, …, t_n}, where n is the number of features in a corpus. The value of t_i represents how much the term t_i contributes to the semantics of document d. The terms in VSM are extracted from training set. They can be words, phrases, or n-grams, etc ^[2].

Each document in datasets is represented as a corresponding vector in vector space. The elements in each vector are weighted by term weighting methods. Most studies of term weighting methods for TC has showed that supervised term weighting methods are superior to unsupervised term weighting methods ^[3]. The traditional term weighting methods for TC are usually borrowed from IR and belong to the unsupervised term weighting methods. The simplest one is binary representation. The most popular one is tf*idf. Note that the tf here also has various variants such as raw term frequency, log (tf), log (tf + 1), or log (tf) + 1. Besides, the idf factor (usually computed as log(N/n_i)) also has a number of variants such as log (N/n_i + 1)), log (N/n_i) + 1), and log (N/n_i) -1), etc ^[3]. TC is a supervised learning task because the category labels of training documents are available in advance. Generally, this known information has been used in supervised term weighting methods in the following ways. One approach is to weight terms by adopting feature selection metrics such as chi-square, information gain, gain ratio, etc. The purpose of feature selection is to select the feature with the most discriminative ability for the current classification problem, reduce the feature dimension of the data set, and improve the classification efficiency and accuracy. Another approach is to weight terms in the interaction with a text classifier. For example, in ^[4], terms are weighted using an iterative approach involving the k-Nearest Neighbor at each step. For each iteration, the weight is adjusted according to the classification results on the evaluation set. However, this method is usually time-consuming, especially when dealing with large-scale data problems.

The difference between unsupervised term weighting methods and supervised term weighting methods is that supervised term weighting methods use class information in training set. However, most of the existing methods did not discuss the representation of test documents for supervised term weighting methods ^[5]. Since test documents do not have any class information and supervised term weighting methods require it. A test document can be first represented as |C| different vectors by using estimated distribution of each class, c_j, and then it has to be represented as one vector that well describes the document in vector space. |C| is the number of classes.

There are two major strategies, local policy and global policy ^[6,7,8]. In local policy, each test document in the independent binary classification task will be represented as a single vector. This means that the vector representation of each document is not an independent vector but a corresponding vector collection which combines with specific binary classification tasks. Global policy has been widely used. Each document will have a global independent representation. In most classification tasks, each document is generally assigned to one category and labeled with the most similar class label. For example, when using k-Nearest Neighbor classifier, the category of an unknown document will obtain the nearest k documents by calculating the Euclidean distance (or Mahalanobis distance or Manhattan distance) from other documents, and then determine the category of the unknown document according to the category of these documents. Thus, most of the classification tasks are regarded as single label task and use global policy ^[9]. In addition, there are some models based on neural networks ^[10,11,12]. This study focuses on models based on global policy, Younghoong Ko proposed Word Max (W-Max), Document Max (D-Max) and Document Two Max (D-TMax) to optimize text representation method and improve the classification performance ^[9], the idea behind W-Max is same as traditional global policy method. In these three methods, any method cannot always obtain optimal values for different data sets. These methods are discussed in Section 2.

Due to the above aspects, is there a relatively general strategy for supervised term weighting schemes, and, if yes, which one can achieve better performance? This is the first question we wish to address in this study.

In this study, we investigate several well-known document representation strategies, including "global policy", W-Max, D-Max, and D-TMax for supervised term weighting schemes. Since we have not discovered similar work presently, this investigation is significant and valuable in document representation strategy for supervised term weighting schemes in automatic text categorization. At the same time, a new document representation strategy for supervised term weighting schemes is proposed. These five document representation strategies are tested on two famous document collections, i.e., Reuters-21578 (skewed category distribution) and 20 Newsgroups (uniform category distribution). We discuss the experimental results in detail and draw conclusions from different aspects.

The remainder of this paper is organized as follows. We briefly review several document representation strategies in Section 2. Section 3 introduces our proposed document representation strategy, as well as a short discussion of the method. We show experimental results in Section 4, and finally, we draw conclusions and discuss future work in Section 5.

2. A brief review of document representation strategies

In the scenario of text categorization, an indexing procedure which converts the raw document into a vector representation is usually necessary since text documents cannot be directly interpreted by a classifier. Document representation is thereby one of the essential components for the construction of a classifier. There are some different types of document representation methods: VSM, Latent Semantic Indexing (LSI) ^[13] and Latent Dirichlet Allocation (LDA) ^[14] and the representation method based on word vector model, such as Word2Vec ^[15].

LSI approximates the source space with fewer dimensions which uses matrix algebra technique. Probabilistic Latent Semantic Indexing (PLSI) has a more solid statistical foundation compared with LSI, since it's based on the likelihood principle and defines a proper generative model of the data ^[16]. Blei, et al. proposed a more widely used topic model, LDA after PLSI. It can recognize the latent topics of documents and use topic probability distribution for representations. Word2vec model and application by Mikolov et al. have attracted a great amount of attention. Its core idea is to get the vectorial representation of words through the context of words. There are two methods: CBOW and Skip-gram. The vector representations of words learned by word2vec models have been shown to carry semantic meanings. In this representation method, words with similar semantics will have similar vector representation. In this section, we briefly review several document representation strategies based on VSM, the most classical method.

2.1. Global policy

In local policy, each test document in the independent binary classification task will be represented as a single vector.

Global policy is defined as follows.

$TW\left( t \right) = \mathop {\max }\limits_{i = 1}^{|C|} TW\left( {t,{c_i}} \right)$

(1)

where TW(t) is the final weight of a term t; TW(t, c_i) is weight of term t in category c_i obtained with supervised term weighting methods. In the process of initial representation, a test document can be represented as |C| different vectors. After using appropriate selection policy, it can be represented as one vector which well describes the document. Global policy selects the maximum term value among all categories for each term. Although this method is effective in some cases, it can not ensure that it has the ability to select the most effective term weighting vector for current test samples. Since test documents do not have any class information and supervised term weighting schemes required. Besides local policy and global policy, Younghoong Ko proposed the following three solutions for this problem, i.e., W-Max, D-Max and D-TMax ^[9]. Now, they are described as follows. Based on the idea of global policy, there are also some methods, such as the sum $ {f}_{sum}\left({t}_{k}\right) = {\sum }_{i = 1}^{\left|C\right|}f({t}_{k}, {c}_{i}) $, or the weighted sum $ {f}_{wsum}\left({t}_{k}\right) = {\sum }_{i = 1}^{\left|C\right|}P\left({c}_{i}\right)f({t}_{k}, {c}_{i}) $, or the maximum $ {f}_{max}\left({t}_{k}\right) = {max}_{i = 1}^{\left|C\right|}({t}_{k}, {c}_{i}) $.These functions try to capture the intuition that the best terms for one category are the ones distributed most differently in the sets of positive and negative examples of the category ^[17]. Now, we described Younghoong Ko 's methods as follows.

2.2. W-Max

Each term's value of term weighting vector will be replaced by the maximum value of the corresponding dimension's term weight in all categories. After comparing with global policy, we may find that they have the same idea.

2.3. D-Max

The sum of all term weights in each term weighting vector is first calculated and then one term weighting vector with the maximum sum value is selected as the document representation vector.

2.4. DT-Max

The sum of all term weights in each term weighting vector is calculated and then two term weighting vectors with the two largest sum values are selected. Then the term weighting vector is constructed by choosing the higher term weighting value from the selected two term weighting vectors for each corresponding dimension's term weight.

To discuss the difference between the above strategies, we take some examples as follows. Assuming that there is a training set D = {d₁, d₂, …, d_n} with m terms, where n is the number of documents and these documents belong to |C| categories. Corresponding to category set C = {c₁, c₂, …, c_|C|}, there is a term weighting vector set V = {v₁, v₂, …, v_|C|}. The matrix M (consist of v₁ to v_|C|) is defined as follows.

$M = \left[ {

$\begin{array}{*{20}{c}} {{t_{11}}}&{{t_{21}}}&{...}&{{t_{1j}}}&{...}&{{t_{1m}}} \\ {{t_{21}}}&{{t_{22}}}&{...}&{{t_{2j}}}&{...}&{{t_{2m}}} \\ {...}&{...}&{...}&{...}&{...}&{...} \\ {{t_{i1}}}&{{t_{i2}}}&{...}&{{t_{ij}}}&{...}&{{t_{im}}} \\ {...}&{...}&{...}&{...}&{...}&{...} \\ {{t_{|c|1}}}&{{t_{|c|2}}}&{...}&{{t_{|c|j}}}&{...}&{{t_{|c|m}}} \end{array}$ } \right] $

(2)

In Eq (2), t_ij is j-th element of the term weighting vector for category c_i, it is calculated using a certain term weighting method (such as tf*tf, tf*or or other term weighting methods). Assuming that the term weighting vector of test set is v_d = {w₁, w₂, …, w_m}, w_k is the k-th element of v_d.

When W-Max is used, w_k can be defined as follows.

$ {w_k} = \mathop {\max }\limits_{i = 1}^{|C|} \left( {{w_{ik}}} \right);k \in [1, m]$

(3)

In Eq (3), w_ik is the k-th element of the vector when category c_i is used as a positive category. When D-Max is used, the sum of all term weights in each vector v_i is first calculated.

$ {\mathit{sum}}_{i} = \sum _{j = 1}^{m}{t}_{ij}$

(4)

After sorting all sum_i, D-Max selects the maximum value sum_maxi. The term weighting vector corresponding to sum_maxi will be selected as v_d.

When D-TMax is used, the sum of all term weights in each vector v_i is first calculated using the above Eq (4). Then the two largest sum values are selected, which are labeled as sum_x and sum_y, respectively. The subscript x and y are indexs of different categories. The k-th element of v_d is calculated as Eq (5).

$ {w}_{k} = \mathit{max}\left({t}_{\mathit{xk}}, {t}_{\mathit{yk}}\right);x\ne y\in \left[1, c\right], k\in [1, m] $

(5)

When compared with W-Max (global policy) method, the other two methods achieved good performance in Youngjoong Ko's experiments, i.e., D-TMax achieved good performance in Reuters21578 (skewed category distribution) while D-Max achieved good performance in 20 Newsgroups (uniform category distribution) ^[8]. Through experiments on the two famous document collections, we can draw a conclusion that the same method may have different effects on different data set.

3. Methodology

Our review of document representation shows that different data sets require different methods to achieve good performance. How can we know the effect of document representation method before we choose it? In other words, when selecting a document representation strategy for an unknown data set, which method is a better choice? This is the second question we asked in this study. We will present the answer at the end of this paper. Table 1 records the numbers of documents which contain term t_k and do not contain term t_k under category c_i and $ \stackrel{-}{{c}_{i}} $. Usually, d >> a, b, c.

Table 1. The Contingency Table for Category c_i and Term t_k.

	t_k	$ \stackrel{-}{{t}_{k}} $
Positive Category: c_i	a	b
Positive Category: $ \stackrel{-}{{c}_{i}} $	c	d

| Show Table

DownLoad: CSV

The improper selecting of document representation strategy would lead to the problem of inappropriate to assign the weight to terms. A test document can be first represented as |C| different vectors by using estimated distribution of each category. For some categories, the weight they assign to terms would has a negative impact on the role of terms in classification. To illustrate this, suppose the training set is skewed with 19 documents, 5 terms and 5 categories. The relationship between term, document, and category is shown in Table 2. The number in Table 2 represents the times that a term occurs in a document.

Table 2. The relationship between term, document, and category.

category	document	t₁	t₂	t₃	t₄	t₅
C₁	d₁	0	0	2	19	3
C₁	d₂	0	1	3	0	3
C₁	d₃	0	0	5	16	2
C₁	d₄	4	0	1	15	2
C₁	d₅	0	0	2	18	3
C₂	d₆	0	0	2	14	3
C₂	d₇	0	1	3	0	3
C₂	d₈	0	0	5	13	2
C₂	d₉	4	0	1	11	2
C₂	d₁₀	0	0	2	17	3
C₃	d₁₁	0	0	3	0	3
C₃	d₁₂	0	0	1	1	3
C₃	d₁₃	0	1	1	0	2
C₄	d₁₄	1	99	3	2	1
C₄	d₁₅	2	99	1	1	1
C₄	d₁₆	1	99	1	2	1
C₅	d₁₇	4	0	3	0	3
C₅	d₁₈	4	0	1	0	3
C₅	d₁₉	4	1	1	0	2

| Show Table

DownLoad: CSV

According to some supervised term weighting schemes, here we use tf*rf as an example ^[1]. Based on notations in table 1, it is defined as follows.

$ \mathit{tf}*\mathit{rf} = tf*log(2+\frac{a}{mac\left(1, c\right)}) $

(6)

The term weights of t1 to t5 for each category are shown in Table 3.

Table 3. The term weights of t₁ to t₅ for each category.

category	t₁	t₂	t₃	t₄	t₅
C₁	1.1155	1.1699	1.2538	1.3626	1.2538
C₂	1.1155	1.1699	1.2538	1.3626	1.2538
C₃	1.0000	1.1699	1.1375	1.0704	1.1375
C₄	1.2630	1.4510	1.0875	1.1520	1.0875
C₅	1.4594	1.0000	1.1375	1.0000	1.1375

| Show Table

DownLoad: CSV

In Younghoong Ko's methods, D-TMax selects the two largest sum values. For multi-class classification problems, in order to obtain better performance, can more categories be selected? Now we will select 1 to 5 categories to test this hypothesis. When choosing 1 or 2 categories, it is called "D-Max" (Document Max) or "D-TMax" (Document Two Max) ^[9]. Based on this rule, we named 3, 4 and 5 categories as "D-3Max" (Document Three Max), "D-4Max" (Document Four Max) and "D-5Max" (Document Five Max). For multiclass text categorization, we named it "D-NMax" (Document Number Max) in this study. The number of selected categories and corresponding results are shown in the Table 4.

Table 4. The number of selected categories and corresponding results.

method	t₁	t₂	t₃	t₄	t₅
D-Max	1.1155	1.1699	1.2538	1.3626	1.2538
D-TMax	1.1155	1.1699	1.2538	1.3626	1.2538
D-3Max	1.2630	1.4510	1.2538	1.3626	1.2538
D-4Max	1.4594	1.4510	1.2538	1.3626	1.2538
D-5Max	1.4594	1.4510	1.2538	1.3626	1.2538

| Show Table

DownLoad: CSV

According to table 2, the frequency of features in each category, t₃ and t₅ are evenly distributed in each category and do not have the ability to distinguish categories. According to the distribution of t₁, t₂ and t₄, these three terms are relatively distinguishable, especially t₂, which is very concentrated in C₄, and its term weight should be the highest among these five terms. Although t₁ appears centrally in C₅, it also appears in individual documents of categories C₁, C₂ and C₄. It has little ability to distinguish categories. For t₄, its distinguishing ability is not specific to a single category. In other words, it can only distinguish C₁ and C₂ from other categories, but cannot distinguish whether the document belongs to C₁ or C₂. The result in bold in the Table 4 violates our intuition that the weight of t₁, t₂ and t₄ should be large, and the weight of t₁ should be relatively small compared to t₂ and t₄. Since t₁ appear with a low frequency in documents compared to t₂ and t₄. Another unreasonable observation is that t₁ in some documents (d₄ and d₉) in category 1 and category 2 also have same frequency when compared to t₁ in the documents in category 5. After observation, we can find that the results of D-Max, D-TMax and D-5Max (W-Max) cannot boost the performance of text categorization. The results from D-3Max are consistent with our intuition that the weight of t₁, t₂ and t₄ should be large, and the weight of t₁ should be relatively small. In order to overcome the shortcomings of Younghoong Ko's methods, in this section we explain our proposed DRGP method, which will choose the traversal method to appropriately weigh the contribution of each term. The DRGP can select the appropriate "N" (in D-NMax) to enhance the performance of text categorization. While our previous work in ^[18] provided experimental evidence for the effectiveness of DRGP under particular experimental circumstances, we formally address how and why this new method is proposed using analytical explanation and empirical observation. Results in Section 4 show that the proposed strategy outperforms other methods significantly.

In the DRGP method, by traversing the term weighting vectors generated by each class, we compare their weighting effects on the training set. The term weighting vector which produces the best effect on training set will be selected as the term weighting vector of test set. The idea of the proposed method is mainly inspired by Younghoong Ko, Miao and Kamel ^[19]. They proposed a pairwise optimized Rocchio algorithm, which dynamically adjusts the prototype position between pairs of categories on training set, and record the best prototype position for test set. We summarize the main process of DRGP as Algorithm 1.

Algorithm 1: document representation strategy based on global policy (DRGP)
Input: fea: feature matrix of training set gnd: a vector of labels for documents in training set Output: selectedC: the most appropriate N value (in D-NMax) for current dataset Local variables \|C\|: total number of categories; M: total number of features; termWeightingVec1: the set of \|C\| original term weighting vectors; termWeightingVec1_i: i-th vector of the original term weighting vectors; termWeightingVec2_i: i-th vector of the reconstructed term weighting vectors; sumVec_i: sum value of all terms in i-th term weighting vector; sortSum: sorted list of each sum values; weightedFea_i: the weighted fea by using termWeightingVec2_i; MicroF₁ⁱ: result of 10-fold cross validation on weightedFea_i; begin 1: apply supervised term weighting method to fea, and get termWeightingVec1;
2: for i = 1 to \|C\| 3: for j = 1 to M 4: compute sumVec_i for termWeightingVec1_i; 5: end for 6: end for 7: sort all sumVec_i, and get sortSum; 8: for i = 1 to \|C\| 9: for j = 1 to M 10: for k = 1 to i 11: tempC = sortSum[k] 12: construct termWeightingVec2_i by the following ways. The j-th dimension of each vector in the selected k term weighting vectors is obtained, i.e., termWeightingVec1[tempC][j]. 13: the maximum value of all termWeightingVec1[tempC][j] will be selected as the j-th value of the termWeightingVec2_i; 14: end for 15: end for 16: end for 17: for i = 1 to \|C\| 18: compute weightedFea_i; 19: end for 20: for i = 1 to \|C\| 21: compute MicroF₁ⁱ; 22: end for 23: record i corresponding to the maximum MicroF₁ⁱ, and assign it to selectedC; end

After the algorithm 1 is executed, we can get selectedC, that is, the N value in D-NMax. Then we can use the following Eq (7) to calculate the term weighting vector of test set. The sum of all term weights in each vector v_i is first calculated using Eq (4). Then the N largest sum values are selected, which are labeled as sum₁, sum₂...sum_i... and sum_N, respectively. The subscript 1, 2...i... and N are index of different categories. The k-th element of v_d is calculated as Eq (7).

$ {w}_{k} = \mathit{max}({t}_{1k}, {t}_{2k}, \dots {t}_{ik}\dots , {t}_{Nk});iϵ[1, N], kϵ[1, m] $

(7)

In algorithm 1, steps 1 to 6, first apply the term weighting method to the original data set to obtain |C| original term weighting vectors. Sum the original feature weight vectors separately to get sumVec_i. Steps 8 to 16, sort all sumVec_i to get sortSum, and reconstruct |C| term weighting vectors by the following way. Select the first 1 to |C| vectors of sortSum in turn, and construct one term weighting vector each time. The j-th dimension element of the term weighting vector is selected in the following method: the maximum value of the j-th dimension of the currently selected k term weighting vectors. Steps 17 to 19, through the newly constructed |C| term weighting vector termWeightingVec2, weight the fea matrix respectively, and each termWeightingVec2_i will get the corresponding weightedFea_i. Steps 20 to 22, verify each weightedFea_i through ten-fold cross-validation, and record the corresponding result MicroF₁ⁱ. Step 23, record the maximum MicroF₁ and the corresponding weight vector sequence number i as the result of selectedC. The MicroF₁ is the popular performance measure in text categorization and the formula will present in Section 4.

However, the algorithm has a shortcoming for its high time complexity. The most time-complex part is to calculate MicroF₁ⁱ (steps 19 to 21, in bold), which is calculated using ten-fold cross-validation. According to Lin's 2006 "machine learning summer school", "the time complexity of the support vector machine algorithm is O(N³)". Therefore, the time complexity of this algorithm can be estimated as O(|C|*N³). Compared with some algorithms that use time in exchange for space, the DRGP algorithm uses time in exchange for precision. This algorithm is more suitable for classification systems that use a fixed training set, that is, after one training on the training set to find the optimal model, it can be applied to the classification system without changing the parameters for a long time.

4. Experimental results

4.1. Data corpora

4.1.1. The 20 Newsgroups corpus

The 20 Newsgroups corpus is a generally used benchmark dataset in the TC ^[20]. In the corpus, there are 20,000 newsgroup documents nearly uniformly distributed into 20 classes. In this study, we use the 20 Newsgroups sorted by the date. After removing duplicates and headers, the remaining 18,846 documents are partitioned into 11,314 (about 60 percent) training documents and 7532 (about 40 percent) testing documents. After preprocessing, there are 26,214 distinct words in this data set.

4.1.2. The Reuters21578 corpus

The Reuters21578 corpus is used in many experiments and it contains 21,578 documents in 135 categories ^[21,22]. We use its ModApte version. There are 5946 training documents and 2347 testing documents in this version. In the study, we choose the top 10 largest categories which have 5228 training documents and 2057 testing documents. After preprocessing, the resulting vocabulary has 18,221 distinct words. Compared with 20 Newsgroups, it is a skewed data set and 80 percent of the categories have less than 7.5 percent instances.

The statistics of datasets are listed in Table 5.

Table 5. Statistics of datasets.

Datasets	# of documents	distinct words	# of classes	# of training	# of testing
20 Newsgroups	18,846	26,214	20	11,314	7532
Reuters21578	7285	18,221	10	5228	2057

| Show Table

DownLoad: CSV

4.2. Combined document representation strategies

Some different document representation strategies listed in Table 6 are selected in this study. Besides D-Max, D-TMax and W-Max, we also investigate the D-NMax in the experiment. In the experiment, we list the results of the selected categories from 1 to the maximum. We can see the corresponding results when the number of selected categories is changed.

Table 6. Summary of document representation strategies.

Denoted by	Description
D-Max	The sum of all term weights in each term weighting vector is first calculated and then one term weighting vector with the maximum sum value is selected as the document representation vector.
D-TMax	The sum of all term weights in each term weighting vector is calculated and then two term weighting vectors with the two largest sum values are selected. Then the term weighting vector is constructed by choosing the higher term weighting value from the selected two term weighting vectors for each corresponding dimension's term weight.
D-NMax	The sum of all term weights in each term weighting vector is calculated and then N term weighting vectors with the N largest sum values are selected. Then the term weighting vector is constructed by choosing the highest term weighting value from the selected N term weighting vectors for each corresponding dimension's term weight.
W-Max (global policy)	Each term's value of term weighting vector will be replaced by the maximum value of the corresponding dimension's term weight in all categories.

| Show Table

DownLoad: CSV

4.3. Learning algorithms

To evaluate classification performance of the proposed method, we choose the promising Support Vector Machines (SVM) learning algorithm in this study ^[23,24]. Although other algorithms such as k Nearest Neighbor, Decision Tree and Naive Bayes are also widely used, they are not included because the real number format of term weights could not be used except for the binary representation (see an exception in ^[25]). Finally, the SVM learning algorithm scale to large classification problems with thousands of features and examples.

The SVM is a relatively efficient machine learning algorithm which shows a good performance. It is based on the structural risk minimization principle from computational learning theory ^[26]. It can handle the large-scale and high-dimensional data sets with high classification accuracy. According to different kernel functions, SVMs are divided into two categories: nonlinear (such as polynomial, sigmoid function, radial-based function) and linear methods. Leopold and Kindermann pointed out that term weighting schemes dominate the performance of SVM classifiers rather than the kernel functions ^[27]. Moreover, the literatures have proven that the linear SVM is superior to nonlinear SVM ^[28]. So, we select the linear kernel function in this study. The other parameters of SVM are set to their default values. The SVM software used is from LIBSVM-3.14 ^[29].

4.4. Performance evaluation

The standard measures to determine the performance of a classification task are precision and recall ^[30,31]. However, it is well known that we may receive low recall when we obtain high precision. It will be ineffective if precision and recall are separated ^[1]. The widely used measure is F₁ measure which combines the precision and recall. For a given category i, Precision, Recall and F₁ measure are defined as follows.

$ {\mathit{Precision}}_{i} = {TP}_{i}/({TP}_{i}+{FP}_{i}) $

(8)

$ {\mathit{Recall}}_{i} = {TP}_{i}/({TP}_{i}+{FN}_{i}) $

(9)

$ {F}_{1}^{i} = \left(2*{Precision}_{i}*{Recall}_{i}\right)/{(Precision}_{i}+{Recall}_{i}) $

(10)

where $ {TP}_{i} $ is the number of documents assigned correctly to class i, $ {FP}_{i} $ is the number of documents that do not belong to class i but are assigned to class i. $ {FN}_{i} $ is the number of documents that actually belong to class i but are not assigned to the class i.

The F₁ measure is estimated by MicroF₁ and MacroF₁ ^[32]. In this study, MicroF₁ and MacroF₁ are employed to measure the performance of the proposed method. They are computed as in Eqs (11) and (12).

$ {\mathit{MicroF}}_{1} = 2*Precision*Recall $

(11)

$ {MacroF}_{1} = \frac{1}{m}*\sum _{i = 1}^{m}{F}_{1}^{i} $

(12)

In Eq (10), m is the number of categories. The MicroF₁ assigns equal weight to each document and it is considered as an average over all the document/category pairs. The MacroF₁ assigns equal weight to each category and is influenced by the results of rare categories.

In order to verify the effectiveness of the proposed method in the experiment, besides tf*rf mentioned before, we also include the tf*or which is widely used in the previous study ^[33]. Its formula is expressed as

$ tf*or = tf*\mathrm{log}\left(\frac{a*d}{b*c}\right) $

(13)

4.5. Experiments

The main purpose of the experiments is to address the two questions, i.e., to explore the superiority of document representation strategy for supervised term weighting schemes and find a measure before choosing a document representation strategy. To accomplish this, we compare the methods on two popular benchmark data corpora, i.e., 20 Newsgroups and Reuters-21578, using SVM in terms of MicroF₁ and MacroF₁ measure. In addition to showing the corresponding results of DRGP in the experiment, the corresponding results of selecting other number of categories are also listed. The experimental results and discussion on these two corpora are in Sections 4.5.1–4.5.3.

4.5.1. Results and discussion on the 20 newsgroups corpora

The returned value of selectedC is 20 when DRGP is used on 20 Newsgroups which are weighted by tf*or term weighting method. Then we can use the formula (7) to calculate the term weighting vector of test set. Figure 1 shows the results of MicroF₁ and MacroF₁ when using supervised term weighting method tf*or and SVM classification algorithm with linear kernel functions on 20 Newsgroups. The best MicroF₁ (0.7884) and MacroF₁ (0.7837) are achieved by using DRGP and W-Max. The result obtained by D-Max is 0.5539 (MicroF₁) and 0.5521 (MacroF₁), and the result obtained by D-TMax is 0.6036 (MicroF₁) and 0.6027 (MacroF₁). It is more intuitive that the results and trends of MicroF₁ and MacroF₁ on the 20 newsgroups are basically the same. The main reason is that 20 Newsgroups is a balanced dataset.

Figure 1. A comparison on MicroF₁ and MacroF₁ using DRGP, tf*or and SVM.

DownLoad: Full-Size Img PowerPoint

The returned value of selectedC is 20 when DRGP is used on 20 Newsgroups which are weighted by tf*rf term weighting method. Then we can use the formula (7) to calculate the term weighting vector of test set. Figure 2 shows the results of MicroF₁ and MacroF₁ when using supervised term weighting method tf*rf and SVM classification algorithm with linear kernel functions on 20 Newsgroups. The best MicroF₁ (0.7958) and MacroF₁ (0.7909) are achieved by using DRGP and W-Max. The result obtained by D-Max is 0.7562 (MicroF₁) and 0.7502 (MacroF₁), and the result obtained by D-TMax is 0.7592 (MicroF₁) and 0.7529 (MacroF₁).

Figure 2. A comparison on MicroF₁ and MacroF₁ using DRGP, tf*rf and SVM.

DownLoad: Full-Size Img PowerPoint

4.5.2. Results and discussion on reuters-21578 corpora

The returned value of selectedC is 6 when DRGP is used on Reuters-21578 which are weighted by tf*or term weighting method. Then we can use the Eq (7) to calculate the term weighting vector of test set. Figure 3 shows the results of MicroF₁ and MacroF₁ when using supervised term weighting method tf*or and SVM classification algorithm with linear kernel functions on Reuters-21578. In other words, the most appropriate N value for current dataset is 6. The sum of all term weights in each term weighting vector is calculated and then six term weighting vectors corresponding to the first six largest sum values are selected. Then the term weighting vector is constructed by choosing the highest term weighting value from the selected six term weighting vectors for each corresponding dimension's term weight. The best MicroF₁ (0.9203) and MacroF₁ (0.8919) are achieved by using DRGP. The result obtained by D-Max is 0.8769 (MicroF₁) and 0.8284 (MacroF₁), and the result obtained by D-TMax is 0.8697 (MicroF₁) and 0.825 (MacroF₁), and the result obtained by W-Max is 0.8898 (MicroF₁) and 0.882 (MacroF₁).

Figure 3. A comparison on MicroF₁ and MacroF₁ using DRGP, tf*or and SVM.

DownLoad: Full-Size Img PowerPoint

The returned value of selectedC is 6 when DRGP is used on Reuters-21578 which are weighted by tf*rf term weighting method. Then we can use the Eq (7) to calculate the term weighting vector of test set. Figure 4 shows the results of MicroF₁ and MacroF₁ when using supervised term weighting method tf*rf and SVM classification algorithm with linear kernel functions on Reuters-21578. The best MicroF₁ (0.9283) and MacroF₁ (0.9086) are achieved by using DRGP. The result obtained by D-Max is 0.9258 (MicroF₁) and 0.9077 (MacroF₁), and the result obtained by D-TMax is 0.9258 (MicroF₁) and 0.8966 (MacroF₁), and the result obtained by W-Max is 0.9269 (MicroF₁) and 0.9049 (MacroF₁).

Figure 4. A comparison on MicroF₁ and MacroF₁ using DRGP, tf*rf and SVM.

DownLoad: Full-Size Img PowerPoint

4.5.3. Discussion on the effects of corpora on different algorithms

The above results showed that DRGP and W-Max achieved the best performance on 20 Newsgroups while DRGP achieved the best performance on Reuters-21578. In addition, the results and trends of MicroF₁ and MacroF₁ on 20 Newsgroups are basically consistent. We think it is very natural results. Because 20 Newsgroups is a balanced dataset and Reuters-21578 is a skewed dataset. Many documents in Reuters-21578 have two or more labels. For example, if a document with two labels is represented by using only one class distribution between two labels, classifiers could has some difficulty to classify it in the other class.

In skewed data sets, it is not suitable to adopt the document representation strategy of D-Max, D-TMax, and W-Max (global policy). Note that although our experiments use the same corpus and same evaluation measure as Youngjoong Ko's ^[8], there are minor differences in data preparation such as the stemming or stop words lists for text preprocessing and the different feature selection measures. On Reuters-21578, the experimental results show the same phenomenon, that the results obtained by D-TMax are greater than D-Max, and the results obtained by W-Max method are the worst.

In the experimental results, we can observe that on the balanced data set (20 Newsgroups), MicroF₁ and MacroF₁ will increase with the increase of selection categories. However, the result of skewed data set (Reuters-21578) is that MicroF₁ and MacroF₁ first increase and then decrease with the increase of selection categories. Therefore, in order to obtain better results, it is necessary to select the appropriate method according to the characteristics of the data set. The proposed DRGP method has achieved best performance in both balanced and skewed data sets. Especially on skewed dataset, the results achieved are greatly improved compared to the results achieved by traditional methods. It should be noted that here we only use tf*or and tf*rf weighting methods and SVM classification method. Our method can be extended to other supervised term weighting methods, and other classifiers can be used for cross validation or classification.

5. Conclusions

Document representation is one of the most important parts for constructing a text classifier. For the supervised term weighting methods, in addition to the global policy, there is local policy, including D-Max, D-TMax and W-Max, etc. Faced with so many strategies, we are not sure how to choose to achieve the best results. That is, the N value in D-NMax cannot be determined.

In this study, we studied and solved this problem. We found that representation methods should be chosen according to corpora characteristics to have better classification performance. Compared with other methods, such as D-Max, D-TMax and W-Max, the method proposed in this study is not only a weighting method, but also a selection strategy. The method can select the appropriate N (in D-NMax) value according to distribution of each category in the dataset. By testing the proposed method on two representative supervised term weighting methods (tf*rf and tf*or) on two datasets (20 Newsgroups and Reuters-21578), it can be obtained that the method is effective on both balanced and unbalanced datasets.

Based on the original document representation method, the proposed method introduces the idea of traversal. Through cross validation, the proposed method first finds the optimal document representation model on the training set, and then applies the selected model to the test set. Due to the different working mechanism of different classifiers, different classifiers may have an impact on the final results, but the method proposed in this paper has no special requirements for classifiers. In this study, we use the SVM. This method can be extended to other supervised feature weighting methods, and other classifiers can be used for cross validation or classification. Compared with the original document representation methods, the proposed method increases the additional computational cost (cross validation on the test set). In the next research, we will continue to study and introduce new optimization methods to reduce the calculation cost.

At the end of this study, we answer the questions we raised in the beginning of the paper as conclusions:

1) Is there a relatively general strategy for supervised term weighting schemes, and, if yes, which one can achieve better performance?

Through a series of evaluations in text categorization, including balanced dataset and skewed dataset, we find that the performances of W-Max, D-Max, D-TMax get different results on different types of datasets. The proposed DRGP and W-Max achieved the best performance on 20 Newsgroups. Besides, the proposed DRGP achieved the best performance on Reuters-21578.

2) How can we know the effect of document representation method before we choose it? In other words, when selecting a document representation strategy for an unknown data set, which method is the best choice?

In this study, we propose the DRGP method, which can select appropriate representation strategy for an unknown dataset. No matter the data set is uniform or not, it will use traversal method on the constructed optimization function to find the document representation strategy on training set, then apply it on test set. The DRGP exhibit stable and consistent improvement over most of the previous document representations mentioned in the experiments.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant 61977015, and the Jilin Province Development and Reform Commission Project of China under Grant 2020C017-3. We would like to thank the organizations for their supports.

Conflict of interest

All authors declare no conflicts of interest in this paper.

References

[1]	Adams III EN (1972) Consensus techniques and the comparison of taxonomic trees. Systematic Zoology 21: 390–397. doi: 10.2307/2412432
[2]	Aho AV, Sagiv Y, Szymanski TG, et al. (1981) Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput 10: 405–421.
[3]	Arnold C, Matthews LJ, Nunn CL (2010) The 10kTrees website: A new online resource for primate phylogeny. Evolutionary Anthropology 19: 114–118. doi: 10.1002/evan.20251
[4]	Bender MA, Farach-Colton M (2000) The LCA problem revisited. In Proceedings of the 4^thLatin American Symposium on Theoretical Informatics (LATIN 2000), volume 1776 of LNCS, pages 88–94. Springer-Verlag.
[5]	Bininda-Emonds ORP (2004) The evolution of supertrees. TRENDS Ecol Evolution 19: 315–322. doi: 10.1016/j.tree.2004.03.015
[6]	Bininda-Emonds ORP, Cardillo M, Jones KE, et al. (2007) The delayed rise of present-day mammals. Nature 446: 507–512. doi: 10.1038/nature05634
[7]	Bryant D (1997) Building Trees, Hunting for Trees, and Comparing Trees: Theory and Methods in Phylogenetic Analysis. PhD thesis, University of Canterbury, Christchurch, New Zealand.
[8]	Bryant D (2003) A classification of consensus methods for phylogenetics. In Janowitz MF, Lapointe FJ, McMorris FR, et al., editors, Bioconsensus, volume 61 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 163–184. American Mathematical Society.
[9]	Byrka J, Guillemot S, Jansson J (2010) New results on optimizing rooted triplets consistency. Discrete Applied Mathematics 158: 1136–1147. doi: 10.1016/j.dam.2010.03.004
[10]	Chor B, Hendy M, Penny D (2007) Analytic solutions for three taxon ML trees with variable rates across sites. Discrete Applied Mathematics 155: 750–758. doi: 10.1016/j.dam.2005.05.043
[11]	Constantinescu M, Sankoff D (1995) An efficient algorithm for supertrees. J Classification 12: 101–112. doi: 10.1007/BF01202270
[12]	Felsenstein J (2004) Inferring Phylogenies. Sinauer Associates, Inc., Sunderland, Massachusetts.
[13]	Garey M, Johnson D (1979) Computers and Intractability – A Guide to the Theory of NPCompleteness. Freeman WH and Company, New York.
[14]	Gąsieniec L, Jansson J, Lingas A, et al. (1999) On the complexity of constructing evolutionary trees. J Combinatorial Optimization 3: 183–197. doi: 10.1023/A:1009833626004
[15]	He YJ, Huynh TND, Jansson J, et al. (2006) Inferring phylogenetic relationships avoiding forbidden rooted triplets. J Bioinformatics Comput Bio 4: 59–74. doi: 10.1142/S0219720006001709
[16]	Henzinger MR, King V, Warnow T. (1999) Constructing a tree from homeomorphic subtrees, with applications to computational evolutionary biology. Algorithmica 24: 1–13. doi: 10.1007/PL00009268
[17]	Huson DH, Rupp R, Scornavacca C (2010) Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press, Cambridge, U.K.
[18]	Jansson J, Lemence RS, Lingas A (2012) The complexity of inferring a minimally resolved phylogenetic supertree. SIAM J Comput 41: 272–291. doi: 10.1137/100811489
[19]	Jansson J, Lingas A, Rajaby R, et al. (2017) Determining the consistency of resolved triplets and fan triplets. In Proceedings of the 21^st Annual International Conference on Research in Computational Molecular Biology (RECOMB 2017), volume 10229 of LNCS, pages 82–98. Springer-Verlag.
[20]	Jansson J, Rajaby R, Shen C, et al. (to appear). Algorithms for the majority rule (+) consensus tree and the frequency difference consensus tree. IEEE/ACM Transactions Computational Bio Bioinformatics.
[21]	Jansson J, Shen C, Sung WK (2016) Improved algorithms for constructing consensus trees. J ACM 63.
[22]	Kannan S,Warnow T, Yooseph S (1998) Computing the local consensus of trees. SIAM J Computing 27: 1695–1724. doi: 10.1137/S0097539795287642
[23]	McKenzie A, Steel M (2000) Distributions of cherries for two models of trees. Mathematical Biosciences 164: 81–92. doi: 10.1016/S0025-5564(99)00060-7
[24]	Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI 2007), pages 89–100. ACM.
[25]	Ng MP, Wormald NC (1996) Reconstruction of rooted trees from subtrees. Discrete Applied Mathematics 69: 19–31. doi: 10.1016/0166-218X(95)00074-2
[26]	Semple C (2003) Reconstructing minimal rooted trees. Discrete Applied Mathematics 127: 489–503. doi: 10.1016/S0166-218X(02)00250-0
[27]	Semple C, Daniel P, Hordijk W, et al. (2004) Supertree algorithms for ancestral divergence dates and nested taxa. Bioinformatics 20: 2355–2360. doi: 10.1093/bioinformatics/bth246
[28]	Snir S, Rao S (2006) Using Max Cut to enhance rooted trees consistency. IEEE/ACM Transactions Comput Bio Bioinformatics 3: 323–333. doi: 10.1109/TCBB.2006.58
[29]	Steel M (1992) The complexity of reconstructing trees from qualitative characters and subtrees. J Classification 9: 91–116. doi: 10.1007/BF02618470
[30]	Sung WK (2010) Algorithms in Bioinformatics: A Practical Introduction. Chapman & Hall/CRC, Boca Raton, Florida.
[31]	Willson SJ. (2004) Constructing rooted supertrees using distances. Bulletin Mathematical Bio 66: 1755–1783. doi: 10.1016/j.bulm.2004.04.006
[32]	Wulff-Nilsen C (2013) Faster deterministic fully-dynamic graph connectivity. In Proceedings of the 24^th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2013), pages 1757–1769. SIAM.

This article has been cited by:

1.	Randi D. Rotjan, Koty H. Sharp, Anna E. Gauthier, Rowan Yelton, Eliya M. Baron Lopez, Jessica Carilli, Jonathan C. Kagan, Juanita Urban-Rich, Patterns, dynamics and consequences of microplastic ingestion by the temperate coral, Astrangia poculata , 2019, 286, 0962-8452, 20190726, 10.1098/rspb.2019.0726
2.	Wolfgang Trumbauer, Sean P. Grace, Lisa J. Rodrigues, Physiological seasonality in the symbiont and host of the northern star coral, Astrangia poculata, 2021, 40, 0722-4028, 1155, 10.1007/s00338-021-02119-5
3.	Cawa Tran, Coral–microbe interactions: their importance to reef function and survival, 2022, 6, 2397-8554, 33, 10.1042/ETLS20210229
4.	Giulia Puntin, Michael Sweet, Sebastian Fraune, Mónica Medina, Koty Sharp, Virginia M. Weis, Maren Ziegler, Harnessing the Power of Model Organisms To Unravel Microbial Functions in the Coral Holobiont, 2022, 86, 1092-2172, 10.1128/mmbr.00053-22
5.	Adam J. Wyness, Daniel Roush, Christopher D. McQuaid, Global distribution and diversity of marine euendolithic cyanobacteria, 2022, 58, 0022-3646, 746, 10.1111/jpy.13288
6.	E. Manea, L. Bramanti, L. Pezzolesi, A. Falace, S. Kaleb, L. Bongiorni, P. E. Galand, Inter- and intraspecific diversity in bacterial communities associated with two crustose coralline algae from the NW Mediterranean Sea, 2025, 0722-4028, 10.1007/s00338-024-02605-6
7.	Jill Ashey, Hollie M. Putnam, M. Conor McManus, Guided by the northern star coral: a research synthesis and roadmap for Astrangia poculata , 2025, 21, 1744-957X, 10.1098/rsbl.2024.0469
8.	Taylor R. Lindsay, Willow M. Dunster, Kelton W. McMahon, Carlos Prada, Heterotrophic dependence and bidirectional metabolite exchange dynamics in a temperate coral, 2025, 0024-3590, 10.1002/lno.70141

Reader Comments

Your name:*

Email:*
© 2018 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)