Novel split quality measures for stratified multilabel cross validation with application to large and sparse gene ontology datasets

Henri Tiittanen; Liisa Holm; Petri Törönen; Henri Tiittanen; Liisa Holm; Petri Törönen

doi:10.3934/aci.2022003

Applied Computing and Intelligence

2022, Volume 2, Issue 1: 49-62. doi: 10.3934/aci.2022003

Previous Article Next Article

Research article

Novel split quality measures for stratified multilabel cross validation with application to large and sparse gene ontology datasets

1.
Institute of Biotechnology, Helsinki Institute of Life Sciences (HiLife), University of Helsinki, Helsinki, Finland
2.
Organismal and Evotionary Biology Research Program, Faculty of Biosciences, University of Helsinki, Helsinki, Finland

Academic Editor: Pasi Franti

Received: 11 February 2022 Revised: 30 March 2022 Accepted: 30 March 2022 Published: 31 March 2022

Multilabel learning is an important topic in machine learning research. Evaluating models in multilabel settings requires specific cross validation methods designed for multilabel data. In this article, we show that the most widely used cross validation split quality measure does not behave adequately with multilabel data that has strong class imbalance. We present improved measures and an algorithm, optisplit, for optimizing cross validations splits. Extensive comparison of various types of cross validation methods shows that optisplit produces more even cross validation splits than the existing methods and it is among the fastest methods with good splitting performance.
- stratified cross validation,
- multilabel learning,
- multilabel cross validation,
- classification,
- gene ontology
Citation: Henri Tiittanen, Liisa Holm, Petri Törönen. Novel split quality measures for stratified multilabel cross validation with application to large and sparse gene ontology datasets[J]. Applied Computing and Intelligence, 2022, 2(1): 49-62. doi: 10.3934/aci.2022003

Related Papers:

Abstract

Multilabel learning is an important topic in machine learning research. Evaluating models in multilabel settings requires specific cross validation methods designed for multilabel data. In this article, we show that the most widely used cross validation split quality measure does not behave adequately with multilabel data that has strong class imbalance. We present improved measures and an algorithm, optisplit, for optimizing cross validations splits. Extensive comparison of various types of cross validation methods shows that optisplit produces more even cross validation splits than the existing methods and it is among the fastest methods with good splitting performance.

References

[1]	M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, et al., Gene ontology: tool for the unification of biology, Nature genetics, 25 (2000), 25–29. https://doi.org/10.1038/75556 doi: 10.1038/75556
[2]	S. Bengio, K. Dembczynski, T. Joachims, M. Kloft, M. Varma, Extreme Classification (Dagstuhl Seminar 18291), Dagstuhl Reports, 8 (2019), 62–80.
[3]	K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, M. Varma, The extreme classification repository: Multi-label datasets and code, 2016.
[4]	F. Charte, A. Rivera, M. J. del Jesus, F. Herrera, A. Troncoso, H. Quintián, E. Corchado, On the impact of dataset complexity and sampling strategy in multilabel classifiers performance, Hybrid Artificial Intelligent Systems, (2016), 500–511. Springer International Publishing. https://doi.org/10.1007/978-3-319-32034-2_42
[5]	A. De Myttenaere, B. Golden, B. Le Grand, F. Rossi, Mean absolute percentage error for regression models, Neurocomputing, 192 (2016), 38–48. https://doi.org/10.1016/j.neucom.2015.12.114 doi: 10.1016/j.neucom.2015.12.114
[6]	F. Florez-Revuelta, Evosplit: An evolutionary approach to split a multi-label data set into disjoint subsets, Applied Sciences, 11 (2021), 2823. https://doi.org/10.3390/app11062823 doi: 10.3390/app11062823
[7]	M Merrillees, L Du, Stratified Sampling for Extreme Multi-Label Data, Pacific-Asia Conference on Knowledge Discovery and Data Mining, (2021), 334–345. https://doi.org/10.1007/978-3-030-75765-6_27
[8]	M Merrillees, L Du, Stratified sampling for xml, 2021. Available from: https://github.com/maxitron93/stratified_sampling_for_XML.
[9]	K. Sechidis, G. Tsoumakas, I. Vlahavas, On the stratification of multi-label data, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, (2011), 145–158. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-23808-6_10
[10]	P. Szymański, T. Kajdanowicz, A scikit-based Python environment for performing multi-label classification, arXiv e-prints, 2017.
[11]	P. Szymański, T. Kajdanowicz, A network perspective on stratification of multi-label data, Proceedings of the First International Workshop on Learning with Imbalanced Domains: Theory and Applications, volume 74 of Proceedings of Machine Learning Research, (2017), 22–35.
[12]	H. Tiittanen, L. Holm, P. Törönen, Optisplit. Available from: https://github.com/xtixtixt/optisplit.
[13]	P. Törönen, A. Medlar, L. Holm, Pannzer2: a rapid functional annotation web server, Nucleic acids res., 46 (2018), W84–W88. https://doi.org/10.1093/nar/gky350 doi: 10.1093/nar/gky350
[14]	G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, I. Vlahavas, Mulan: A java library for multi-label learning, J. Mach. Learn. Res., 12 (2011), 2411–2414.
[15]	D. H. Wolpert, Stacked generalization, Neural Networks, 5 (1992), 241–259. https://doi.org/10.1016/S0893-6080(05)80023-1
[16]	N. Zhou, Y. Jiang, T. R. Bergquist, A. J. Lee, B. Z. Kacsoh, A. W. Crocker, K. A. Lewis, G. Georghiou, et al., The cafa challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens, Genome biol., 20 (2019), 1–23.

Reader Comments

Your name:*

Email:*
© 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)