
Citation: Faiz Ul Islam, Guangjie Liu, Weiwei Liu. Identifying VoIP traffic in VPN tunnel via Flow Spatio-Temporal Features[J]. Mathematical Biosciences and Engineering, 2020, 17(5): 4747-4772. doi: 10.3934/mbe.2020260
[1] | Sidra Abid Syed, Munaf Rashid, Samreen Hussain, Anoshia Imtiaz, Hamnah Abid, Hira Zahid . Inter classifier comparison to detect voice pathologies. Mathematical Biosciences and Engineering, 2021, 18(3): 2258-2273. doi: 10.3934/mbe.2021114 |
[2] | Pengcheng Wen, Yuhan Zhang, Guihua Wen . Intelligent personalized diagnosis modeling in advanced medical system for Parkinson's disease using voice signals. Mathematical Biosciences and Engineering, 2023, 20(5): 8085-8102. doi: 10.3934/mbe.2023351 |
[3] | Yihua Song, Chen Ge, Ningning Song, Meili Deng . A novel dictionary learning-based approach for Ultrasound Elastography denoising. Mathematical Biosciences and Engineering, 2022, 19(11): 11533-11543. doi: 10.3934/mbe.2022537 |
[4] | Yudong Zhang, Juan Manuel Gorriz, Deepak Ranjan Nayak . Optimization Algorithms and Machine Learning Techniques in Medical Image Analysis. Mathematical Biosciences and Engineering, 2023, 20(3): 5917-5920. doi: 10.3934/mbe.2023255 |
[5] | Yufeng Qian . Exploration of machine algorithms based on deep learning model and feature extraction. Mathematical Biosciences and Engineering, 2021, 18(6): 7602-7618. doi: 10.3934/mbe.2021376 |
[6] | Abdulwahab Ali Almazroi . Survival prediction among heart patients using machine learning techniques. Mathematical Biosciences and Engineering, 2022, 19(1): 134-145. doi: 10.3934/mbe.2022007 |
[7] | Chelsea Harris, Uchenna Okorie, Sokratis Makrogiannis . Spatially localized sparse approximations of deep features for breast mass characterization. Mathematical Biosciences and Engineering, 2023, 20(9): 15859-15882. doi: 10.3934/mbe.2023706 |
[8] | Leandro Donisi, Giuseppe Cesarelli, Pietro Balbi, Vincenzo Provitera, Bernardo Lanzillo, Armando Coccia, Giovanni D'Addio . Positive impact of short-term gait rehabilitation in Parkinson patients: a combined approach based on statistics and machine learning. Mathematical Biosciences and Engineering, 2021, 18(5): 6995-7009. doi: 10.3934/mbe.2021348 |
[9] | Mark Kei Fong Wong, Hao Hei, Si Zhou Lim, Eddie Yin-Kwee Ng . Applied machine learning for blood pressure estimation using a small, real-world electrocardiogram and photoplethysmogram dataset. Mathematical Biosciences and Engineering, 2023, 20(1): 975-997. doi: 10.3934/mbe.2023045 |
[10] | Gayathri Vivekanandhan, Mahtab Mehrabbeik, Karthikeyan Rajagopal, Sajad Jafari, Stephen G. Lomber, Yaser Merrikhi . Applying machine learning techniques to detect the deployment of spatial working memory from the spiking activity of MT neurons. Mathematical Biosciences and Engineering, 2023, 20(2): 3216-3236. doi: 10.3934/mbe.2023151 |
Speech problems are linked to negative effects on quality of life, significant indirect costs of speech-related work, short-term demands and projections of costs of primary health care approximately $5 billion a year all over the world [1]. Dysphony diagnosis may include medical therapy, surgery, and/or speech therapy. Speech therapy can either be used as the primary mode, as an alternative to or as a medium for treatment assistance. Voice therapy in patients with muscles tension dysphony and benign phono-traumatic vocal fold lesions, degeneration of the vocal folds associated with age, disorders of the neurons (incl. Parkinson's disease) and disorders of the voice associated with reflux was shown to be effective. To date, most studies of voice therapy have been carried out in university tertiary voice clinics, whereas further studies on use of speech therapy have been conducted by otolaryngologists who are subject to bias recall [1]. In what is perceived as a' normal voice' there is a huge variation. It is problematic to determine its essential properties because a continuum exists between a normal and a disordered voice. A normal voice is essentially in quality unnoticeable and allows sufficient communication and unnecessary effort or inconvenience. Hoarseness is a word that describes an abnormal, harsh, breathy, weak or strained voice quality. A voice problem or dysphony can be defined by structural or functional anomaly of the voice mechanism as any impaired, limited or restricted activity or participation in (world health organization) [2]. Vocal production of the voice may be specified by fundamental frequency, intensity, vibration and vocal intonation according to its vocal parameters. The perceptional correlates of frequency are known as pitch or subjective level sensations that are appropriate for age and sex and are known as loudness or subjective noise sensations that are suitable for the environment. [3]. A person's voice displays these features as gender, age, emotional state and cultural heritage [4]. This represents individual identity and makes it possible to differentiate between individuals. The voice represents different aspects of the individual's physical, social, cultural and psychological development at different stages of life infancy, puberty, adulthood and aging [5]. A good voice satisfies the professional and/or personal needs of the individual of full, and is held comfortable in a person's life. Expression quality may be affected by hormonal changes, asthma, disease, blood vessels, neurology and emotional disorders, operations or other general health-related factors [3]. There are however no universal criteria to determine the characteristics and limits of a normal voice and certain shifts in voice during a vocalization are anticipated and socially acceptable. But some developments cannot be as indicators of social or emotive expression, despite taking such changes into account. Such changes are then called dysphonia [4]. Voice disorders manifest in various ways, including the presence of sensory and auditory symptoms, deviations in vocal quality and functional and/or structural laryngeal changes that may involve behavioral and/or organic factors associated with their genesis and maintenance [5]. These disorders can have a negative impact on the patient's quality of life, compromising social, emotional, and work-related situations [6,7]. Patients with voice disorders may experience various symptoms, of which hoarseness, sore throat, vocal fatigue, and throat clearing are the most common. These symptoms may be associated with intense voice use, upper respiratory tract infections, stress, and smoking [8]. Because manifestation of a voice disorder is multidimensional, its assessment must include a variety of factors, including perceptual voice assessment, visual laryngeal inspection, acoustic analysis, aerodynamic assessment, and vocal self-assessment [9]. Voice Pathology disorders can be detected using the classification tools for computer helped voice pathology. Language pathology recently focused on the techniques of machine learning. These tools can early diagnose and offer adequate treatment for voice pathologies. Clinical voice pathology is detected by several procedures, including acoustic analysis. Voice disorder services are available for the study of the auditory behavior of voices suffering from different forms of vocal disabilities in hospitals as much as in electronic voice disorder detection systems. The assessment of pain, such as dysphonia, is an essential factor of the medical evaluation and treatment of man's voice. In addition to larynx and vocal fold endoscopic testing, visual and acoustic measurement techniques are crucial components in the clinical evaluation of dysphonia. It consists of the calculation, in compliance with SIFEL Recommendations [10] Edicts and Phoniatrics, following the instructions of the Phoniatrics Committee of the European Society of Laryngology to identify certain modifications to the vocal tract, the relevant parameters obtained from the voice signal. It is, in contrast to other medical tests, a non - invasive clinical trial by direct observation of vocal folds, for example [11,12]. For medical diagnosis, the use of classifier systems slowly increases. The development of specialist networks and decision support (DSS) technologies for medical applications has led to the recent advancement in the field of artificial intelligence. Expert systems and various artificial detection intelligence methods had the ability to be good medical devices. Classification systems may contribute to the increase in precision, accuracy and reliability of diagnosis and the reduction of possible errors [13].
The first database that is used in this review is Saarbruecken Voice Database (SVD) [14]. A collection of voice recordings by over 2000 people. 1) Vocal registration [I a, u] produced at standard, high and low pitches. The truth was recorded in a recording session. 2) Vocal documentation of increasing pitch [I a, u]. 3) Recording of the phrase'' Good morning, how do you like it?''(' How are you, good morning?'). The voice signal and the EGG signal were stored in individual files for the specified components. The database has text file includes all relevant information about the dataset. Those characteristics make it a good choice for experimenters to use. All recorded SVD voices were sampled with a resolution of 16-bit at 50 kHz. There are some recording sessions where not all vowels are included in each version, depending on the quality of their recording. The' Saarbruecken Voice Server' is available via this web interface. It contains multiple internet pages which are used to choose parameters for the database application, to play directly and records and pick the recording session files which are to be exported after chosen desired parameter from SVD database.
The second database that is used in this review is Massachusetts eye and ear infirmary (MEEI) [15]. Contains over 1,400 vocal tests of the long vowel / a/ and the first portion of the Rainbow passage, created by MEEI Voice and Speech Lab. It has been sold in two distinct surroundings by Kay Elemetrics [16]. The sampling frequency was 50 kHz, while the response frequency for normal samples was 25 kHz or 50 kHz, respectively. It is used in most voice pathology detection and classification experiments although the different conditions and sound levels used to capture normal and pathological voice have many drawbacks. In this collection, some tools, such as stroboscopy, auditory aerodynamics and physical neck and mouth tests, were used to assess speech disorders (this information was provided by Kay Elemetrics).
The third database that is used in this review is Arabic voice pathology database (AVPD) [15]. Samples of words and voices were recorded at various sessions in King Abdul Aziz University Hospital in Riyadh, Saudi Arabia, Communication & Swallowing Disorders Unit. In a sound treatment room, a standard recording protocol was used to collect voices of the patient by experienced phoneticists. The database protocol has been developed to prevent specific MEEI data base deficiencies [17]. The AVPD provides records of long-standing vowels and voice folding disorders, coupled with the same records of regular speakers. After a laryngeal stroboscope has been clinically checked, pathological vocal folds have been identified. In the case of anatomy, the perceptive degree of voice disorders was calculated at a scale of 1–3, the most severe is 3. The gravity ranking of each sample was focused on the category of three medical experts. The texts are different: (1) three long-lasting vowels with initial details and offset details; (2) single Arabic and several common words; and (3) continuous speech. The chosen text has been specifically selected over all Arabic phonemes. Most speakers have reported three utterances of each vowel: /a/, /u/ and /i/. Just once single words and repetitive speaking were recorded to discourage patients from overloading them. For both normal and disease samples in AVPD, the test frequency is 50 kHz
This paper provides a meta-analysis of the relevant research articles that are directly targeting voice disorders and the databases use for the detection and the machine learning techniques used for the detection as explained in figure 1. This aim of this review is to investigates, summarizes, analyzes and discussions of a series of research articles regarding their details, finding and accuracy. Our research based on research papers from databases such as PubMed, IEEE Xplore and ScienceDirect, till June2020. In this paper, we primarily aim to assess the current efficacy of various methods of machine learning used to detect voice disorders and to explore the development, shortcomings and problems that have been made, as well as future research needs. To the best of our knowledge this is the first literature review that covers all three most popular databases i.e. SVD [14], MEEI [15] and AVPD [15] available for voice disorders. The important contributions of this paper are:
● Meta-analysis on the detection of voice disorder using SVD [14], MEEI [15] and AVPD [15] databases.
● Review outcomes and accuracy of 45 relevant articles.
● Identify the gap for research in this field.
The arrangement of this paper is organized as follows. Section 1 provides a short introduction of voice disorders and databases we have targeted. Section 2 provides the methodology used to conduct this review of the literature. The finding of this systematic assessment is mention in Section 3 of this paper. Section 4 deals with our main research concerns. This conclusion of this whole paper is provided in section 5 with restrictions, research gaps and recommendations for further investigations.
The population, intervention, comparison and outcome bases method (PICO) [18] was considered for this meta-analysis. The search strategy was set up according to PICO:
● P = (Population) = people with voice disorders
● I = (Intervention) = detection with data given in the form of voices. Here data extraction is done from SVD [14], MEEI [15] and AVPD [15].
● C = (Comparison) = different Machine learning algorithms
● O = (Outcome) = report accuracies and compare them.
A set of search strings was generated with the Boolean operator combining suitable synonyms and alternate terms: AND restricts and limits the quest and OR expands and extends the search [18]. With help of these Boolean operators the search term was formulated as: (voice disorder) AND (SVD/MEEI/AVPD) AND ("computer vision" OR "neural network" OR "artificial intelligence" OR "pattern recognition" OR "machine learning"). Peer-reviewed publications have been searched in 3 big databases: PubMed, IEEE Xplore and ScienceDirect. Search was restricted in ScienceDirect to review articles, research articles, conference abstracts, correspondences, data articles, discussions, case reports. All three databases have been searched till June 2020. The set of keywords were formulated that have been used to perform search in these databases. We searched these three databases three different time for each dataset we have target in our meta-analysis. The search results were in PubMed (n = 12), IEEE Xplore (n = 19) and ScienceDirect (n = 103). The total number of search results were (n = 134) when the initial searched was performed. Total included studies are 45 and this whole process has been explained in the figure 2 flowchart. The total number of each database used is for SVD (n = 20), MEEI (n = 31), AVPD (n = 6) and it has been represented in figure 1. With the help of pie chart. It can be seen from pie chart that MEEI is the most used database for voice pathologies detection.
Using the endnote web system, search results were stored and organized and a table of data extracted from every selected paper was created. For articles deemed to be potentially eligible, full texts were uploaded into the Endnote web (by Clarivate Analytics). The first search applied the search terms for each selected database and included the full document in both journals and conferences. Thousands of irrelevant findings have returned from this procedure, and therefore a decision is made to limit the search on the title and the type of content of the document. Further study is determined by reference to the sources of the related studies found. After collecting primary search studies, we scanned the titles and the abstract for the relevant studies. An ongoing investigation has been carried out with a complete text to assess the relevant studies.
This study focused on peer reviewed articles that used machine learning to recognize voice disorders in voice recordings as it’s described in figure 2. In fact, we concentrated mostly on the related research papers with respect to these criteria in order to understand the problem through machine learning or implementation. This only includes articles that solely used voice recordings from SVD [14], MEEI [15] and AVPD [15] database to detect voice disorder. The second criterion is to ensure that the selected research papers use approaches based on machine learning. The criteria eliminated any papers that do not include machine learning or an algorithm in which the disease is defined. This also excludes papers solely based on a qualitative examination and not analyzed on basis of accuracy and quantitative analyzes. The third criterion notes that the research papers chosen also include image detection software for disease. The criteria showed the accuracy of machine learning and its techniques applied in all selected article that are quantitatively reviewed published. In order to report irrelevant research papers, the inclusion and exclusion criteria were used. This examination paper outlines the inclusion and exclusion criteria used:
Inclusion Criteria:
● Research articles based on voice recordings as a data in order to predict the disorder.
● Studies that use any of the following: SVD [14], MEEI [15] and AVPD [15] database
● Research article consisting of machine learning techniques.
● Articles consists of voice filtering and segmentation techniques or an application or any software in order to detect the disease through voices.
● All articles are in the language of English.
● In either a journal or a conference proceeding published story is included.
Exclusion Criteria:
● Research article that do not include voice recordings as a data were excluded.
● Studies that didn’t use SVD [14], MEEI [15] and AVPD [51] database.
● Research articles that do not use any machine learning.
● Articles that do not use voice filtering and segmentation are excluded.
● Research which have not been written in English.
● Research that were not included in any journal or conference proceedings.
From table 1 we can observe that all the selected and screened stories are in between 2002 to 2020. But most of the publications are from last five years which can be observed in figure 3 which proves that detection of voice disorders through machine learning techniques and to apply them in clinical setting is the area of interest for most of the researchers.
No | Author/Year | ML technique/Classifier | Feature Selection/Filter | Overall Accuracy | Overall Sensitivity | Overall Specificity |
SVD Dataset | ||||||
1.. | A. Al-Nasheri et al. /2017 [19] | SVM | peak, lag, entropy/eight band pass filter | 99.53% | 91.22% | 94.27% |
2. | Al-Nasheri et al./2017 [20] | SVM | MDVP parameters | 99.68% | NA | NA |
3. | Al-Nasheri et al./2017 [21] | SVM | Eight frequency bands using correlation functions | 90.97% | NA | NA |
4. | Zulfiqar Ali et al./2017 [22] | GMM | MFCCs | 80.02% | 71.0%–84.7% | 70%–76.6% |
5. | Fonseca et al./2020 [23] | SVM | SE, ZCRs, SH | 95% | NA | NA |
6. | Garcia et al./2019 [24] | Gaussian Mixture Regression | GBR scale | NA | NA | NA |
7. | Guedes et al./2019 [25] | DLN (LSTM; CNN) | PCA | 80; 78;66; 67;63; 66 | 80;78;66; 67;63; 66 | 80; 80;67; 67;69; 71 |
8. | Hammami et al./2020 [26] | SVM | HOS; DWT | 99.3%; 93.1% | 96.4%; 92.8% | 99.4%; 93.3% |
9. | Panek et al./2016 [27] | K-Means Clustering | PCA | 100% | NA | NA |
10. | Moon et al./2018 [28] | LR DT RF SVM DNN | HOS and DEO | 82.77%;80.25%;84.87%;86.13%;87.4% | NA | NA |
11. | Ezzine et al./2018 [29] | ANN SVM | glottal flow features | 99.27%;98.43% | NA | NA |
12. | Markaki et al./2009 [30] | RBF Kernal with SVM | Mutual Information b/w subjective voice quality and computed features | 94.1% | NA | NA |
13. | Markaki et al./2011 [31] | SVM | mutual information p/w voice classes (normophonic/dysphonic) | 94.1% | NA | NA |
14. | Miramont et al./2020 [32] | SVM | CPP, SDNPCV, NPCV, HNR, Noise Level, D2 (10, 25), Shimmer, and K2 (6, 25). | 86.53% | NA | NA |
15. | Muhammad etal. /2017 [33] | SVM | Glottal source excitation | 93.2 ± 0.01 | 94.3 | 92.3 |
16. | Shia et al. /2017 [34] | FFNN | DWT | 93.3% | NA | NA |
17. | Kadiri et al./2020 [35] | SVM | Glottal source features and MFCC | 76.19% | NA | NA |
18. | Zhang et al./2020 [36] | DNN | Pitch extraction and line spectrum pair | NA | NA | NA |
19. | Teixeira et al./2018 [37] | SVM | Jitter, shimmer and HNR, MFCC | 71% | NA | NA |
20. | Teixeira et al./2017 [38] | MLP-ANN | Jitter, shimmer and HNR | 100%(female); 90%(male) | NA | NA |
MEEI Dataset | ||||||
1. | A. A. Dibazar et al. /2002 [39] | HMM | me1 frequency filter | 99.4% with E = 8% | NA | NA |
2. | A. Al-Nasheri et al. /2017 [19] | SVM | peak, lag, entropy/eight band pass filter | 99.54% | 99.96% | 99.96% |
3. | Akbari et al./2014 [40] | MC-LDA; ML-NN | wavelet packet- based features | 96.67% | NA | NA |
4. | Al-Nasheri et al./2017 [20] | SVM | MDVP parameters | 88.21% | 88.90% | 89.21% |
5. | Al-Nasheri et al./2017 [21] | SVM | Eight frequency bands using correlationfunctions | 99.80% | NA | NA |
6. | Zulfiqar Ali et al./2016 [41] | GMM | Estimation of Auditory Spectrum and Cepstral Coefficients | 99.56% | NA | NA |
7. | Zulfiqar Ali et al./2017 [22] | GMM | MFCCs | 94.6% | 94.7%–99.1% | 50.9%–94.5% |
8. | Amami et al./2017 [42] | SVM | DBSCAN and MFCCs | 98% | NA | NA |
9. | Londono et al./2010 [43] | HMM | MFCCs | 82.1472.2 | 81.1373.6 | 83.3373.4 |
10. | Arjmandi et al./2011 [44] | 1)QD classfier 2)NM classifier 3)Parzen Classifier; KNN; SVM; ML NN | MDVP parameters | 78.9%;87.20%;85.50%;88.86%;89.29%;88.7% | 88%;70.9%;73.5%;78.30%;82.25%;83% | 66%;97%;93.85%;96.17%;94.3%;85.1% |
11. | Barreira et al./2020 [45] | Gaussian Naïve Bayes | HASS-KLD, H-KLD, MFCCs, Sample skewness | 99.55% | 100% | 98% |
12. | Francis et al./2016 [46] | ANN | MMTLS | 96.48% | NA | NA |
13. | Cordeiro et al./2017 [47] | SVM, GMM, DA | MFCC, LSF | 98.7% | NA | NA |
14. | Cordeiro et al./2018 [48] | SVM | RPPC | 94.2% | NA | NA |
15. | Fang et al. /2019 [49] | DNN SVM GMM | MFCCs | 99.14 ± 1.9%;98.28 ± 2.3%;98.26 ±1.8% | NA | NA |
16. | Muhammad et al. /2013 [14] | SVM | MPEG-7 low level audio feature | 99.994% ±0.011 | 1 | 0.999 |
17. | Muhammad et al. /2013 [50] | SVM | VTAI Feature Extraction | 99.02% ± 0.01 | 99.8% ± 0.02 | 97.5% ± 0.04 |
18. | Ghasemzadeh et al. 2015 [51] | GA and LDA with SVM | Nonlinear features | 98.4% | 99.3 ± 1.2 | 94 ± 5.7 |
19. | Llorente et al./2009 [52] | MLP-NN | MFCCs | 96 ± 1.3 | 0.99 | 0.82 |
20. | Hariharan et al./2013 [53] | LS-SVM; kNN PNN; CART | Wavelet packet transform based energy/entropy | 92.24 ± 0.24;89.82 ± 0.28;89.54 ± 1.34;86.97 ± 0.20 | 93.02 ± 0.33;91.96 ± 0.56;90.62 ± 2.46;87.71 ± 0.28 | 91.49 ± 0.22;87.89 ± 0.43;88.59 ± 0.47;86.27 ± 0.42 |
21. | Ezzine et al./2018 [29] | ANN SVM | glottal flow features | 93.6%, 93.57% | NA | NA |
22. | Mahmood /2019 [54] | Naïve Bayes ANN SVM RF | MFCC | 72.70%, 93.72%, 99.78%, 99.91% | NA | NA |
23. | Mekyska et al./2015 [55] | SVM RF | spectra, inferior colliculus coefficients, bicepstrum, approximate entropy, empirical mode decomposition | 99.9 ± 0.4100.0 ± 0.0 | 99.8 ± 0.5100.0 ± 0.0 | 99.9 ± 0.7100.0 ± 0.0 |
24. | Miramont et al./2020 [32] | SVM | CPP, SDNPCV, NPCV, HNR, Noise Level, D2 (10, 25), Shimmer, and K2 (6, 25). | 87.06% | NA | NA |
25. | Muhammad et al. /2014 [56] | SVM | MPEG-7 feature | 99.994% | 1 | 0.999 |
26. | Muhammad et al. /2017 [33] | SVM | Glottal source excitation | 99.4 ± 0.02 | 99.4% | 98.9% |
27. | Nayak et al./2005 [57] | ANN | DWT coefficients as a feature vector | 80–85% | NA | NA |
28. | Henriquez et al./2009 [58] | NN | first- and second- order Rényi entropies, correlation entropy, correlation dimension | 99.69% | NA | NA |
29. | Salehi et al./2015 [59] | SVM | Parametric wavelet by adaptation wavelet transform | 98.30% | NA | NA |
30 | Lechon et al./2006 [60] | MFCC | NN | 89.6 ± 2.49% | NA | NA |
31. | Travieso et al./2017 [61] | HMM; Linear SVM; Kernal SVM | Nonlinear Dynamic Parameterization | 93.55 ± 3.24;96.73 ± 3.42;99.87 ± 0.39 | NA | NA |
AVPD Dataset | ||||||
1. | A. Al-Nasheri et al. /2017 [19] | SVM | peak, lag, entropy/eight band pass filter | 96.02% | 91.22% | 94.27% |
2. | Al-Nasheri et al./2017 [20] | SVM | MDVP parameters | 72.53% | NA | NA |
3. | Al-Nasheri et al./2017 [21] | SVM | Eight frequency bands using correlation functions | 91.16% | NA | NA |
4. | Zulfiqar Ali et al./2017 [22] | GMM | MFCCs | 83.6% | 67.9%–7.8.4% | 75.9%–89.74% |
5. | Mesallam et al./2017 [62] | SVM GMM VQ HMM | MFCC | 93.6%, 91.6%, 90.3%, 88.9% | NA | NA |
6. | Muhammad et al. /2017 [33] | SVM | Glottal source excitation | 91.5 ± 0.09 | 92.2% | 91.1% |
ANN = Artificial Neural Network, CART= Classification and Regression Tree, CNN = Convolutional Neural Network, DA = Discriminant analysis, DBSCAN = Density Based Spatial Clustering of Applications with Noise. DEO = Differential Energy Operator, DLN = Deep Learning Network, DNN = Deep Neural Network, DT = Decision Tree, DWT = Discrete Wavelet Transform, FFNN = Feed Forward Neural Network, GA = Genetic Algorithm, GMM = Gaussian Mixture Model, HASS-KLD = Higher amplitude suppression spectrum Kullback–Leibler divergence, H-KLD = Histogram Kullback–Leibler Divergence, HMM = Hidden Markov Model, HNR = Harmonic to Noise Ratio, HOS = High Order Statistics features, KNN = K-Nearest Neighbor Classifier, LDA = Linear Discriminant analysis, LR = Logistic Regression, LSF = Line spectral frequencies, LTSM = Long Short Term Memory, MC-LDA = Multi-Class Linear Discriminant Analysis, MDVP = Multidimensional Voice Program parameters, MFCCs = Mel-frequency cepstral coefficients, ML-NN = Multilayer Neural Network, MMTLS = Modified Mellin Transform of Log Spectrum, NA = Not Available, NM = Nearest Mean Classifier, NN = Neural Network, PCA= Principal Component Analysis, PNN = Probabilistic Neural Network, QD = Quadratic Discriminant Classifier, RF = Random Forest, RPPC = Relative Power of the Periodic Component, SE = Signal Energy, SH = Signal Entropy, SVM = Support Vector Machine, VQ = Vector Quantizer, VTAI = Vocal Tract Area Irregularity, ZCRs = Zero-Crossing Rates. |
In table 1, it has been observed that SVM is the most used algorithm for the diagnosis of voice disorders in all three datasets. In our lives today the recognition of voice disorders plays an important role. Many of these disorders should therefore be treated until they progress to a critical condition at an early stage of incidence. SVMs have become a popular tool for discriminatory labeling. Speech synthesis is a promising field for recent SVM applications [64].
Support Vector Machine (SVM) is an old classification approach and has shown great scientific interest, especially in the fields of machine classification, regression and learning. SVM with the known classes associated. This is defined as filtering or extraction of features. Even if no prediction of unknown samples is necessary, function selection and SVM classification have been used together. They may be used to define main sets that take part in the class differentiation process. The SVM maps the entrance space to a large area. The SVM could determine the border of areas belonging to both classes by calculating an optimal hyperplane separation. The hyperplane is chosen to maximize the distance between the nearest samples of workouts. Initially, SVM models have been defined to categorize linear classes. Because the area of characteristics is large, the function characteristics for finding the separation hyperplane cannot be used directly. The characteristic function is used to compute non-linear mapping using special non-linear functions known as the kernel. The Kernel has the advantage of working in the input area where the weighted sum of the kernel function evaluated by support vectors can be used to solve the classification problem. By using different kernel functions, the SVM algorithm can create a range of learning machines. SVM tends to have a far better accuracy and give promising results then artificial neural network [63]. SVM (support vector machines) have become a common tool for classification, regression or novelty recognition machine learning tasks. They demonstrate good performance in general terms on many real questions and the method is logically inspired. The design of the learner machine does not have to be sought through experimentation [66]. There are very few free parameters. While SVMs are extremely powerful classifiers utilizing non-linear kernels, there are some downsides to this: 1). To find the best model, various kernel configurations and model parameters must be tested; 2). Training can be very long, particularly if there are many features or examples in the data set; 3). It is difficult to understand their inner workings because the underlying models are based on complex mathematical structures and their findings are difficult to interpret. For eg, the selection of the features with all available data and the subsequent testing of classifier training yield a positive error estimate [65].
In figure 4, 5, and 6 a quantitative analysis has been carried out that shows that importance of SVM. SVM is the algorithm that has been widely used in the detection of voice disorders. For many years SVM and its application in the area of medical has been the topic of research for many researchers. SVM is the preference of scientist as a machine learning algorithm because of its best accuracy outcomes. In figure 4, 5, and 6 it has been observed that with variation in features different accuracies has been evaluated with SVM as a common algorithm in SVD [14], MEEI [15] and AVPD [15] database.
In figure 7, 8, and 9 a quantitative analysis has been carried out between other algorithms in all selected databases. It has been observed that other than SVM, there are some algorithms that are resulted in good accuracies. For example, in graph 5 of SVD, Zulfiqar Ali et al. [22], GMM is used and the resulted accuracy is 80.02% with sensitivity 91.22% and specificity 94.27%. A Gaussian mixture model (GMM), as a weighted sum of Gaussian elements, is a parametric probability density function. GMMs are commonly used as a parametric model to distribute the probability in continuous measurements or characteristics in a biometry system, such as spectral related vocal-tract characteristics in a speech recognition system. The GMM parameters can be estimated from training data based on a well-qualified pre-model iterative EM or Maximum Posteriori (MAP) estimation [67]. In Moon et al. [28], Random Forest algorithm is used to detect voice disorders and the resulted outcome is 84.87% accuracy however overall sensitivity and specificity were not reported. RF is a series or community of classification trees and regression trees [68] which is trained in datasets of the same scale as the training set, called bootstraps. Once a tree is developed, bootstraps are used as test set which do not contain any specific record of the original (out - of-bag (OOB)) samples. The OOB estimate of the generalization error is the error rate of classification in all test sets. In 1996 [69] Breiman found that an OOB mistake is correct with a test set of the same size as that for the bagged classifiers. It removes the need for a different test set with the OOB calculation. In SVD, the highest reported accuracy of is 99% [20]. After SVM, GMM [22,24] and RT [29], convolutional neural network used in the detection of voice disorder and resulted in good outcome. A class that is influential in various computer vision tasks, Convolutional neural network (CNN) is attracting interest through a range of domains, including radiology. CNN is designed to learn spatial hierarchies through numerous building blocks, including cooling layers, bonding layers and fully connected layers, automatic and adaptive context propagation. [70]. CNN is a deep learning method that is commonly used for solving difficult problems. CNN is a deep learning solution. This overcomes the limitations of traditional machines [71]. In [25] CNN is used and the reported accuracy is 78%.
In figure 8 of MEEI, Naïve Bayes [54] has the lowest reported accuracy which is 72.70%. Other than Naïve Bayes, algorithms like HMM [39,43], LDA [40], GMM [22,41,49], RF [54], PNN [53], KNN [53],
ANN [49,29] all have accuracies ranging in between 90% to 100%, which is again considered as the good reported outcome in terms of accuracy.
SVD [14], MEEI [15] and AVPD [15] databases are the center focus of this meta-analysis. Table 2 contain the basic differences in between all three databases which include their language, location, sampling frequency and the text that has been recorded.
Comparative Characteristic | SVD | MEEI | AVPD |
language | German | English | Arabic |
Location | Saarland University, Germany | Massachusetts Eye & Ear Infirmary (MEEI)voice and speech laboratory, USA | King Abdul-Aziz University Hospital, Saudi Arabia |
Sampling frequency | 50 KHz | 10 KHz 25 KHz 50 KHz |
48 KHz |
Text | Vowel /a/ -Vowel /i/ -Vowel /u/ -sentence |
-Vowel /a/ -Rainbow passage |
-Vowel /a/ -Vowel /i/ -Vowel /u/ -Al-Fateha -Arabic digits -Common words |
In pathology evaluation, perceptual severity has a major role to play, which either in SVD or MEEI repositories is not accessible. A confusion matrix provides information on honestly and incorrect categorized topics in an automated disturbance detection system. The cause for misclassification can be calculated by the perceptual severity of this structure. Automatic systems can at times not differentiate between typical abnormal subjects and relatively severe ones. This is why the perceptive severity in the AVPD is also taken into account in grades 1–3, in which 3 is a highly severe speech disorder. In comparison the typical AVPD participants are reported in the same state as those used for the pathological subjects following the clinical assessment [76]. A clinical examination of standard MEEI topics is not conducted although the history of the speech problem is incomplete [72]. No such information is provided in the SVD database. In AVPD, according to the MEEI database, all normal and pathological specimens are recorded at a single AVPD sampling frequency. Deliyski et al. concluded that the precision and the efficiency of the acoustic analysis is affected by the frequency of the sampling [73]. However, there is a vowel in the MEEI database and three vowels are registered in the AVPD. While three vows are also recorded in the SVD, they are only reported once. In the AVPD, three vowels are repeatedly reported, as some studies have suggested to model the intraspeaker variability for more than one single sample of the same vowel [74,75]. The total length of the reported study, that is 60 seconds, is another important feature of the AVPD. By regular as well as disordered individuals any text reported in an AVPD is of the same duration. Between normal and pathologic topics, the recording times in the MEEI database vary. In comparison, the connected language (sentence) duration in the SVD database is only 2 seconds, which is not enough to build an automatic speech detection system. In addition, the SVD database cannot be used for a text-independent system. The AVPD is 18 seconds long on average and comprises seven sentences. The length of Al-Fateha speech is 18 seconds and it is segmented into two components to develop text-independent structures [76].
After detailed quantitative analysis it has been noticed that only one unsupervised technique is used and that is only in SVD in Panek et al. /2016 [27] and its resulted accuracy is up to 99% although resulted sensitivity and specificity is missing. Other than no researcher has used any unsupervised technique for voice pathology detection. The validation of PCA by k-mean clustering and cross validation loses 10% signal (the variance of 90%) from the initial vector of the feature and produces worse results than the analysis by the original 28 vectors of functionality. In comparison with the results for women, the analysis based on kPCA included all the pitches analyzed showed the most accurate evidence of patient's health and condition. The analogous analysis of male recordings showed 100 % accuracy for 28 feature vectors and for the relevant number of key components for each pitch and kPCA result for each vowel. The k-means algorithm provides perfect separation of data for male recordings, which is the opposite of the female analysis using 28 parameters and PCA. This question was coped to and 99% of the classification accuracy from the kPCA analytics, which are non-linear data transformation. This indicates that the isolation of data in linear fashion was not adequate. In addition, k-means algorithm is presented as artifacts allocated by distance to the closest cluster. [27], though it is been suggested that researchers should focus more unsupervised techniques and evaluate these databases.
Tissue diseases, systemic changes, mechanical stress, surface discomfort, change in tissue, changes in neurology and muscle, and other factors [53] can cause Voice disease. The agility, strength and form of Vocal folds, resulting in abnormal noise and reduced acoustic tone, was affected by the vocal pathology. Subjective and objective evaluations of vocal problems have been approached until now [78]. The first group (subjective assessment) is the auditory and visual analysis of vocal folds in a hospital [77]. The first is a subjective assessment. The second category (target evaluation) is focused on automatic computer-based processing of acoustic signals to measure and identify the underlying vocal pathology, which may not even be detected by a human [62]. Therefore, this type of assessment is inherently non-subjective. Within reality, voices can now easily be captured and stored globally via cloud technologies using many intelligent devices. Many libraries have been commonly used by researchers for the objective assessment of speech pathology. The Massachusetts Eye and Ear Infirmary (MEEI) [15], the Saarbrücken Voice Database (SVD) [14], and the Arabic Voice Pathology Database (AVPD) [15]. In the repositories there are also some pitfalls. For example, certain bases are highly uniformly distributed within stable and unhealthy groups, and datasets provide troubling differences in the number of samples per type of pathology (e.g. there are fewer than 3 as more pathologies in the database). Some repositories do not have details on the severity of disease or on pathology symptoms during phonation, so some of the samples may seem safe, despite being called pathology and vice versa. Not to mention that more than 1 type of pathology is used to label documents and it is particularly challenging to incorporate or delete samples in different language [77].
Talking about the limitation of this systematic review, we cannot deny the fact of lower number of included publications. Secondly those articles were selected which were published in English language, which can restrict the portrayal of work from non-English speaking countries and limit the generalizability of the results. Thirdly, there's a big possibility that search strategy for this review may have missed some relevant studies, since the studies which were published in conference proceedings were avoided mostly.
We discussed the strengths and weaknesses of SVD, MEEI and AVPD. After detailed analysis of the studies including the techniques used and outcome measurements, it was also concluded that Support Vector Machine (SVM) is the most common used algorithm for the detection of voice disorders. The amount of work done in this field concluded that clinical diagnosis voice disorders through machine learning algorithms have been the area of interest for most researchers. Other than was also noticed that researchers focus on supervised techniques for the clinical diagnosis of voice disorder rather than using unsupervised techniques. The identified gap that researchers should also focus more on unsupervised techniques in future so the analysis can be made based on their results that which provides the best outcomes and results. The second identified gap is that more work needs to be done on the AVPD database to evaluate its data with more feature extraction.
The authors have no conflict of interest in the conducted study.
[1] | M. Shen, M. W. Wei, L. H. Zhu, M. Z. Wang, Classification of encrypted traffic with second-order markov chains and application attribute bigrams, IEEE Trans. Inf. Forensics Secur., 12 (2017), 1830-1843. |
[2] | Y. N. Dong, J. J. Zhao, J. Jin Novel feature selection and classification of internet video traffic based on a hierarchical scheme, Comput. Networks, 119 (2017), 102-111. |
[3] | S. E. Middleton, S. Modafferi, Scalable classification of QoS for real-time interactive applications from IP traffic measurements, Comput. Networks, 107 (2016), 121-132. |
[4] | P. Burnap, M. L. Williams, Cyber hate speech on twitter: An application of machine classification and statistical modeling for policy and decision making, Policy Internet, 7 (2015), 223-242. |
[5] | M. Korczyński, A. Duda, Markov chain fingerprinting to classify encrypted traffic, In IEEE INFOCOM 2014-IEEE Conference on Computer Communications, 2014, 781-789. |
[6] | P. Velan, M. Cermák, P. Čeleda, M. Drašar, A survey of methods for encrypted traffic classification and analysis, Int. J. Network Manage., 25 (2015), 355-374. |
[7] | Z. Cao, G. Xiong, Y. Zhao, Z. Z. Li, L. Guo, A survey on encrypted traffic classification, In International Conference on Applications and Techniques in Information Security, Springer, Berlin, Heidelberg, 2014, 73-81. |
[8] | W. B. Diab, S. Tohme, C. Bassil, Critical vpn security analysis and new approach for securing voip communications over vpn networks, In Proceedings of the 3rd ACM workshop on Wireless multimedia networking and performance modeling, 2007, 92-96. |
[9] | J. Khalife, A. Hajjar, J. Diaz-Verdejo, A multilevel taxonomy and requirements for an optimal traffic-classification model, Int. J. Network Manage., 24 (2014), 101-120. |
[10] | N. Namdev, S. Agrawal, S. Silkari, Recent advancement in machine learning based internet traffic classification, Proc. Comput. Sci., 60 (2015), 784-791. |
[11] | M. Finsterbusch, C. Richter, E. Rocha, J. A. Muller, K. Hanssgen, A survey of payload-based traffic classification approaches, IEEE Commun. Surv. Tutorials, 16 (2013), 1135-1156. |
[12] | K. S. Shim, J. H. Ham, Baraka D. Sija, M. S. Kim, Application traffic classification using payload size sequence signature, Int. J. Network Manage., 27 (2017), e1981. |
[13] | T. T. Nguyen, G. Armitage, A survey of techniques for internet traffic classification using machine learning, IEEE commun. Surv. Tutorials, 10 (2008), 56-76. |
[14] | J. Erman, M. Arlitt, A. Mahanti, Traffic classification using clustering algorithms, In Proceedings of the 2006 SIGCOMM workshop on Mining network data, 2006, 281-286. |
[15] | R. Keralapura, A. Nucci, C. Chuah, A novel self-learning architecture for p2p traffic classification in high speed networks, Comput. Networks, 54 (2010), 1055-1068. |
[16] | J. Zhang, Y. Xiang, W. L. Zhou, Y. Wang, Unsupervised traffic classification using flow statistical properties and IP packet payload, J. Comput. Syst. Sci., 79 (2013), 573-585. |
[17] | Y. Wang, Y. Xiang, J. Zhang, W. L. Zhou, G. Y. Wei, L. T. Yang, Internet traffic classification using constrained clustering, IEEE Trans. Parallel Distrib. Syst., 25 (2014), 2932-2943. |
[18] | A. Este, F. Gringoli, L. Salgarelli, Support vector machines for TCP traffic classification, Comput. Networks, 53 (2009), 2476-2490. |
[19] | A. Finamore, M. Mellia, M. Meo, D. Rossi, Kiss: Stochastic packet inspection classifier for udp traffic, IEEE/ACM Trans. Networking, 18 (2010), 1505-1515. |
[20] | L. Zhenxiang, H. Mingbo, L. Song, W. Xin, Research of P2P traffic comprehensive identification method, In 2011 International Conference on Network Computing and Information Security, 2011, 307-310. |
[21] | D. J. Arndt, A. Nur Zincir-Heywood, A comparison of three machine learning techniques for encrypted network traffic analysis, In 2011 IEEE Symposium on Computational Intelligence for Security and Defense Applications (CISDA), 2011, 107-114. |
[22] | R. Alshammari, A. Nur Zincir-Heywood, Can encrypted traffic be identified without port numbers, IP addresses and payload inspection?, Comput. networks, 55 (2011), 1326-1350. |
[23] | T. T. Nguyen, G. Armitage, P. Branch, S. Zander, Timely and continuous machine-learning-based classification for interactive IP traffic, IEEE/ACM Trans. Networking, 20 (2012), 1880-1894. |
[24] | W. Ye, K. Cho, Hybrid P2P traffic classification with heuristic rules and machine learning, Soft Comput., 18 (2014), 1815-1827. |
[25] | L. Peng, B. Yang, Y. H. Chen, Effective packet number for early stage internet traffic identification, Neurocomputing, 156 (2015), 252-267. |
[26] | J. Zhang, C. Chen, Y. Xiang, W. L. Zhou, Semi-supervised and compound classification of network traffic, In 2012 32nd International Conference on Distributed Computing Systems Workshops, 2012, 617-621. |
[27] | J. Yuan, Z. Li, R. Yuan, Information entropy based clustering method for unsupervised internet traffic classification, In 2008 IEEE International Conference on Communications, 2008, 1588-1592. |
[28] | M. Zhang, H. L. Zhang, B. Zhang, G.Lu, Encrypted traffic classification based on an improved clustering algorithm, In International Conference on Trustworthy Computing and Services, Springer, Berlin, Heidelberg, 2012, 124-131. |
[29] | V. Paxson, Empirically derived analytic models of wide-area TCP connections, IEEE/ACM Trans. Networking, 2 (1994), 316-336. |
[30] | V. Paxson, S. Floyd, Wide area traffic: The failure of Poisson modeling, IEEE/ACM Trans. Networking, 3 (1995), 226-244. |
[31] | A. McGregor, M. Hall, P. Lorier, J. Brunskill, Flow clustering using machine learning techniques, In International workshop on passive and active network measurement, Springer, Berlin, Heidelberg, 2004, 205-214. |
[32] | T. Auld, A. W. Moore, S. F. Gull, Bayesian neural networks for internet traffic classification, IEEE Trans. Neural Networks, 18 (2007), 223-239. |
[33] | J. Erman, A. Mahanti, M. Arlitt, I. Cohen, C. Williamson, Offline/realtime traffic classification using semi-supervised learning, Perform. Eval., 64 (2007), 1194-1213. |
[34] | W. Li, M. Canini, A. W. Moore, R. Bolla, Efficient application identification and the temporal and spatial stability of classification schema, Comput. Networks, 53 (2009), 790-809. |
[35] | C. Bacquet, K. Gumus, D. Tizer, A. Nur Zincir-Heywood, M. I. Heywood, A comparison of unsupervised learning techniques for encrypted traffic identification, J. Inf. Assur. Secur., 5 (2010), 464-472. |
[36] | D. Arndt, How to: Calculating flow statistics using netmate, 2011. Available from: http://dan.arndt.ca/nims/calculating-flow-statistics-using-netmate/. |
[37] | J. Zhang, C. Chen, Y. Xiang, W. L. Zhou, Y. Xiang, Internet traffic classification by aggregating correlated naive bayes predictions, IEEE Trans. Inf. Forensics Secur., 8 (2013), 5-15. |
[38] | N. F. Huang, G. Y. Jai, H. C. Chao, Y. J. Tzang, H. Y. Chang, Application traffic classification at the early stage by characterizing application rounds, Inf. Sci., 232 (2013), 130-142. |
[39] | Y. J. Fu, H. Xiong, X. Lu, J. Yang, C. Chen, Service usage classification with encrypted internet traffic in mobile messaging apps, IEEE Trans. Mobile Comput., 15 (2016), 2851-2864. |
[40] | M. Conti, L. V. Mancini, R. Spolaor, N. V. Verde, Analyzing android encrypted network traffic to identify user actions, IEEE Trans. Inf. Forensics Secur., 11 (2016), 114-125. |
[41] | Z. Liu, R. Wang, D. Tang, Extending labeled mobile network traffic data by three levels traffic identification fusion, Future Gener. Comput. Syst., 88 (2018), 453-466. |
[42] | G. Aceto, D. Ciuonzo, A. Montieri, A. Pescap, Multi-classification approaches for classifying mobile app traffic, J. Network Comput. Appl., 103 (2018), 131-145. |
[43] | K. L. Dias, M. A. Pongelupe, W. M. Caminhas, L. de Errico, An innovative approach for real-time network traffic classification, Comput. Networks, 158 (2019), 143-157. |
[44] | A. J. Pinheiro, J. de M. Bezerra, C. A. Burgardt, D. R. Campelo, Identifying IoT devices and events based on packet length from encrypted traffic, Comput. Commun., 144 (2019), 8-17. |
[45] | Y. M. Choi, On the accuracy of signature-based traffic identification technique in IP networks, In 2007 2nd IEEE/IFIP International Workshop on Broadband Convergence Networks, 2007, 1-12. |
[46] | B. C. Park, Y. J. Won, M. S. Kim, J. W. Hong, Towards automated application signature generation for traffic identification, In NOMS 2008-2008 IEEE Network Operations and Management Symposium, 2008, 160-167. |
[47] | T. Okabe, T. Kitamura, T. Shizuno, Statistical traffic identification method based on flow-level behavior for fair VoIP service, In 1st IEEE Workshop on VoIP Management and Security, 2006, 35-40. |
[48] | D. Bonfiglio, M. Mellia, M. Meo, D. Rossi, P. Tofanelli, Revealing skype traffic: When randomness plays with you, SIGCOMM '07: Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications, 2007, 37-48. |
[49] | R. Alshammari, A. Nur Zincir-Heywood, An investigation on the identification of VoIP traffic: Case study on Gtalk and Skype, In 2010 International Conference on Network and Service Management, 2010, 310-313. |
[50] | H. A. H. Ibrahim, S. M. Nor, A. Mohammed, A. B. Mohammed, Taxonomy of machine learning algorithms to classify real time interactive applications, Int. J. Comput. Networks Wireless Commun., 2 (2012), 69-73. |
[51] | D. Adami, C. Callegari, S. Giordano, M. Pagano, T. Pepe, Skype-Hunter: A real-time system for the detection and classification of Skype traffic, Int. J. Commun. Syst., 25 (2012), 386-403. |
[52] | L. A. Khan, M. S. Baig, A. M. Youssef, Speaker recognition from encrypted VoIP communications, Digital Invest., 7 (2010), 65-73. |
[53] | T. Yildirim, P. J. Radcliffe, VoIP traffic classification in IPSec tunnels, In 2010 International Conference on Electronics and Information Engineering, 2010, 151-157. |
[54] | B. Li, M. Ma, Z. G. Jin, A VoIP traffic identification scheme based on host and flow behavior analysis, J. Network Syst. Manage., 19 (2011), 111-129. |
[55] | R. Alshammari, A. Nur Zincir-Heywood, Identification of VoIP encrypted traffic using a machine learning approach, J. King Saud Univ. Comput. Inf. Sci., 27 (2015), 77-92. |
[56] | T. Qin, L. Wang, Z. L. Liu, X. H. Guan, Robust application identification methods for P2P and VoIP traffic classification in backbone network, Knowl. Based Syst., 82 (2015), 152-162. |
[57] | M. M. Rathore, A. Ahmad, A. Paul, S. Rho, Exploiting encrypted and tunneled multimedia calls in high-speed big data environment, Multimedia Tools and Appl., 77 (2018), 4959-4984. |
[58] | G. Draper-Gil, A. H. Lashkari, M. S. Mamun, A. A. Ghorbani, Characterization of encrypted and vpn traffic using time-related features, In Proceedings of the 2nd international conference on information systems security and privacy (ICISSP), 2016, 407-414. |
[59] | H. L. Arash, G. Draper-Gil, M. S. Mamun, Ali A. Ghorbani, CICFlowMeter: Network traffic flow generator and analyser, Available from: https://www.unb.ca/cic/research/applications.html, 2017. |
[60] | J. R. Quinlan, C4.5: Program for machine learning, San Mateo, California, Morgan Kaufmann Publishers, 1993. |
[61] | W. X. Sun, J. Chen, J. Q. Li, Decision tree and PCA-based fault diagnosis of rotating machinery, Mech. Syst. Signal Process., 21 (2007), 1300-1317. |
[62] | L. Breiman, Random forests, Mach. Learn., 45 (2001), 5-32. |
[63] | Y. Freund, R. E. Schapire, A desicion-theoretic generalization of on-line learning and an application to boosting, J. Comput. Syst. Sci., 55 (1997), 23-27. |
[64] | L. Breiman, Bagging predictors, Mach. Learn., 24 (1996), 123-140. |
1. | Sidra Abid Syed, Munaf Rashid, Samreen Hussain, Anoshia Imtiaz, Hamnah Abid, Hira Zahid, Inter classifier comparison to detect voice pathologies, 2021, 18, 1551-0018, 2258, 10.3934/mbe.2021114 | |
2. | Sidra Abid Syed, Munaf Rashid, Samreen Hussain, Hira Zahid, Wen Si, Comparative Analysis of CNN and RNN for Voice Pathology Detection, 2021, 2021, 2314-6141, 1, 10.1155/2021/6635964 | |
3. | Mario Madruga, Yolanda Campos-Roca, Carlos J. Pérez, Impact of noise on the performance of automatic systems for vocal fold lesions detection, 2021, 41, 02085216, 1039, 10.1016/j.bbe.2021.07.001 | |
4. | Shouyuan Wu, Jianjian Wang, Qiangqiang Guo, Hui Lan, Juanjuan Zhang, Ling Wang, Estill Janne, Xufei Luo, Qi Wang, Yang Song, Joseph L. Mathew, Yangqin Xun, Nan Yang, Myeong Soo Lee, Yaolong Chen, Application of artificial intelligence in clinical diagnosis and treatment: an overview of systematic reviews, 2022, 2, 26671026, 88, 10.1016/j.imed.2021.12.001 | |
5. | Meike Brockmann-Bauser, Maria Francisca de Paula Soares, Do We Get What We Need from Clinical Acoustic Voice Measurements?, 2023, 13, 2076-3417, 941, 10.3390/app13020941 | |
6. | Jungirl Seok, Tack-Kyun Kwon, Artificial Intelligence for Clinical Research in Voice Disease, 2022, 33, 2508-268X, 142, 10.22469/jkslp.2022.33.3.142 | |
7. | Ghada Al-Hussain, Farag Shuweihdi, Haitham Alali, Mowafa Househ, Alaa Abd-alrazaq, The Effectiveness of Supervised Machine Learning in Screening and Diagnosing Voice Disorders: Systematic Review and Meta-analysis, 2022, 24, 1438-8871, e38472, 10.2196/38472 | |
8. | Hira Zahid, Munaf Rashid, Samreen Hussain, Fahad Azim, Sidra Abid Syed, Afshan Saad, Recognition of Urdu sign language: a systematic review of the machine learning classification, 2022, 8, 2376-5992, e883, 10.7717/peerj-cs.883 | |
9. | Daniel Rodríguez Marconi, Camilo Morales, Polette Araya, Richard Ferrada, Manuel Ibarra, Maria Teresa Catrifol, Uso del smartphone en telepráctica para trastornos de la voz. Una revisión desde el concepto de Mhealth, 2022, 12, 2174-5218, e78550, 10.5209/rlog.78550 | |
10. | Haoran Shen, Junjie Cao, Lin Zhang, Jing Li, Jianghong Liu, Zhiyuan Chu, Shifeng Wang, Yanjiang Qiao, Classification research of TCM pulse conditions based on multi-label voice analysis, 2024, 11, 20957548, 172, 10.1016/j.jtcms.2024.03.008 | |
11. | Jie Cai, Yuliang Song, Jianghao Wu, Xiong Chen, Voice Disorder Classification Using Wav2vec 2.0 Feature Extraction, 2024, 08921997, 10.1016/j.jvoice.2024.09.002 | |
12. | Alper Idrisoglu, Ana Luiza Dallora, Peter Anderberg, Johan Sanmartin Berglund, Applied Machine Learning Techniques to Diagnose Voice-Affecting Conditions and Disorders: Systematic Literature Review, 2023, 25, 1438-8871, e46105, 10.2196/46105 | |
13. | S. Revathi, K. Mohanasundaram, 2024, chapter 10, 9798369322383, 236, 10.4018/979-8-3693-2238-3.ch010 | |
14. | Ioanna Miliaresi, Aggelos Pikrakis, A Modular Deep Learning Architecture for Voice Pathology Classification, 2023, 11, 2169-3536, 80465, 10.1109/ACCESS.2023.3300795 | |
15. | Ioanna Miliaresi, Aggelos Pikrakis, Kyriakos Poutos, 2022, A Deep Multimodal Voice Pathology Classifier with Electroglottographic Signal Processing Capabilities, 978-1-6654-8158-8, 109, 10.1109/ICFSP55781.2022.9924745 | |
16. | Jeong Hoon Lee, Jungirl Seok, Jae Yeong Kim, Hee Chan Kim, Tack-Kyun Kwon, Evaluating the Diagnostic Potential of Connected Speech for Benign Laryngeal Disease Using Deep Learning Analysis, 2024, 08921997, 10.1016/j.jvoice.2024.01.015 | |
17. | Jae Yeong Kim, Jungirl Seok, Jehyun Lee, Jeong Hoon Lee, Tack-Kyun Kwon, Deep-Learning-Based Segmentation of Predefined Chunks in Connected Speech: A Retrospective Analysis, 2024, 35, 2508-268X, 15, 10.22469/jkslp.2024.35.1.15 | |
18. | Rijul Gupta, Dhanshree R Gunjawate, Duy Duong Nguyen, Craig Jin, Catherine Madill, Voice disorder recognition using machine learning: a scoping review protocol, 2024, 14, 2044-6055, e076998, 10.1136/bmjopen-2023-076998 | |
19. | Sania Tanvir, Sidra Abid Syed, Samreen Hussain, Razia Zia, Munaf Rashid, Hira Zahid, Sajid Shah, Detection of Vitiligo Through Machine Learning and Computer‐Aided Techniques: A Systematic Review, 2024, 2024, 2314-6133, 10.1155/bmri/3277546 | |
20. | Magdalena M. Pietrzak, Justyna Kałuża-Olszewska, Ewa Niebudek-Bogusz, Artur Klepaczko, Wioletta Pietruszewska, High-Speed Videoendoscopy and Stiffness Mapping for AI-Assisted Glottic Lesion Differentiation, 2025, 17, 2072-6694, 1376, 10.3390/cancers17081376 | |
21. | M. Sathyam Reddy, M. Suneetha, D. Prakasa Rao, Venkata Lakshmi Narayana Gorle, 2025, Chapter 3, 978-981-97-9925-1, 35, 10.1007/978-981-97-9926-8_3 | |
22. | Duy Duong Nguyen, Rijul Gupta, Dhanshree R Gunjawate, John Holik, Craig Jin, Catherine Madill, Speech-to-Noise Ratio and Voice-to-Noise Ratio of Voice Databases With Implications for Acoustic Voice Analysis, 2025, 08921997, 10.1016/j.jvoice.2025.05.029 | |
23. | Rica Schulze, Sabrina Schröder, Dirk Weyhe, Verena Uslar, Sebastian Fudickar, 2025, Method to Determine Most Sensitive and Accurate Mediapipe Face-Mesh Keypoints for Speech Therapy Exercises, 9798400714023, 190, 10.1145/3733155.3733207 | |
24. | Rab Nawaz Bashir, Muhammad Ali Shahid, Tahir Rashid, Muhammad Faheem, Taoufik Saidani, Oumaima Saidani, Amjad Rehman Khan, Voice pathology identification using mel spectrogram features and deep learning, 2025, 19, 1863-1703, 10.1007/s11760-025-04527-4 |
No | Author/Year | ML technique/Classifier | Feature Selection/Filter | Overall Accuracy | Overall Sensitivity | Overall Specificity |
SVD Dataset | ||||||
1.. | A. Al-Nasheri et al. /2017 [19] | SVM | peak, lag, entropy/eight band pass filter | 99.53% | 91.22% | 94.27% |
2. | Al-Nasheri et al./2017 [20] | SVM | MDVP parameters | 99.68% | NA | NA |
3. | Al-Nasheri et al./2017 [21] | SVM | Eight frequency bands using correlation functions | 90.97% | NA | NA |
4. | Zulfiqar Ali et al./2017 [22] | GMM | MFCCs | 80.02% | 71.0%–84.7% | 70%–76.6% |
5. | Fonseca et al./2020 [23] | SVM | SE, ZCRs, SH | 95% | NA | NA |
6. | Garcia et al./2019 [24] | Gaussian Mixture Regression | GBR scale | NA | NA | NA |
7. | Guedes et al./2019 [25] | DLN (LSTM; CNN) | PCA | 80; 78;66; 67;63; 66 | 80;78;66; 67;63; 66 | 80; 80;67; 67;69; 71 |
8. | Hammami et al./2020 [26] | SVM | HOS; DWT | 99.3%; 93.1% | 96.4%; 92.8% | 99.4%; 93.3% |
9. | Panek et al./2016 [27] | K-Means Clustering | PCA | 100% | NA | NA |
10. | Moon et al./2018 [28] | LR DT RF SVM DNN | HOS and DEO | 82.77%;80.25%;84.87%;86.13%;87.4% | NA | NA |
11. | Ezzine et al./2018 [29] | ANN SVM | glottal flow features | 99.27%;98.43% | NA | NA |
12. | Markaki et al./2009 [30] | RBF Kernal with SVM | Mutual Information b/w subjective voice quality and computed features | 94.1% | NA | NA |
13. | Markaki et al./2011 [31] | SVM | mutual information p/w voice classes (normophonic/dysphonic) | 94.1% | NA | NA |
14. | Miramont et al./2020 [32] | SVM | CPP, SDNPCV, NPCV, HNR, Noise Level, D2 (10, 25), Shimmer, and K2 (6, 25). | 86.53% | NA | NA |
15. | Muhammad etal. /2017 [33] | SVM | Glottal source excitation | 93.2 ± 0.01 | 94.3 | 92.3 |
16. | Shia et al. /2017 [34] | FFNN | DWT | 93.3% | NA | NA |
17. | Kadiri et al./2020 [35] | SVM | Glottal source features and MFCC | 76.19% | NA | NA |
18. | Zhang et al./2020 [36] | DNN | Pitch extraction and line spectrum pair | NA | NA | NA |
19. | Teixeira et al./2018 [37] | SVM | Jitter, shimmer and HNR, MFCC | 71% | NA | NA |
20. | Teixeira et al./2017 [38] | MLP-ANN | Jitter, shimmer and HNR | 100%(female); 90%(male) | NA | NA |
MEEI Dataset | ||||||
1. | A. A. Dibazar et al. /2002 [39] | HMM | me1 frequency filter | 99.4% with E = 8% | NA | NA |
2. | A. Al-Nasheri et al. /2017 [19] | SVM | peak, lag, entropy/eight band pass filter | 99.54% | 99.96% | 99.96% |
3. | Akbari et al./2014 [40] | MC-LDA; ML-NN | wavelet packet- based features | 96.67% | NA | NA |
4. | Al-Nasheri et al./2017 [20] | SVM | MDVP parameters | 88.21% | 88.90% | 89.21% |
5. | Al-Nasheri et al./2017 [21] | SVM | Eight frequency bands using correlationfunctions | 99.80% | NA | NA |
6. | Zulfiqar Ali et al./2016 [41] | GMM | Estimation of Auditory Spectrum and Cepstral Coefficients | 99.56% | NA | NA |
7. | Zulfiqar Ali et al./2017 [22] | GMM | MFCCs | 94.6% | 94.7%–99.1% | 50.9%–94.5% |
8. | Amami et al./2017 [42] | SVM | DBSCAN and MFCCs | 98% | NA | NA |
9. | Londono et al./2010 [43] | HMM | MFCCs | 82.1472.2 | 81.1373.6 | 83.3373.4 |
10. | Arjmandi et al./2011 [44] | 1)QD classfier 2)NM classifier 3)Parzen Classifier; KNN; SVM; ML NN | MDVP parameters | 78.9%;87.20%;85.50%;88.86%;89.29%;88.7% | 88%;70.9%;73.5%;78.30%;82.25%;83% | 66%;97%;93.85%;96.17%;94.3%;85.1% |
11. | Barreira et al./2020 [45] | Gaussian Naïve Bayes | HASS-KLD, H-KLD, MFCCs, Sample skewness | 99.55% | 100% | 98% |
12. | Francis et al./2016 [46] | ANN | MMTLS | 96.48% | NA | NA |
13. | Cordeiro et al./2017 [47] | SVM, GMM, DA | MFCC, LSF | 98.7% | NA | NA |
14. | Cordeiro et al./2018 [48] | SVM | RPPC | 94.2% | NA | NA |
15. | Fang et al. /2019 [49] | DNN SVM GMM | MFCCs | 99.14 ± 1.9%;98.28 ± 2.3%;98.26 ±1.8% | NA | NA |
16. | Muhammad et al. /2013 [14] | SVM | MPEG-7 low level audio feature | 99.994% ±0.011 | 1 | 0.999 |
17. | Muhammad et al. /2013 [50] | SVM | VTAI Feature Extraction | 99.02% ± 0.01 | 99.8% ± 0.02 | 97.5% ± 0.04 |
18. | Ghasemzadeh et al. 2015 [51] | GA and LDA with SVM | Nonlinear features | 98.4% | 99.3 ± 1.2 | 94 ± 5.7 |
19. | Llorente et al./2009 [52] | MLP-NN | MFCCs | 96 ± 1.3 | 0.99 | 0.82 |
20. | Hariharan et al./2013 [53] | LS-SVM; kNN PNN; CART | Wavelet packet transform based energy/entropy | 92.24 ± 0.24;89.82 ± 0.28;89.54 ± 1.34;86.97 ± 0.20 | 93.02 ± 0.33;91.96 ± 0.56;90.62 ± 2.46;87.71 ± 0.28 | 91.49 ± 0.22;87.89 ± 0.43;88.59 ± 0.47;86.27 ± 0.42 |
21. | Ezzine et al./2018 [29] | ANN SVM | glottal flow features | 93.6%, 93.57% | NA | NA |
22. | Mahmood /2019 [54] | Naïve Bayes ANN SVM RF | MFCC | 72.70%, 93.72%, 99.78%, 99.91% | NA | NA |
23. | Mekyska et al./2015 [55] | SVM RF | spectra, inferior colliculus coefficients, bicepstrum, approximate entropy, empirical mode decomposition | 99.9 ± 0.4100.0 ± 0.0 | 99.8 ± 0.5100.0 ± 0.0 | 99.9 ± 0.7100.0 ± 0.0 |
24. | Miramont et al./2020 [32] | SVM | CPP, SDNPCV, NPCV, HNR, Noise Level, D2 (10, 25), Shimmer, and K2 (6, 25). | 87.06% | NA | NA |
25. | Muhammad et al. /2014 [56] | SVM | MPEG-7 feature | 99.994% | 1 | 0.999 |
26. | Muhammad et al. /2017 [33] | SVM | Glottal source excitation | 99.4 ± 0.02 | 99.4% | 98.9% |
27. | Nayak et al./2005 [57] | ANN | DWT coefficients as a feature vector | 80–85% | NA | NA |
28. | Henriquez et al./2009 [58] | NN | first- and second- order Rényi entropies, correlation entropy, correlation dimension | 99.69% | NA | NA |
29. | Salehi et al./2015 [59] | SVM | Parametric wavelet by adaptation wavelet transform | 98.30% | NA | NA |
30 | Lechon et al./2006 [60] | MFCC | NN | 89.6 ± 2.49% | NA | NA |
31. | Travieso et al./2017 [61] | HMM; Linear SVM; Kernal SVM | Nonlinear Dynamic Parameterization | 93.55 ± 3.24;96.73 ± 3.42;99.87 ± 0.39 | NA | NA |
AVPD Dataset | ||||||
1. | A. Al-Nasheri et al. /2017 [19] | SVM | peak, lag, entropy/eight band pass filter | 96.02% | 91.22% | 94.27% |
2. | Al-Nasheri et al./2017 [20] | SVM | MDVP parameters | 72.53% | NA | NA |
3. | Al-Nasheri et al./2017 [21] | SVM | Eight frequency bands using correlation functions | 91.16% | NA | NA |
4. | Zulfiqar Ali et al./2017 [22] | GMM | MFCCs | 83.6% | 67.9%–7.8.4% | 75.9%–89.74% |
5. | Mesallam et al./2017 [62] | SVM GMM VQ HMM | MFCC | 93.6%, 91.6%, 90.3%, 88.9% | NA | NA |
6. | Muhammad et al. /2017 [33] | SVM | Glottal source excitation | 91.5 ± 0.09 | 92.2% | 91.1% |
ANN = Artificial Neural Network, CART= Classification and Regression Tree, CNN = Convolutional Neural Network, DA = Discriminant analysis, DBSCAN = Density Based Spatial Clustering of Applications with Noise. DEO = Differential Energy Operator, DLN = Deep Learning Network, DNN = Deep Neural Network, DT = Decision Tree, DWT = Discrete Wavelet Transform, FFNN = Feed Forward Neural Network, GA = Genetic Algorithm, GMM = Gaussian Mixture Model, HASS-KLD = Higher amplitude suppression spectrum Kullback–Leibler divergence, H-KLD = Histogram Kullback–Leibler Divergence, HMM = Hidden Markov Model, HNR = Harmonic to Noise Ratio, HOS = High Order Statistics features, KNN = K-Nearest Neighbor Classifier, LDA = Linear Discriminant analysis, LR = Logistic Regression, LSF = Line spectral frequencies, LTSM = Long Short Term Memory, MC-LDA = Multi-Class Linear Discriminant Analysis, MDVP = Multidimensional Voice Program parameters, MFCCs = Mel-frequency cepstral coefficients, ML-NN = Multilayer Neural Network, MMTLS = Modified Mellin Transform of Log Spectrum, NA = Not Available, NM = Nearest Mean Classifier, NN = Neural Network, PCA= Principal Component Analysis, PNN = Probabilistic Neural Network, QD = Quadratic Discriminant Classifier, RF = Random Forest, RPPC = Relative Power of the Periodic Component, SE = Signal Energy, SH = Signal Entropy, SVM = Support Vector Machine, VQ = Vector Quantizer, VTAI = Vocal Tract Area Irregularity, ZCRs = Zero-Crossing Rates. |
Comparative Characteristic | SVD | MEEI | AVPD |
language | German | English | Arabic |
Location | Saarland University, Germany | Massachusetts Eye & Ear Infirmary (MEEI)voice and speech laboratory, USA | King Abdul-Aziz University Hospital, Saudi Arabia |
Sampling frequency | 50 KHz | 10 KHz 25 KHz 50 KHz |
48 KHz |
Text | Vowel /a/ -Vowel /i/ -Vowel /u/ -sentence |
-Vowel /a/ -Rainbow passage |
-Vowel /a/ -Vowel /i/ -Vowel /u/ -Al-Fateha -Arabic digits -Common words |
No | Author/Year | ML technique/Classifier | Feature Selection/Filter | Overall Accuracy | Overall Sensitivity | Overall Specificity |
SVD Dataset | ||||||
1.. | A. Al-Nasheri et al. /2017 [19] | SVM | peak, lag, entropy/eight band pass filter | 99.53% | 91.22% | 94.27% |
2. | Al-Nasheri et al./2017 [20] | SVM | MDVP parameters | 99.68% | NA | NA |
3. | Al-Nasheri et al./2017 [21] | SVM | Eight frequency bands using correlation functions | 90.97% | NA | NA |
4. | Zulfiqar Ali et al./2017 [22] | GMM | MFCCs | 80.02% | 71.0%–84.7% | 70%–76.6% |
5. | Fonseca et al./2020 [23] | SVM | SE, ZCRs, SH | 95% | NA | NA |
6. | Garcia et al./2019 [24] | Gaussian Mixture Regression | GBR scale | NA | NA | NA |
7. | Guedes et al./2019 [25] | DLN (LSTM; CNN) | PCA | 80; 78;66; 67;63; 66 | 80;78;66; 67;63; 66 | 80; 80;67; 67;69; 71 |
8. | Hammami et al./2020 [26] | SVM | HOS; DWT | 99.3%; 93.1% | 96.4%; 92.8% | 99.4%; 93.3% |
9. | Panek et al./2016 [27] | K-Means Clustering | PCA | 100% | NA | NA |
10. | Moon et al./2018 [28] | LR DT RF SVM DNN | HOS and DEO | 82.77%;80.25%;84.87%;86.13%;87.4% | NA | NA |
11. | Ezzine et al./2018 [29] | ANN SVM | glottal flow features | 99.27%;98.43% | NA | NA |
12. | Markaki et al./2009 [30] | RBF Kernal with SVM | Mutual Information b/w subjective voice quality and computed features | 94.1% | NA | NA |
13. | Markaki et al./2011 [31] | SVM | mutual information p/w voice classes (normophonic/dysphonic) | 94.1% | NA | NA |
14. | Miramont et al./2020 [32] | SVM | CPP, SDNPCV, NPCV, HNR, Noise Level, D2 (10, 25), Shimmer, and K2 (6, 25). | 86.53% | NA | NA |
15. | Muhammad etal. /2017 [33] | SVM | Glottal source excitation | 93.2 ± 0.01 | 94.3 | 92.3 |
16. | Shia et al. /2017 [34] | FFNN | DWT | 93.3% | NA | NA |
17. | Kadiri et al./2020 [35] | SVM | Glottal source features and MFCC | 76.19% | NA | NA |
18. | Zhang et al./2020 [36] | DNN | Pitch extraction and line spectrum pair | NA | NA | NA |
19. | Teixeira et al./2018 [37] | SVM | Jitter, shimmer and HNR, MFCC | 71% | NA | NA |
20. | Teixeira et al./2017 [38] | MLP-ANN | Jitter, shimmer and HNR | 100%(female); 90%(male) | NA | NA |
MEEI Dataset | ||||||
1. | A. A. Dibazar et al. /2002 [39] | HMM | me1 frequency filter | 99.4% with E = 8% | NA | NA |
2. | A. Al-Nasheri et al. /2017 [19] | SVM | peak, lag, entropy/eight band pass filter | 99.54% | 99.96% | 99.96% |
3. | Akbari et al./2014 [40] | MC-LDA; ML-NN | wavelet packet- based features | 96.67% | NA | NA |
4. | Al-Nasheri et al./2017 [20] | SVM | MDVP parameters | 88.21% | 88.90% | 89.21% |
5. | Al-Nasheri et al./2017 [21] | SVM | Eight frequency bands using correlationfunctions | 99.80% | NA | NA |
6. | Zulfiqar Ali et al./2016 [41] | GMM | Estimation of Auditory Spectrum and Cepstral Coefficients | 99.56% | NA | NA |
7. | Zulfiqar Ali et al./2017 [22] | GMM | MFCCs | 94.6% | 94.7%–99.1% | 50.9%–94.5% |
8. | Amami et al./2017 [42] | SVM | DBSCAN and MFCCs | 98% | NA | NA |
9. | Londono et al./2010 [43] | HMM | MFCCs | 82.1472.2 | 81.1373.6 | 83.3373.4 |
10. | Arjmandi et al./2011 [44] | 1)QD classfier 2)NM classifier 3)Parzen Classifier; KNN; SVM; ML NN | MDVP parameters | 78.9%;87.20%;85.50%;88.86%;89.29%;88.7% | 88%;70.9%;73.5%;78.30%;82.25%;83% | 66%;97%;93.85%;96.17%;94.3%;85.1% |
11. | Barreira et al./2020 [45] | Gaussian Naïve Bayes | HASS-KLD, H-KLD, MFCCs, Sample skewness | 99.55% | 100% | 98% |
12. | Francis et al./2016 [46] | ANN | MMTLS | 96.48% | NA | NA |
13. | Cordeiro et al./2017 [47] | SVM, GMM, DA | MFCC, LSF | 98.7% | NA | NA |
14. | Cordeiro et al./2018 [48] | SVM | RPPC | 94.2% | NA | NA |
15. | Fang et al. /2019 [49] | DNN SVM GMM | MFCCs | 99.14 ± 1.9%;98.28 ± 2.3%;98.26 ±1.8% | NA | NA |
16. | Muhammad et al. /2013 [14] | SVM | MPEG-7 low level audio feature | 99.994% ±0.011 | 1 | 0.999 |
17. | Muhammad et al. /2013 [50] | SVM | VTAI Feature Extraction | 99.02% ± 0.01 | 99.8% ± 0.02 | 97.5% ± 0.04 |
18. | Ghasemzadeh et al. 2015 [51] | GA and LDA with SVM | Nonlinear features | 98.4% | 99.3 ± 1.2 | 94 ± 5.7 |
19. | Llorente et al./2009 [52] | MLP-NN | MFCCs | 96 ± 1.3 | 0.99 | 0.82 |
20. | Hariharan et al./2013 [53] | LS-SVM; kNN PNN; CART | Wavelet packet transform based energy/entropy | 92.24 ± 0.24;89.82 ± 0.28;89.54 ± 1.34;86.97 ± 0.20 | 93.02 ± 0.33;91.96 ± 0.56;90.62 ± 2.46;87.71 ± 0.28 | 91.49 ± 0.22;87.89 ± 0.43;88.59 ± 0.47;86.27 ± 0.42 |
21. | Ezzine et al./2018 [29] | ANN SVM | glottal flow features | 93.6%, 93.57% | NA | NA |
22. | Mahmood /2019 [54] | Naïve Bayes ANN SVM RF | MFCC | 72.70%, 93.72%, 99.78%, 99.91% | NA | NA |
23. | Mekyska et al./2015 [55] | SVM RF | spectra, inferior colliculus coefficients, bicepstrum, approximate entropy, empirical mode decomposition | 99.9 ± 0.4100.0 ± 0.0 | 99.8 ± 0.5100.0 ± 0.0 | 99.9 ± 0.7100.0 ± 0.0 |
24. | Miramont et al./2020 [32] | SVM | CPP, SDNPCV, NPCV, HNR, Noise Level, D2 (10, 25), Shimmer, and K2 (6, 25). | 87.06% | NA | NA |
25. | Muhammad et al. /2014 [56] | SVM | MPEG-7 feature | 99.994% | 1 | 0.999 |
26. | Muhammad et al. /2017 [33] | SVM | Glottal source excitation | 99.4 ± 0.02 | 99.4% | 98.9% |
27. | Nayak et al./2005 [57] | ANN | DWT coefficients as a feature vector | 80–85% | NA | NA |
28. | Henriquez et al./2009 [58] | NN | first- and second- order Rényi entropies, correlation entropy, correlation dimension | 99.69% | NA | NA |
29. | Salehi et al./2015 [59] | SVM | Parametric wavelet by adaptation wavelet transform | 98.30% | NA | NA |
30 | Lechon et al./2006 [60] | MFCC | NN | 89.6 ± 2.49% | NA | NA |
31. | Travieso et al./2017 [61] | HMM; Linear SVM; Kernal SVM | Nonlinear Dynamic Parameterization | 93.55 ± 3.24;96.73 ± 3.42;99.87 ± 0.39 | NA | NA |
AVPD Dataset | ||||||
1. | A. Al-Nasheri et al. /2017 [19] | SVM | peak, lag, entropy/eight band pass filter | 96.02% | 91.22% | 94.27% |
2. | Al-Nasheri et al./2017 [20] | SVM | MDVP parameters | 72.53% | NA | NA |
3. | Al-Nasheri et al./2017 [21] | SVM | Eight frequency bands using correlation functions | 91.16% | NA | NA |
4. | Zulfiqar Ali et al./2017 [22] | GMM | MFCCs | 83.6% | 67.9%–7.8.4% | 75.9%–89.74% |
5. | Mesallam et al./2017 [62] | SVM GMM VQ HMM | MFCC | 93.6%, 91.6%, 90.3%, 88.9% | NA | NA |
6. | Muhammad et al. /2017 [33] | SVM | Glottal source excitation | 91.5 ± 0.09 | 92.2% | 91.1% |
ANN = Artificial Neural Network, CART= Classification and Regression Tree, CNN = Convolutional Neural Network, DA = Discriminant analysis, DBSCAN = Density Based Spatial Clustering of Applications with Noise. DEO = Differential Energy Operator, DLN = Deep Learning Network, DNN = Deep Neural Network, DT = Decision Tree, DWT = Discrete Wavelet Transform, FFNN = Feed Forward Neural Network, GA = Genetic Algorithm, GMM = Gaussian Mixture Model, HASS-KLD = Higher amplitude suppression spectrum Kullback–Leibler divergence, H-KLD = Histogram Kullback–Leibler Divergence, HMM = Hidden Markov Model, HNR = Harmonic to Noise Ratio, HOS = High Order Statistics features, KNN = K-Nearest Neighbor Classifier, LDA = Linear Discriminant analysis, LR = Logistic Regression, LSF = Line spectral frequencies, LTSM = Long Short Term Memory, MC-LDA = Multi-Class Linear Discriminant Analysis, MDVP = Multidimensional Voice Program parameters, MFCCs = Mel-frequency cepstral coefficients, ML-NN = Multilayer Neural Network, MMTLS = Modified Mellin Transform of Log Spectrum, NA = Not Available, NM = Nearest Mean Classifier, NN = Neural Network, PCA= Principal Component Analysis, PNN = Probabilistic Neural Network, QD = Quadratic Discriminant Classifier, RF = Random Forest, RPPC = Relative Power of the Periodic Component, SE = Signal Energy, SH = Signal Entropy, SVM = Support Vector Machine, VQ = Vector Quantizer, VTAI = Vocal Tract Area Irregularity, ZCRs = Zero-Crossing Rates. |
Comparative Characteristic | SVD | MEEI | AVPD |
language | German | English | Arabic |
Location | Saarland University, Germany | Massachusetts Eye & Ear Infirmary (MEEI)voice and speech laboratory, USA | King Abdul-Aziz University Hospital, Saudi Arabia |
Sampling frequency | 50 KHz | 10 KHz 25 KHz 50 KHz |
48 KHz |
Text | Vowel /a/ -Vowel /i/ -Vowel /u/ -sentence |
-Vowel /a/ -Rainbow passage |
-Vowel /a/ -Vowel /i/ -Vowel /u/ -Al-Fateha -Arabic digits -Common words |