Polyphonic sound event localization and detection based on Multiple Attention Fusion ResNet

Shouming Zhang; Yaling Zhang; Yixiao Liao; Kunkun Pang; Zhiyong Wan; Songbin Zhou; Shouming Zhang; Yaling Zhang; Yixiao Liao; Kunkun Pang; Zhiyong Wan; Songbin Zhou

doi:10.3934/mbe.2024089

Mathematical Biosciences and Engineering

2024, Volume 21, Issue 2: 2004-2023. doi: 10.3934/mbe.2024089

Previous Article Next Article

Research article Special Issues

Polyphonic sound event localization and detection based on Multiple Attention Fusion ResNet

1.
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
2.
Institute of Intelligent Manufacturing, Guangdong Academy of Science, Guangdong Key Laboratory of Modern Control Technology, Guangzhou 510030, China

Academic Editor: Guenther Palm

Received: 13 August 2023 Revised: 22 December 2023 Accepted: 29 December 2023 Published: 08 January 2024

Sound event localization and detection have been applied in various fields. Due to the polyphony and noise interference, it becomes challenging to accurately predict the sound event and their occurrence locations. Aiming at this problem, we propose a Multiple Attention Fusion ResNet, which uses ResNet34 as the base network. Given the situation that the sound duration is not fixed, and there are multiple polyphonic and noise, we introduce the Gated Channel Transform to enhance the residual basic block. This enables the model to capture contextual information, evaluate channel weights, and reduce the interference caused by polyphony and noise. Furthermore, Split Attention is introduced to the model for capturing cross-channel information, which enhances the ability to distinguish the polyphony. Finally, Coordinate Attention is introduced to the model so that the model can focus on both the channel information and spatial location information of sound events. Experiments were conducted on two different datasets, TAU-NIGENS Spatial Sound Events 2020, and TAU-NIGENS Spatial Sound Events 2021. The results demonstrate that the proposed model significantly outperforms state-of-the-art methods under multiple polyphonic and noise-directional interference environments and it achieves competitive performance under a single polyphonic environment.
- sound event localization and detection,
- attention,
- Gated Channel Transformation,
- deep learning
Citation: Shouming Zhang, Yaling Zhang, Yixiao Liao, Kunkun Pang, Zhiyong Wan, Songbin Zhou. Polyphonic sound event localization and detection based on Multiple Attention Fusion ResNet[J]. Mathematical Biosciences and Engineering, 2024, 21(2): 2004-2023. doi: 10.3934/mbe.2024089

Related Papers:

Abstract

Sound event localization and detection have been applied in various fields. Due to the polyphony and noise interference, it becomes challenging to accurately predict the sound event and their occurrence locations. Aiming at this problem, we propose a Multiple Attention Fusion ResNet, which uses ResNet34 as the base network. Given the situation that the sound duration is not fixed, and there are multiple polyphonic and noise, we introduce the Gated Channel Transform to enhance the residual basic block. This enables the model to capture contextual information, evaluate channel weights, and reduce the interference caused by polyphony and noise. Furthermore, Split Attention is introduced to the model for capturing cross-channel information, which enhances the ability to distinguish the polyphony. Finally, Coordinate Attention is introduced to the model so that the model can focus on both the channel information and spatial location information of sound events. Experiments were conducted on two different datasets, TAU-NIGENS Spatial Sound Events 2020, and TAU-NIGENS Spatial Sound Events 2021. The results demonstrate that the proposed model significantly outperforms state-of-the-art methods under multiple polyphonic and noise-directional interference environments and it achieves competitive performance under a single polyphonic environment.

References

[1]	T. K. Chan, C. S. Chin, A comprehensive review of polyphonic sound event detection, IEEE Access, 8 (2020), 103339–103373. https://doi.org/10.1109/ACCESS.2020.2999388 doi: 10.1109/ACCESS.2020.2999388
[2]	A. Mesaros, T. Heittola, T. Virtanen, M. D. Plumbley, Sound event detection: A tutorial, IEEE Signal Process Mag., 38 (2021), 67–83. https://doi.org/10.1109/MSP.2021.3090678 doi: 10.1109/MSP.2021.3090678
[3]	J. P. Bello, C. Silva, O. Nov, R. L. Dubois, A. Arora, J. Salamon, et al., Sonyc: A system for monitoring, analyzing, and mitigating urban noise pollution, Commun. ACM, 62 (2019), 68–77. https://doi.org/10.1145/3224204 doi: 10.1145/3224204
[4]	T. Hu, C. Zhang, B. Cheng, X. P. Wu, Research on abnormal audio event detection based on convolutional neural network (in Chinese), J. Signal Process., 34 (2018), 357–367. https://doi.org/10.16798/j.issn.1003-0530.2018.03.013 doi: 10.16798/j.issn.1003-0530.2018.03.013
[5]	D. Stowell, M. Wood, Y. Stylianou, H. Glotin, Bird detection in audio: A survey and a challenge, in 2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), (2016), 1–6. https://doi.org/10.1109/MLSP.2016.7738875
[6]	K. K. Lell, A. Pja, Automatic COVID-19 disease diagnosis using 1D convolutional neural network and augmentation with human respiratory sound based on parameters: Cough, breath, and voice, AIMS Public Health, 8 (2021), 240. https://doi.org/10.3934/publichealth.2021019
[7]	K. K. Lella, A. Pja, Automatic diagnosis of COVID-19 disease using deep convolutional neural network with multi-feature channel from respiratory sound data: Cough, voice, and breath, Alexandria Eng. J., 61 (2022), 1319–1334. https://doi.org/10.1016/j.aej.2021.06.024
[8]	G. Chen, M. Liu, J. Chen, Frequency-temporal-logic-based bearing fault diagnosis and fault interpretation using Bayesian optimization with Bayesian neural network, Mech. Syst. Signal Process., 145 (2020), 1–21. https://doi.org/10.1016/j.ymssp.2020.106951 doi: 10.1016/j.ymssp.2020.106951
[9]	S. Adavanne, A. Politis, T. Virtanen, A multi-room reverberant dataset for sound event localization and detection, preprint, arXiv: 1905.08546.
[10]	S. R. Eddy, What is a hidden Markov model, Nat. Biotechnol., 22 (2004), 1315–1316. https://doi.org/10.1038/nbt1004-1315 doi: 10.1038/nbt1004-1315
[11]	J. Wang, S. Sun, Y. Ning, M. Zhang, W. Pang, Ultrasonic TDoA indoor localization based on Piezoelectric Micromachined Ultrasonic Transducers, in 2021 IEEE International Ultrasonics Symposium (IUS), (2021), 1–3. https://doi.org/10.1109/IUS52206.2021.9593813
[12]	C. Liu, J. Yun, J. Su, Direct solution for fixed source location using well-posed TDOA and FDOA measurements, J. Syst. Eng. Electron., 31 (2020), 666–673. https://doi.org/10.23919/JSEE.2020.000042 doi: 10.23919/JSEE.2020.000042
[13]	T. Hayashi, S. Watanabe, T. Toda, T. Hori, J. Le Roux, K. Takeda, BLSTM-HMM hybrid system combined with sound activity detection network for polyphonic sound event detection, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2017), 766–770. https://doi.org/10.1109/ICASSP.2017.7952259
[14]	H. Zhu, H. Wan, Single sound source localization using convolutional neural networks trained with spiral source, in 2020 5th International Conference on Automation, Control and Robotics Engineering (CACRE), (2020), 720–724. https://doi.org/10.1109/CACRE50138.2020.9230056
[15]	S. Adavanne, A. Politis, J. Nikunen, T. Virtanen, Sound event localization and detection of overlapping sources using convolutional recurrent neural networks, IEEE J. Sel. Top. Signal Process., 13 (2019), 34–48. https://doi.org/10.1109/JSTSP.2018.2885636 doi: 10.1109/JSTSP.2018.2885636
[16]	T. Komatsu, M. Togami, T. Takahashi, Sound event localization and detection using convolutional recurrent neural networks and gated linear units, in 2020 28th European Signal Processing Conference (EUSIPCO), (2021), 41–45. https://doi.org/10.23919/Eusipco47968.2020.9287372
[17]	V. Spoorthy, S. G. Koolagudi, A transpose-SELDNet for polyphonic sound event localization and detection, in 2023 IEEE 8th International Conference for Convergence in Technology (I2CT), (2023), 1–6. https://doi.org/10.1109/I2CT57861.2023.10126251
[18]	J. S. Kim, H. J. Park, W. Shin, S. W. Han, AD-YOLO: You look only once in training multiple sound event localization and detection, in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2023), 1–5. https://doi.org/10.1109/ICASSP49357.2023.10096460
[19]	J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 779–788. https://doi.org/10.1109/CVPR.2016.91
[20]	H. Zhang, I. McLoughlin, Y. Song, Robust sound event recognition using convolutional neural networks, in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2015), 559–563. https://doi.org/10.1109/ICASSP.2015.7178031
[21]	H. Phan, L. Pham, P. Koch, N. Q. K. Duong, I. McLoughlin, A. Mertins, On multitask loss function for audio event detection and localization, preprint, arXiv: 2009.05527.
[22]	S. Adavanne, A. Politi, T. Virtanen, Localization, detection and tracking of multiple moving sound sources with a convolutional recurrent neural network, preprint, arXiv: 1904.12769.
[23]	Z. X. Han, Research on robot sound source localization method based on beamforming (in Chinese), Nanjing Univ. Inf. Sci. Technol., 2022 (2022). https://doi.org/10.27248/d.cnki.gnjqc.2021.000637 doi: 10.27248/d.cnki.gnjqc.2021.000637
[24]	T. N. T. Nguyen, K. N. Watcharasupat, N. K. Nguyen, D. L. Jones, W. S. Gan, SALSA: Spatial cue-augmented log-spectrogram features for polyphonic sound event localization and detection, IEEE/ACM Trans. Audio Speech Lang. Process., 30 (2022), 1749–1762. https://doi.org/10.1109/TASLP.2022.3173054 doi: 10.1109/TASLP.2022.3173054
[25]	A. Politis, A. Mesaros, S. Adavanne, T. Heittola, T. Virtanen, Overview and evaluation of sound event localization and detection in DCASE 2019, IEEE/ACM Trans. Audio Speech Lang. Process., 29 (2021), 684–698. https://doi.org/10.1109/TASLP.2020.3047233 doi: 10.1109/TASLP.2020.3047233
[26]	Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, M. D. Plumbley, Polyphonic sound event detection and localization using a two-stage strategy, preprint, arXiv: 1905.00268.
[27]	K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 770–778. https://doi.org/10.1109/CVPR.2016.90
[28]	J. Naranjo-Alcazar, S. Perez-Castanos, J. Ferrandis, P. Zuccarello, M. Cobos, Sound event localization and detection using squeeze-excitation residual CNNs, preprint, arXiv: 2006.14436.
[29]	R. Ranjan, S. Jayabalan, T. Nguyen, W. Gan, Sound event detection and direction of arrival estimation using Residual Net and recurrent neural networks, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), (2019), 214–218. https://doi.org/10.33682/93dp-f064
[30]	J. Hu, L. Shen, S. Albanie, G. Sun, E. Wu, Squeeze-and-excitation networks, IEEE Trans. Pattern Anal. Mach. Intell., 42 (2020), 2011–2023. https://doi.org/10.1109/TPAMI.2019.2913372 doi: 10.1109/TPAMI.2019.2913372
[31]	D. L. Huang, R. F. Perez, Sseldnet: A fully end-to-end sample-level framework for sound event localization and detection, in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), (2021), 1–5.
[32]	S. Woo, J. Park, J. Y. Lee, I. S. Kweon, CBAM: Convolutional Block Attention Module, in Proceedings of the European Conference on Computer Vision (ECCV), (2018), 3–19. https://doi.org/10.1007/978-3-030-01234-2_1
[33]	J. W. Kim, G. W. Lee, C. S. Park, H. K. Kim, Sound event detection using EfficientNet-B2 with an attentional pyramid network, in 2023 IEEE International Conference on Consumer Electronics (ICCE), (2023), 1–2. https://doi.org/10.1109/ICCE56470.2023.10043590
[34]	C. Xu, H. Liu, Y. Min, Y. Zhen, Sound event localization and detection based on dual attention (in Chinese), Comput. Eng. Appl., 2022 (2022), 1–11.
[35]	Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, ECA-Net: Efficient channel attention for deep convolutional neural networks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 11534–11542. https://doi.org/10.1109/CVPR42600.2020.01155
[36]	J. Jia, M. Sun, G. Wu, W. Qiu, W. G. Qiu, DeepDN_iGlu: Prediction of lysine glutarylation sites based on attention residual learning method and DenseNet, Math. Biosci. Eng., 20 (2023), 2815–2830. https://doi.org/10.3934/mbe.2023132 doi: 10.3934/mbe.2023132
[37]	Z. Yang, L. Zhu, Y. Wu, Y. Yang, Gated Channel Transformation for visual recognition, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 11791–11800. https://doi.org/10.1109/CVPR42600.2020.01181
[38]	H. Zhang, C. Wu, Z. Zhang, Y. Zhu, H. Lin, Z. Zhang, et al., ResNeSt: Split-attention networks, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), (2022), 2735–2745. https://doi.org/10.1109/CVPRW56347.2022.00309
[39]	Q. Hou, D. Zhou, J. Feng, Coordinate attention for efficient mobile network design, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 13708–13717. https://doi.org/10.1109/CVPR46437.2021.01350
[40]	A. Politis, S. Adavanne, T. Virtanen, A dataset of reverberant spatial sound scenes with moving sources for sound event localization and detection, preprint, arXiv: 2006.01919.
[41]	A. Politis, S. Adavanne, D. Krause, A. Deleforge, P. Srivastava, T. Virtanen, A dataset of dynamic reverberant sound scenes with directional interferers for sound event localization and detection, preprint, arXiv: 2106.06999.
[42]	A. Politis, A. Mesaros, S. Adavanne, T. Heittola, T. Virtanen, Overview and evaluation of sound event localization and detection in DCASE 2019, IEEE/ACM Trans. Audio Speech Lang. Process., 29 (2021), 684–698. https://doi.org/10.1109/TASLP.2020.3047233 doi: 10.1109/TASLP.2020.3047233
[43]	K. Liu, X. Zhao, Y. Hu, Y. Fu, Modeling the effects of individual and group heterogeneity on multi-aspect rating behavior, Front. Data Domputing, 2 (2020), 59–77. https://doi.org/10.11871/jfdc.issn.2096-742X.2020.02.005 doi: 10.11871/jfdc.issn.2096-742X.2020.02.005
[44]	Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, M. D. Plumbley, PANNs: Large-scale pretrained audio neural networks for audio pattern recognition, IEEE/ACM Trans. Audio Speech Lang. Process., 28 (2020), 2880–2894. https://doi.org/10.1109/TASLP.2020.3030497 doi: 10.1109/TASLP.2020.3030497

Reader Comments

Your name:*

Email:*
© 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)